Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Replacing Text with Regular Expressions in Python
Regular expressions provide a powerful way to search and manipulate text based on patterns. A common task is to find text that matches a specific pattern and replace it with something else. Python's re
module provides the sub()
function for exactly this purpose. This document explains how to use the re.sub()
function effectively.
Understanding Regular Expressions
Before diving into the sub()
function, it's essential to have a basic understanding of regular expressions. Regular expressions are sequences of characters that define a search pattern. They can include special characters (metacharacters) that represent various patterns, such as any digit, any whitespace, or specific character ranges. For example:
.
(dot): Matches any single character except newline.*
(asterisk): Matches zero or more occurrences of the preceding character or group.+
(plus): Matches one or more occurrences of the preceding character or group.?
(question mark): Matches zero or one occurrence of the preceding character or group.\d
: Matches any digit (0-9).\s
: Matches any whitespace character (space, tab, newline).\w
: Matches any word character (a-z, A-Z, 0-9, and underscore).[]
: Defines a character class (e.g.,[aeiou]
matches any vowel).()
: Groups parts of the regular expression. These groups can be referenced in the replacement string.
The re.sub()
Function
The re.sub()
function takes three main arguments:
- pattern: The regular expression pattern to search for. This is a string representing the pattern.
- replacement: The string to replace the matched text with. This can also be a function (see later examples).
- string: The string in which to search for the pattern and perform the replacement.
- count (optional): The maximum number of replacements to make. If omitted or zero, all occurrences are replaced.
- flags (optional): Modifiers that affect how the regular expression is interpreted (e.g.,
re.IGNORECASE
for case-insensitive matching).
The function returns a new string with the replacements made. The original string is not modified.
Basic Examples
Here are some basic examples of using re.sub()
:
Example 1: Replacing a specific word
import re
text = "The quick brown fox jumps over the lazy dog."
new_text = re.sub(r"fox", "cat", text)
print(new_text) # Output: The quick brown cat jumps over the lazy dog.
Example 2: Removing all digits from a string
import re
text = "My phone number is 123-456-7890."
new_text = re.sub(r"\d", "", text)
print(new_text) # Output: My phone number is ---.
Example 3: Replacing multiple spaces with a single space
import re
text = "This string has multiple spaces."
new_text = re.sub(r"\s+", " ", text)
print(new_text) # Output: This string has multiple spaces.
Using Backreferences
Backreferences allow you to refer to captured groups within the regular expression in the replacement string. Captured groups are defined using parentheses ()
in the regular expression. You can access them using \1
, \2
, etc., where \1
refers to the first captured group, \2
to the second, and so on.
Example 4: Swapping two words
import re
text = "Hello, World!"
new_text = re.sub(r"(\w+), (\w+)", r"\2, \1", text)
print(new_text) # Output: World, Hello!
In this example, (\w+)
captures one or more word characters. The first group is "Hello" and the second is "World". The replacement string r"\2, \1"
swaps the order of these groups.
Using a Function as a Replacement
Instead of a string, you can provide a function as the replacement
argument to re.sub()
. This function will be called for each match found, and the return value of the function will be used as the replacement string.
The function receives a Match
object as its argument, which contains information about the match, such as the matched string, the captured groups, and the match position.
Example 5: Converting temperatures from Celsius to Fahrenheit
import re
def celsius_to_fahrenheit(match):
celsius = float(match.group(1))
fahrenheit = (celsius * 9/5) + 32
return str(fahrenheit) + "F"
text = "The temperature is 25C today."
new_text = re.sub(r"(\d+)C", celsius_to_fahrenheit, text)
print(new_text) # Output: The temperature is 77.0F today.
In this example, the regular expression (\d+)C
captures the temperature in Celsius. The celsius_to_fahrenheit
function converts the Celsius value to Fahrenheit and returns the result as a string with "F" appended.
Using the count
Parameter
The count
parameter limits the number of replacements made. If count=1
, only the first occurrence of the pattern will be replaced.
Example 6: Replacing only the first occurrence
import re
text = "apple banana apple cherry"
new_text = re.sub(r"apple", "orange", text, count=1)
print(new_text) # Output: orange banana apple cherry
Case-Insensitive Replacement
To perform a case-insensitive replacement, you can use the re.IGNORECASE
or re.I
flag.
Example 7: Case-insensitive replacement
import re
text = "The cat sat on the mat."
new_text = re.sub(r"cat", "dog", text, flags=re.IGNORECASE)
print(new_text) # Output: The dog sat on the mat.
Common Mistakes and Best Practices
- Escaping Special Characters: Remember to escape special characters in your regular expression pattern using a backslash (
\
) if you want to match them literally. For example, to match a literal dot (.
), use\.
. - Raw Strings: Use raw strings (
r"..."
) for regular expression patterns to avoid issues with backslash escaping. This makes the code more readable and less prone to errors. - Testing: Always test your regular expressions thoroughly to ensure they match the intended patterns and produce the desired results. Online regex testers are helpful for this.
- Readability: For complex regular expressions, consider adding comments to explain the different parts of the pattern.
- Performance: Compiling regular expressions (using
re.compile()
) can improve performance if you're using the same pattern multiple times.
By mastering the re.sub()
function and regular expression syntax, you can efficiently manipulate text and perform complex replacements in your Python programs.