Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.


Replacing Text with Regular Expressions in Python

Regular expressions provide a powerful way to search and manipulate text based on patterns. A common task is to find text that matches a specific pattern and replace it with something else. Python's re module provides the sub() function for exactly this purpose. This document explains how to use the re.sub() function effectively.

Understanding Regular Expressions

Before diving into the sub() function, it's essential to have a basic understanding of regular expressions. Regular expressions are sequences of characters that define a search pattern. They can include special characters (metacharacters) that represent various patterns, such as any digit, any whitespace, or specific character ranges. For example:

  • . (dot): Matches any single character except newline.
  • * (asterisk): Matches zero or more occurrences of the preceding character or group.
  • + (plus): Matches one or more occurrences of the preceding character or group.
  • ? (question mark): Matches zero or one occurrence of the preceding character or group.
  • \d: Matches any digit (0-9).
  • \s: Matches any whitespace character (space, tab, newline).
  • \w: Matches any word character (a-z, A-Z, 0-9, and underscore).
  • []: Defines a character class (e.g., [aeiou] matches any vowel).
  • (): Groups parts of the regular expression. These groups can be referenced in the replacement string.

The re.sub() Function

The re.sub() function takes three main arguments:

  1. pattern: The regular expression pattern to search for. This is a string representing the pattern.
  2. replacement: The string to replace the matched text with. This can also be a function (see later examples).
  3. string: The string in which to search for the pattern and perform the replacement.
  4. count (optional): The maximum number of replacements to make. If omitted or zero, all occurrences are replaced.
  5. flags (optional): Modifiers that affect how the regular expression is interpreted (e.g., re.IGNORECASE for case-insensitive matching).

The function returns a new string with the replacements made. The original string is not modified.

Basic Examples

Here are some basic examples of using re.sub():

Example 1: Replacing a specific word

import re

text = "The quick brown fox jumps over the lazy dog."
new_text = re.sub(r"fox", "cat", text)
print(new_text)  # Output: The quick brown cat jumps over the lazy dog. 

Example 2: Removing all digits from a string

import re

text = "My phone number is 123-456-7890."
new_text = re.sub(r"\d", "", text)
print(new_text)  # Output: My phone number is ---. 

Example 3: Replacing multiple spaces with a single space

import re

text = "This  string   has    multiple    spaces."
new_text = re.sub(r"\s+", " ", text)
print(new_text)  # Output: This string has multiple spaces. 

Using Backreferences

Backreferences allow you to refer to captured groups within the regular expression in the replacement string. Captured groups are defined using parentheses () in the regular expression. You can access them using \1, \2, etc., where \1 refers to the first captured group, \2 to the second, and so on.

Example 4: Swapping two words

import re

text = "Hello, World!"
new_text = re.sub(r"(\w+), (\w+)", r"\2, \1", text)
print(new_text)  # Output: World, Hello! 

In this example, (\w+) captures one or more word characters. The first group is "Hello" and the second is "World". The replacement string r"\2, \1" swaps the order of these groups.

Using a Function as a Replacement

Instead of a string, you can provide a function as the replacement argument to re.sub(). This function will be called for each match found, and the return value of the function will be used as the replacement string.

The function receives a Match object as its argument, which contains information about the match, such as the matched string, the captured groups, and the match position.

Example 5: Converting temperatures from Celsius to Fahrenheit

import re

def celsius_to_fahrenheit(match):
    celsius = float(match.group(1))
    fahrenheit = (celsius * 9/5) + 32
    return str(fahrenheit) + "F"

text = "The temperature is 25C today."
new_text = re.sub(r"(\d+)C", celsius_to_fahrenheit, text)
print(new_text)  # Output: The temperature is 77.0F today. 

In this example, the regular expression (\d+)C captures the temperature in Celsius. The celsius_to_fahrenheit function converts the Celsius value to Fahrenheit and returns the result as a string with "F" appended.

Using the count Parameter

The count parameter limits the number of replacements made. If count=1, only the first occurrence of the pattern will be replaced.

Example 6: Replacing only the first occurrence

import re

text = "apple banana apple cherry"
new_text = re.sub(r"apple", "orange", text, count=1)
print(new_text)  # Output: orange banana apple cherry 

Case-Insensitive Replacement

To perform a case-insensitive replacement, you can use the re.IGNORECASE or re.I flag.

Example 7: Case-insensitive replacement

import re

text = "The cat sat on the mat."
new_text = re.sub(r"cat", "dog", text, flags=re.IGNORECASE)
print(new_text)  # Output: The dog sat on the mat. 

Common Mistakes and Best Practices

  • Escaping Special Characters: Remember to escape special characters in your regular expression pattern using a backslash (\) if you want to match them literally. For example, to match a literal dot (.), use \..
  • Raw Strings: Use raw strings (r"...") for regular expression patterns to avoid issues with backslash escaping. This makes the code more readable and less prone to errors.
  • Testing: Always test your regular expressions thoroughly to ensure they match the intended patterns and produce the desired results. Online regex testers are helpful for this.
  • Readability: For complex regular expressions, consider adding comments to explain the different parts of the pattern.
  • Performance: Compiling regular expressions (using re.compile()) can improve performance if you're using the same pattern multiple times.

By mastering the re.sub() function and regular expression syntax, you can efficiently manipulate text and perform complex replacements in your Python programs.