Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.

⬅ Previous Next ➡

Advanced Regular Expression Techniques in Python

Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. While basic regex syntax covers many common use cases, advanced techniques allow for more sophisticated and precise operations. This document explores advanced concepts like lookarounds, backreferences, and conditional matching in Python using the re module.

Lookarounds

Lookarounds are zero-width assertions that match a position in a string without consuming any characters. They allow you to specify conditions that must be present before or after a pattern, without including those conditions in the match itself. There are two types of lookarounds:

Positive Lookahead (?=...): Matches if the subpattern ... matches at the current position.
Negative Lookahead (?!...): Matches if the subpattern ... does not match at the current position.
Positive Lookbehind (?<=...): Matches if the subpattern ... matches immediately before the current position.
Negative Lookbehind (?<!...): Matches if the subpattern ... does not match immediately before the current position. The subpattern inside a lookbehind must have a fixed width (cannot use quantifiers like *, +, or ? directly inside).

Examples:

Positive Lookahead:

Find words followed by "ing":

 import re

text = "He is singing a song, and dancing now."
pattern = r"\w+(?=ing)"  # Matches one or more word characters followed by "ing"
matches = re.findall(pattern, text)
print(matches)  # Output: ['sing', 'danc']

Negative Lookahead:

Find words that are not followed by "ing":

 import re

text = "He is singing a song, and dancing now, also running."
pattern = r"\w+(?!ing)"  # Matches one or more word characters not followed by "ing"
matches = re.findall(pattern, text)
print(matches)  # Output: ['He', 'i', 'a', 'song', 'and', 'danc', 'no', 'also', 'runn']  (Note the single characters which match part of words)

Positive Lookbehind:

Find words preceded by "a ":

 import re

text = "This is a test, a sample, a demo."
pattern = r"(?<=a )\w+"  # Matches one or more word characters preceded by "a "
matches = re.findall(pattern, text)
print(matches)  # Output: ['test', 'sample', 'demo']

Negative Lookbehind:

Find words not preceded by "a ":

 import re

text = "This is a test, a sample, a demo."
pattern = r"(?<!a )\w+"  # Matches one or more word characters not preceded by "a "
matches = re.findall(pattern, text)
print(matches)  # Output: ['This', 'is']

Backreferences

Backreferences allow you to refer to a previously captured group within the same regular expression. They are denoted by \1, \2, \3, and so on, where the number corresponds to the capturing group's position in the pattern (counting from left to right).

Example:

Find repeated words:

 import re

text = "This is is a test test string."
pattern = r"(\b\w+)\s+\1\b"  # Matches a word (\b\w+), followed by whitespace \s+, followed by the same word \1, followed by a word boundary \b
matches = re.findall(pattern, text)
print(matches)  # Output: ['is', 'test']

Backreferences are also useful in re.sub for replacing text based on captured groups:

 import re

text = "John Smith"
pattern = r"(\w+)\s(\w+)" # Capture first and last names
replacement = r"\2, \1" # Reverse them, separated by comma
new_text = re.sub(pattern, replacement, text)
print(new_text) # Output: Smith, John

Conditional Matching (Less Common, Requires `regex` Module)

Conditional matching allows you to specify different patterns to match based on whether a capturing group has matched or not. Python's standard re module does not support conditional matching directly. You need to use the third-party regex module, which provides extended regex features.

Install the regex module:

pip install regex

Syntax:

(?(id)yes-pattern|no-pattern)

id: The number or name of the capturing group.
yes-pattern: The pattern to match if the group id has matched.
no-pattern: The pattern to match if the group id has not matched. The |no-pattern part is optional.

Example:

Match a string that starts with "prefix-" followed by either a number or a word, but only if the prefix is present. If it's a number, it should have a leading zero.

 import regex

text = ["prefix-0123", "prefix-word", "1234", "word"]

for t in text:
    pattern = r"^(prefix-)?(?(1)(\d{4}|(\w+))|(\d{4}|(\w+)))$" # (prefix-)? captures the prefix optionally
    match = regex.match(pattern, t)
    if match:
        print(f"'{t}' matches")
    else:
        print(f"'{t}' does not match")

# Output:
# 'prefix-0123' matches
# 'prefix-word' matches
# '1234' does not match
# 'word' does not match

Explanation:

^(prefix-)?: Matches the beginning of the string (^) followed by the optional group "prefix-" (captured in group 1). The question mark makes the whole prefix optional.
(?(1)...): This is the conditional. It checks if capturing group 1 (the prefix) matched.
(?(1)(\d{4}|(\w+))|(\d{4}|(\w+))): If group 1 (the prefix) matched:
- (\d{4}|(\w+)): Matches either four digits (\d{4}) or one or more word characters (\w+). The digits ensure a leading zero.
Otherwise (group 1 did not match, meaning no "prefix-"):
- (\d{4}|(\w+)): Matches either four digits or one or more word characters (\w+).
$: Matches the end of the string.

Conditional matching can become quite complex, but it provides a way to create very specific and flexible patterns.

⬅ Previous Next ➡

Regular Expressions

Advanced Regular Expression Techniques in Python

Lookarounds

Examples:

Positive Lookahead:

Negative Lookahead:

Positive Lookbehind:

Negative Lookbehind:

Backreferences

Example:

Conditional Matching (Less Common, Requires regex Module)

Example:

Conditional Matching (Less Common, Requires `regex` Module)