Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Advanced Regular Expression Techniques in Python
Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. While basic regex syntax covers many common use cases, advanced techniques allow for more sophisticated and precise operations. This document explores advanced concepts like lookarounds, backreferences, and conditional matching in Python using the re
module.
Lookarounds
Lookarounds are zero-width assertions that match a position in a string without consuming any characters. They allow you to specify conditions that must be present before or after a pattern, without including those conditions in the match itself. There are two types of lookarounds:
- Positive Lookahead
(?=...)
: Matches if the subpattern...
matches at the current position. - Negative Lookahead
(?!...)
: Matches if the subpattern...
does not match at the current position. - Positive Lookbehind
(?<=...)
: Matches if the subpattern...
matches immediately before the current position. - Negative Lookbehind
(?<!...)
: Matches if the subpattern...
does not match immediately before the current position. The subpattern inside a lookbehind must have a fixed width (cannot use quantifiers like*
,+
, or?
directly inside).
Examples:
Positive Lookahead:
Find words followed by "ing":
import re
text = "He is singing a song, and dancing now."
pattern = r"\w+(?=ing)" # Matches one or more word characters followed by "ing"
matches = re.findall(pattern, text)
print(matches) # Output: ['sing', 'danc']
Negative Lookahead:
Find words that are not followed by "ing":
import re
text = "He is singing a song, and dancing now, also running."
pattern = r"\w+(?!ing)" # Matches one or more word characters not followed by "ing"
matches = re.findall(pattern, text)
print(matches) # Output: ['He', 'i', 'a', 'song', 'and', 'danc', 'no', 'also', 'runn'] (Note the single characters which match part of words)
Positive Lookbehind:
Find words preceded by "a ":
import re
text = "This is a test, a sample, a demo."
pattern = r"(?<=a )\w+" # Matches one or more word characters preceded by "a "
matches = re.findall(pattern, text)
print(matches) # Output: ['test', 'sample', 'demo']
Negative Lookbehind:
Find words not preceded by "a ":
import re
text = "This is a test, a sample, a demo."
pattern = r"(?<!a )\w+" # Matches one or more word characters not preceded by "a "
matches = re.findall(pattern, text)
print(matches) # Output: ['This', 'is']
Backreferences
Backreferences allow you to refer to a previously captured group within the same regular expression. They are denoted by \1
, \2
, \3
, and so on, where the number corresponds to the capturing group's position in the pattern (counting from left to right).
Example:
Find repeated words:
import re
text = "This is is a test test string."
pattern = r"(\b\w+)\s+\1\b" # Matches a word (\b\w+), followed by whitespace \s+, followed by the same word \1, followed by a word boundary \b
matches = re.findall(pattern, text)
print(matches) # Output: ['is', 'test']
Backreferences are also useful in re.sub
for replacing text based on captured groups:
import re
text = "John Smith"
pattern = r"(\w+)\s(\w+)" # Capture first and last names
replacement = r"\2, \1" # Reverse them, separated by comma
new_text = re.sub(pattern, replacement, text)
print(new_text) # Output: Smith, John
Conditional Matching (Less Common, Requires regex
Module)
Conditional matching allows you to specify different patterns to match based on whether a capturing group has matched or not. Python's standard re
module does not support conditional matching directly. You need to use the third-party regex
module, which provides extended regex features.
Install the regex
module:
pip install regex
Syntax:
(?(id)yes-pattern|no-pattern)
id
: The number or name of the capturing group.yes-pattern
: The pattern to match if the groupid
has matched.no-pattern
: The pattern to match if the groupid
has not matched. The|no-pattern
part is optional.
Example:
Match a string that starts with "prefix-" followed by either a number or a word, but only if the prefix is present. If it's a number, it should have a leading zero.
import regex
text = ["prefix-0123", "prefix-word", "1234", "word"]
for t in text:
pattern = r"^(prefix-)?(?(1)(\d{4}|(\w+))|(\d{4}|(\w+)))$" # (prefix-)? captures the prefix optionally
match = regex.match(pattern, t)
if match:
print(f"'{t}' matches")
else:
print(f"'{t}' does not match")
# Output:
# 'prefix-0123' matches
# 'prefix-word' matches
# '1234' does not match
# 'word' does not match
Explanation:
^(prefix-)?
: Matches the beginning of the string (^
) followed by the optional group "prefix-" (captured in group 1). The question mark makes the whole prefix optional.(?(1)...)
: This is the conditional. It checks if capturing group 1 (the prefix) matched.(?(1)(\d{4}|(\w+))|(\d{4}|(\w+)))
: If group 1 (the prefix) matched:(\d{4}|(\w+))
: Matches either four digits (\d{4}
) or one or more word characters (\w+
). The digits ensure a leading zero.
(\d{4}|(\w+))
: Matches either four digits or one or more word characters (\w+
).
$
: Matches the end of the string.
Conditional matching can become quite complex, but it provides a way to create very specific and flexible patterns.