Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Regular Expressions in Python
Regular expressions (often shortened to "regex" or "regexp") are sequences of characters that define a search pattern. They are powerful tools for searching, matching, and manipulating text based on patterns. Python's re
module provides a robust way to work with regular expressions.
The re
Module
To use regular expressions in Python, you need to import the re
module:
import re
Key Functions in the re
Module
The re
module provides several key functions for working with regular expressions. Here's an overview of the most commonly used ones:
re.search(pattern, string, flags=0)
The re.search()
function searches the string
for the first occurrence of the pattern
. If the pattern is found, it returns a match object; otherwise, it returns None
. The optional flags
argument can be used to modify how the pattern is matched (e.g., case-insensitive matching).
Example:
import re
text = "The quick brown fox jumps over the lazy dog."
pattern = "fox"
match = re.search(pattern, text)
if match:
print("Pattern found:", match.group()) # Output: Pattern found: fox
print("Start index:", match.start()) # Output: Start index: 16
print("End index:", match.end()) # Output: End index: 19
else:
print("Pattern not found.")
re.match(pattern, string, flags=0)
The re.match()
function attempts to match the pattern
at the beginning of the string
. If the pattern matches at the beginning, it returns a match object; otherwise, it returns None
. Like re.search()
, it also accepts optional flags
.
Example:
import re
text = "The quick brown fox jumps over the lazy dog."
pattern = "The"
pattern2 = "quick"
match = re.match(pattern, text)
match2 = re.match(pattern2, text)
if match:
print("Pattern found at the beginning:", match.group()) # Output: Pattern found at the beginning: The
else:
print("Pattern not found at the beginning.") # This will be printed for quick
if match2:
print("Pattern found at the beginning:", match2.group())
else:
print("Pattern not found at the beginning.")
re.findall(pattern, string, flags=0)
The re.findall()
function returns a list of all non-overlapping matches of the pattern
in the string
. If the pattern contains capturing groups, it returns a list of tuples, where each tuple contains the matches for each group. If no capturing groups are present, it returns a list of the matched strings.
Example:
import re
text = "The cat sat on the mat. Another cat is here."
pattern = "cat"
matches = re.findall(pattern, text)
print("All matches:", matches) # Output: All matches: ['cat', 'cat']
# Example with capturing groups
text = "user1@example.com, user2@domain.net"
pattern = r"(\w+)@(\w+\.\w+)" # Capture username and domain
matches = re.findall(pattern, text)
print("Matches with groups:", matches) #Output: Matches with groups: [('user1', 'example.com'), ('user2', 'domain.net')]
re.sub(pattern, replacement, string, count=0, flags=0)
The re.sub()
function replaces all occurrences of the pattern
in the string
with the replacement
. The count
argument specifies the maximum number of replacements to make. If count
is 0 (the default), all occurrences are replaced. The replacement
can be a string or a function.
Example:
import re
text = "The quick brown fox jumps over the lazy dog."
pattern = "fox"
replacement = "wolf"
new_text = re.sub(pattern, replacement, text)
print("Original text:", text) # Output: Original text: The quick brown fox jumps over the lazy dog.
print("Modified text:", new_text) # Output: Modified text: The quick brown wolf jumps over the lazy dog.
#Example with count
text2 = "apple banana apple cherry apple"
pattern2 = "apple"
replacement2 = "orange"
new_text2 = re.sub(pattern2, replacement2, text2, count=2)
print("Original text:", text2) # Original text: apple banana apple cherry apple
print("Modified text:", new_text2) #Modified text: orange banana orange cherry apple
Regular Expression Syntax
Regular expressions use a special syntax to define patterns. Here are some common elements:
.
(dot): Matches any single character (except newline).*
: Matches zero or more occurrences of the preceding character or group.+
: Matches one or more occurrences of the preceding character or group.?
: Matches zero or one occurrence of the preceding character or group.[]
: Defines a character class (e.g.,[a-z]
matches any lowercase letter).[^]
: Negates a character class (e.g.,[^0-9]
matches any character that is not a digit).\d
: Matches any digit (0-9).\D
: Matches any non-digit character.\w
: Matches any word character (alphanumeric and underscore).\W
: Matches any non-word character.\s
: Matches any whitespace character (space, tab, newline).\S
: Matches any non-whitespace character.^
: Matches the beginning of the string.$
: Matches the end of the string.|
: Acts as an "or" operator.()
: Defines a capturing group.\
: Used to escape special characters or represent character classes (e.g.,\.
matches a literal dot).
Remember to use raw strings (r"..."
) for your regular expressions, especially when they contain backslashes. This prevents Python from interpreting backslashes as escape sequences.
Example: Validating Email Addresses
Here's an example of using regular expressions to validate email addresses:
import re
def is_valid_email(email):
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return re.match(pattern, email) is not None
email1 = "test@example.com"
email2 = "invalid-email"
print(f"{email1}: {is_valid_email(email1)}") # Output: test@example.com: True
print(f"{email2}: {is_valid_email(email2)}") # Output: invalid-email: False
Flags
Regular expression operations can be further customized by specifying flags. Some common flags include:
re.IGNORECASE
orre.I
: Perform case-insensitive matching.re.MULTILINE
orre.M
: Enable multi-line matching, allowing^
and$
to match the beginning and end of each line.re.DOTALL
orre.S
: Make the.
(dot) match any character, including newline characters.
Example using re.IGNORECASE
:
import re
text = "Python is a great language."
pattern = "python"
match = re.search(pattern, text)
print(f"Case-sensitive search: {match}") #Case-sensitive search: None
match = re.search(pattern, text, re.IGNORECASE)
print(f"Case-insensitive search: {match.group() if match else None}") # Case-insensitive search: Python
Regular expressions are a powerful tool for text processing. By mastering the syntax and functions of the re
module, you can efficiently search, match, and manipulate text in Python.