Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Grouping and Capturing in Python Regular Expressions
What is Grouping and Capturing?
In regular expressions, grouping allows you to treat multiple characters as a single unit. This is achieved using parentheses ()
. Beyond just treating them as a unit, grouping also enables capturing, which means the portion of the input string that matches the group can be extracted and used later.
Capturing is crucial when you need to extract specific parts of a string that match a pattern. For example, you might want to extract the day, month, and year from a date string.
How to Use Parentheses for Grouping and Capturing
Parentheses ()
are the key to both grouping and capturing. Any part of a regular expression enclosed in parentheses is considered a group, and by default, it's also captured. Here's a basic example:
import re
text = "My phone number is 123-456-7890."
pattern = r"(\d{3})-(\d{3})-(\d{4})" # Groups: area code, prefix, line number
match = re.search(pattern, text)
if match:
print("Full match:", match.group(0)) # Entire match
print("Area code:", match.group(1)) # First group (area code)
print("Prefix:", match.group(2)) # Second group (prefix)
print("Line number:", match.group(3))# Third group (line number)
# Accessing groups via groups() method
area_code, prefix, line_number = match.groups()
print("Area code (using groups()):", area_code)
else:
print("No match found.")
In this example, the regular expression (\d{3})-(\d{3})-(\d{4})
is broken down as follows:
\d{3}
matches exactly three digits.-
matches the hyphen character.- Each
(\d{3})
and(\d{4})
is a capturing group, capturing three digits, three digits, and four digits respectively.
match.group(n)
method is used to access the captured groups, where n
is the group number (starting from 1). match.group(0)
represents the entire matched string. The match.groups()
method returns a tuple containing all captured groups. More Examples
Extracting Date Components
import re
date_string = "Today is 2023-10-27."
date_pattern = r"(\d{4})-(\d{2})-(\d{2})"
date_match = re.search(date_pattern, date_string)
if date_match:
year, month, day = date_match.groups()
print(f"Year: {year}, Month: {month}, Day: {day}")
else:
print("Date not found.")
Extracting Email Usernames and Domains
import re
email = "user@example.com"
email_pattern = r"(\w+)@(\w+\.\w+)"
email_match = re.search(email_pattern, email)
if email_match:
username, domain = email_match.groups()
print(f"Username: {username}, Domain: {domain}")
else:
print("Invalid email format.")
Non-Capturing Groups
Sometimes you need to use parentheses for grouping but don't want to capture the matched text. You can create a non-capturing group using (?:...)
. This is useful for improving performance or simplifying the access to captured groups when you only need certain parts of the matched text.
import re
text = "Protocol: http, Port: 80"
pattern = r"(?:Protocol: )(\w+), (?:Port: )(\d+)"
match = re.search(pattern, text)
if match:
protocol, port = match.groups()
print(f"Protocol: {protocol}, Port: {port}")
else:
print("No match found.")
In this example, (?:Protocol: )
and (?:Port: )
are non-capturing groups. They are used to match the "Protocol: " and "Port: " prefixes, but these prefixes are not captured. Only the protocol and port values are captured.
Named Groups
You can also assign names to your capturing groups for better readability and maintainability. This is done using the syntax (?P<name>...)
.
import re
log_line = "127.0.0.1 - frank [10/Oct/2023:13:55:36 +0000] \"GET /index.html HTTP/1.1\" 200 1038"
pattern = r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (?P<user>\w+) \[(?P<timestamp>[^\]]+)\] \"(?P<request>[^\"]+)\" (?P<status>\d+) (?P<size>\d+)"
match = re.search(pattern, log_line)
if match:
print("IP Address:", match.group("ip"))
print("User:", match.group("user"))
print("Timestamp:", match.group("timestamp"))
print("Request:", match.group("request"))
print("Status Code:", match.group("status"))
print("Size:", match.group("size"))
# Accessing named groups using match.groupdict()
log_data = match.groupdict()
print("Log data dictionary:", log_data)
else:
print("No match found.")
Using named groups makes your code more readable and easier to understand, especially when dealing with complex regular expressions.