Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.


Grouping and Capturing in Python Regular Expressions

What is Grouping and Capturing?

In regular expressions, grouping allows you to treat multiple characters as a single unit. This is achieved using parentheses (). Beyond just treating them as a unit, grouping also enables capturing, which means the portion of the input string that matches the group can be extracted and used later.

Capturing is crucial when you need to extract specific parts of a string that match a pattern. For example, you might want to extract the day, month, and year from a date string.

How to Use Parentheses for Grouping and Capturing

Parentheses () are the key to both grouping and capturing. Any part of a regular expression enclosed in parentheses is considered a group, and by default, it's also captured. Here's a basic example:

 import re

text = "My phone number is 123-456-7890."
pattern = r"(\d{3})-(\d{3})-(\d{4})"  # Groups: area code, prefix, line number

match = re.search(pattern, text)

if match:
    print("Full match:", match.group(0))   # Entire match
    print("Area code:", match.group(1))  # First group (area code)
    print("Prefix:", match.group(2))     # Second group (prefix)
    print("Line number:", match.group(3))# Third group (line number)

    # Accessing groups via groups() method
    area_code, prefix, line_number = match.groups()
    print("Area code (using groups()):", area_code)

else:
    print("No match found.") 

In this example, the regular expression (\d{3})-(\d{3})-(\d{4}) is broken down as follows:

  • \d{3} matches exactly three digits.
  • - matches the hyphen character.
  • Each (\d{3}) and (\d{4}) is a capturing group, capturing three digits, three digits, and four digits respectively.
The match.group(n) method is used to access the captured groups, where n is the group number (starting from 1). match.group(0) represents the entire matched string. The match.groups() method returns a tuple containing all captured groups.

More Examples

Extracting Date Components

 import re

date_string = "Today is 2023-10-27."
date_pattern = r"(\d{4})-(\d{2})-(\d{2})"

date_match = re.search(date_pattern, date_string)

if date_match:
    year, month, day = date_match.groups()
    print(f"Year: {year}, Month: {month}, Day: {day}")
else:
    print("Date not found.") 

Extracting Email Usernames and Domains

 import re

email = "user@example.com"
email_pattern = r"(\w+)@(\w+\.\w+)"

email_match = re.search(email_pattern, email)

if email_match:
    username, domain = email_match.groups()
    print(f"Username: {username}, Domain: {domain}")
else:
    print("Invalid email format.") 

Non-Capturing Groups

Sometimes you need to use parentheses for grouping but don't want to capture the matched text. You can create a non-capturing group using (?:...). This is useful for improving performance or simplifying the access to captured groups when you only need certain parts of the matched text.

 import re

text = "Protocol: http, Port: 80"
pattern = r"(?:Protocol: )(\w+), (?:Port: )(\d+)"

match = re.search(pattern, text)

if match:
    protocol, port = match.groups()
    print(f"Protocol: {protocol}, Port: {port}")
else:
    print("No match found.") 

In this example, (?:Protocol: ) and (?:Port: ) are non-capturing groups. They are used to match the "Protocol: " and "Port: " prefixes, but these prefixes are not captured. Only the protocol and port values are captured.

Named Groups

You can also assign names to your capturing groups for better readability and maintainability. This is done using the syntax (?P<name>...).

 import re

log_line = "127.0.0.1 - frank [10/Oct/2023:13:55:36 +0000] \"GET /index.html HTTP/1.1\" 200 1038"
pattern = r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (?P<user>\w+) \[(?P<timestamp>[^\]]+)\] \"(?P<request>[^\"]+)\" (?P<status>\d+) (?P<size>\d+)"

match = re.search(pattern, log_line)

if match:
    print("IP Address:", match.group("ip"))
    print("User:", match.group("user"))
    print("Timestamp:", match.group("timestamp"))
    print("Request:", match.group("request"))
    print("Status Code:", match.group("status"))
    print("Size:", match.group("size"))

    # Accessing named groups using match.groupdict()
    log_data = match.groupdict()
    print("Log data dictionary:", log_data)
else:
    print("No match found.") 

Using named groups makes your code more readable and easier to understand, especially when dealing with complex regular expressions.