Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.


Regular Expressions in Python

Regular expressions (often shortened to "regex" or "regexp") are sequences of characters that define a search pattern. They are powerful tools for searching, matching, and manipulating text based on patterns. Python's re module provides a robust way to work with regular expressions.

The re Module

To use regular expressions in Python, you need to import the re module:

import re

Key Functions in the re Module

The re module provides several key functions for working with regular expressions. Here's an overview of the most commonly used ones:

re.search(pattern, string, flags=0)

The re.search() function searches the string for the first occurrence of the pattern. If the pattern is found, it returns a match object; otherwise, it returns None. The optional flags argument can be used to modify how the pattern is matched (e.g., case-insensitive matching).

Example:

 import re

text = "The quick brown fox jumps over the lazy dog."
pattern = "fox"

match = re.search(pattern, text)

if match:
    print("Pattern found:", match.group())  # Output: Pattern found: fox
    print("Start index:", match.start())    # Output: Start index: 16
    print("End index:", match.end())      # Output: End index: 19
else:
    print("Pattern not found.") 

re.match(pattern, string, flags=0)

The re.match() function attempts to match the pattern at the beginning of the string. If the pattern matches at the beginning, it returns a match object; otherwise, it returns None. Like re.search(), it also accepts optional flags.

Example:

 import re

text = "The quick brown fox jumps over the lazy dog."
pattern = "The"
pattern2 = "quick"

match = re.match(pattern, text)
match2 = re.match(pattern2, text)


if match:
    print("Pattern found at the beginning:", match.group()) # Output: Pattern found at the beginning: The
else:
    print("Pattern not found at the beginning.")           # This will be printed for quick

if match2:
    print("Pattern found at the beginning:", match2.group())
else:
    print("Pattern not found at the beginning.") 

re.findall(pattern, string, flags=0)

The re.findall() function returns a list of all non-overlapping matches of the pattern in the string. If the pattern contains capturing groups, it returns a list of tuples, where each tuple contains the matches for each group. If no capturing groups are present, it returns a list of the matched strings.

Example:

 import re

text = "The cat sat on the mat. Another cat is here."
pattern = "cat"

matches = re.findall(pattern, text)

print("All matches:", matches)  # Output: All matches: ['cat', 'cat']

# Example with capturing groups
text = "user1@example.com, user2@domain.net"
pattern = r"(\w+)@(\w+\.\w+)"  # Capture username and domain

matches = re.findall(pattern, text)

print("Matches with groups:", matches) #Output: Matches with groups: [('user1', 'example.com'), ('user2', 'domain.net')] 

re.sub(pattern, replacement, string, count=0, flags=0)

The re.sub() function replaces all occurrences of the pattern in the string with the replacement. The count argument specifies the maximum number of replacements to make. If count is 0 (the default), all occurrences are replaced. The replacement can be a string or a function.

Example:

 import re

text = "The quick brown fox jumps over the lazy dog."
pattern = "fox"
replacement = "wolf"

new_text = re.sub(pattern, replacement, text)

print("Original text:", text)           # Output: Original text: The quick brown fox jumps over the lazy dog.
print("Modified text:", new_text)     # Output: Modified text: The quick brown wolf jumps over the lazy dog.

#Example with count
text2 = "apple banana apple cherry apple"
pattern2 = "apple"
replacement2 = "orange"

new_text2 = re.sub(pattern2, replacement2, text2, count=2)

print("Original text:", text2) # Original text: apple banana apple cherry apple
print("Modified text:", new_text2) #Modified text: orange banana orange cherry apple 

Regular Expression Syntax

Regular expressions use a special syntax to define patterns. Here are some common elements:

  • . (dot): Matches any single character (except newline).
  • *: Matches zero or more occurrences of the preceding character or group.
  • +: Matches one or more occurrences of the preceding character or group.
  • ?: Matches zero or one occurrence of the preceding character or group.
  • []: Defines a character class (e.g., [a-z] matches any lowercase letter).
  • [^]: Negates a character class (e.g., [^0-9] matches any character that is not a digit).
  • \d: Matches any digit (0-9).
  • \D: Matches any non-digit character.
  • \w: Matches any word character (alphanumeric and underscore).
  • \W: Matches any non-word character.
  • \s: Matches any whitespace character (space, tab, newline).
  • \S: Matches any non-whitespace character.
  • ^: Matches the beginning of the string.
  • $: Matches the end of the string.
  • |: Acts as an "or" operator.
  • (): Defines a capturing group.
  • \: Used to escape special characters or represent character classes (e.g., \. matches a literal dot).

Remember to use raw strings (r"...") for your regular expressions, especially when they contain backslashes. This prevents Python from interpreting backslashes as escape sequences.

Example: Validating Email Addresses

Here's an example of using regular expressions to validate email addresses:

 import re

def is_valid_email(email):
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return re.match(pattern, email) is not None

email1 = "test@example.com"
email2 = "invalid-email"

print(f"{email1}: {is_valid_email(email1)}")  # Output: test@example.com: True
print(f"{email2}: {is_valid_email(email2)}")  # Output: invalid-email: False 

Flags

Regular expression operations can be further customized by specifying flags. Some common flags include:

  • re.IGNORECASE or re.I: Perform case-insensitive matching.
  • re.MULTILINE or re.M: Enable multi-line matching, allowing ^ and $ to match the beginning and end of each line.
  • re.DOTALL or re.S: Make the . (dot) match any character, including newline characters.

Example using re.IGNORECASE:

 import re

text = "Python is a great language."
pattern = "python"

match = re.search(pattern, text)
print(f"Case-sensitive search: {match}") #Case-sensitive search: None

match = re.search(pattern, text, re.IGNORECASE)
print(f"Case-insensitive search: {match.group() if match else None}") # Case-insensitive search: Python 

Regular expressions are a powerful tool for text processing. By mastering the syntax and functions of the re module, you can efficiently search, match, and manipulate text in Python.