Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.


Searching for Patterns in Strings (Python)

Introduction

String manipulation is a fundamental skill in programming. A common task is to find specific patterns within strings. Python's re module (regular expression) provides powerful tools for this. We'll focus on the search() and match() functions, exploring how to use them to locate and extract information from strings.

Understanding Regular Expressions

Before diving into the functions, it's crucial to understand regular expressions (regex). Regex are sequences of characters that define a search pattern. They can be simple (e.g., finding a literal string) or complex (e.g., finding email addresses).

Some common regex elements include:

  • . (dot): Matches any single character (except newline).
  • * (asterisk): Matches the preceding character zero or more times.
  • + (plus): Matches the preceding character one or more times.
  • ? (question mark): Matches the preceding character zero or one time.
  • [] (square brackets): Matches any single character within the brackets (e.g., [aeiou] matches any vowel).
  • ^ (caret): Matches the beginning of the string (or line if using multiline mode). Inside square brackets, it negates the character set (e.g., [^aeiou] matches any character that is not a vowel).
  • $ (dollar sign): Matches the end of the string (or line if using multiline mode).
  • \d: Matches any digit (0-9).
  • \w: Matches any word character (a-z, A-Z, 0-9, and _).
  • \s: Matches any whitespace character (space, tab, newline, etc.).
  • () (parentheses): Groups parts of the pattern, allowing you to extract the matched groups.

The search() Function

The re.search(pattern, string) function searches the entire string for the first occurrence of the pattern. If a match is found, it returns a match object; otherwise, it returns None.

 import re

string = "This is a test string with the number 123."
pattern = r"\d+"  # Matches one or more digits

match = re.search(pattern, string)

if match:
  print("Match found!")
  print("Start index:", match.start())
  print("End index:", match.end())
  print("Matched string:", match.group(0)) # or match.group()
else:
  print("No match found.") 

In this example, \d+ matches the sequence of digits "123". The match.start(), match.end(), and match.group() methods provide information about the matched substring. match.group(0) returns the entire matched string.

The match() Function

The re.match(pattern, string) function tries to match the pattern at the *beginning* of the string. It only returns a match object if the pattern matches from the very start. If the pattern does not match at the beginning, it returns None.

 import re

string1 = "Python is a great language."
string2 = "A great language is Python."
pattern = r"Python"

match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)

if match1:
  print("match1 found at the beginning!")
else:
  print("match1 not found at the beginning.")

if match2:
  print("match2 found at the beginning!")
else:
  print("match2 not found at the beginning.") 

In this example, match1 will have a match object because "Python" is at the beginning of string1. match2 will return None because "Python" is not at the beginning of string2.

Extracting Groups with Parentheses

Parentheses in a regular expression define capturing groups. You can extract the text matched by each group using the group() method, with the group number as an argument (starting from 1). Group 0 is the entire match.

 import re

string = "My phone number is 555-123-4567."
pattern = r"(\d{3})-(\d{3})-(\d{4})"  # Capture area code, exchange, and line number

match = re.search(pattern, string)

if match:
  print("Full number:", match.group(0))
  print("Area code:", match.group(1))
  print("Exchange:", match.group(2))
  print("Line number:", match.group(3))
else:
  print("No phone number found.") 

In this example, the pattern is structured to capture the different parts of a phone number. match.group(1) gives the area code, match.group(2) gives the exchange, and match.group(3) gives the line number.

Practical Examples and Exercises

  1. Extracting Email Addresses: Write a regular expression and code to extract all email addresses from a text string. Consider different email address formats.
     import re
    
    text = "Contact us at support@example.com or sales@another-example.org for more information.  Also try info@example.net."
    pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" #Improved email regex
    
    emails = re.findall(pattern, text) #Using findall instead of search to find all instances
    
    print("Email addresses found:", emails) 
  2. Validating a Date Format: Create a regular expression to validate dates in the format YYYY-MM-DD. Test your regex with valid and invalid dates.
     import re
    
    def validate_date(date_string):
      pattern = r"^\d{4}-\d{2}-\d{2}$" #Date format YYYY-MM-DD.  ^ and $ added for exact match
      match = re.match(pattern, date_string)
      if match:
        print(f"{date_string} is a valid date format.")
      else:
        print(f"{date_string} is an invalid date format.")
    
    validate_date("2023-10-27")
    validate_date("2023/10/27")
    validate_date("2023-10-27T12:00:00") 
  3. Finding Specific Words: Use search() to find all occurrences of a particular word (e.g., "the") in a string, regardless of case.
     import re
    
    text = "The quick brown fox jumps over the lazy dog.  THE end."
    pattern = r"\bthe\b" #Word boundary to match only whole word 'the'
    matches = re.findall(pattern, text, re.IGNORECASE)  # Ignore case
    
    print(f"The word 'the' appears {len(matches)} times.")