Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Searching for Patterns in Strings (Python)
Introduction
String manipulation is a fundamental skill in programming. A common task is to find specific patterns within strings. Python's re
module (regular expression) provides powerful tools for this. We'll focus on the search()
and match()
functions, exploring how to use them to locate and extract information from strings.
Understanding Regular Expressions
Before diving into the functions, it's crucial to understand regular expressions (regex). Regex are sequences of characters that define a search pattern. They can be simple (e.g., finding a literal string) or complex (e.g., finding email addresses).
Some common regex elements include:
.
(dot): Matches any single character (except newline).*
(asterisk): Matches the preceding character zero or more times.+
(plus): Matches the preceding character one or more times.?
(question mark): Matches the preceding character zero or one time.[]
(square brackets): Matches any single character within the brackets (e.g.,[aeiou]
matches any vowel).^
(caret): Matches the beginning of the string (or line if using multiline mode). Inside square brackets, it negates the character set (e.g.,[^aeiou]
matches any character that is not a vowel).$
(dollar sign): Matches the end of the string (or line if using multiline mode).\d
: Matches any digit (0-9).\w
: Matches any word character (a-z, A-Z, 0-9, and _).\s
: Matches any whitespace character (space, tab, newline, etc.).()
(parentheses): Groups parts of the pattern, allowing you to extract the matched groups.
The search()
Function
The re.search(pattern, string)
function searches the entire string for the first occurrence of the pattern. If a match is found, it returns a match object; otherwise, it returns None
.
import re
string = "This is a test string with the number 123."
pattern = r"\d+" # Matches one or more digits
match = re.search(pattern, string)
if match:
print("Match found!")
print("Start index:", match.start())
print("End index:", match.end())
print("Matched string:", match.group(0)) # or match.group()
else:
print("No match found.")
In this example, \d+
matches the sequence of digits "123". The match.start()
, match.end()
, and match.group()
methods provide information about the matched substring. match.group(0)
returns the entire matched string.
The match()
Function
The re.match(pattern, string)
function tries to match the pattern at the *beginning* of the string. It only returns a match object if the pattern matches from the very start. If the pattern does not match at the beginning, it returns None
.
import re
string1 = "Python is a great language."
string2 = "A great language is Python."
pattern = r"Python"
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)
if match1:
print("match1 found at the beginning!")
else:
print("match1 not found at the beginning.")
if match2:
print("match2 found at the beginning!")
else:
print("match2 not found at the beginning.")
In this example, match1
will have a match object because "Python" is at the beginning of string1
. match2
will return None
because "Python" is not at the beginning of string2
.
Extracting Groups with Parentheses
Parentheses in a regular expression define capturing groups. You can extract the text matched by each group using the group()
method, with the group number as an argument (starting from 1). Group 0 is the entire match.
import re
string = "My phone number is 555-123-4567."
pattern = r"(\d{3})-(\d{3})-(\d{4})" # Capture area code, exchange, and line number
match = re.search(pattern, string)
if match:
print("Full number:", match.group(0))
print("Area code:", match.group(1))
print("Exchange:", match.group(2))
print("Line number:", match.group(3))
else:
print("No phone number found.")
In this example, the pattern is structured to capture the different parts of a phone number. match.group(1)
gives the area code, match.group(2)
gives the exchange, and match.group(3)
gives the line number.
Practical Examples and Exercises
- Extracting Email Addresses: Write a regular expression and code to extract all email addresses from a text string. Consider different email address formats.
import re text = "Contact us at support@example.com or sales@another-example.org for more information. Also try info@example.net." pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" #Improved email regex emails = re.findall(pattern, text) #Using findall instead of search to find all instances print("Email addresses found:", emails)
- Validating a Date Format: Create a regular expression to validate dates in the format YYYY-MM-DD. Test your regex with valid and invalid dates.
import re def validate_date(date_string): pattern = r"^\d{4}-\d{2}-\d{2}$" #Date format YYYY-MM-DD. ^ and $ added for exact match match = re.match(pattern, date_string) if match: print(f"{date_string} is a valid date format.") else: print(f"{date_string} is an invalid date format.") validate_date("2023-10-27") validate_date("2023/10/27") validate_date("2023-10-27T12:00:00")
- Finding Specific Words: Use
search()
to find all occurrences of a particular word (e.g., "the") in a string, regardless of case.import re text = "The quick brown fox jumps over the lazy dog. THE end." pattern = r"\bthe\b" #Word boundary to match only whole word 'the' matches = re.findall(pattern, text, re.IGNORECASE) # Ignore case print(f"The word 'the' appears {len(matches)} times.")