Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Regular Expressions in Python
Introduction to Regular Expressions
Regular expressions (regex) are sequences of characters that define a search pattern. They are used to match patterns in strings and are a powerful tool for text manipulation, data validation, and search.
In Python, the re
module provides support for working with regular expressions.
Real-World Examples and Use Cases
- Data Validation: Validating email addresses, phone numbers, and postal codes.
- Data Extraction: Extracting specific information from unstructured text, such as dates, prices, or product names.
- Data Cleaning: Removing unwanted characters, standardizing text formats, and correcting inconsistencies.
- Web Scraping: Extracting data from websites based on HTML patterns.
- Log File Analysis: Identifying errors, warnings, and other important events in log files.
- Text Editors and IDEs: Find and replace functionality.
- Network Security: Intrusion detection and prevention systems use regex to identify malicious patterns in network traffic.
Practical Examples in Python
1. Data Cleaning: Removing Unwanted Characters
Suppose you have a string with unwanted characters like special symbols and extra spaces.
import re
text = " This is a string with !@#$%^&*()_+ some unwanted characters. "
cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text).strip()
print(f"Original text: '{text}'")
print(f"Cleaned text: '{cleaned_text}'")
Explanation:
re.sub(pattern, replacement, string)
replaces all occurrences of thepattern
in thestring
with thereplacement
.r'[^a-zA-Z0-9\s]'
is the regex pattern.[^...]
means "match any character that is NOT in the set...". In this case, the set is a-z, A-Z, 0-9, and whitespace (\s). So it matches any character that is not a letter, number, or whitespace.''
is the replacement string (empty string, effectively removing the matched characters)..strip()
removes leading and trailing whitespace.
2. Data Validation: Validating Email Addresses
A common use case is validating email addresses. A robust email regex can be complex, but here's a basic example.
import re
email = "test@example.com"
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
if re.match(pattern, email):
print(f"'{email}' is a valid email address.")
else:
print(f"'{email}' is not a valid email address.")
email = "invalid-email"
if re.match(pattern, email):
print(f"'{email}' is a valid email address.")
else:
print(f"'{email}' is not a valid email address.")
Explanation:
r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
is the regex pattern.^
matches the beginning of the string.[a-zA-Z0-9._%+-]+
matches one or more alphanumeric characters, dots, underscores, percentage signs, plus signs, or hyphens (before the @).@
matches the "@" symbol.[a-zA-Z0-9.-]+
matches one or more alphanumeric characters, dots, or hyphens (after the @).\.
matches a literal dot (escaped with a backslash because.
has a special meaning in regex).[a-zA-Z]{2,}
matches two or more alphabetic characters (for the top-level domain, e.g., "com", "org", "net").$
matches the end of the string.re.match(pattern, string)
checks if thepattern
matches at the *beginning* of thestring
.
3. Web Scraping: Extracting Links from HTML
This example demonstrates extracting all links from a simplified HTML string.
import re
html = 'Example WebsiteSome text
Another Site'
pattern = r''
links = re.findall(pattern, html)
print("Extracted Links:")
for link in links:
print(link)
Explanation:
r''
is the regex pattern.matches the literal string ">".
re.findall(pattern, string)
returns a list of all the strings matched by the capturing group in thepattern
within thestring
.
4. Log File Analysis: Identifying Error Messages
This example demonstrates how to extract error messages from a log file.
import re
log_data = """
2023-10-27 10:00:00 INFO: Application started
2023-10-27 10:00:05 ERROR: Database connection failed
2023-10-27 10:00:10 WARNING: Low disk space
2023-10-27 10:00:15 ERROR: Invalid user input
"""
pattern = r'ERROR: (.*)'
errors = re.findall(pattern, log_data)
print("Error Messages:")
for error in errors:
print(error)
Explanation:
r'ERROR: (.*)'
is the regex pattern.ERROR:
matches the literal string "ERROR: ".(.*)
matches any character (.
) zero or more times (*
) in a greedy way. The parentheses()
create a capturing group.re.findall(pattern, string)
returns a list of all the strings matched by the capturing group in thepattern
within thestring
.
5. Replacing Text: Standardizing Date Formats
Convert different date formats to a standardized format.
import re
dates = ["10/27/2023", "2023-10-27", "Oct 27, 2023"]
pattern = r'(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|([A-Za-z]+) (\d{2}), (\d{4})' # Matches all three formats
def standardize_date(date_string):
match = re.match(pattern, date_string)
if match:
if match.group(1): # YYYY-MM-DD format
year, month, day = match.group(1, 2, 3)
elif match.group(4): # MM/DD/YYYY format
month, day, year = match.group(4, 5, 6)
else: # Month DD, YYYY format
month_name, day, year = match.group(7, 8, 9)
from datetime import datetime
month = str(datetime.strptime(month_name, "%b").month).zfill(2) # Convert month name to number
return f"{year}-{month}-{day}"
else:
return "Invalid date format"
for date in dates:
standardized_date = standardize_date(date)
print(f"Original date: {date}, Standardized date: {standardized_date}")
Explanation:
- This example is more complex, demonstrating the power of regex for handling multiple possible patterns.
- The pattern
(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|([A-Za-z]+) (\d{2}), (\d{4})
uses the|
(OR) operator to match one of three different date formats: YYYY-MM-DD, MM/DD/YYYY, and Month DD, YYYY. - Each alternative format has its own set of capturing groups (parentheses).
- The
standardize_date
function checks which group was matched and extracts the corresponding year, month, and day. - The
datetime
module is used to convert the month name (e.g., "Oct") into a numerical month (e.g., "10"). - The
zfill(2)
method ensures that the month and day are padded with a leading zero if they are single digits (e.g., "9" becomes "09").