Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.


Regular Expressions in Python

Introduction to Regular Expressions

Regular expressions (regex) are sequences of characters that define a search pattern. They are used to match patterns in strings and are a powerful tool for text manipulation, data validation, and search.

In Python, the re module provides support for working with regular expressions.

Real-World Examples and Use Cases

  • Data Validation: Validating email addresses, phone numbers, and postal codes.
  • Data Extraction: Extracting specific information from unstructured text, such as dates, prices, or product names.
  • Data Cleaning: Removing unwanted characters, standardizing text formats, and correcting inconsistencies.
  • Web Scraping: Extracting data from websites based on HTML patterns.
  • Log File Analysis: Identifying errors, warnings, and other important events in log files.
  • Text Editors and IDEs: Find and replace functionality.
  • Network Security: Intrusion detection and prevention systems use regex to identify malicious patterns in network traffic.

Practical Examples in Python

1. Data Cleaning: Removing Unwanted Characters

Suppose you have a string with unwanted characters like special symbols and extra spaces.

 import re

text = "  This is a string with !@#$%^&*()_+ some unwanted characters.  "
cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text).strip()
print(f"Original text: '{text}'")
print(f"Cleaned text: '{cleaned_text}'") 

Explanation:

  • re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement.
  • r'[^a-zA-Z0-9\s]' is the regex pattern. [^...] means "match any character that is NOT in the set...". In this case, the set is a-z, A-Z, 0-9, and whitespace (\s). So it matches any character that is not a letter, number, or whitespace.
  • '' is the replacement string (empty string, effectively removing the matched characters).
  • .strip() removes leading and trailing whitespace.

2. Data Validation: Validating Email Addresses

A common use case is validating email addresses. A robust email regex can be complex, but here's a basic example.

 import re

email = "test@example.com"
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

if re.match(pattern, email):
    print(f"'{email}' is a valid email address.")
else:
    print(f"'{email}' is not a valid email address.")

email = "invalid-email"
if re.match(pattern, email):
    print(f"'{email}' is a valid email address.")
else:
    print(f"'{email}' is not a valid email address.") 

Explanation:

  • r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' is the regex pattern.
  • ^ matches the beginning of the string.
  • [a-zA-Z0-9._%+-]+ matches one or more alphanumeric characters, dots, underscores, percentage signs, plus signs, or hyphens (before the @).
  • @ matches the "@" symbol.
  • [a-zA-Z0-9.-]+ matches one or more alphanumeric characters, dots, or hyphens (after the @).
  • \. matches a literal dot (escaped with a backslash because . has a special meaning in regex).
  • [a-zA-Z]{2,} matches two or more alphabetic characters (for the top-level domain, e.g., "com", "org", "net").
  • $ matches the end of the string.
  • re.match(pattern, string) checks if the pattern matches at the *beginning* of the string.

3. Web Scraping: Extracting Links from HTML

This example demonstrates extracting all links from a simplified HTML string.

 import re

html = 'Example Website

Some text

Another Site' pattern = r'' links = re.findall(pattern, html) print("Extracted Links:") for link in links: print(link)

Explanation:

4. Log File Analysis: Identifying Error Messages

This example demonstrates how to extract error messages from a log file.

 import re

log_data = """
2023-10-27 10:00:00 INFO: Application started
2023-10-27 10:00:05 ERROR: Database connection failed
2023-10-27 10:00:10 WARNING: Low disk space
2023-10-27 10:00:15 ERROR: Invalid user input
"""

pattern = r'ERROR: (.*)'
errors = re.findall(pattern, log_data)

print("Error Messages:")
for error in errors:
    print(error) 

Explanation:

5. Replacing Text: Standardizing Date Formats

Convert different date formats to a standardized format.

 import re

dates = ["10/27/2023", "2023-10-27", "Oct 27, 2023"]
pattern = r'(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|([A-Za-z]+) (\d{2}), (\d{4})'  # Matches all three formats

def standardize_date(date_string):
  match = re.match(pattern, date_string)
  if match:
      if match.group(1):  # YYYY-MM-DD format
          year, month, day = match.group(1, 2, 3)
      elif match.group(4):  # MM/DD/YYYY format
          month, day, year = match.group(4, 5, 6)
      else:  # Month DD, YYYY format
          month_name, day, year = match.group(7, 8, 9)
          from datetime import datetime
          month = str(datetime.strptime(month_name, "%b").month).zfill(2) # Convert month name to number
      return f"{year}-{month}-{day}"
  else:
      return "Invalid date format"

for date in dates:
    standardized_date = standardize_date(date)
    print(f"Original date: {date}, Standardized date: {standardized_date}") 

Explanation: