Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.

⬅ Previous Next ➡

Regular Expressions in Python

Introduction to Regular Expressions

Regular expressions (regex) are sequences of characters that define a search pattern. They are used to match patterns in strings and are a powerful tool for text manipulation, data validation, and search.

In Python, the re module provides support for working with regular expressions.

Real-World Examples and Use Cases

Data Validation: Validating email addresses, phone numbers, and postal codes.
Data Extraction: Extracting specific information from unstructured text, such as dates, prices, or product names.
Data Cleaning: Removing unwanted characters, standardizing text formats, and correcting inconsistencies.
Web Scraping: Extracting data from websites based on HTML patterns.
Log File Analysis: Identifying errors, warnings, and other important events in log files.
Text Editors and IDEs: Find and replace functionality.
Network Security: Intrusion detection and prevention systems use regex to identify malicious patterns in network traffic.

Practical Examples in Python

1. Data Cleaning: Removing Unwanted Characters

Suppose you have a string with unwanted characters like special symbols and extra spaces.

 import re

text = "  This is a string with !@#$%^&*()_+ some unwanted characters.  "
cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text).strip()
print(f"Original text: '{text}'")
print(f"Cleaned text: '{cleaned_text}'")

Explanation:

re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement.
r'[^a-zA-Z0-9\s]' is the regex pattern. [^...] means "match any character that is NOT in the set...". In this case, the set is a-z, A-Z, 0-9, and whitespace (\s). So it matches any character that is not a letter, number, or whitespace.
'' is the replacement string (empty string, effectively removing the matched characters).
.strip() removes leading and trailing whitespace.

2. Data Validation: Validating Email Addresses

A common use case is validating email addresses. A robust email regex can be complex, but here's a basic example.

 import re

email = "test@example.com"
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

if re.match(pattern, email):
    print(f"'{email}' is a valid email address.")
else:
    print(f"'{email}' is not a valid email address.")

email = "invalid-email"
if re.match(pattern, email):
    print(f"'{email}' is a valid email address.")
else:
    print(f"'{email}' is not a valid email address.")

Explanation:

r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' is the regex pattern.
^ matches the beginning of the string.
[a-zA-Z0-9._%+-]+ matches one or more alphanumeric characters, dots, underscores, percentage signs, plus signs, or hyphens (before the @).
@ matches the "@" symbol.
[a-zA-Z0-9.-]+ matches one or more alphanumeric characters, dots, or hyphens (after the @).
\. matches a literal dot (escaped with a backslash because . has a special meaning in regex).
[a-zA-Z]{2,} matches two or more alphabetic characters (for the top-level domain, e.g., "com", "org", "net").
$ matches the end of the string.
re.match(pattern, string) checks if the pattern matches at the *beginning* of the string.

3. Web Scraping: Extracting Links from HTML

This example demonstrates extracting all links from a simplified HTML string.

 import re

html = 'Example WebsiteSome text
Another Site'
pattern = r''
links = re.findall(pattern, html)

print("Extracted Links:")
for link in links:
    print(link)

Explanation:

4. Log File Analysis: Identifying Error Messages

This example demonstrates how to extract error messages from a log file.

 import re

log_data = """
2023-10-27 10:00:00 INFO: Application started
2023-10-27 10:00:05 ERROR: Database connection failed
2023-10-27 10:00:10 WARNING: Low disk space
2023-10-27 10:00:15 ERROR: Invalid user input
"""

pattern = r'ERROR: (.*)'
errors = re.findall(pattern, log_data)

print("Error Messages:")
for error in errors:
    print(error)

Explanation:

5. Replacing Text: Standardizing Date Formats

Convert different date formats to a standardized format.

 import re

dates = ["10/27/2023", "2023-10-27", "Oct 27, 2023"]
pattern = r'(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|([A-Za-z]+) (\d{2}), (\d{4})'  # Matches all three formats

def standardize_date(date_string):
  match = re.match(pattern, date_string)
  if match:
      if match.group(1):  # YYYY-MM-DD format
          year, month, day = match.group(1, 2, 3)
      elif match.group(4):  # MM/DD/YYYY format
          month, day, year = match.group(4, 5, 6)
      else:  # Month DD, YYYY format
          month_name, day, year = match.group(7, 8, 9)
          from datetime import datetime
          month = str(datetime.strptime(month_name, "%b").month).zfill(2) # Convert month name to number
      return f"{year}-{month}-{day}"
  else:
      return "Invalid date format"

for date in dates:
    standardized_date = standardize_date(date)
    print(f"Original date: {date}, Standardized date: {standardized_date}")

Explanation:

⬅ Previous Next ➡