Regular Expressions

Learn how to use regular expressions to search, match, and manipulate text.


Basic Regular Expression Syntax

Introduction

Regular expressions (regex or regexp) are a powerful tool for pattern matching and manipulation in text. They provide a concise and flexible way to search, extract, and replace text based on defined patterns. This document explores the basic syntax elements of regular expressions used in Python, including character classes, anchors, and quantifiers.

Character Classes

Character classes define a set of characters that you want to match. Here are some common character classes:

  • . (Dot): Matches any single character except newline.
  • \d: Matches any digit (0-9).
  • \D: Matches any non-digit character.
  • \w: Matches any word character (a-z, A-Z, 0-9, and _).
  • \W: Matches any non-word character.
  • \s: Matches any whitespace character (space, tab, newline).
  • \S: Matches any non-whitespace character.
  • [abc]: Matches any one of the characters a, b, or c.
  • [^abc]: Matches any character except a, b, or c.
  • [a-z]: Matches any lowercase letter from a to z.
  • [A-Z]: Matches any uppercase letter from A to Z.
  • [0-9]: Matches any digit from 0 to 9.

Python Example:

 import re

text = "The quick brown fox jumps over the lazy dog 123."

# Find all digits
digits = re.findall(r"\d+", text)
print(f"Digits found: {digits}")  # Output: Digits found: ['123']

# Find all word characters
words = re.findall(r"\w+", text)
print(f"Words found: {words}") # Output: Words found: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '123']

# Find all characters that are not words
non_words = re.findall(r"\W+", text)
print(f"Non-words found: {non_words}") # Output: Non-words found: [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

#Find all spaces
spaces = re.findall(r"\s+", text)
print(f"Spaces found: {spaces}") # Output: Spaces found: [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '] 

Anchors

Anchors don't match characters but rather positions within the string. They assert that the match must occur at a specific location.

  • ^ (Caret): Matches the beginning of the string (or the beginning of a line if the MULTILINE flag is set).
  • $ (Dollar): Matches the end of the string (or the end of a line if the MULTILINE flag is set).
  • \b: Matches a word boundary (the position between a word character and a non-word character).
  • \B: Matches a non-word boundary.

Python Example:

 import re

text = "The quick brown fox\nLazy dog"

# Match lines starting with 'The'
starting_lines = re.findall(r"^The.*", text, re.MULTILINE)
print(f"Lines starting with 'The': {starting_lines}") # Output: Lines starting with 'The': ['The quick brown fox']

# Match lines ending with 'dog'
ending_lines = re.findall(r".*dog$", text, re.MULTILINE)
print(f"Lines ending with 'dog': {ending_lines}")  # Output: Lines ending with 'dog': ['Lazy dog']

# Find words starting with 'L' using word boundary
word_boundary_l = re.findall(r"\bL\w+",text)
print(f"Words starting with 'L': {word_boundary_l}") # Output: Words starting with 'L': ['Lazy'] 

Quantifiers

Quantifiers specify how many times a preceding element must occur to constitute a match.

  • * (Asterisk): Matches zero or more occurrences of the preceding element.
  • + (Plus): Matches one or more occurrences of the preceding element.
  • ? (Question Mark): Matches zero or one occurrence of the preceding element (optional).
  • {n}: Matches exactly n occurrences of the preceding element.
  • {n,}: Matches n or more occurrences of the preceding element.
  • {n,m}: Matches between n and m occurrences of the preceding element (inclusive).

Python Example:

 import re

text = "abbcccddddeeeee"

# Match 'b' followed by two or more 'c's
match_plus = re.findall(r"bc+", text)
print(f"Match 'bc+': {match_plus}") # Output: Match 'bc+': ['bccc']

# Match 'd' followed by two to four 'd's
match_range = re.findall(r"d{2,4}", text)
print(f"Match 'd{{2,4}}': {match_range}") # Output: Match 'd{2,4}': ['ddd', 'dddd']

#Match zero or more 'e's
match_star = re.findall(r"e*", text)
print(f"Match 'e*': {match_star}")

#Match zero or one 'e'
match_question = re.findall(r"e?", text)
print(f"Match 'e?': {match_question}") 

Grouping and Alternation

  • ( ): Groups together the expressions contained inside it, used to apply quantifiers to the entire group or to capture parts of the match.
  • |: Acts like an "or" operator, matching either the expression before or after the pipe.

Python Example:

 import re

text = "cat dog bird fish"

# Match either 'cat' or 'dog'
match_alternation = re.findall(r"cat|dog", text)
print(f"Match 'cat|dog': {match_alternation}")  # Output: ['cat', 'dog']

# Group 'cat' or 'dog' and match with 's' (cats or dogs)
text2 = "cats dogs bird fish"
match_grouping = re.findall(r"(cat|dog)s", text2)
print(f"Match '(cat|dog)s': {match_grouping}") # Output: ['cat', 'dog'] 

Escaping Special Characters

Since regular expressions use certain characters to represent special meanings (like ., *, +, ?, (, ), [, ], {, }, \, |, ^, $), you need to escape these characters with a backslash (\) if you want to match them literally.

Python Example:

 import re

text = "1 + 1 = 2"

# To match the '+' sign literally, we need to escape it
match_plus_literal = re.findall(r"\+", text)
print(f"Match '+': {match_plus_literal}")  # Output: ['+'] 

Conclusion

This document has provided an overview of the basic syntax elements used in regular expressions. Mastering these elements will enable you to effectively search, extract, and manipulate text in Python. Remember to consult the official Python documentation for a comprehensive understanding of regular expressions.