Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Regular Expressions in Python: Character Classes and Special Characters
Introduction
Regular expressions (regex) are powerful tools for pattern matching in strings. Python's re
module provides comprehensive support for working with regex. Understanding character classes and special characters is crucial for building effective regular expressions.
Character Classes
Character classes are shorthand ways to represent a set of characters. They make regular expressions more concise and readable. Here are some common character classes:
\d
: Matches any digit (0-9). Equivalent to[0-9]
.\w
: Matches any word character (letters, numbers, and underscore). Equivalent to[a-zA-Z0-9_]
.\s
: Matches any whitespace character (space, tab, newline, etc.). Equivalent to[ \t\n\r\f\v]
.\D
: Matches any non-digit character. Equivalent to[^0-9]
.\W
: Matches any non-word character. Equivalent to[^a-zA-Z0-9_]
.\S
: Matches any non-whitespace character. Equivalent to[^ \t\n\r\f\v]
.[abc]
: Matches any character within the brackets (in this case, 'a', 'b', or 'c').[^abc]
: Matches any character not within the brackets (in this case, anything except 'a', 'b', or 'c').[a-z]
: Matches any lowercase letter from 'a' to 'z'.[A-Z]
: Matches any uppercase letter from 'A' to 'Z'.[0-9]
: Matches any digit from '0' to '9'.
Example: The regex \d{3}-\d{2}-\d{4}
matches a US phone number format like "123-45-6789".
import re
pattern = r"\d{3}-\d{2}-\d{4}"
text = "My phone number is 123-45-6789."
match = re.search(pattern, text)
if match:
print("Match found:", match.group(0)) # Output: Match found: 123-45-6789
else:
print("No match found.")
Special Characters
Special characters have predefined meanings within regular expressions. To match them literally, you usually need to escape them with a backslash (\
).
.
(dot): Matches any single character except a newline character.*
(asterisk): Matches the preceding character zero or more times.+
(plus): Matches the preceding character one or more times.?
(question mark): Matches the preceding character zero or one time (makes the preceding character optional).^
(caret): Matches the beginning of the string (or line if theMULTILINE
flag is used).$
(dollar sign): Matches the end of the string (or line if theMULTILINE
flag is used).\
(backslash): Used to escape special characters or to denote character classes. For example,\.
matches a literal dot.|
(pipe): Acts as an "or" operator. For example,a|b
matches either "a" or "b".()
(parentheses): Used to group characters together and capture them as a group. You can access these captured groups usingmatch.group(n)
, wheren
is the group number (starting from 1).[]
(square brackets): Defines a character set (as discussed in the Character Classes section).{}
(curly braces): Specifies a quantifier for the preceding character or group. For example,a{3}
matches exactly three "a"s, anda{2,4}
matches two to four "a"s.
Example: The regex a.*b
matches any string that starts with "a" and ends with "b", with any characters in between.
import re
pattern = r"a.*b"
text1 = "acdefgb"
text2 = "cdefg"
match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)
if match1:
print("Match found in text1:", match1.group(0)) # Output: Match found in text1: acdefgb
else:
print("No match found in text1.")
if match2:
print("Match found in text2:", match2.group(0))
else:
print("No match found in text2.") # Output: No match found in text2.
Example: Matching an IP Address.
import re
pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"
text = "The server's IP address is 192.168.1.100."
match = re.search(pattern, text)
if match:
print("IP Address Found:", match.group(0)) #IP Address Found: 192.168.1.100
else:
print("No IP address found.")
Example: Matching an email address.
import re
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
text = "Contact us at support@example.com for assistance."
match = re.search(pattern, text)
if match:
print("Email Address Found:", match.group(0)) # Email Address Found: support@example.com
else:
print("No email address found.")
Quantifiers
Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. They are crucial for creating flexible and powerful regex patterns.
*
: Matches the preceding element zero or more times. Equivalent to{0,}
.+
: Matches the preceding element one or more times. Equivalent to{1,}
.?
: Matches the preceding element zero or one time. Equivalent to{0,1}
. Makes the preceding element optional.{n}
: Matches the preceding element exactlyn
times.{n,}
: Matches the preceding elementn
or more times.{n,m}
: Matches the preceding element betweenn
andm
times (inclusive).
Greedy vs. Lazy Quantifiers:
By default, quantifiers in regular expressions are greedy. This means they will try to match as much as possible while still allowing the overall pattern to match. To make a quantifier lazy (or reluctant), you append a ?
to it. A lazy quantifier will match as little as possible.
For example, given the string "abbbbbc"
and the regex "ab+c"
, the greedy quantifier +
will match all the "b"s ("abbbbbc"
). However, if the regex is "ab+?c"
, the lazy quantifier +?
will only match one "b" ("abc"
), resulting in a match of "abc"
from the beginning of the string.
import re
text = "abbbbbc"
# Greedy quantifier
pattern_greedy = r"ab+c"
match_greedy = re.search(pattern_greedy, text)
print(f"Greedy match: {match_greedy.group(0) if match_greedy else None}") #Greedy match: abbbbbc
# Lazy quantifier
pattern_lazy = r"ab+?c"
match_lazy = re.search(pattern_lazy, text)
print(f"Lazy match: {match_lazy.group(0) if match_lazy else None}") #Lazy match: abc
Common Mistakes
- Forgetting to escape special characters: If you want to match a literal
.
,*
,+
,?
, etc., remember to escape it with a backslash (e.g.,\.
). - Incorrectly using character classes: Make sure you understand the meaning of each character class (
\d
,\w
,\s
, etc.) and use the appropriate one for your needs. - Overly complex regex: Sometimes a simple regex is better than a complex one. Break down your pattern into smaller, more manageable parts if necessary.
- Not accounting for case sensitivity: Remember that regex matching is case-sensitive by default. Use the
re.IGNORECASE
flag or the(?i)
inline flag if you want to perform a case-insensitive match.