Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Regular Expressions in Python
What are Regular Expressions?
Regular expressions (regex) are sequences of characters that define a search pattern. They are a powerful tool for pattern matching within strings and are widely used for tasks like data validation, searching and replacing text, and data extraction. In Python, the re
module provides support for working with regular expressions.
Common Regular Expression Patterns
Here's a breakdown of common regex patterns and their meanings:
.
(dot): Matches any single character except newline.^
(caret): Matches the beginning of the string.$
(dollar): Matches the end of the string.*
(asterisk): Matches zero or more occurrences of the preceding character or group.+
(plus): Matches one or more occurrences of the preceding character or group.?
(question mark): Matches zero or one occurrence of the preceding character or group. Makes the preceding quantifier lazy (non-greedy).[]
(square brackets): Defines a character class. Matches any single character within the brackets. e.g.,[abc]
matches 'a', 'b', or 'c'.[^]
(negated character class): Matches any single character *not* within the brackets. e.g.,[^abc]
matches any character except 'a', 'b', or 'c'.\d
: Matches any digit (0-9).\D
: Matches any non-digit character.\w
: Matches any word character (alphanumeric and underscore). Equivalent to[a-zA-Z0-9_]
.\W
: Matches any non-word character.\s
: Matches any whitespace character (space, tab, newline, etc.).\S
: Matches any non-whitespace character.|
(pipe): Acts as an "or" operator. e.g.,a|b
matches either 'a' or 'b'.()
(parentheses): Creates a capturing group. Allows you to extract portions of the matched string.{n}
: Matches exactly n occurrences of the preceding character or group.{n,}
: Matches n or more occurrences of the preceding character or group.{n,m}
: Matches between n and m occurrences of the preceding character or group.\
(backslash): Used to escape special characters, allowing you to match them literally. e.g.,\.
matches a literal dot. Also used for special sequences like\d
,\w
,\s
.
Common Validation Examples in Python
Here are examples of using regular expressions for common validation tasks:
Email Address Validation
This regex is a common starting point for email validation. It's not perfect and doesn't cover *all* valid email formats, but it's a good balance between complexity and effectiveness.
import re
def validate_email(email):
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return re.match(pattern, email) is not None
# Examples
email1 = "test@example.com"
email2 = "invalid-email"
email3 = "another.test@sub.example.co.uk"
print(f"{email1}: {validate_email(email1)}")
print(f"{email2}: {validate_email(email2)}")
print(f"{email3}: {validate_email(email3)}")
Explanation:
^[a-zA-Z0-9._%+-]+
: Matches one or more alphanumeric characters, dots, underscores, percentage signs, plus signs, or hyphens at the beginning of the string.@
: Matches the "@" symbol.[a-zA-Z0-9.-]+
: Matches one or more alphanumeric characters, dots, or hyphens.\.
: Matches a literal dot.[a-zA-Z]{2,}
: Matches two or more alphabetic characters (for the domain extension).$
: Matches the end of the string.
Phone Number Validation
This regex validates a simple US phone number format (e.g., 123-456-7890 or 1234567890). It can be adjusted to fit other formats.
import re
def validate_phone_number(phone_number):
pattern = r"^\d{3}[-]?\d{3}[-]?\d{4}$"
return re.match(pattern, phone_number) is not None
# Examples
phone1 = "123-456-7890"
phone2 = "1234567890"
phone3 = "123.456.7890"
phone4 = "123-4567-890"
print(f"{phone1}: {validate_phone_number(phone1)}")
print(f"{phone2}: {validate_phone_number(phone2)}")
print(f"{phone3}: {validate_phone_number(phone3)}")
print(f"{phone4}: {validate_phone_number(phone4)}")
Explanation:
^\d{3}
: Matches three digits at the beginning of the string.[-]?
: Matches an optional hyphen (zero or one occurrence).\d{3}
: Matches three digits.[-]?
: Matches an optional hyphen (zero or one occurrence).\d{4}
: Matches four digits.$
: Matches the end of the string.
URL Validation
This regex provides a basic check for a URL format. More robust URL validation often involves using a dedicated library or service.
import re
def validate_url(url):
pattern = r"^(https?://)?([\da-z.-]+)\.([a-z.]{2,6})([/\w.-]*)*\/?$"
return re.match(pattern, url) is not None
# Examples
url1 = "http://www.example.com"
url2 = "https://example.com/path/to/page.html"
url3 = "www.example.com"
url4 = "invalid-url"
print(f"{url1}: {validate_url(url1)}")
print(f"{url2}: {validate_url(url2)}")
print(f"{url3}: {validate_url(url3)}")
print(f"{url4}: {validate_url(url4)}")
Explanation:
^(https?://)?
: Matches an optional "http://" or "https://".([\da-z.-]+)
: Matches one or more alphanumeric characters, digits, dots, or hyphens (the domain name).\.
: Matches a literal dot.([a-z.]{2,6})
: Matches two to six alphabetic characters or dots (the top-level domain).([/\w.-]*)*
: Matches zero or more occurrences of a slash followed by alphanumeric characters, underscores, dots, or hyphens (the path).\/?
: Matches an optional trailing slash.$
: Matches the end of the string.
Important Considerations
- Complexity: Regular expressions can become complex and difficult to read. Commenting your regex and breaking down complex patterns into smaller, more manageable parts is good practice.
- Performance: Complex regex patterns can impact performance, especially when used on large datasets. Consider optimizing your regex if performance becomes an issue.
- Security: Be cautious when using regular expressions with user-supplied input, as they can be vulnerable to Regular Expression Denial of Service (ReDoS) attacks.
- Limitations: Regular expressions are not always the best tool for parsing complex or highly structured data (e.g., HTML, XML). Dedicated parsers are often a better choice in these scenarios.