Regular Expressions
Learn how to use regular expressions to search, match, and manipulate text.
Introduction to Regular Expressions in Python
What are Regular Expressions?
Regular expressions (regex or regexp) are sequences of characters that define a search pattern. They are a powerful tool for searching, matching, and manipulating text based on specific patterns. Think of them as highly flexible and specialized wildcards.
Why are Regular Expressions Important?
Regular expressions are fundamental to text processing because they allow you to:
- Search: Quickly find specific patterns within large bodies of text.
- Validate: Ensure data conforms to a required format (e.g., email addresses, phone numbers).
- Extract: Pull out relevant information from text (e.g., extracting all dates from a document).
- Replace: Modify text by substituting patterns with different values.
- Split: Divide text into smaller chunks based on a pattern.
These capabilities are essential in various applications, including data cleaning, web scraping, log file analysis, and more.
Fundamental Concepts of Regular Expressions
1. Literals
The simplest regex pattern is a literal, which matches the exact sequence of characters. For example, the regex "hello"
will only match the word "hello".
Example in Python:
import re
text = "hello world"
pattern = "hello"
match = re.search(pattern, text)
if match:
print("Match found!")
else:
print("No match found!")
2. Metacharacters
Metacharacters are special characters that have predefined meanings in regular expressions. They provide the power and flexibility to define complex patterns.
Some common metacharacters include:
.
(dot): Matches any single character (except newline).*
(asterisk): Matches zero or more occurrences of the preceding character or group.+
(plus): Matches one or more occurrences of the preceding character or group.?
(question mark): Matches zero or one occurrence of the preceding character or group.[]
(square brackets): Defines a character class, matching any character within the brackets.^
(caret): Matches the beginning of a string or line (depending on context).$
(dollar sign): Matches the end of a string or line.\
(backslash): Escapes a metacharacter to treat it as a literal, or introduces a special sequence.|
(pipe): Represents "or", allowing you to match one of several alternatives.()
(parentheses): Groups characters together and captures the matched group.
3. Character Classes
Character classes define a set of characters that can be matched at a single position. They are defined using square brackets []
.
[abc]
: Matches either 'a', 'b', or 'c'.[a-z]
: Matches any lowercase letter.[A-Z]
: Matches any uppercase letter.[0-9]
: Matches any digit.[^abc]
: Matches any character *except* 'a', 'b', or 'c'. The^
inside the brackets negates the character class.
4. Quantifiers
Quantifiers specify how many times a preceding character or group should be repeated.
*
: Zero or more times.+
: One or more times.?
: Zero or one time.{n}
: Exactly n times.{n,}
: n or more times.{n,m}
: Between n and m times (inclusive).
5. Special Sequences
Special sequences, also denoted with a backslash \
, provide shortcuts for common character classes.
\d
: Matches any digit (equivalent to [0-9]).\D
: Matches any non-digit character (equivalent to [^0-9]).\w
: Matches any word character (alphanumeric and underscore) (equivalent to [a-zA-Z0-9_]).\W
: Matches any non-word character (equivalent to [^a-zA-Z0-9_]).\s
: Matches any whitespace character (space, tab, newline, etc.).\S
: Matches any non-whitespace character.
Regular Expressions in Python
Python provides the re
module for working with regular expressions. The re
module offers various functions for searching, matching, and manipulating strings based on regex patterns.
Common re
Module Functions
re.search(pattern, string)
: Searches the string for the first occurrence of the pattern. Returns a match object if found, otherwise None.re.match(pattern, string)
: Attempts to match the pattern at the *beginning* of the string. Returns a match object if successful, otherwise None.re.findall(pattern, string)
: Finds all occurrences of the pattern in the string and returns them as a list of strings.re.finditer(pattern, string)
: Finds all occurrences of the pattern in the string and returns them as an iterator of match objects.re.sub(pattern, replacement, string)
: Replaces all occurrences of the pattern in the string with the replacement string.re.split(pattern, string)
: Splits the string into a list of substrings based on the occurrences of the pattern.