Regular Expressions (often abbreviated as regex or regexp) are one of the most powerful tools in programming, especially in Python. They allow you to search, match, and manipulate text with precision and efficiency. If you’ve ever wondered how to validate an email, extract data from logs, or parse complex text, then regex is your new best friend. Let’s dive deep into Python’s re
module, starting from the basics and building up to advanced concepts.
What is a Regular Expression?
At its core, a regular expression is a sequence of characters that define a search pattern. Think of it as a special language for finding patterns in text. For example:
- Pattern:
cat
Text: “The cat sat on the mat.”
Match: Yes, the word “cat” is present.
Python provides the re
module to work with regular expressions. To use it, simply import the module:
import re
Basic Concepts and Syntax
1. Matching Literal Characters
A regex matches characters exactly unless you use special symbols. For example:
pattern = r"cat" text = "The cat is here." match = re.search(pattern, text) print(match.group() if match else "No match")
r"cat"
: Ther
prefix denotes a raw string, ensuring backslashes are treated literally.- Output:
cat
2. Metacharacters
Metacharacters are symbols with special meanings in regex. Some of the most common ones are:
Metacharacter | Description | |
---|---|---|
. | Matches any character except \n. | |
^ | Matches the start of a string. | |
$ | Matches the end of a string. | |
* | Matches 0 or more repetitions. | |
+ | Matches 1 or more repetitions. | |
? | Matches 0 or 1 repetition. | |
` | ` | Acts as an OR operator. |
Example:
pattern = r"ca." text = "cat, car, cab" matches = re.findall(pattern, text) print(matches) # ['cat', 'car', 'cab']
3. Character Classes
Character classes let you specify a set of characters to match. For example:
[abc]
: Matches any one ofa
,b
, orc
.[a-z]
: Matches any lowercase letter.[^a-z]
: Matches anything except lowercase letters.
Example:
pattern = r"[aeiou]" text = "Python is awesome." matches = re.findall(pattern, text) print(matches) # ['o', 'i', 'a', 'e', 'o', 'e']
Intermediate Techniques
1. Quantifiers
Quantifiers define how many times a character or group can repeat:
Quantifier | Description |
{n} | Exactly n times. |
{n,} | At least n times. |
{n,m} | Between n and m times. |
Example:
pattern = r"a{2,3}" text = "aaa a aa aaa" matches = re.findall(pattern, text) print(matches) # ['aaa', 'aa']
2. Grouping and Capturing
Parentheses ()
group parts of a regex and capture matched text:
pattern = r"(\w+)@(\w+).com" text = "Contact us at [email protected]." match = re.search(pattern, text) print(match.groups()) # ('support', 'example')
\w+
: Matches one or more word characters.groups()
: Returns all captured groups.
3. Escape Sequences
Escape special characters with \
if you want to match them literally:
\.
matches a literal period (.
).\d
matches any digit (equivalent to[0-9]
).\D
matches non-digit characters.\s
matches any whitespace (spaces, tabs, etc.).
Example:
pattern = r"\d{4}" text = "Year: 2023." match = re.search(pattern, text) print(match.group()) # 2023
Advanced Topics
1. Lookahead and Lookbehind
These are special assertions to match patterns without including them in the result.
- Positive Lookahead (
?=
): Ensures a pattern is followed by another. - Negative Lookahead (
?!
): Ensures a pattern is not followed by another. - Positive Lookbehind (
?<=
): Ensures a pattern is preceded by another. - Negative Lookbehind (
?<!
): Ensures a pattern is not preceded by another.
Example:
pattern = r"\d+(?=\sUSD)" text = "Price: 50 USD, 70 EUR" matches = re.findall(pattern, text) print(matches) # ['50']
2. Flags
Flags modify regex behavior. Common flags include:
re.IGNORECASE
orre.I
: Makes matching case-insensitive.re.MULTILINE
orre.M
: Allows^
and$
to match at the start and end of each line.re.DOTALL
orre.S
: Makes.
match newline characters as well.
Example:
pattern = r"^hello" text = "Hello\nhello" matches = re.findall(pattern, text, re.I | re.M) print(matches) # ['Hello', 'hello']
3. Substitutions
The re.sub
function replaces patterns with a specified string:
pattern = r"\d+" text = "Replace 123 with 456." result = re.sub(pattern, "456", text) print(result) # Replace 456 with 456.
Real-World Examples
1. Email Validation
Validate email addresses with regex:
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$" email = "[email protected]" if re.match(pattern, email): print("Valid email") else: print("Invalid email")
2. Extracting URLs
Find all URLs in text:
pattern = r"https?://[\w.-]+" text = "Visit https://example.com and http://test.org." urls = re.findall(pattern, text) print(urls) # ['https://example.com', 'http://test.org']
3. Data Cleaning
Remove non-alphanumeric characters:
pattern = r"[^a-zA-Z0-9 ]" text = "Hello, World! 123." cleaned_text = re.sub(pattern, "", text) print(cleaned_text) # Hello World 123
4. Log Parsing
Extract timestamps from logs:
pattern = r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}" log = "2025-01-20 10:30:45 - INFO: Task completed" timestamps = re.findall(pattern, log) print(timestamps) # ['2025-01-20 10:30:45']
Tips for Mastery
- Test Patterns Online: Tools like regex101.com allow you to experiment with regex interactively.
- Start Simple: Break complex patterns into smaller pieces and build up.
- Use Comments: Python lets you write verbose regex for clarity:
pattern = re.compile(r""" ^ # Start of string (\w+) # Capture a word \s+ # One or more spaces