How to Use Python Regular Expressions

Regular Expressions (often abbreviated as regex or regexp) are one of the most powerful tools in programming, especially in Python. They allow you to search, match, and manipulate text with precision and efficiency. If you’ve ever wondered how to validate an email, extract data from logs, or parse complex text, then regex is your new best friend. Let’s dive deep into Python’s re module, starting from the basics and building up to advanced concepts.


What is a Regular Expression?

At its core, a regular expression is a sequence of characters that define a search pattern. Think of it as a special language for finding patterns in text. For example:

  • Pattern: cat
    Text: “The cat sat on the mat.”
    Match: Yes, the word “cat” is present.

Python provides the re module to work with regular expressions. To use it, simply import the module:

import re

Basic Concepts and Syntax

1. Matching Literal Characters

A regex matches characters exactly unless you use special symbols. For example:

pattern = r"cat"
text = "The cat is here."
match = re.search(pattern, text)
print(match.group() if match else "No match")
  • r"cat": The r prefix denotes a raw string, ensuring backslashes are treated literally.
  • Output: cat

2. Metacharacters

Metacharacters are symbols with special meanings in regex. Some of the most common ones are:

MetacharacterDescription
.Matches any character except \n.
^Matches the start of a string.
$Matches the end of a string.
*Matches 0 or more repetitions.
+Matches 1 or more repetitions.
?Matches 0 or 1 repetition.
``Acts as an OR operator.

Example:

pattern = r"ca."
text = "cat, car, cab"
matches = re.findall(pattern, text)
print(matches)  # ['cat', 'car', 'cab']

3. Character Classes

Character classes let you specify a set of characters to match. For example:

  • [abc]: Matches any one of a, b, or c.
  • [a-z]: Matches any lowercase letter.
  • [^a-z]: Matches anything except lowercase letters.

Example:

pattern = r"[aeiou]"
text = "Python is awesome."
matches = re.findall(pattern, text)
print(matches)  # ['o', 'i', 'a', 'e', 'o', 'e']

Intermediate Techniques

1. Quantifiers

Quantifiers define how many times a character or group can repeat:

QuantifierDescription
{n}Exactly n times.
{n,}At least n times.
{n,m}Between n and m times.

Example:

pattern = r"a{2,3}"
text = "aaa a aa aaa"
matches = re.findall(pattern, text)
print(matches)  # ['aaa', 'aa']

2. Grouping and Capturing

Parentheses () group parts of a regex and capture matched text:

pattern = r"(\w+)@(\w+).com"
text = "Contact us at [email protected]."
match = re.search(pattern, text)
print(match.groups())  # ('support', 'example')
  • \w+: Matches one or more word characters.
  • groups(): Returns all captured groups.

3. Escape Sequences

Escape special characters with \ if you want to match them literally:

  • \. matches a literal period (.).
  • \d matches any digit (equivalent to [0-9]).
  • \D matches non-digit characters.
  • \s matches any whitespace (spaces, tabs, etc.).

Example:

pattern = r"\d{4}"
text = "Year: 2023."
match = re.search(pattern, text)
print(match.group())  # 2023

Advanced Topics

1. Lookahead and Lookbehind

These are special assertions to match patterns without including them in the result.

  • Positive Lookahead (?=): Ensures a pattern is followed by another.
  • Negative Lookahead (?!): Ensures a pattern is not followed by another.
  • Positive Lookbehind (?<=): Ensures a pattern is preceded by another.
  • Negative Lookbehind (?<!): Ensures a pattern is not preceded by another.

Example:

pattern = r"\d+(?=\sUSD)"
text = "Price: 50 USD, 70 EUR"
matches = re.findall(pattern, text)
print(matches)  # ['50']

2. Flags

Flags modify regex behavior. Common flags include:

  • re.IGNORECASE or re.I: Makes matching case-insensitive.
  • re.MULTILINE or re.M: Allows ^ and $ to match at the start and end of each line.
  • re.DOTALL or re.S: Makes . match newline characters as well.

Example:

pattern = r"^hello"
text = "Hello\nhello"
matches = re.findall(pattern, text, re.I | re.M)
print(matches)  # ['Hello', 'hello']

3. Substitutions

The re.sub function replaces patterns with a specified string:

pattern = r"\d+"
text = "Replace 123 with 456."
result = re.sub(pattern, "456", text)
print(result)  # Replace 456 with 456.

Real-World Examples

1. Email Validation

Validate email addresses with regex:

pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
email = "[email protected]"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")

2. Extracting URLs

Find all URLs in text:

pattern = r"https?://[\w.-]+"
text = "Visit https://example.com and http://test.org."
urls = re.findall(pattern, text)
print(urls)  # ['https://example.com', 'http://test.org']

3. Data Cleaning

Remove non-alphanumeric characters:

pattern = r"[^a-zA-Z0-9 ]"
text = "Hello, World! 123."
cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)  # Hello World 123

4. Log Parsing

Extract timestamps from logs:

pattern = r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}"
log = "2025-01-20 10:30:45 - INFO: Task completed"
timestamps = re.findall(pattern, log)
print(timestamps)  # ['2025-01-20 10:30:45']

Tips for Mastery

  1. Test Patterns Online: Tools like regex101.com allow you to experiment with regex interactively.
  2. Start Simple: Break complex patterns into smaller pieces and build up.
  3. Use Comments: Python lets you write verbose regex for clarity:
pattern = re.compile(r"""
    ^               # Start of string
    (\w+)          # Capture a word
    \s+            # One or more spaces

Leave a Comment

Share this