Pattern Matching

Intro to Regular Expressions

Regular Expressions (Regex) are a language within a language. They allow you to find needles in haystacks, validate emails, and scrape data with surgical precision.

A Regular Expression is a special sequence of characters that helps you match or find other strings or sets of strings. It is heavily based on mathematical Computer Science theory (Finite State Machines), but in practice, it's a superpower for text processing.

In Python, we use the standard library module re.

What You'll Learn

Raw Strings: Why r"..." is mandatory for regex.
The API: match vs search vs findall.
Quantifiers: Matching one, many, or optional characters (* + ?).
Groups: Extracting specific data (and why Named Groups are best).
Performance: Using re.compile() to cache patterns.

The `r""` Revolution

Before writing a single regex, you must understand Raw Strings. In standard Python strings, the backslash \ is an escape character (e.g., \n is newline). In Regex, \ is also a special character (e.g., \d is a digit).

This creates "Backslash Hell" where you need \\\\ to match a single backslash.Always prefix regex strings with r to tell Python "Ignore backslashes".

PYTHON
import re

# âŒ BAD: Python interprets  as 'backspace' character
pattern = "word" 

# âœ… GOOD: Raw string passes  literally to the regex engine
pattern = r"word"

The Cheat Sheet

Regex syntax is dense. Here are the 20% of symbols you'll use 80% of the time.

Symbol	Meaning
`.`	Matches any character (except newline).
`^` / `$`	Anchors: Start / End of the string.
`\d` / `\w`	Matches Digit (0-9) / Word Char (a-z, 0-9, _).
`*`	Matches 0 or more repetitions.
`+`	Matches 1 or more repetitions.
`?`	Matches 0 or 1 (Optional).
`[]`	Set: `[aeiou]` matches any vowel.

Functions: Search vs Match

A common beginner bug: re.match() checks ONLY the start of the string.re.search() checks ANYWHERE.

The Subtle Difference
PYTHON
import re

text = "Error: File not found"

# âŒ re.match fails because text starts with 'Error', not 'File'
result = re.match(r"File", text)
print(result) # None

# âœ… re.search scans the whole string
result = re.search(r"File", text)
print(result) # <re.Match object; span=(7, 11), match='File'>

# âœ… re.findall returns a LIST of all matches
text = "10 apples, 20 oranges, 30 bananas"
numbers = re.findall(r"d+", text)
print(numbers) # ['10', '20', '30']

Advanced: Capturing Groups

Often you don't just want to know if it matched, you want to extract specific parts. We use parentheses () to define Groups.

Extracting Email Parts
PYTHON
email = "support@example.com"

# Define groups with ()
pattern = r"(w+)@(w+.w+)"

match = re.search(pattern, email)
if match:
    # group(0) is the entire match
    print(match.group(0)) # "support@example.com"
    
    # group(1) is the first paretheses (Username)
    print(match.group(1)) # "support"
    
    # group(2) is the second parentheses (Domain)
    print(match.group(2)) # "example.com"

Pro Tip: Named Groups

Numbered groups are confusing. Use (?P<name>...) to name them!

PYTHON

pattern = r"(?P<user>w+)@(?P<domain>w+.w+)"
match = re.search(pattern, email)
print(match.group("user"))   # "support"

Performance: Compilation

If you are using a regex pattern inside a loop (e.g., checking 10,000 strings), using re.search() directly is inefficient because Python has to re-parse the pattern string every time.

Instead, use re.compile() to parse the regex once into a reusable object.