Intro to Regular Expressions
Regular Expressions (Regex) are a language within a language. They allow you to find needles in haystacks, validate emails, and scrape data with surgical precision.
A Regular Expression is a special sequence of characters that helps you match or find other strings or sets of strings. It is heavily based on mathematical Computer Science theory (Finite State Machines), but in practice, it's a superpower for text processing.
In Python, we use the standard library module re.
What You'll Learn
- Raw Strings: Why
r"..."is mandatory for regex. - The API:
matchvssearchvsfindall. - Quantifiers: Matching one, many, or optional characters (
* + ?). - Groups: Extracting specific data (and why Named Groups are best).
- Performance: Using
re.compile()to cache patterns.
The r"" Revolution
Before writing a single regex, you must understand Raw Strings. In standard Python strings, the backslash \ is an escape character (e.g., \n is newline). In Regex, \ is also a special character (e.g., \d is a digit).
This creates "Backslash Hell" where you need \\\\ to match a single backslash.Always prefix regex strings with r to tell Python "Ignore backslashes".
import re
# ⌠BAD: Python interprets as 'backspace' character
pattern = "word"
# ✅ GOOD: Raw string passes literally to the regex engine
pattern = r"word"The Cheat Sheet
Regex syntax is dense. Here are the 20% of symbols you'll use 80% of the time.
| Symbol | Meaning |
|---|---|
. | Matches any character (except newline). |
^ / $ | Anchors: Start / End of the string. |
\d / \w | Matches Digit (0-9) / Word Char (a-z, 0-9, _). |
* | Matches 0 or more repetitions. |
+ | Matches 1 or more repetitions. |
? | Matches 0 or 1 (Optional). |
[] | Set: [aeiou] matches any vowel. |
Functions: Search vs Match
A common beginner bug: re.match() checks ONLY the start of the string.re.search() checks ANYWHERE.
import re
text = "Error: File not found"
# ⌠re.match fails because text starts with 'Error', not 'File'
result = re.match(r"File", text)
print(result) # None
# ✅ re.search scans the whole string
result = re.search(r"File", text)
print(result) # <re.Match object; span=(7, 11), match='File'>
# ✅ re.findall returns a LIST of all matches
text = "10 apples, 20 oranges, 30 bananas"
numbers = re.findall(r"d+", text)
print(numbers) # ['10', '20', '30']Advanced: Capturing Groups
Often you don't just want to know if it matched, you want to extract specific parts. We use parentheses () to define Groups.
email = "support@example.com"
# Define groups with ()
pattern = r"(w+)@(w+.w+)"
match = re.search(pattern, email)
if match:
# group(0) is the entire match
print(match.group(0)) # "support@example.com"
# group(1) is the first paretheses (Username)
print(match.group(1)) # "support"
# group(2) is the second parentheses (Domain)
print(match.group(2)) # "example.com"Pro Tip: Named Groups
Numbered groups are confusing. Use (?P<name>...) to name them!
pattern = r"(?P<user>w+)@(?P<domain>w+.w+)"
match = re.search(pattern, email)
print(match.group("user")) # "support"Performance: Compilation
If you are using a regex pattern inside a loop (e.g., checking 10,000 strings), using re.search() directly is inefficient because Python has to re-parse the pattern string every time.
Instead, use re.compile() to parse the regex once into a reusable object.
# ✅ Optimized for multiple uses
email_pattern = re.compile(r"[w.-]+@[w.-]+")
users = ["a@b.com", "invalid-email", "c@d.org"]
valid_users = []
for u in users:
if email_pattern.match(u):
valid_users.append(u)