Table of Contents
- What Are Regular Expressions in Ruby?
- Ruby Regex Syntax Basics
- 2.1 Literals and Delimiters
- 2.2 Modifiers (Flags)
- 2.3 Core Components: Anchors, Character Classes, and Quantifiers
- Essential Ruby Regex Methods
- Practical Applications
- 4.1 Validation: Emails, URLs, and Phone Numbers
- 4.2 Data Parsing: Extracting Information from Strings
- 4.3 Text Manipulation: Substitution and Formatting
- 4.4 Advanced Use Cases: Lookarounds, Backreferences, and Unicode
- Common Pitfalls and How to Avoid Them
- Best Practices
- Conclusion
- References
1. What Are Regular Expressions in Ruby?
A regular expression is a sequence of characters defining a search pattern. In Ruby, regex patterns are created using /pattern/ literals (e.g., /hello/) or the Regexp.new constructor (e.g., Regexp.new("hello")). Ruby’s regex engine supports POSIX basic and extended regex syntax, plus Ruby-specific extensions like named captures and Unicode properties.
Regex in Ruby is tightly integrated with strings, enabling operations like matching, extracting, and replacing text with methods like String#match, String#scan, and String#gsub.
2. Ruby Regex Syntax Basics
2.1 Literals and Delimiters
Ruby regex literals are typically enclosed in forward slashes (/.../), but you can use other delimiters like %r{...} for readability (useful if the pattern contains slashes, e.g., %r{https?://}).
# Basic literal
simple_regex = /ruby/
# Alternative delimiter (avoids escaping slashes)
url_regex = %r{https?://example\.com}
2.2 Modifiers (Flags)
Modifiers alter regex behavior. Common ones include:
i: Case-insensitive matching.m: Multiline mode (.matches newlines).x: Free-spacing mode (ignores whitespace, enabling comments).
# Case-insensitive match
/cat/i.match("Cat") # => #<MatchData "Cat">
# Free-spacing mode (comments and whitespace ignored)
complex_regex = %r{
\d{3} # Area code (3 digits)
- # Hyphen
\d{2} # Central office code (2 digits)
- # Hyphen
\d{4} # Line number (4 digits)
}x
complex_regex.match("123-45-6789") # => #<MatchData "123-45-6789">
2.3 Core Components
Anchors: Define Position
^/\A: Start of string (^matches start of line in multiline mode;\Aalways matches start of string).$/\z: End of string ($matches end of line;\zmatches absolute end).\b: Word boundary (between a word character[a-zA-Z0-9_]and a non-word character).
# \A and \z ensure the entire string matches
email_regex = /\Auser@domain\.com\z/
email_regex.match("[email protected]") # => #<MatchData "[email protected]">
email_regex.match("a [email protected]") # => nil (extra character at start)
Character Classes: Match Specific Characters
Enclose characters in [] to match any one of them. Use - for ranges and ^ to negate.
# Match vowels (a, e, i, o, u)
vowel_regex = /[aeiou]/i # i modifier for case insensitivity
vowel_regex.match("Ruby") # => #<MatchData "u">
# Negated class: match non-digits
non_digit_regex = /[^0-9]/
non_digit_regex.match("123abc") # => #<MatchData "a">
Quantifiers: Specify Repetitions
Quantifiers define how many times a preceding element should match:
*: 0 or more times (greedy).+: 1 or more times (greedy).?: 0 or 1 time (optional).{n}: Exactlyntimes.{n,}:nor more times.{n,m}: Betweennandmtimes.
# Match "go" followed by 0+ "o"s (e.g., "go", "goo", "gooo")
go_regex = /go+/
go_regex.match("goo") # => #<MatchData "goo">
# Match 3-5 digits
digit_regex = /\d{3,5}/
digit_regex.match("1234") # => #<MatchData "1234">
Groups and Alternations
(...): Capture group (extracts matched text for later use).(?:...): Non-capturing group (groups for quantifiers without capturing).|: Alternation (matches either the left or right pattern).
# Capture group: extract first and last name
name_regex = /(\w+) (\w+)/
match = name_regex.match("John Doe")
match[1] # => "John" (first group)
match[2] # => "Doe" (second group)
# Alternation: match "cat" or "dog"
pet_regex = /cat|dog/
pet_regex.match("dog") # => #<MatchData "dog">
3. Essential Ruby Regex Methods
Ruby strings and the Regexp class provide methods to work with regex:
String#match(regex): Returns aMatchDataobject (ornil) with details about the match.String#scan(regex): Returns an array of all non-overlapping matches.String#sub(regex, replacement): Replaces the first match withreplacement.String#gsub(regex, replacement): Replaces all matches withreplacement(global substitute).Regexp#===: Used incasestatements for pattern matching.
# match: Get details about the first match
text = "Ruby is fun, Ruby is powerful"
match_data = text.match(/Ruby/)
match_data[0] # => "Ruby" (full match)
match_data.begin(0) # => 0 (start index)
# scan: Extract all matches
text.scan(/Ruby/) # => ["Ruby", "Ruby"]
# sub: Replace first occurrence
text.sub(/Ruby/, "Python") # => "Python is fun, Ruby is powerful"
# gsub: Replace all occurrences
text.gsub(/Ruby/, "Python") # => "Python is fun, Python is powerful"
# Case statement with Regexp#===
case "[email protected]"
when /\A\w+@\w+\.\w+\z/ then puts "Valid email"
else puts "Invalid email"
end # => "Valid email"
4. Practical Applications
4.1 Validation: Emails, URLs, and Phone Numbers
Regex is ideal for validating structured input like emails, URLs, and phone numbers.
Example 1: Email Validation
A basic email regex checks for a local part, @, domain, and TLD:
EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
def valid_email?(email)
EMAIL_REGEX.match?(email)
end
valid_email?("[email protected]") # => true
valid_email?("invalid-email") # => false
Breakdown:
\A/\z: Ensure the entire string is checked.[\w+\-.]+: Local part (letters, digits,_,+,-,.).@: Separator.[a-z\d\-]+(\.[a-z\d\-]+)*: Domain (e.g.,example.co.uk).\.[a-z]+: TLD (e.g.,.com,.org).
Example 2: URL Validation
Validate HTTP/HTTPS URLs with optional paths:
URL_REGEX = %r{\Ahttps?://[a-z\d\-]+\.[a-z]+\z}i
def valid_url?(url)
URL_REGEX.match?(url)
end
valid_url?("https://example.com") # => true
valid_url?("ftp://invalid.com") # => false
4.2 Data Parsing: Extracting Information
Regex excels at extracting structured data from unstructured text (e.g., logs, user input).
Example: Parsing Log Files
Extract IP addresses, timestamps, and requests from a web server log:
log_line = '192.168.1.1 - - [15/Mar/2024:10:30:45 +0000] "GET /index.html HTTP/1.1" 200 1234'
# Regex to capture IP, timestamp, request, status, size
LOG_REGEX = /
(\d+\.\d+\.\d+\.\d+) # IP address
.*? # Ignore middle part (non-greedy)
\[(.*?)\] # Timestamp (inside [])
"([^"]+)" # Request (inside quotes)
(\d+) # Status code
(\d+) # Response size
/x
matches = log_line.match(LOG_REGEX)
ip = matches[1] # => "192.168.1.1"
timestamp = matches[2] # => "15/Mar/2024:10:30:45 +0000"
request = matches[3] # => "GET /index.html HTTP/1.1"
4.3 Text Manipulation: Substitution and Formatting
Use gsub to transform text, with support for dynamic replacements via blocks.
Example 1: Redact Sensitive Data
Mask credit card numbers (show first 4 and last 4 digits):
credit_card = "4111-1111-1111-1111"
redacted = credit_card.gsub(/(\d{4})[-\d]+(\d{4})/, '\1-XXXX-XXXX-\2')
# => "4111-XXXX-XXXX-1111"
Example 2: Format Names
Swap first and last names (e.g., “Doe, John” → “John Doe”):
name = "Doe, John"
formatted_name = name.gsub(/(\w+), (\w+)/) { "#{$2.capitalize} #{$1.capitalize}" }
# => "John Doe"
4.4 Advanced Use Cases
Lookarounds: Assertions Without Capturing
Lookarounds check for patterns before or after a match without including them in the result.
- Positive Lookahead (
(?=...)): Matches if...follows. - Negative Lookahead (
(?!...)): Matches if...does not follow.
Example: Validate passwords with at least one uppercase letter:
PASSWORD_REGEX = /\A(?=.*[A-Z]).{8,}\z/ # At least 8 chars + 1 uppercase
PASSWORD_REGEX.match?("pass1234") # => false (no uppercase)
PASSWORD_REGEX.match?("Pass1234") # => true
Backreferences: Reuse Captured Groups
Reference previously captured groups with \1, \2, etc.
Example: Swap “first last” with “last, first”:
name = "Alice Smith"
reversed = name.gsub(/(\w+) (\w+)/, '\2, \1') # => "Smith, Alice"
Unicode Support
Ruby regex natively handles Unicode. Use \p{...} to match Unicode properties (e.g., emojis, scripts).
Example: Extract emojis from text:
text = "Hello! 😊 Ruby is awesome 🚀"
emojis = text.scan(/\p{Emoji}/) # => ["😊", "🚀"]
5. Common Pitfalls and How to Avoid Them
Greedy vs. Lazy Quantifiers
By default, quantifiers (*, +, {n,m}) are greedy (match as much as possible). Use ? to make them lazy (match as little as possible).
# Greedy: Matches from first <p> to last </p> (too much!)
text = "<p>First</p> <p>Second</p>"
text.match(/<p>.*<\/p>/)[0] # => "<p>First</p> <p>Second</p>"
# Lazy: Matches the first <p>...</p>
text.match(/<p>.*?<\/p>/)[0] # => "<p>First</p>"
Escaping Special Characters
Special regex characters (., *, +, ?, etc.) must be escaped with \ to match them literally. Use Regexp.escape for dynamic patterns.
# Problem: "." matches any character, not a literal dot
invalid_regex = /example.com/
invalid_regex.match("exampleXcom") # => #<MatchData "exampleXcom"> (oops!)
# Solution: Escape the dot
valid_regex = /example\.com/
valid_regex.match("exampleXcom") # => nil
# Escape dynamic input with Regexp.escape
user_input = "[email protected]"
safe_regex = /\A#{Regexp.escape(user_input)}\z/
Performance
Complex regex can slow down your code. Optimize by:
- Using specific patterns (e.g.,
\d{4}instead of\d+for 4-digit numbers). - Avoiding nested quantifiers (e.g.,
(a+)*can cause “catastrophic backtracking”).
6. Best Practices
- Test Regex Early: Use tools like Rubular to test patterns interactively.
- Document Complex Patterns: Use free-spacing mode (
xmodifier) to add comments. - Reuse Regex Objects: Define regex as constants to avoid recompiling (e.g.,
EMAIL_REGEX = /.../). - Prefer
match?Overmatchfor boolean checks (avoids creating aMatchDataobject). - Handle Edge Cases: Test with invalid inputs, empty strings, and Unicode text.
7. Conclusion
Ruby’s regex support empowers developers to tackle text processing tasks efficiently. From validation to parsing and advanced manipulation, regex reduces boilerplate and improves readability when used correctly. By mastering the basics, avoiding common pitfalls, and following best practices, you’ll unlock regex’s full potential in your Ruby projects.
8. References
- Ruby Regexp Documentation
- Rubular: Ruby Regex Tester
- Unicode Property Escapes
- Regular Expressions: The Good Parts (Book by Jeffrey Friedl)