Understanding Regex Patterns for URL Manipulation
Introduction
In this article, we’ll explore how to manipulate URLs using regular expressions (regex) in Python. We’ll focus on the basics of regex patterns and apply them to extract domain information from URLs.
What is a Regular Expression?
A regular expression (regex) is a pattern used to match character combinations in strings. Regex patterns are used extensively in text processing, data validation, and extraction tasks.
The Problem: Manipulating URLs with Regex
The problem at hand involves manipulating URLs by adding the protocol (http or https) and/or path components using regex patterns. In this case, we’re given a pandas Series containing messy URL data.
import re
import pandas as pd
import numpy as np
# Example URL data
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl',
'https://www.qwer.com/example/qwer', 'None',
'test.com/example/test', 'None', '123135123',
'nourlhere', 'lol', 'hello.tv', 'nolink',
'ihavenowebsite.com'])
Solving the Problem: Refactored Code
The refactored code uses urllib.parse
to finalize the URL manipulation.
import re
import urllib.parse
import pandas as pd
import numpy as np
# Example URL data
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl',
'https://www.qwer.com/example/qwer', 'None',
'test.com/example/test', 'None', '123135123',
'nourlhere', 'lol', 'hello.tv', 'nolink',
'ihavenowebsite.com'])
# Regex patterns
re1 = r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.([a-z]{2,5})([/.]*)'
re3 = r'www\.([\w]*)'
def modurl(s):
u = urllib.parse.urlparse(s)
if u.netloc=="" or u.path!="/example":
return s
else:
return f"{s}/{urllib.parse.urlparse(s).netloc.split('.')[-2]}.{urllib.parse.urlparse(s).netloc.split('.')[-1]}"
# URL manipulation
example = (example
.map(lambda x: x.replace('https://www.', ''))
.map(lambda x: x.replace('www.', ''))
.map(lambda x: x.replace('https://', ''))
.map(lambda x: x.replace('http://', ''))
.map(lambda x: re.search(re1, x) and "http://www."+x or x)
.map(lambda x: re.match(re3, x) and f"{x}/{urllib.parse.urlparse(x).netloc.split('.')[-2]}.{urllib.parse.urlparse(x).netloc.split('.')[-1]}" or x)
.map(lambda x: modurl(x))
)
# Print the result
print(example.to_string())
Explanation
The refactored code uses urllib.parse
to break down URLs into their constituent parts, such as protocol (scheme), domain, and path. We then use regex patterns to manipulate these components.
Regex Pattern 1 (re1
): Matches the pattern of a URL with a top-level domain (TLD) component.
([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.([a-z]{2,5})([/.]*)
Regex Pattern 2 (re3
): Matches the pattern of a URL with a TLD component.
www\.([\w]*)
The modurl
function uses these regex patterns to manipulate URLs. It checks for the presence of the protocol (http or https) and/or path components using re.search
and re.match
, respectively. If both components are present, it constructs a new URL with the format {original_url}/{domain}/path
. Otherwise, it returns the original URL.
Conclusion
In this article, we explored how to manipulate URLs using regex patterns in Python. We applied these techniques to extract domain information from URLs and construct new URLs with added protocol (http or https) and/or path components. The refactored code uses urllib.parse
to simplify the URL manipulation process.
Last modified on 2024-03-28