Manipulating URLs Using Regular Expressions in Python

Understanding Regex Patterns for URL Manipulation

Introduction

In this article, we’ll explore how to manipulate URLs using regular expressions (regex) in Python. We’ll focus on the basics of regex patterns and apply them to extract domain information from URLs.

What is a Regular Expression?

A regular expression (regex) is a pattern used to match character combinations in strings. Regex patterns are used extensively in text processing, data validation, and extraction tasks.

The Problem: Manipulating URLs with Regex

The problem at hand involves manipulating URLs by adding the protocol (http or https) and/or path components using regex patterns. In this case, we’re given a pandas Series containing messy URL data.

import re
import pandas as pd
import numpy as np

# Example URL data
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 
                     'https://www.qwer.com/example/qwer', 'None',
                     'test.com/example/test', 'None', '123135123', 
                     'nourlhere', 'lol', 'hello.tv', 'nolink',
                     'ihavenowebsite.com'])

Solving the Problem: Refactored Code

The refactored code uses urllib.parse to finalize the URL manipulation.

import re
import urllib.parse
import pandas as pd
import numpy as np

# Example URL data
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 
                     'https://www.qwer.com/example/qwer', 'None',
                     'test.com/example/test', 'None', '123135123', 
                     'nourlhere', 'lol', 'hello.tv', 'nolink',
                     'ihavenowebsite.com'])

# Regex patterns
re1 = r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.([a-z]{2,5})([/.]*)'
re3 = r'www\.([\w]*)'

def modurl(s):
    u = urllib.parse.urlparse(s)
    if u.netloc=="" or u.path!="/example":
        return s
    else:
        return f"{s}/{urllib.parse.urlparse(s).netloc.split('.')[-2]}.{urllib.parse.urlparse(s).netloc.split('.')[-1]}"

# URL manipulation
example = (example
          .map(lambda x: x.replace('https://www.', ''))
          .map(lambda x: x.replace('www.', ''))
          .map(lambda x: x.replace('https://', ''))
          .map(lambda x: x.replace('http://', ''))
          .map(lambda x: re.search(re1, x) and "http://www."+x or x)
          .map(lambda x: re.match(re3, x) and f"{x}/{urllib.parse.urlparse(x).netloc.split('.')[-2]}.{urllib.parse.urlparse(x).netloc.split('.')[-1]}" or x)
          .map(lambda x: modurl(x))
)

# Print the result
print(example.to_string())

Explanation

The refactored code uses urllib.parse to break down URLs into their constituent parts, such as protocol (scheme), domain, and path. We then use regex patterns to manipulate these components.

Regex Pattern 1 (re1): Matches the pattern of a URL with a top-level domain (TLD) component.

([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.([a-z]{2,5})([/.]*) 

Regex Pattern 2 (re3): Matches the pattern of a URL with a TLD component.

www\.([\w]*) 

The modurl function uses these regex patterns to manipulate URLs. It checks for the presence of the protocol (http or https) and/or path components using re.search and re.match, respectively. If both components are present, it constructs a new URL with the format {original_url}/{domain}/path. Otherwise, it returns the original URL.

Conclusion

In this article, we explored how to manipulate URLs using regex patterns in Python. We applied these techniques to extract domain information from URLs and construct new URLs with added protocol (http or https) and/or path components. The refactored code uses urllib.parse to simplify the URL manipulation process.


Last modified on 2024-03-28