Filtering DataFrames with Strings in Pandas
Introduction
In this article, we will delve into the world of data manipulation with pandas and explore how to filter rows from a DataFrame based on strings. We’ll discuss the importance of cleaning and preprocessing text data before applying filters.
Why Filter Rows by String?
When working with text data, it’s essential to clean and preprocess the data before applying filters or performing analysis. In this case, we’re interested in filtering tweets containing specific words. Cleaning the data ensures that our filter is accurate and relevant to the data.
Cleaning Text Data
Before applying a filter, we need to clean the text data. This involves removing unwanted characters, such as mentions, hashtags, and hyperlinks. We’ll use Python’s re
module to perform these operations.
Removing Mentions
Mentions can be represented by the @
symbol followed by one or more word characters. We can remove them using the following regular expression:
`{< highlight python >}
import re
text = "Hello @john, how are you?"
clean_text = re.sub(r'@\w+', "", text)
print(clean_text) # Output: Hello , how are you?
{</highlight>}`
Removing Hashtags
Hashtags can be represented by the #
symbol. We’ll remove them using the following regular expression:
`{< highlight python >}
import re
text = "I love #python"
clean_text = re.sub(r"#", "", text)
print(clean_text) # Output: I love python
{</highlight>}`
Removing Retweets
Retweets can be represented by the RT
symbol followed by one or more spaces. We’ll remove them using the following regular expression:
`{< highlight python >}
import re
text = "RT @john, how are you?"
clean_text = re.sub(r"RT[\s]+", "", text)
print(clean_text) # Output: , how are you?
{</highlight>}`
Removing Hyperlinks
Hyperlinks can be represented by the http
or https
protocol followed by one or more non-space characters. We’ll remove them using the following regular expression:
`{< highlight python >}
import re
text = "Hello https://www.example.com"
clean_text = re.sub(r"https?:\/\/\S+", "", text)
print(clean_text) # Output: Hello
{</highlight>}`
Cleaning Tweets with Python’s re
Module
We can create a function to clean the tweets using the regular expressions above:
`{< highlight python >}
import re
import tweepy
def cleantext(text):
text = re.sub(r'@\w+', "", text) # Remove Mentions
text = re.sub(r"#", "", text) # Remove Hashtags Symbol
text = re.sub(r"RT[\s]+", "", text) # Remove Retweets
text = re.sub(r"https?:\/\/\S+", "", text) # Remove The Hyper Link
return text
# Authenticate API
api = tweepy.API(tweepy OAuthHandler(
consumer_key='your_consumer_key',
consumer_secret='your_consumer_secret',
access_token='your_access_token',
access_token_secret='your_access_token_secret'
))
# Get tweets
tweets = api.user_timeline(screen_name = "elonmusk", count = 2000,lang = "en", tweet_mode = "extended")
# Clean the tweets
df = pd.DataFrame([tweet.full_text for tweet in tweets], columns = ["tweet"])
df["tweet"] = df["tweet"].apply(cleantext)
# Filter tweets containing 'Doge'
filtered_tweets = df.loc[df['tweet'].str.contains('doge')]
print(filtered_tweets)
{</highlight>}`
Using df.loc
to Filter Rows
To filter rows based on a string, we can use the loc
function. The loc
function allows us to access a group of rows and columns by label(s) or a boolean array.
`{< highlight python >}
import pandas as pd
# Create a DataFrame
d = {'tweet': ['elon tweets about doge coin', 'elon tweets about bitcoin']}
df = pd.DataFrame(data=d)
# Filter tweets containing 'Doge'
filtered_tweets = df.loc[df['tweet'].str.contains('doge')]
print(filtered_tweets)
{</highlight>}`
Conclusion
In this article, we explored how to filter rows from a DataFrame based on strings using pandas. We cleaned the text data by removing unwanted characters and then applied the filter using df.loc
. This technique can be useful in data analysis and machine learning tasks where filtering data is essential.
Additional Resources
For more information on working with DataFrames, we recommend checking out the Pandas documentation. Additionally, for more advanced text processing techniques, you may want to explore Python’s NLTK library.
Last modified on 2023-12-04