Filtering DataFrames with Strings in Pandas

Introduction

In this article, we will delve into the world of data manipulation with pandas and explore how to filter rows from a DataFrame based on strings. We’ll discuss the importance of cleaning and preprocessing text data before applying filters.

Why Filter Rows by String?

When working with text data, it’s essential to clean and preprocess the data before applying filters or performing analysis. In this case, we’re interested in filtering tweets containing specific words. Cleaning the data ensures that our filter is accurate and relevant to the data.

Cleaning Text Data

Before applying a filter, we need to clean the text data. This involves removing unwanted characters, such as mentions, hashtags, and hyperlinks. We’ll use Python’s re module to perform these operations.

Removing Mentions

Mentions can be represented by the @ symbol followed by one or more word characters. We can remove them using the following regular expression:

`{< highlight python >}
import re

text = "Hello @john, how are you?"
clean_text = re.sub(r'@\w+', "", text)
print(clean_text)  # Output: Hello , how are you?
{</highlight>}`

Removing Hashtags

Hashtags can be represented by the # symbol. We’ll remove them using the following regular expression:

`{< highlight python >}
import re

text = "I love #python"
clean_text = re.sub(r"#", "", text)
print(clean_text)  # Output: I love python
{</highlight>}`

Removing Retweets

Retweets can be represented by the RT symbol followed by one or more spaces. We’ll remove them using the following regular expression:

`{< highlight python >}
import re

text = "RT @john, how are you?"
clean_text = re.sub(r"RT[\s]+", "", text)
print(clean_text)  # Output: , how are you?
{</highlight>}`

Removing Hyperlinks

Hyperlinks can be represented by the http or https protocol followed by one or more non-space characters. We’ll remove them using the following regular expression:

`{< highlight python >}
import re

text = "Hello https://www.example.com"
clean_text = re.sub(r"https?:\/\/\S+", "", text)
print(clean_text)  # Output: Hello 
{</highlight>}`

Cleaning Tweets with Python’s `re` Module

We can create a function to clean the tweets using the regular expressions above:

`{< highlight python >}
import re
import tweepy

def cleantext(text):
    text = re.sub(r'@\w+', "", text) # Remove Mentions
    text = re.sub(r"#", "", text) # Remove Hashtags Symbol
    text = re.sub(r"RT[\s]+", "", text) # Remove Retweets
    text = re.sub(r"https?:\/\/\S+", "", text) # Remove The Hyper Link
    
    return text

# Authenticate API
api = tweepy.API(tweepy OAuthHandler(
    consumer_key='your_consumer_key',
    consumer_secret='your_consumer_secret',
    access_token='your_access_token',
    access_token_secret='your_access_token_secret'
))

# Get tweets
tweets = api.user_timeline(screen_name = "elonmusk", count = 2000,lang = "en", tweet_mode = "extended")

# Clean the tweets
df = pd.DataFrame([tweet.full_text for tweet in tweets], columns = ["tweet"])
df["tweet"] = df["tweet"].apply(cleantext)

# Filter tweets containing 'Doge'
filtered_tweets = df.loc[df['tweet'].str.contains('doge')]
print(filtered_tweets)
{</highlight>}`

Using `df.loc` to Filter Rows

To filter rows based on a string, we can use the loc function. The loc function allows us to access a group of rows and columns by label(s) or a boolean array.

`{< highlight python >}
import pandas as pd

# Create a DataFrame
d = {'tweet': ['elon tweets about doge coin', 'elon tweets about bitcoin']}
df = pd.DataFrame(data=d)

# Filter tweets containing 'Doge'
filtered_tweets = df.loc[df['tweet'].str.contains('doge')]

print(filtered_tweets)
{</highlight>}`

Conclusion

In this article, we explored how to filter rows from a DataFrame based on strings using pandas. We cleaned the text data by removing unwanted characters and then applied the filter using df.loc. This technique can be useful in data analysis and machine learning tasks where filtering data is essential.

Additional Resources

For more information on working with DataFrames, we recommend checking out the Pandas documentation. Additionally, for more advanced text processing techniques, you may want to explore Python’s NLTK library.

Last modified on 2023-12-04