Searching for Specific Values in Pandas DataFrames: A Step-by-Step Guide

Searching an Entire DataFrame for a Specific Value

When working with dataframes in pandas, it’s not uncommon to need to search for specific values within the dataframe. In this article, we’ll explore how to achieve this using the contains function and return the value next to each match.

Understanding the Problem

Let’s start by looking at the sample dataset provided:

Protocol Number:    xx-yzm2 
        
Section Major       Task                        Budget
1                   Study Setup                 25303.18
2                   Study Setup Per-Location    110037.8
3                   Site Identified by CRO      29966.25
4                   Pre-study Site Visit (PSSV) 130525.92

We want to search for the keyword 'protocol' and return the value next to each match. This means we need to find all rows where the value in a specific column contains the keyword.

The Solution

One way to achieve this is by using the apply function along with a lambda function that checks if the specified column contains the keyword. We’ll also use the np.where function to get the indices of the matching rows and columns.

Here’s how you can do it:

import pandas as pd
import numpy as np

data = {0: ['Protocol Number:', np.nan, 'Section Major', '1', '2', '3', '4'],
         1: ['xx-yzm2', np.nan, 'Task', 'Study Setup', 'Study Setup Per-Location', 
             'Site Identified by CRO', 'Pre-study Site Visit (PSSV)'],
         2: [np.nan, np.nan, 'Budget', '25303.18', '110037.8', '29966.25', '130525.92']}

df = pd.DataFrame(data)

                  0                            1          2
0  Protocol Number:                      xx-yzm2        NaN
1               NaN                          NaN        NaN
2     Section Major                         Task     Budget
3                 1                  Study Setup   25303.18
4                 2     Study Setup Per-Location   110037.8
5                 3       Site Identified by CRO   29966.25
6                 4  Pre-study Site Visit (PSSV)  130525.92

keyword = 'protocol'

# case-insensitive: case=False
# row: array([0], dtype=int64), col: array([0], dtype=int64)
row, col = np.where(df.apply(lambda x: x.astype(str).str.contains(keyword, case=False)))

result = df.iat[row[0],col[0]+1]
print(result)

# xx-yzm2

Why case=False?

In the lambda function, we’re using the str.contains method to check if the specified column contains the keyword. By default, this method is case-sensitive, meaning it will only match if the exact same characters appear in both strings (ignoring case). To make it case-insensitive, we set case=False. This allows us to find matches regardless of whether they’re uppercase or lowercase.

Handling Multiple Matches

If you want to get all matches instead of just the first one, you can use a loop. Here’s how:

for i in range(len(row)):
    if not col[i]+1 == len(df.columns):
        print(df.iat[row[i],col[i]+1])
    else:
        # error handle, you're keyword was found in last column, 
        # i.e. there is no `next` col
        pass

This code will print the value next to each match found in the dataframe.

Error Handling

As you can see, if the keyword is found in a column that has only one element (i.e., it’s not followed by another column), this code will throw an error because col[i]+1 would be equal to the total number of columns. To handle such cases, we could add some basic error checking:

for i in range(len(row)):
    if col[i] == len(df.columns)-1:
        print(df.iat[row[i],col[i]])

Conclusion

In this article, we explored how to search an entire dataframe for a specific value. We used the contains function and returned the value next to each match by using the apply, np.where, and indexing functions in pandas.


Last modified on 2024-12-11