Comparing Two Lists from SQL in Python and Showing Result Using Pandas.IO
When working with data in Python, often we need to compare two datasets or tables that are stored in a database. In this blog post, we will explore how to compare two lists of data that are stored in SQL databases using Python and the popular library pandas.
Introduction to pandas and SQL Data Retrieval
Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures like Series (one-dimensional labeled array) and DataFrame (two-dimensional labeled data structure with columns of potentially different types).
We will use pandas.IO, which is a module for reading and writing various types of files, to retrieve data from our SQL database.
To start, we must first install the necessary libraries. The most essential one is pandas. We can install it via pip, Python’s package manager:
pip install pandas
Retrieving Data from SQL Database
SQL databases store our data in tables, which are similar to spreadsheets. Each row of a table represents a single record or entry.
We will retrieve the data from two different SQL tables using pandas.IO.
Let us first create a sample database with two tables:
CREATE TABLE df1 (
PN INT,
Stock VARCHAR(10),
WHS VARCHAR(5),
Cost DECIMAL(3,2)
);
CREATE TABLE df2 (
PN INT,
Stock VARCHAR(10),
WHS VARCHAR(5),
Cost DECIMAL(3,2),
Time VARCHAR(10)
);
We will insert some data into these tables:
INSERT INTO df1 (PN, Stock, WHS, Cost) VALUES
(1111, '1', 'VLN', 0.20);
INSERT INTO df1 (PN, Stock, WHS, Cost) VALUES
(1111, '2', 'VLN', 0.20);
INSERT INTO df2 (PN, Stock, WHS, Cost, Time) VALUES
(1111, '1', 'VLN', 0.20, '15:00');
INSERT INTO df2 (PN, Stock, WHS, Cost, Time) VALUES
(1111, '3', 'VLN', 0.20, '16:00');
Retrieving Data with pandas.IO
Now that we have created our database and inserted some data into it, let us use pandas.IO to retrieve the data from these tables.
import pandas as pd
from sqlalchemy import create_engine
# Create an engine that can connect to a SQL database
engine = create_engine('postgresql://user:password@host:port/dbname')
# Use pandas.IO to read the data from df1 table in our database
df1_data = pd.read_sql_table('df1', engine)
# Use pandas.IO to read the data from df2 table in our database
df2_data = pd.read_sql_table('df2', engine)
Comparing Data with Pandas
Now that we have retrieved the data from both tables using pandas.IO, let us compare these two datasets.
To do this, we will use the pandas.merge function, which is used to merge two DataFrames based on a common column.
# Perform an outer join of df1 and df2
merged_data = pd.merge(df1_data, df2_data, how='outer', indicator=True)
# Filter out rows that do not have any match in df2
filtered_data = merged_data[(merged_data['_merge'] == 'right_only') | (merged_data['_merge'] == 'both')]
Explanation and Context
Here’s a breakdown of what each line does:
pd.merge(df1_data, df2_data, how='outer', indicator=True)
: This line performs an outer join on the two DataFrames. The “how” parameter specifies whether to perform an inner join (default), left join, right join, or full outer join.- In this case, we used ‘outer’ because we want to compare all rows in df1 with all rows in df2, and vice versa.
indicator=True
: This tells pandas.IO to create a new column called ‘_merge’ in the resulting DataFrame. The value of this column will be ’left_only’, ‘both’, or ‘right_only’. The left_only row contains only columns from the left operand (df1). The right_only row contains only columns from the right operand (df2). The both row contains all columns from both operands.filtered_data = merged_data[(merged_data['_merge'] == 'right_only') | (merged_data['_merge'] == 'both')]
: This line filters out rows that do not have any match in df2.
Conclusion
In this blog post, we covered how to compare two lists of data from SQL databases using Python and pandas.IO. We used pandas.IO to retrieve the data from two tables, then performed an outer join on these DataFrames to find all matching records.
We also explained how to use the ‘outer’ join parameter in pandas.merge function and how to filter out rows that do not have any match in df2.
By following this guide, you should be able to compare two lists of data from SQL databases using Python and pandas.IO.
Last modified on 2025-04-23