How To Compare Two Dataframes in Python

There are multiple scenarios, Where we need to compare two data frames column-wise and row-wise, In this article, We will learn, How to compare two dataframes in python?

Comparing two dataframes in Python is done using compare() method, This method is part of the Pandas library and allows us to easily identify the differences between two dataframes and create a new data frame that displays these differences in a clear and concise manner.

We will look into this in more detail

<strong>How To Compare Two Dataframes in Python</strong>
Data frame

Introduction

In the world of data analysis, comparing two dataframes is a common task that is often required to identify differences or similarities between datasets. Python, being a popular programming language for data analysis, offers several libraries to work with tabular data, including Pandas.

Pandas is a fast, flexible, and efficient library for data manipulation and analysis that provides many functions to work with dataframes. It provides several powerful features to work with tabular data, including the ability to compare two dataframes. In this article, we will discuss how to compare two dataframes in Python using compare() method available in the Pandas library.

Compare Method

Use Compare() method available in the Pandas library to compare two dataframes in Python. It makes the process much easier by providing a simple and efficient way to compare two dataframes. This method takes two dataframes as input and returns a new dataframe that contains the differences between them. Using this method, we can compare dataframes both column-wise and row-wise.

Syntax

DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)
ParametersDescription
otherThis parameter is used to specify the second dataframe to be compared with the first dataframe. If this parameter is not provided, then the method will raise a ValueError.
align_axisThis parameter is used to align the axis of the two dataframes being compared. By default, this parameter is set to 1, which means that the comparison will be done on columns. If you want to do the comparison on rows, then you can set this parameter to 0.
keep_shapeThis parameter is used to keep the shape of the original dataframe when the comparison is performed. By default, this parameter is set to False, which means that the resulting dataframe will have the same shape as the first dataframe. If you set this parameter to True, then the resulting dataframe will have the same shape as the larger of the two dataframes.
keep_equalThis parameter is used to keep the values that are equal between the two dataframes. By default, this parameter is set to False, which means that only the differences between the two dataframes will be returned. If you set this parameter to True, then the resulting dataframe will also include the values that are equal between the two dataframes.

How to Use compare() Method

To use the compare() method, we need to 

  • Import the Pandas library.
  • Create two dataframes to compare. 
  • Call the compare() method to get the comparison.
# import pandas library
import pandas as pd
​
# create first dataframe
df1 = pd.DataFrame({
  'Items': ['Bag', 'Shoes', 'Ring', 'Jacket'],
  'Price': [100, 50, 50, 70],
  'Items_Sold': [10, 300, 505, 90]},
   
  columns=["Items", "Price", "Items_Sold"])
​
# create second dataframe
​
df2 = pd.DataFrame({
  'Items': ['Bag', 'Shoes', 'Ring', 'Jacket'],
  'Price': [10, 70, 55, 90],
  'Items_Sold': [5, 110, 5, 90] },
   
  columns=["Items", "Price", "Items_Sold"])
   
Display DataFrame 1Display DataFrame 2
<strong>How To Compare Two Dataframes in Python</strong><strong>How To Compare Two Dataframes in Python</strong>

In the above code, two dataframes df1 and df2 were created using the pd.DataFrame() method. The df1 dataframe had three columns named Items, Price, and Items_Sold, with corresponding values for each row. Similarly, the df2 dataframe had the same columns, but different values for each row. 

Let’s use compare() method to compare the two dataframes df1 with df2.

# Compare the two dataframes
df1.compare(df2)

Output

<strong>How To Compare Two Dataframes in Python</strong>

Above, the compare() method is called on two dataframes, it returns a new dataframe that has the same columns as the original dataframes, with two additional levels of column headers named self and other. These headers indicate which dataframe each value comes from and help to distinguish the differences between the two dataframes.

For instance, in the output of the compare() method in the example code, we can see that the Price column has self and other sub-columns. The self sub-column contains the values from the original df1 dataframe, while the other sub-column contains the values from the df2 dataframe. The same applies to the Items_Sold column.

The self and other sub-columns enable us to quickly see how the values in each column differ between the two dataframes. In the above output, we can see that the Price for the Bag item in df1 was 100 in the self column, while in df2 it was only 10 in the other column. Similarly, the Items_Sold for the Shoes item was 300 in the self column, but only 110 in the other column. By comparing the values in the self and other sub-columns, we can easily identify the differences between the two dataframes.

In the output, the Items_Sold value for the Jacket item in both self and other columns was NaN. This happened because the Items_Sold value for the Jacket item in df1 was 90, while in df2 it was also 90. Since the values were the same, the compare() method did not detect any difference and thus returned NaN for that row in the self and other columns. The NaN values represent missing or undefined values in the output.

Note: This compare() method is only available in pandas version 1.1.0. or above. 

# Check pandas version
import pandas as pd
pd.__version__
​
<strong>How To Compare Two Dataframes in Python</strong>

The Pandas version on my machine is 1.4.4 that why I have no need to update it but if you are using an older version run the following command to update it.

# Update pandas version
pip3 install pandas --upgrade

Now let’s see how to compare dataframes row-wise in the below example:

df1.compare(df2, align_axis=0)

Output:

<strong>How To Compare Two Dataframes in Python</strong>

This code compares two dataframes df1 and df2 row-wise using the compare() method provided by the Pandas library. We can see the align_axis parameter is set to 0, which means that the comparison will be done on rows rather than columns. This align_axis parameter determines how the rows or columns of the dataframes will be aligned for the comparison.

Let’s set keep_shape parameter as True and see how it affects our resulting dataframe.

df1.compare(df2,keep_shape=True )

Output:

<strong>How To Compare Two Dataframes in Python</strong>

In the above code snippet, the keep_shape parameter was set to True, which ensured that the output dataframe had the same shape as the input dataframes. As a result, the resulting dataframe had the same number of rows as the input dataframes, with NaN values filled in for any rows that did not contain differences. By default, the keep_shape parameter is set to False, which means that the resulting dataframe will only have rows that contain differences between the two dataframes. 

Now, set keep_equal parameter as True:

df1.compare(df2, keep_equal=True)

Output:

<strong>How To Compare Two Dataframes in Python</strong>

In the above code, the keep_equal parameter was set to True, indicating that the resulting dataframe would also include the values that are equal between the two dataframes. This parameter was the opposite of the default behavior of the compare() method, which only displayed the differences between the two dataframes.

Conclusion

In conclusion, comparing two dataframes in Python can be a very useful technique when analyzing data. The compare() method of the Pandas library allows us to easily identify the differences between two dataframes and create a new dataframe that displays these differences in a clear and concise manner. By using the various parameters of the compare() method, such as keep_shape, keep_equal, and align_axis, we can customize the output to fit our specific needs. 

Overall, the compare() method is a powerful tool for data analysis that can save our time and effort in identifying differences between two dataframes which can help in making data-driven decisions.

Similar Posts