The Book of Gehn

Diff Between Data Frames for Testing

October 1, 2022

Let’s say that we want to compare two Pandas’ dataframes for unit testing.

One is the expected dataframe, crafted by us and it will be the source of truth for the test.

The other is the obtained dataframe that is the result of the experiment that we want to check.

Doing a naive comparison will not work: first we may want to tolerate some minor differences due computation imprecision; and second, and most important, we don’t want to know just if the dataframes are different or not

We want to know where are the differences.

Knowing exactly what is different makes the debugging much easier – trying to figure out which column in which row there is a different by hand is not fun.

Comparison row-by-row (the naive approach)

Pandas has a beautiful assert_frame_equal function for unit testing: it compares the dataframes row by row and shows which column has some differences and how many there are.

Considere the following dataframes:

>>> import pandas as pd

>>> expected_df = pd.DataFrame({
...     'name':    ['alice',  'bob',   'alice', 'charlie'],
...     'subject': [ 'math', 'math', 'physics',    'math'],
...     'grade':   [      5,      4,         4,         5],
... })

>>> obtained_df = pd.DataFrame({
...     'name':    [  'alice',   'bob',  'alice', 'charlie'],
...     'subject': ['physics',  'math',   'math',    'math'],
...     'grade':   [        4,       4,        3,         2],
... })

Now let’s use assert_frame_equal:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(
...     obtained_df,
...     expected_df,
...     check_exact=True,
...     check_like=True,
...     check_dtype=False
... )
Traceback (most recent call last):
<...>
AssertionError: DataFrame.iloc[:, 1] (column name="subject") are different
<...>
DataFrame.iloc[:, 1] (column name="subject") values are different (50.0 %)
[index]: [0, 1, 2, 3]
[left]:  [physics, math, math, math]
[right]: [math, math, physics, math]

So assert_frame_equal detected that half of the rows are different on the "subject" column.

With a closer inspection on the left and right series we see that the first and third values are different (physics/math and math/physics).

Now if you see dataframes obtained_df and expected_df you will see that it is more likely to interpret this difference as an reorder of the rows and not as a real discrepancy (unless you are really wanting to have a specific order).

Make the comparison order-insensitive

assert_frame_equal alone is not enough as it is sensible to reorders.

The reset_index is required because sort_values preserves the index: each row remember its original position. assert_frame_equal checks by index so it will revert the sort.

Let’s try to sort the rows before the check then:

>>> sort_by = ["subject"]

>>> ob = obtained_df.sort_values(sort_by).reset_index(drop=True)
>>> ex = expected_df.sort_values(sort_by).reset_index(drop=True)

>>> assert_frame_equal(
...     ob,
...     ex,
...     check_exact=True,
...     check_like=True,
...     check_dtype=False
... )
Traceback<...>
AssertionError: DataFrame.iloc[:, 0] (column name="name") are different
<...>
DataFrame.iloc[:, 0] (column name="name") values are different (50.0 %)
[index]: [0, 1, 2, 3]
[left]:  [bob, alice, charlie, alice]
[right]: [alice, bob, charlie, alice]

Mmm… more noise… perhaps sorting by "name" and "subject"?

>>> sort_by = ["name", "subject"]

>>> ob = obtained_df.sort_values(sort_by).reset_index(drop=True)
>>> ex = expected_df.sort_values(sort_by).reset_index(drop=True)

>>> assert_frame_equal(
...     ob,
...     ex,
...     check_exact=True,
...     check_like=True,
...     check_dtype=False
... )
Traceback<...>
AssertionError: DataFrame.iloc[:, 2] (column name="grade") are different
<...>
DataFrame.iloc[:, 2] (column name="grade") values are different (50.0 %)
[index]: [0, 1, 2, 3]
[left]:  [3, 4, 4, 2]
[right]: [5, 4, 4, 5]

Better, I guess.

Sorting solves the reorder problem but it is still hard to interpret the results.

Even with this example of just 4 rows, there are too many rows!

More over, assert_frame_equal is not helpful when the dataframes has a different size, it just says that and that it :|

>>> ob.loc[len(ob.index)] = ["alice", "algebra", 3]

>>> assert_frame_equal(
...     ob,
...     ex,
...     check_exact=True,
...     check_like=True,
...     check_dtype=False
... )
<...>
AssertionError: DataFrame are different
<...>
DataFrame shape mismatch
[left]:  (5, 3)
[right]: (4, 3)

Comparing group-by-group

Instead of comparing row by row we could group the rows under some key and compare then each row within each group.

>>> group_by = ["name", "subject"]

>>> obtained_g = obtained_df.groupby(group_by, dropna=False)
>>> expected_g = expected_df.groupby(group_by, dropna=False)

The intuition is that we will get fewer differences between groups and possibly more meaningful.

Here we get which groups do we have in common:

>>> obtained_group_names = {key for key, item in obtained_g}
>>> expected_group_names = {key for key, item in expected_g}

>>> common_group_names = obtained_group_names & expected_group_names
>>> sorted(common_group_names)
[('alice', 'math'), ('alice', 'physics'), ('bob', 'math'), ('charlie', 'math')]

Now for each group we compare them using assert_frame_equal:

>>> name = ('alice', 'math')
>>> ob = pd.DataFrame(obtained_g.get_group(name).reset_index(drop=True))
>>> ex = pd.DataFrame(expected_g.get_group(name).reset_index(drop=True))

>>> assert_frame_equal(
...     ob,
...     ex,
...     check_exact=True,
...     check_like=True,
...     check_dtype=False
... )
Traceback<...>
AssertionError: DataFrame.iloc[:, 2] (column name="grade") are different
<...>
DataFrame.iloc[:, 2] (column name="grade") values are different (100.0 %)
[index]: [0]
[left]:  [3]
[right]: [5]

Now it is clear that Alice got a grade of 3 in her Math class but it was expected to get a 5.

pd.concat will put the rows of one dataframe after the rows of the other preserving their original indexes.

Sorting by index then will make the first row of the expected appear after the first row of the obtained, the second row of the expected appear after the second row of the obtained and so on.

We could pretty print the group interleaving the rows of the obtained and expected dataframes so we can compare them line by line:

>>> ob["DF"] = "obtained"
>>> ex["DF"] = "expected"

>>> merged = pd.concat([ob, ex]).sort_index()
>>> merged
    name subject  grade        DF
0  alice    math      3  obtained
0  alice    math      5  expected

Much better!

Putting all this together

The plan is:

Updated version:  code

This is the code:

>>> def assert_df_equal(obtained_df, expected_df, group_by=[], sort_by=[], check_exact=True, rtol=1e-5, atol=1e-8, to_string_kwargs={}):
...     diffs = []
...
...     # See the whole dataframe as the single group in case of
...     # the user don't wanting to group the rows
...     singleton_group = False
...     if not group_by:
...         group_by = lambda *args: singleton_group
...         singleton_group = True
...
...     obtained_g = obtained_df.groupby(group_by, dropna=False)
...     expected_g = expected_df.groupby(group_by, dropna=False)
...
...     obtained_group_names = {key for key, item in obtained_g}
...     expected_group_names = {key for key, item in expected_g}
...
...     common_group_names = obtained_group_names & expected_group_names
...     unexpected_group_names = obtained_group_names - expected_group_names
...     missing_group_names = expected_group_names - obtained_group_names
...
...     # Check if we have unexpected groups or missing groups.
...     # These are groups that cannot be compared with any other row
...     # in the opposite dataframes and therefore they are straight
...     # differences by definition
...     for name in sorted(unexpected_group_names):
...         ob = pd.DataFrame(obtained_g.get_group(name))
...         if sort_by:
...             ob.sort_values(sort_by, inplace=True)
...
...         ob["DF"] = "obtained"
...
...         msg = f"Unexpected (not expected) group {name!r}\n\n{ob.to_string(**to_string_kwargs)}"
...         diffs.append(msg)
...
...     for name in sorted(missing_group_names):
...         ex = pd.DataFrame(expected_g.get_group(name))
...         if sort_by:
...             ex.sort_values(sort_by, inplace=True)
...
...         ex["DF"] = "expected"
...
...         msg = f"Missing (not obtained) group {name!r}\n\n{ex.to_string(**to_string_kwargs)}"
...         diffs.append(msg)
...
...     # Compare group by group, sorting them if sort_by is given.
...     for name in sorted(common_group_names):
...         ob = pd.DataFrame(obtained_g.get_group(name))
...         ex = pd.DataFrame(expected_g.get_group(name))
...
...         if sort_by:
...             ob.sort_values(sort_by, inplace=True)
...             ex.sort_values(sort_by, inplace=True)
...
...         ob.reset_index(drop=True, inplace=True)
...         ex.reset_index(drop=True, inplace=True)
...
...         try:
...             assert_frame_equal(ob, ex, check_exact=check_exact, rtol=rtol, atol=atol, check_like=True, check_dtype=False)
...         except Exception as err:
...             ob["DF"] = "obtained"
...             ex["DF"] = "expected"
...
...             merged = pd.concat([ob, ex]).sort_index()
...
...             if singleton_group:
...                 msg = f"{err}"
...             else:
...                 msg = f"For group {name!r}: {err}"
...
...             msg = f"{msg}\n\n{merged.to_string(**to_string_kwargs)}"
...             diffs.append(msg)
...
...     if diffs:
...         msg = f"Found {len(diffs)} difference{'s' if len(diffs) > 1 else ''}.\nDetails follows:\n\n"
...         raise AssertionError(msg + "\n\n".join(diffs))

Examples

Let’s begin with the same dataframes as before:

>>> expected_df = pd.DataFrame({
...     'name':    ['alice',  'bob',   'alice', 'charlie'],
...     'subject': [ 'math', 'math', 'physics',    'math'],
...     'grade':   [      5,      4,         4,         5],
... })

>>> obtained_df = pd.DataFrame({
...     'name':    [  'alice',   'bob',  'alice', 'charlie'],
...     'subject': ['physics',  'math',   'math',    'math'],
...     'grade':   [        4,       4,        3,         2],
... })

Calling assert_df_equal without any group by or sort by does not gives you much. As before a straight call to assert_frame_equal says that there is a difference on "subject" when we already know that it is not.

>>> assert_df_equal(obtained_df, expected_df)
<...>
AssertionError: Found 1 difference.
Details follows:
<...>
DataFrame.iloc[:, 1] (column name="subject") are different
<...>
DataFrame.iloc[:, 1] (column name="subject") values are different (50.0 %)
[index]: [0, 1, 2, 3]
[left]:  [physics, math, math, math]
[right]: [math, math, physics, math]
<...>
      name  subject  grade        DF
0    alice  physics      4  obtained
0    alice     math      5  expected
1      bob     math      4  obtained
1      bob     math      4  expected
2    alice     math      3  obtained
2    alice  physics      4  expected
3  charlie     math      2  obtained
3  charlie     math      5  expected

Sorting quickly makes the real difference visible:

>>> assert_df_equal(obtained_df, expected_df, sort_by=['name', 'subject'])
<...>
AssertionError: Found 1 difference.
Details follows:
<...>
DataFrame.iloc[:, 2] (column name="grade") are different
<...>
DataFrame.iloc[:, 2] (column name="grade") values are different (50.0 %)
[index]: [0, 1, 2, 3]
[left]:  [3, 4, 4, 2]
[right]: [5, 4, 4, 5]
<...>
      name  subject  grade        DF
0    alice     math      3  obtained
0    alice     math      5  expected
1    alice  physics      4  obtained
1    alice  physics      4  expected
2      bob     math      4  obtained
2      bob     math      4  expected
3  charlie     math      2  obtained
3  charlie     math      5  expected

But the dataframes are too large and it is not easy to see exactly where is the problem.

Moreover adding an extra row breaks the comparison.

>>> obtained_df.loc[len(obtained_df.index)] = ["alice", "algebra", 3]
>>> assert_df_equal(obtained_df, expected_df, sort_by=['name', 'subject'])
<...>
AssertionError: Found 1 difference.
Details follows:
<...>
DataFrame are different
<...>
DataFrame shape mismatch
[left]:  (5, 3)
[right]: (4, 3)
<...>

group_by to rescue!

Partitioning the dataframes in smaller groups makes much easier the debugging and much more robust against shape mismatches

>>> assert_df_equal(obtained_df, expected_df, group_by=['name', 'subject'])
Traceback (most recent call last):
<...>
AssertionError: Found 3 differences.
Details follows:
<...>
Unexpected (not expected) group ('alice', 'algebra')
<...>
    name  subject  grade        DF
4  alice  algebra      3  obtained
<...>
For group ('alice', 'math'): DataFrame.iloc[:, 2] (column name="grade") are different
<...>
DataFrame.iloc[:, 2] (column name="grade") values are different (100.0 %)
[index]: [0]
[left]:  [3]
[right]: [5]
<...>
    name subject  grade        DF
0  alice    math      3  obtained
0  alice    math      5  expected
<...>
For group ('charlie', 'math'): DataFrame.iloc[:, 2] (column name="grade") are different
<...>
DataFrame.iloc[:, 2] (column name="grade") values are different (100.0 %)
[index]: [0]
[left]:  [2]
[right]: [5]
<...>
      name subject  grade        DF
0  charlie    math      2  obtained
0  charlie    math      5  expected

It looks complicated but now we know exactly where are the differences:

Diff Between Data Frames for Testing - October 1, 2022 - Martin Di Paola