DataComPy and Fugue project achieving remarkable success

Reaching a major milestone in collaboration with the Fugue project.

Since the last time we wrote about DataComPy in October of 2021, a lot has changed. According to pepy.tech, which tracks downloads from the Python package index (PyPI), the package has been downloaded over 12 million times. DataComPy was also granted critical status on PyPI due to the large number of downloads as outlined in the PyPI 2FA Security Key Giveaway post. This is a huge milestone and a testament to its applicability of a simple yet well-defined tool, making it a breeze to understand detailed differences between two Pandas or Spark DataFrames. More importantly, our decision to open source the package has been emphatically validated by the community.

The Fugue project

During PyData Seattle 2023, I had the opportunity to connect with the maintainers of Fugue, a project that defines an abstraction layer so users can scale their native Python code to work against distributed data types like Spark or Dask.

After learning more about the project it became evident that DataComPy would benefit from adopting Fugue; with help from Han and Kevin, the maintainers of Fugue, we identified two main improvements that we could make to DataComPy:

  1. Extending the functionality to the backends that Fugue supports (Spark, Dask, Ray, Polars, DuckDB, Arrow, etc.)

  2. Comparison across dataset types (e.g. Pandas DataFrame vs Spark)

Comparing data sets: DataComPy Python and Fugue synergy

The usage is very similar to the existing experience using Pandas. The only difference is there is no Class instantiation of the Compare class like for Pandas:

    from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

datacompy.is_match(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
)
# False

# This method prints out a human-readable report summarizing and sampling differences
print(datacompy.report(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
))
  

DataComPy uses Fugue to partition the two DataFrames into chunks and compare each chunk in parallel using the Pandas-based Compare. The comparison results are then aggregated to produce the final result. Different from the join operation used in Compare and SparkCompare, the Fugue version uses the cogroup -> map-like semantic (not exactly the same as Fugue adopts a coarse version to achieve great performance), which guarantees full data comparison with consistent results compared to Pandas-based Compare.

Cross-type DataFrames comparison

In order to compare DataFrames of different backends, you just need to replace df1 and df2 with DataFrames of different backends. Just pass in DataFrames such as Pandas DataFrames, DuckDB relations, Polars DataFrames, Arrow tables, Spark DataFrames, Dask DataFrames or Ray datasets. For example, to compare a Pandas dataframe with a Spark dataframe:

    from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark_df2 = spark.createDataFrame(df2)
datacompy.is_match(
    df1,
    spark_df2,
    join_columns='acct_id',
)
  

To use a specific backend, you need to have the corresponding library installed. For example, if you want to compare Ray datasets, you must install the ray extra:

    pip install datacompy[ray]
  

Implementing this type of functionality natively within DataComPy would have been a large effort, but Fugue gives us this capability for free! Not to mention the new functionality we will receive as Fugue continues to mature! This, in my mind, is the true power of open source. Collaboration like this can unlock opportunities where they did not exist before!

The future of DataComPy: User-driven enhancements and Fugue integration

The next objective is to ensure we have method parity between the Fugue functionality and our core library (see issue #214). We also want to investigate whether or not we can deprecate our native Spark functionality in favor of the Fugue-based alternative.

Ultimately we want this to be a package for users and the direction will be heavily influenced by user input. If you have thoughts, 

 suggestions and contributions, we highly encourage you to participate. You can find the repository on GitHub - with instructions on how to contribute and open discussions.

The strategic collaboration with the Fugue project has propelled DataComPy to new heights, introducing enhanced functionality and opportunities for users. As DataComPy continues to evolve, it exemplifies the power of open source collaboration and stands ready to meet the data analysis needs of a dynamic and ever-changing landscape.   Finally, a huge thank you to all the contributors who have helped us reach 12 million downloads. Here is looking at the next 12 million!


Faisal Dosani, Director, Data Science

Faisal Dosani joined Capital One in 2014 and is currently a Director for the Canada Card Data Science team focusing on our machine learning tooling and infrastructure. Faisal spends most of his time collaborating with colleagues across the enterprise to ensure our tooling and data is well managed, robust, and well thought out. Empowering users to solve business problems through effective software and design is his north star. As a side he is a huge Python fan, and enjoys 3D printing.

Explore #LifeAtCapitalOne

Innovate. Inspire. Feel your impact from day one.

Learn more

Related Content