Using DataComPy for comparing pandas and spark dataframes
DataComPy is an open-source python software developed by Capital One
DataComPy is an open source project by Capital One developed to compare Pandas and Spark dataframes. It can be used as a replacement for SAS' PROC COMPARE or as an alternative to Pandas.DataFrame.equals(Pandas.DataFrame
, providing the additional functionality of printing out stats and letting users adjust for match accuracy.
Building the DataComPy open source project
DataComPy was one of several software programs developed for internal use within Capital One. The purpose of building these features is to produce customer management risk models, where data validation and comparison are critical to ensuring integrity and confidence.
Specifically, as we were using this in models with numerous data sources, we needed to understand any data lineage nuances such as data transformations that had been performed along the way. Hence, building a similar capability to PROC COMPARE was essential to our use of Python.
The tool was completed for internal use in 2017 and after one year of use we started the process of open sourcing DataComPy. Since DataComPy's official release to the open-source community, it has almost 180 stars, 60 forks, and 14 distinct contributors. it has 268 stars, 91 forks, and 18 distinct contributors.
How DataComPy helps data scientists, analysts, and engineers with model builds
DataComPy’s goal is to provide a human-readable output describing differences between two dataframes in Pandas and Spark. It provides descriptive reporting at the column and row level outlining where columns and rows are identical, and where there may be differences. It tries to remain flexible by allowing users to provide tolerances of their matching criteria, custom joining, and several helper functions to enable users to easily compare intersections and unique items from a row and column perspective.
Data scientists, analysts, and engineers will find DataComPy a helpful tool in examining data for model builds. The tool is also helpful for ETL processing and general data comparison between various systems where transformations are being made.
Comparing data between sources can be finicky and cumbersome. We hope DataComPy will help ease the burden on users so they can focus on solving critical business problems.
DataComPy and Pandas overview
Comparing dataframes
With DataComPy and Pandas you can easily compare two dataframes using either joining columns or on an existing index.
compare = datacompy.Compare(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
# OR
compare = datacompy.Compare(df1, df2, join_columns=['acct_id', 'name'])
# OR
compare = datacompy.Compare(df1, df2, on_index=True)
Reporting summary
The report() method can then be used to provide a human readable summary of the differences.
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 original 5 6
1 new 4 5
Column Summary
--------------
Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0
Row Summary
-----------
Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 7
Columns with Unequal Values or Types
------------------------------------
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 dollar_amt float64 float64 1 0.0500 0
1 float_fld float64 float64 4 0.0005 3
2 name object object 2 0.0000 0
Sample Rows with Unequal Values
-------------------------------
acct_id dollar_amt (original) dollar_amt (new)
0 10000001234 123.45 123.4
acct_id float_fld (original) float_fld (new)
0 10000001234 14530.1555 14530.155
5 10000001238 NaN 111.000
2 10000001236 NaN 1.000
1 10000001235 1.0000 NaN
acct_id name (original) name (new)
0 10000001234 George Maharis George Michael Bluth
3 10000001237 Bob Loblaw Robert Loblaw
Sample Rows Only in original (First 10 Columns)
-----------------------------------------------
acct_id dollar_amt name float_fld date_fld
4 10000001238 1.05 Lucille Bluth NaN 2017-01-01
Quick overview of DataComPy and Spark
Comparing Spark dataframes using DataComPy is very similar to that in Pandas. Compare() is replaced with SparkCompare() and some additional useful features.
comparison = datacompy.SparkCompare(spark, base_df, compare_df,
join_columns = [('acct_id', 'ACCOUNT_IDENTIFIER'), ('acct_sfx_num', 'SUFFIX_NUMBER')],
column_mapping = [('clsd_reas_cd', 'AM00_STATC_CLOSED'),
('open_dt', 'AM00_DATE_ACCOUNT_OPEN'),
('tbal_cd', 'AM0B_FC_TBAL')],
known_differences= [
{'name': 'Left-padded, four-digit numeric code',
'types': ['tinyint', 'smallint', 'int', 'bigint', 'float', 'double', 'decimal'],
'transformation': "lpad(cast({input} AS bigint), 4, '0')"},
{'name': 'Null to *2',
'types': ['string'],
'transformation': "case when {input} is null then '*2' else {input} end"},
{'name': 'Julian date -> date',
'types': ['bigint'],
'transformation': "to_date(cast(unix_timestamp(cast({input} AS string), 'yyyyDDD') AS timestamp))"}
])
Column mapping
You can provide column mappings if the names differ between the datasets and also call out known differences and apply transformations on the compare side to help cluster these expected differences in the report output.
The corresponding output would look like:
****** Column Summary ******
Number of columns in common with matching schemas: 3
Number of columns in common with schema differences: 2
Number of columns in base but not compare: 0
Number of columns in compare but not base: 0
****** Schema Differences ******
Base Column Name Compare Column Name Base Dtype Compare Dtype
---------------- ---------------------- ------------- -------------
open_dt AM00_DATE_ACCOUNT_OPEN date bigint
tbal_cd AM0B_FC_TBAL string double
****** Row Summary ******
Number of rows in common: 5
Number of rows in base but not compare: 0
Number of rows in compare but not base: 0
Number of duplicate rows found in base: 0
Number of duplicate rows found in compare: 0
****** Row Comparison ******
Number of rows with some columns unequal: 5
Number of rows with all columns equal: 0
****** Column Comparison ******
Number of columns compared with unexpected differences in some values: 1
Number of columns compared with all values equal but known differences found: 2
Number of columns compared with all values completely equal: 0
****** Columns with Unequal Values ******
Base Column Name Compare Column Name Base Dtype Compare Dtype # Matches # Known Diffs # Mismatches
---------------- ------------------- ------------- ------------- --------- ------------- ------------
clsd_reas_cd AM00_STATC_CLOSED string string 2 2 1
open_dt AM00_DATE_ACCOUNT_OPEN date bigint 0 5 0
tbal_cd AM0B_FC_TBAL string double 0 5 0
The DataComPy roadmap
Our roadmap currently includes some refactoring of the Pandas and Spark codebase to help unify the experience, as well as sourcing more user feedback to help enhance the current capabilities. The full DataComPy roadmap can be seen here.
We highly encourage new contributors and feedback is always welcome by opening a new issue! Learn more about Capital One’s “open source first” philosophy and how we actively use, create, and contribute to open source software in our work.