Detecting sensitive information in data with Data Profiler

Simplifying detection with just a few lines of code using Data Profiler—an open source project from Capital One

Anh Truong

May 25, 2021|5 min read

Companies rely on their customer data to improve their business quality and customer service. This necessitates the requirement of protecting their private customer information such as Personally Identifiable Information (PII) and Non-Public Personal Information (NPI). This information may include, but is not limited to, customer names, physical addresses, IP addresses, bank account numbers, and credit card numbers. This sensitive information exists in many data sources under both unstructured (phone transcripts from customers) and structured (tabular transaction data) datasets. In order to detect sensitive information from these sources, my team at Capital One has developed a quick-and-easy-to-use open source library, Data Profiler, that helps read and identify sensitive information from different file types.

Let’s walk through several examples of how we use the Data Profiler library on some public datasets.

Example of prediction on structured data using Data Profiler

First, let us look at a public, tabular dataset on internet traffic using the data reader class of the Data Profiler.

    import dataprofiler as dp
import json
import pandas as pd

data = dp.Data("data/structured/aws_honeypot_marx_geo.csv")
df_data = data.data
df_data.head()

The returned data is a Pandas dataframe, which contains sixteen columns, each of which has a different data label. To predict these data labels, we will use the data labeler from Data Profiler.

Note that, besides data labeler, Data Profiler also provides the identification of data types and data statistics for each column. In the scope of this post, we disable those options and keep only the data label functionality.

    # set option to run only data labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"text.is_enabled": False, 
                     "int.is_enabled": False, 
                     "float.is_enabled": False, 
                     "order.is_enabled": False, 
                     "category.is_enabled": False, 
                     "datetime.is_enabled": False,})

profile = dp.Profiler(data, profiler_options=profile_options)

That’s it! After running the Data Profiler, we get the results in profile.report. By default, the reported results contain lots of detailed information about the whole dataset. For this example, we only need the predicted label for each column. The below script shows results at the column level.

    # get the prediction from data profiler
results = profile.report()
columns, predictions = [], []
for col in results['data_stats']:
    columns.append(col)
    predictions.append(results['data_stats'][col]['data_label'])
df_results = pd.DataFrame({'Column': columns, 'Prediction': predictions})
print(df_results)

Per-column label prediction on structured data

The results show that the Data Profiler is able to detect sensitive information such as datetime, ipv4, or address. It predicts None for the empty column ‘owner’ and BACKGROUND for columns with miscellaneous information. With several lines of code, we are able to identify some sensitive information in our dataset, which can subsequently be used to help companies protect their data efficiently by applying different tokenization techniques for different types of sensitive information.

Example of prediction on unstructured data using Data Profiler

Besides structured data, Data Profiler can also detect sensitive information on unstructured data. Here we’ll use a sample of spam email from an Enron email dataset for this demo. As above, we start investigating the content of the given email sample.

    # load data
data = dp.Data("data/emails/enron-sample")
print(data.data[0])

Message-ID: <11111111.1111111111111.JavaMail.evans@thyme>

Date: Fri, 10 Aug 2005 11:31:37 -0700 (PDT)

From: w..smith@company.com

To: john.smith@company.com

Subject: RE: ABC

Mime-Version: 1.0

Content-Type: text/plain; charset=us-ascii

Content-Transfer-Encoding: 7bit

X-From: Smith, Mary W. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>

X-To: Smith, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>

X-cc:

X-bcc:

X-Folder: \SSMITH (Non-Privileged)\Sent Items

X-Origin: Smith-S

X-FileName: SSMITH (Non-Privileged).pst

All I ever saw was the e-mail from the office.

Mary

-----Original Message-----

From: Smith, John

Sent: Friday, August 10, 2005 13:07 PM

To: Smith, Mary W.

Subject: ABC

Have you heard any more regarding the ABC sale? I guess that means that it's no big deal here, but you think they would have send something.

John Smith

123-456-7890

Here we have an email with a header, the main content, the original message, and a signature part at the end. With this unstructured text, we directly use the data labeler to predict the location of sensitive information therein.

    data_labeler = dp.DataLabeler(labeler_type='unstructured')
predictions = data_labeler.predict(data)
print(predictions['pred'])

[array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
       20., 20., 21., 21.,  6.,  6.,  6.,  6.,  7.,  7.,  7.,  7.,  7.,
        ...
       15., 15., 15., 15., 15.,  1., 16., 16., 16., 16., 16., 16., 16.,
       16., 16., 16., 16., 16.,  1.])]

Note that the unstructured type of the data labeler is used with text. By default, the data labeler predicts the results at the character level for unstructured text. It also provides alternative representations as chosen by users for better interpretability. For example, the prediction results can be at the word level following the standard NER (Named Entity Recognition) format, e.g., utilized by spaCy. All we need to do is to add several options to the postprocessor of the data labeler.

    # convert prediction to word format and ner format
# Set the output to the NER format (start position, end position, label)
data_labeler.set_params(
    {'postprocessor': {'output_format':'ner', 'use_word_level_argmax':True}} 
)
predictions = data_labeler.predict(data)

# display results
for pred in predictions['pred'][0]:
    print('{}: {}'.format(data.data[0][pred[0]: pred[1]], pred[2]))
    print('--------------------------------------------------------')

evans@thyme>: EMAIL_ADDRESS

--------------------------------------------------------

Fri, 10: DATE

--------------------------------------------------------

Aug 2005 11: DATETIME

--------------------------------------------------------

smith@company.com: EMAIL_ADDRESS

--------------------------------------------------------

john.smith@company.com: EMAIL_ADDRESS

--------------------------------------------------------

7bit: QUANTITY

--------------------------------------------------------

Smith, Mary W: PERSON

--------------------------------------------------------

</O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>: HASH_OR_KEY

--------------------------------------------------------

Smith, John: PERSON

--------------------------------------------------------

</O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>: HASH_OR_KEY

--------------------------------------------------------

-----Original: QUANTITY

--------------------------------------------------------

Smith, John: PERSON

--------------------------------------------------------

Friday, August 10, 2005 13: DATETIME

--------------------------------------------------------

Smith, Mary W.: PERSON

--------------------------------------------------------

John Smith: PERSON

--------------------------------------------------------

123-456-7890: PHONE_NUMBER

-------------------------------------------------------

Here Data Profiler identifies sensitive information such as datetime, email address, person names, and phone number within this random email.

Conclusion

The Data Profiler open source library is a great solution for companies looking to quickly detect sensitive information in their data lakes. It can be used to scan through both structured and unstructured data with different file types. We encourage readers to visit Data Profiler on Github for more information on the project and details on how to contribute to Data Profiler.

References

[1] Github repo: https://github.com/capitalone/DataProfiler

[2] Sensitive Data Detection with High-throughput Neural Network Models for Financial Institutions: https://arxiv.org/pdf/2012.09597.pdf

Anh Truong, Machine Learning Engineer

Anh Truong is a Machine Learning Engineer, with 4 years of predictive model developing experience at Capital One. He received the PhD at the University of Illinois at Urbana-Champaign, and has been working on synthetic data generation and sensitive information detection using deep learning. He is passionate about advanced techniques in AI, ML, NLP, as well as sport and reading in his leisure time.