Accelerating AI with great data
Discover how scalable data management and governance help Capital One harness AI for innovation and impactful outcomes.
Presented at AWS re:Invent 2024 by Marty Andolino, VP, Software Engineering and Kajal Wood, Senior Director, Software Engineering at Capital One.
AI has evolved from a futuristic concept to a driving force of innovation. But at the heart of AI lies the same thing that's powered industry for the last several decades: data.
We presented at AWS re:Invent 2024 on how we think about producing high-quality data that is well governed and easy to find, capture, understand and use—ultimately powering the ability to accelerate AI use cases that can help solve challenging customer problems and drive real business value.
The symbiotic relationship between AI and data
AI and data share a symbiotic relationship. On the one hand, data is the core of AI, enabling it to learn, adapt and make intelligent decisions. On the other hand, AI helps unlock the value of data by identifying patterns, finding insights and creating better customer experiences through deep interactions.
By collecting better and more relevant data from customer experiences, we can train our AI and machine learning (ML) models more effectively. These models can help enhance our customer experiences and increase engagement, offering more data that can be used in the future.
Navigating the complex world of data
It's fairly well known that the amount of data is growing at an increasing rate over time. By the end of 2024, the total amount of data in the world is expected to be around 149 zettabytes. But it isn’t just the amount of data that's increasing; it's the complexity of that data. Many businesses have a dual challenge of exponentially growing datasets and a significant shift toward unstructured data. Not only that, customers today are expecting real-time experiences, and those types of experiences require extremely low latency and reliable data ecosystems.
As a result, overcoming challenges such as data integrity, real-time data access and the underutilization of available data requires strong data management. This requires:
- implementing comprehensive data governance frameworks,
- investing in highly scalable infrastructure, and
- developing new approaches to data integration and data analytics.
The end goal is not just to manage the rapidly increasing volume, variety and velocity of data, but also to find meaningful insights for business value and enhance customer experiences.
The data lifecycle: A holistic approach
The data lifecycle is also becoming increasingly complex, and various stages of the data lifecycle have unique challenges. From creation and storage to usage and archival, each stage requires a different strategy. To ensure data quality and accessibility, we emphasize the importance of proper data registration, inventory and control. This holistic approach is essential for building a strong data foundation.
Ensuring data remains well-governed across the lifecycle ensures data users have access to the most accurate, relevant and timely data available. This both improves the performance of our models and builds trust and transparency into AI-driven decisions.
Key principles for great data
As we know, the AI revolution is further raising the stakes on how companies manage and use data. Based on our experience, we’ve learned that data can be produced and consumed by applying three key principles:
- Self-service data ecosystem: Empower data producers and consumers to register, discover and manage data through a common self-service data portal and well-defined process.
- Automation: Automate controls, policies and procedures to streamline the data lifecycle, improve security and governance.
- Scale: Leverage platforms, tools and mechanisms to ensure data consistency and scalability of data practices.
These principles guide how we think about data as a valuable asset and ensure it is managed appropriately. Building a self-service data ecosystem empowers data users across the organization to efficiently access and utilize data. Automation plays a key role in streamlining data processes, reducing manual effort and minimizing the risk of errors. And by keeping a focus on scalability, our data infrastructure can handle rapidly increasing volume and complexity of data.
The data producer experience
Data producers are core to the data ecosystem. To ensure a seamless experience for our data producers, we provide them with a self-service data portal that enables easy onboarding of data and focuses on intuitiveness, intelligence, transparency and efficiency.
The data onboarding process involves several steps:
- Registering the data and defining its metadata
- Setting privacy and security indicators to ensure proper data protection and governance
- Designing and approving a structured schema for the dataset
- Provisioning access to storage environments and managing credentials
This process ensures data producers can easily onboard new datasets, reducing the time and effort required to make data available for consumption. By automating many of the steps involved in data onboarding, we’re able to ensure consistency, reduce errors and free up data producers to focus on more strategic tasks.
The control plane: Configuring and monitoring data
At the core of our data pipeline lies the control plane. In registering a dataset through the self-service portal, we’re able to populate our control plane with the various characteristics about the registered data. This centralized system holds configuration information for each dataset, provisions access to data, manages data quality and ensures proper orchestration of the data pipeline.
The control plane coordinates and manages the flow of data throughout the organization, ensuring it is processed efficiently and remains well governed. By centralizing these functions, we can maintain a high level of control and visibility over our data.
Embracing serverless: Building and deploying quickly
We also leverage managed services capabilities from AWS that empower our data engineers to focus on creating value rather than managing infrastructure. We leverage a variety of AWS services such as S3, Route 53, Lambda and Step Functions to build a resilient and scalable data pipeline.
Our serverless-first approach enables us to build and deploy data pipelines quickly and efficiently. Managing and maintaining infrastructure is expensive, but relying on managed services allows us to avoid this overhead and focus on innovation and delivering value to our customers.
Automation and scale: The cornerstones of an efficient data ecosystem
We’ve established self-service capabilities for our data producers that populates the control plane for all the dataset configurations. However, implementation against this in many places is not always helpful. Driving automation and scale requires a different approach.
We need mechanisms of common enforcement to automate our standards and scale them. To address this, there are two approaches that we use to automate our standard and scale them:
- Central platform: A centralized platform for publishing data with enforced governance, data quality checks and support for different data stores.
- Federated data model: Enabling a data team to have complete control of the Spark compute infrastructure in the data plane while maintaining SDK enforced governance.
These two approaches provide a balance between centralized control and decentralized flexibility. The central platform ensures a level of consistency and governance across the organization, while the federated model allows teams to tailor their data pipelines to their specific needs. This approach is key for managing many different types of data and use cases within a large enterprise. Regardless of the implementation, the data governance capabilities are applied consistently across both centralized and federated approaches.
The data consumer experience
Now let’s take a look at the data consumer. Data consumers cannot do their job if our data is not high quality or meaningful—or it’s just not discoverable.
At Capital One, our lake strategy uses a single storage solution, offering easy consumption by enabling any compute engine to point to the storage location. Our lake strategy also uses zones to support diverse use cases. A zone strategy enables the lake for these use cases that might have varied access, storage and retention policies.
These capabilities enable consumers to experiment with data for diverse use cases while maintaining governance. For example, a data scientist can use self-service data sandboxes to enable rapid model build and train. Meanwhile, an ML engineer consumer can use real-time applications to enable deployment of low latency data stores.
Key takeaways: Building a trustworthy data ecosystem
At Capital One, our approach to building a trustworthy data ecosystem that powers AI can be summarized in these key takeaways:
- Streamlined experiences: Providing a consistent and user-friendly experience for both data producers and data consumers.
- Mechanisms for enforcement: Building for automation and scale to ensure data quality and governance through centralized and federated data models and pipelines and data lifecycle policies.
- Rapid experimentation: Enabling fast and secure access to data for experimentation while maintaining governance.
- Unwavering trustworthiness: Ability to ensure data is well-governed and easy to find, use and consume.
Learn more about Capital One Tech and explore career opportunities
New to tech at Capital One? We're building innovative solutions in-house and transforming the financial industry:
-
Explore open tech jobs and join our world-class team in changing banking for good.
-
See how we’re building and running serverless applications at a massive scale.
-
Read more from our technologists on our tech blog.
---
This blog was authored by Kajal Wood and Marty Andolino.
Kajal Wood is a Sr. Director of Enterprise Data Technology at Capital One. In her role, Kajal leads a team of engineers responsible for building and maintaining data storage platforms, such as the lake and a suite of capabilities designed to standardize and protect company data. She is also responsible for data consumption tools within the company, including Databricks, Snowflake, and multiple BI tools. Kajal’s experience spans designing and implementing scalable data architecture and leading cross-functional teams responsible for building end-to-end data solutions across ingestion, storage, and consumption.
Marty Andolino is a VP of Engineering at Capital One. In his role, Marty leads a team responsible for data pipelines, data governance services, and external data sharing. Having been with Capital One for more than nine years, he has held various tech roles across retail, marketing, fraud, data, decisions, and architecture. He is passionate about building a positive customer experience, innovative technology solutions, and mentoring. Marty went to Syracuse University (go Orange!), and in his free time, Marty enjoys watching sports and movies, as well as spending time outside and with his family.