Federating data management to scale in the cloud
In today’s complex and diverse data landscape, the ability to scale a well-managed cloud data ecosystem becomes crucial. But traditional approaches to data management that concentrate data responsibilities within a single central data team can create bottlenecks that slow data consumption and thus action and innovation. Enterprises today require a new model that makes high-quality data accessible quickly to various stakeholders across an organization.
More businesses are seeing the value of moving to a decentralized data management approach where organizations distribute data responsibilities along business lines and produce data sets with greater accuracy and speed. A key aspect of the approach is federated data management, or the practice of empowering data stakeholders in various lines of business to manage their own data. We will examine what this looks like and why it’s especially important for businesses looking to take advantage of the speed and scalability possible in the cloud.
What is federated data management?
Federated data management is changing the ownership of data from a centralized team that historically was responsible for all data management to the domain teams that produce the data. In other words, an organization gives ownership of the data to the team that’s most familiar with it. In most organizations, these domains, or a group of people organized around a common business endeavor, usually lie within lines of business.
Federated vs. centralized data management
Decentralizing data management is a reaction to the challenges organizations can encounter today with a traditional centralized approach. When data sets were smaller and more predictable, a single data team was able to manage data for the entire organization. The team received data requests and, whether they came from the marketing or human resources departments, the data team was responsible for processing and preparing all data for consumption.
However, many companies are moving to the cloud where the number and variety of datasets exponentially increases and a single team often does not have the time or domain knowledge to manage all the data sets accumulating across various lines of business. A central team can become a bottleneck to the business and slow down innovation while trying to answer this growing backlog of data needs and requests. Additionally, the team can result in a disconnect between the data producers that know the data the best and the data engineers preparing the data for analysis, which can affect data accuracy and quality.
When you allow the data owners to publish data and assign data management responsibility, the organization is able to scale and gain access to the data much faster. For federated data management to work, however, each domain must follow a centrally determined policy for governance, enforcement and monitoring.
Components of federated data management
Local ownership of data
A federated approach moves an organization to a model where the experts of the data own the management, processing and storing of data. By arranging data responsibilities into discrete units, each with their own data products, data federation gives organizations greater access to data and the ability to scale quickly.
In a federated approach, each domain is responsible for the entire data pipeline. The people closest to and most familiar with the data take end-to-end ownership of the data, including the source systems, pipelines and data products. They also work independently, scaling their own processes without affecting other teams or domains. There must also be a high degree of interoperability in shared data sets as consumers use data from multiple domains to get their jobs done.
Establishing data ownership is also crucial to managing and governing data appropriately. In posing the question to your stakeholders of “Who owns the data?” you will likely get multiple answers including the data producer, data consumer, the owner of the system in which the data resides, or the data warehouse administrator. Federating data ownership so that responsibilities lie clearly with the lines of business producing the data establishes accountability and transparency across your business.
Central policies
But decentralizing alone would quickly lead to data inconsistencies, inefficiencies and deep silos. Each domain must also ground itself in a set of enterprise-wide standards for data that determines how to categorize, manage, secure and access the data, which is where centralized data governance comes in. A balance between decentralized data ownership and centralized data governance ensures consistency and high data quality across the organization while allowing the business to scale and move rapidly.
Self-service experiences
Key to empowering the different lines of business to manage their own datasets is providing them with flexible, self-service experiences. This is where a central platform can help a business provide the data management tools each line of business needs while orchestrating central governance controls behind the scenes.
Federated data management is an important component of data mesh, which is an architectural concept of data management that treats data as a product and decentralizes data ownership across different lines of business. These domains then deliver data products that the rest of the organization can consume. Data mesh is a framework that allows enterprises to design modern data architectures and scale a well-managed cloud ecosystem. A main principle behind data mesh is that high-quality data can be accessible to anyone across an organization, sometimes referred to as the democratization of data.
Importance of data federation in the cloud
Data federation is a useful concept for any organization trying to derive insights from large volumes of data. But it becomes especially important in the complex, fast-moving cloud environment where there are nearly no limitations in data volume, storage and scalability.
Data federation leads to many important benefits for an organization looking to scale in the cloud, such as:
- Accuracy: Domain owners know their data the best and can ensure the accuracy of important data elements such as the meaning of metadata fields, assigning access, and classifications for sensitivity.
- Accountability: Each domain is responsible for end-to-end delivery of the data product. This ensures the data is available and meets high standards of quality. Organizationally it is clear who is responsible for each data set, which encourages greater accountability from domain owners to produce better data.
- Scalability: The independence yet interoperability of the discrete data domains gives organizations a foundation for scaling across the business.
- Improved discovery and trust: As domain teams publish data that is trustworthy, transparent and accurate for data consumers, they build up confidence in the organization in the quality of data and encourage data discovery efforts.
- Consistent governance: Because domains operate within centrally defined policies and standards for governance, there is greater consistency across data sets and an assurance that consumers are working with data that meets enterprise-wide standards.
- Faster innovation: With teams able to serve their own needs at their own pace, without going through a central data team to fulfill requests, businesses as a whole are able to move faster in taking advantage of new data insights that speed innovation.
How we federated data at Capital One
Early on in our data journey, we recognized the benefits of moving our data workloads to the cloud. But we also realized we could no longer operate with a data management approach organized around a central data team. The influx of data coming in from multiple sources created a bottleneck for our data engineering team. They began experiencing an increase in data complexity, inefficiencies such as duplicate requests and greater demand for faster data analysis from the business.
We looked to distribute data responsibilities to our lines of business as a way for stakeholders to independently move forward in deriving the value they need from data while no longer depending on a central team’s backlog. At the heart of our approach was treating data as a product, which meant we applied product thinking to data sets. Each line of business produced a data set with the intention of providing immediate value to the end customers, our data scientists and analysts.
Through the shifts we made in the culture and organization of our data management, we experienced incredible improvements in data access and scalability. For example, publishing a data set could now take a few hours rather than many days. We also eliminated many manual processes through the centralized tooling we created, reducing 55,000 hours of manual work.
Read more about how we evolved our data management: Data management: A modern, integrated approach
Let’s take a look at what this looked like in practice and how we made federated data management possible at Capital One.
Discrete units of data responsibilities
The first thing we did was we broke our lines of business into discrete organizations and units of data responsibilities. We applied differing levels of hierarchy to each line of business with smaller organizations only requiring one level while larger organizations were broken into three or four levels. We then assigned responsibilities for each line of business by business unit with the same set of roles existing within each organization.
Common enterprise standards for data
At Capital One, we have our own enterprise data governance tower that works with regulators and internal stakeholders to define the organization’s standards. We needed to know where our data resides, which data is sensitive and who’s responsible for each data set. We developed common enterprise standard definitions for metadata curation, data quality for shared data sets and entitlement patterns for data based on sensitivity. Lines of businesses then followed the standards of the entire organization.
At the same time, there were separate rules within lines of businesses and this is where hierarchies became important. We knew not all data is created equal, so we addressed important nuances by enacting a sloped governance approach to shared data sets. For example, the most rigorous metadata curation standards might be reserved for data used for regulatory reports while less sensitive data in a user sandbox requires only a percentage of the metadata curation.
Usability layer for self-service, user-specific experiences
Next, we knew we needed a centralized tool to comply with the enterprise standards we described while also enabling self-service data management. Technically, this meant building a usability layer for the different data experiences based on the personas that most commonly interacted with data at Capital One. This layer was oriented around the job that needed to be accomplished such as publishing a new data product or protecting sensitive data. We built experiences for the data product owner or producer, data consumer, risk manager and business data platform owner. Data users were now able to go to one place for their self-service tools, accomplish their jobs with automated workflows and feel confident they were adhering to necessary governance and controls.
For example, the data publisher could use the centralized tooling to register the data, assign classifications, assign entitlements, then publish the data. The data governance activities occur in the background where the complexities of the processes remain hidden. Predefined rules are incorporated into the workflow, such as if credit card numbers must be encrypted, so that a data publisher can remain compliant with governance policies. For consumers, they can request access to a dataset and the centralized tool will automatically grant or prevent access in accordance with centrally set rules without weeks of wait time.
Data federation best practices
Transitioning to federated data ownership may not always be seamless. However, we believe the following will help many organizations in establishing well-defined and efficient federated data policies and processes.
- Get executive-level buy-in: One of the biggest challenges you may face will be gaining internal buy-in for a new federated model. This is more likely if you’re a large organization that has historically followed the centralized approach to data management. Gaining buy-in from the top, such as your CIO, can go a long way in influencing all stakeholders to adopt the new model and move things along more quickly.
- Be flexible: Accommodate and be respectful of the various ingestion patterns of the different lines of business. Each tool you build must incorporate the different lines of business and the unique ways they produce and access data. If you don’t, data users will find another way of getting their data that’s outside of your control.
- Build trust: If your organization historically relied on a central team to own all things data, that team may have a hard time trusting that the different lines of business will be able to keep important policies and standards in place. This is where centralized tooling becomes so important. Enable lines of businesses to manage and publish data in a way that supports central policies and governance. With all the controls in place, the data team can trust that all lines of business are following appropriate guidelines and policies.
Find the right balance
Modern organizations have much to gain in empowering lines of business with the autonomy to manage and produce their own data, but within centrally defined data standards.
Striking the right balance between federated data ownership and centralized tooling and policy will go far in helping your business operate successfully in modern cloud systems.