Developing Consumer Data Products at Petabyte Scale


D&A Strategy

Modern Data Platform

We unified every data consumer team around a new cloud architecture and single set of data standards. The new framework gave the company confidence to democratize data product development and unstuck the product development backlog.

Scott Reedy, Axis Group

Meeting the Challenge

How does a data product company democratize data product development and unify its data strategy in a cloud-first world?

It's hard enough to design a data platform for a company that serves a few thousand customers from a dozen data sources. But how do you architect a solution that effectively manages 30 million data producers that generate hundreds of millions of data points each day? Data products are hard to build, maintain, and scale. When you're using Big Data as an input to your product, the challenges can seem insurmountable.

But it doesn't have to be that way.

This case study highlights Axis Group's approach in assisting a client to construct a scalable data platform capable of handling the intricate requirements of their data products—and how we did it in a way that ensured sustainability and ease of maintenance.

Our client, a fast-growing media measurement and analytics company, captures granular behavioral data from millions of TV viewers—tapping into data from DVRs, set top boxes, and smart TV's to help their customers analyze viewing activity of programming and advertising, at second-by-second granularity. This adds up to more data in a single day than many companies generate in a year.

Further complicating matters, our client sells its data to external parties; because data is the product, data quality and robustness are essential to remain competitive.

Once your data reaches the petabyte scale, every aspect of your data operations and architecture becomes substantially more challenging to manage, calling for specialized architectures, processes, tooling, and expertise. Data platforms of this scale require a complex web of components to handle ingestion, storage, and processing. It's also important to understand that while these systems require a modern approach, traditional data management practices are still required to maintain appropriate performance, security, and compliance.

Our client was quickly running up against the natural limitations of their existing data platform. While they have deep analytics and deliver a market-leading data product to their customers, they were rapidly outgrowing the legacy architecture they built—back when "modern data architecture" meant an on-premise Hadoop cluster.

In short, they were unable to move at the speed demanded by their internal users, their customers, and the broader market.

One key struggle was a lack of standardization in their tooling and processes across the enterprise: as their data ecosystem grew, each team developed idiosyncratic solutions to whatever problem they faced. This resulted in a jumble of data pipelines, using a suite of different orchestration tools, each feeding a number of inconsistent data management systems.

The result was predictable—the frequent use of non-standard processes, and competing standards, caused lock-in and stagnated platform growth, leaving the organization dependent on the original creators and systems for ongoing support and maintenance, since no one else possessed the knowledge to modify or maintain their systems once the author had departed. Enterprise-wide development stalled, and the backlog of desired capabilities grew longer each year.

It was time to bring the teams together under one banner—a unified data platform strategy, with universal standards and scalable frameworks that could be readily understood and adopted. Migrating such a large amount of data and transitioning to a new platform is a daunting task that requires significant planning, coordination, and resources. Complicating matters further was the fact that this wasn't a fresh ecosystem; it called for a seamless data platform migration while daily operations continued.

Our client was ready to make a change and migrate its legacy architecture to a modern cloud platform and scale their systems and processes—but they weren't sure how to get there.

That's when they called Axis Group.


Frame 8255


Our Solution

With a cloud-first architecture and updated data models, Axis democratized the development of complex data products.

The heart of Axis's approach was to develop a universal, foundational, cloud-based data ecosystem, supported by standards, practices, and frameworks that every team could readily adopt to get the data they needed.

In the modern era, data platforms have moved past Hadoop; they are cloud-native, built anticipating near-infinite scalability and aligning a team towards innovation instead of maintenance. A cloud architecture becomes especially important when the data feeds a customer-facing data product, since it requires speed and flexibility to support your customers' needs as they evolve.

After assessing the client's market strategy, their existing infrastructure, and how their teams really work, Axis developed a right-sized plan for the future.

When we compared our client's core needs and capabilities to the data platform landscape, Axis recommended that a custom, standardized framework centered around Databricks would be the best approach. Databricks offers flexible infrastructure to run their data pipelines, elastic resources for storage and compute, and a rich set of features that allow easy integration with other data tools.

We started from the ground up by developing a custom framework to help the team better standardize, modularize, and extend their code. This laid a solid foundation to build upon, and let the company focus more on business logic and less on plumbing. We then developed and facilitated a plan to migrate their core ETL code in-framework. Along the way, we enabled the company with a modern testing methodology to accelerate development and improve quality.

With so many disparate approaches to orchestration and data quality in place across the organization, the next step was to rationalize and consolidate them for unified management; we opted for AWS-managed service offerings to automate away much of the overhead normally associated with those systems and processes. On the road to production, Axis also incorporated coding best practices and standards including developer workflows, sharable libraries, modern CI/CD practices, and Infrastructure-as-Code for infrastructure standardization.

One reason Axis recommended Databricks in this case was because it allows each team to work the way that makes the most sense to them. Thus the client's Data Engineers could use Databricks for infrastructure automation when running their Spark jobs, but stick with a traditional software development workflow to produce and deploy their code. But Data Scientists decided to use Databricks to develop and train models, and rely on Jupyter notebooks to produce their team's code. And Data Analysts could use Databricks to run ad-hoc queries and create reports.

Since Databricks supports all of these patterns out of the box, and lets people quickly transition to new tools and workflows, Axis helped the client embrace this capability, and enabled the organization to build meaningful bridges across their data teams and promote collaboration under one development banner.

When designing a modern data platform, it is not uncommon for engineers to overlook foundational data management principles. Here, the engineering group tended to focus on advanced features at the expense of fundamental concepts such as data modeling. The Axis team, with their extensive expertise in Data Management, collaborated with stakeholders to fill in the gaps. Our solution involved developing an optimized data model that facilitated growth and scalability while maintaining data quality. This resulted in a significant reduction of redundant data and improved the overall performance of the platform.

One area of great interest to our client was how to manage its reference data—dimensional information used to enrich their facts. While reference data management (RDM) is often treated as an afterthought, at petabyte scale it becomes its own unique engineering challenge beyond the capabilities of traditional RDM tools: how to maintain accuracy and consistency while still processing timely changes across billions of records.

Here, our client's needs were even more complex, since they needed to conduct "what-if" scenarios to examine the broader impact of modifying one or more dimensions on billions of records before deciding whether to finalize it; and they needed to track slowly-changing dimensions and the lineage of any modifications that are committed. Under the legacy framework, even minor changes to reference data cascaded across a host of different systems, and required the assistance of a senior engineer, a task that would be a simple change in any traditional operational data store.

To meet the company's requirements, the Axis team developed a custom RDM solution using Python, Airflow, and MariaDB that lets even low-level analysts update reference data and examine its impact at scale, all using everyday tools like MS Excel.

Together, the changes to architecture, data models and tooling—along with Axis's mentorship and enablement—democratized access to the development of data products and created an entirely new way to do business. The data teams are now unified under one set of data and code standards—"one true framework to rule them all"—and Axis was able to help the company prioritize and unstick the development backlog with a workable roadmap for the future.