Why do we need Data Products?

The true reason behind any data-related activity is to enhance business outcomes. The objective of leveraging data was and will be to enrich the business experience and returns. This direct tie between data and business, even though obvious, was lost in translation for a long time as we, the data community, dived into the tactics and forgot the end objective.

As data teams got caught up in defining, building, and maintaining the process – data infrastructures, pipelines, and architectures, it progressively ate up the time they spent on the core data and data applications. What this meant was:

  • Limited data applications to power customer-facing endpoints or business decisions.
  • Faulty and untrustworthy data as a result of increasing debt in the data infra layers and never enough data engineers to manage it.
  • Long and loopy path from data to insights that led to the loss of several business opportunities.
  • High time to ROI and low ROI of data teams due to complex builds and fixer-uppers that hogged significant resources and ate up from whatever ROI was generated in the first place.
Over 90% of the world’s data was generated (captured) only during the last couple of years and stored across expensive storage such as data lakes, warehouses, or lakehouses that was hauntingly similar to the basement with dusty files filled with rich information that no one could essentially operationalize. Another name for it is a data swamp.
Such data mismanagement results from data stacks that silo, duplicate, fragment, lock in, and misgovern data. To battle this chaos, we need to establish the data product ideology and implement it through a unified data architecture that pushes back against the data disruption caused by prevalent stacks and frees up the organization’s resources to focus on building the real deal – the data product.

TLDR: Data Products establish a direct connection between data investments and tangible business outcomes, a bridge that was starkly missing from the data landscape before.

What is a Data Product?

A data product is a reliable unit of data or a container of data that enables a direct and seamless impact on business decisions and outcomes at the time of opportunity.

There are five aspects to the above definition:

Simplicity or Seamless Impact

One of the challenges that businesses face is to wade through the complexities of data to somehow mine insights and patterns that they could half-heartedly rely on. A data product is targeted to simplify this barrier since they are born out of decoupled physical and logical layers where the control of the data narrative lies with business. From the perspective of product thinking, data product is an enhanced user experience of data.

Data Unit or a Container

The most fundament physical or logical data unit that could independently add value to the end user. While physical units are directly consumed, logical units are used as on-the-fly channels for materializing the physical units, powering intelligent data movement and saving movement expenses in the process.

Direct Business Impact

The ideal objective of any data stack, data team, or data initiative should be to create valuable data that actually uplifts business objectives. While this end vision is lost under the task of maintaining heaps of complex subsystems, the data product brings back the target to the forefront and solves it right-to-left.

Reliable Unit

Nobody would make use of data they cannot trust, and even if they do due to limited options, there is the inevitable burden of engaging high-cost dedicated teams and resources to look after the lineage of the unstable business use case.

Operationalized at the time of opportunity

The most critical factor in business is time, especially due to the rate at which business moves today. Infra infiltrated with leaky pipelines needing constant fixes steals away time from actually producing business-relevant insights. A data product, therefore, is data that is available on demand with quality and governance enforced on the fly.

These five factors are key to businesses and embody the outcome of a well-developed product mindset where data itself is treated as a product by the organization. Unlike common practice, data must be developed and managed as a product to leverage optimal business value.
Interestingly, data product materialization is more of a people and process problem than a technology problem. There can be numerous potential ways to power a data product, as we have seen pop up over the last few years, but the key is to implement an architecture that duals as a data culture enabler.
Otherwise, as is common in the data ecosystem, everything falls apart as another unmanageable asset that the data team is compelled to look after due to the high investments poured in by the organization. Establishing the product mindset is the largest barrier to enabling and maintaining a data product so that it doesn’t become another feared high-entropy design approach.
A data product adds value to the user consistently and reliably and, at every point in time, embodies the following features:

Discoverable

Data discoverability is perhaps the most important feature behind data operationalization. If you cannot find the data, you are blocked at the very first stage. Discoverability implies reusability, which is why the data product approach encourages uniqueness and is anti-duplicity. Duplicating data or data product pipelines is equivalent to approaching the state of another data swamp. Discoverability is powered by a global metadata manager that sits in a central control plane and has visibility over the entire data ecosystem across distributed user planes.

Secure

Data products being on-demand value providers are inherently embedded with security protocols that activate on-the-fly based on the demand center. They are required to be governed centrally (pseudo-federated) with fundamental access and data policies that are enforced security protocols on asset, row, and column levels. The uniqueness of data products brings more importance to the security component since the data is addressable by multiple end-points and must adhere to the set standards and compliances on every materialization channel.

Addressable

Addressable data is available as a standard asset across cross-functional, regional, and multi-cloud environments. Being addressable means having a common and unique address that adheres to the organizational or domain’s convention and is accessible by code and even low-code/no-code channels.

Trustworthy

Trustworthy data meet quality expectations, is compliant with required standards, and is laced with end-to-end lineage and provenance particulars. Businesses should be able to blindly rely on the fulfilment of their expectations on the data they demand. This is possible declaratively where a dedicated system manages the quality and governance expectations while business teams could focus completely on operationalizing the served data.

Natively Accessible

Natively accessible data is language, format, personnel, and system-agnostic. Enablers of this data product attribute could include domain-specific languages (DSL) to navigate the low-level subsystems, the capability of the system to understand a whole range of programming and SQL languages, and the ability to adopt new ones through native transpilers.

Interoperable

Data generated or sourced from any source should be able to talk to each other without bombarding conflicts. This is possible through data APIs or logical constructs and loosely coupled yet tightly integrated components that sit on top of unique and addressable data to ensure visibility across the required use cases with end-to-end on-the-fly governance. Interoperability also requires universal semantics that are manageable through a central control plane.

Valuable on its own

Data is valuable on its own if it acts as a complete and independent entity that directly impact business decisions. This requires the data to adhere to universal semantics, business quality assertions, and use-case-specific governance standards.

Examples of Data Products

Now that we have a better understanding of what data products are and why they are necessary for the data ecosystem, let’s look at a few examples to get a clearer picture:
  • A spreadsheet in an S3 bucket
  • A table view knitted from across heterogeneous sources
  • A report generated from an analytics dashboard
  • A metrics layer on top of a data model
  • Features in an ML feature store
  • A database of dynamic rules in a self-driving car
  • An encrypted or masked dataview displayed through a SQL interface
The list is not exhaustive. In short, any data or logical data construct that embodies the attributes discussed above can be declared and used as a data product.

How is a Data Product materialized?

Data Products would be the outcome of elegant dataOps and agile methodologies. While there are multiple ways to generate data products, maintaining data products is the key challenge and the primary barrier is the data culture problem.
A unified architectural approach that decouples a central control plane and user data planes with domain-based segregation stimulates the right tradeoff between technology enhancement and cultural enablement. Such an architecture also inherently decouples logical and physical data layers to restore ownership of data products to the business.

To deep-dive into each component from the above architecture, feel free to refer to the Data Mesh Implementation of a Data Operating System.

Unified Architecture with decoupled control and user planes, domain segregation, and decouple logical and physical layers.