“Data is the new oil” has become a cliche, but most people miss the fact that when Clive Humby coined the phrase in 2006 he was also talking about the process of refining data so that it is usable and useful. Just as crude oil becomes more potent as it is refined into petrol and kerosene, so many different data products draw from the same underlying raw data. The Data Mesh offers us a framework for the federated development of data products by supporting the decomposition of the data platform into domain-oriented projects that can reduce time-to-market and complexity. But because “data love data,” business must appreciate the value of the resulting fractional distillates - not merely as stand-alone products, but as the potential fuel for as-yet-unimagined cross-domain and enterprise-wide data analytics products.
Setting the right boundaries
Decomposition of a large and complex problem space into a collection of smaller models is central to the domain-driven design (DDD) inherent in the Data Mesh concept. But it also requires that we define explicit boundaries and interrelationships between domains, implying at least some degree of management and co-ordination between the various development teams. Striking the right balance between more and less independence is both art and science. Setting the optimum ‘bounded context’ and establishing lightweight governance models are critical to the success of data mesh architectures and I’ll return to them in a separate blog.
Consider a (grossly) simplified Retail Finance scenario. The Mortgage team may have a legitimate and urgent requirement to create a new data product to understand the impact of the COVID-19 pandemic on the demand for larger suburban properties. At a minimum, this new data product will probably require the roll-up of mortgage product applications by a new and different geographical hierarchy from that used by the rest of the organisation. A domain-aligned development process and schema makes this possible without lengthy discussion and negotiation across the rest of the organisation.
By contrast, when the demand for mortgages on large, suburban properties is converted into actual loans that are advanced to customers, the data that result will underpin very many other data and analytic products that support both tactical and strategic decision-making. These will require data to be combined across functional and organisational lines - and the resulting data products are likely to be widely shared and re-used across the organisation.
Even a simple, stylised example like these highlights that large and complex organisations need to create a wide variety of data products with a wide variety of functional and non-functional requirements. Much of the underlying data may be the same, but the way we process them may be very different in each case. Because re-use is the ultimate “eliminate unnecessary work” play, we should be looking to create and re-use pre-processed ‘distillates’ as widely as possible.
Three deployment strategies for Data Mesh
If re-use is ultimately about more than merely avoiding the creation of 99 different versions of a nearly identical report - and is instead about creating layered data architectures that enable high-value data elements to be discovered and reused, then interoperability will often be a key consideration. But how can this be accomplished in a Data Mesh?
Federating the development of complex data products should not imply the federation of their deployment. In fact, we see three different strategies for enterprise deployment aligned to the Data Mesh concept:
- Co-location
- Connection
- Isolation.
These choices are not mutually exclusive, and many real-world implementations use a combination of all three.
The co-located approach to deployment places domains, built in parallel by different teams on the same platform. This does not guarantee interoperability, which must still be designed-in through the definition of a minimum-set of explicit interrelationships. But deploying on a single platform can have very important performance and scalability advantages, especially for cross-domain workloads. Specifically, co-location can eliminate large-scale data movement operations across relatively slow WAN, LAN and storage networks and permit improved query optimisation. This leads to enhanced query throughput and reduced total-cost-of-ownership.
In some cases, it does not make sense to co-locate all domains on a single platform or under a single database instance. For example, sovereignty laws often demand that PII data must be persisted within the jurisdiction. For a multi-national company this means that there will be multiple schemas deployed across different geographies – even if the database technology used is the same. The connected scenario uses virtualisation and federation technologies to enable cross-domain analytics. But, as with the co-located scenario, the hard yards of reconciling and aligning data so that they can be reliably combined and compared, still have to be run.
Isolated domains are often proposed based on the need for extreme security (e.g., for HR data). The data product is completely self-contained within a single domain and the schemas used are typically narrow in scope and service operational reporting requirements rather than enterprise analytics. However, more often than not, the real reason for isolation has to do with politics and the desire for organizational independence. It is rare that truly useful data does not amplify its value when combined with other data.
Connected Cloud Data Warehouse
Agile, incremental, and business-led approaches to data and analytic product development matter because they help organisations to focus on delivering high-quality data products that solve a real-world business problem. Federated development under the data mesh concept can play a key role by enabling the decomposition of complex problems so that they can be worked on by multiple, business-focused teams in parallel. But successful deployment at-scale requires us also to design for interoperability, discovery and re-use.
Creating discoverable data services to enable data products designed for re-use to be accessed and deployed beyond specific domains is critical to avoiding the constant re-invention of similar, overlapping data products. By balancing federated development with the requirement for interoperability and the ability to deploy at-scale, the connected cloud data warehouse is fundamental to Data Mesh implementation in large and complex organisations and delivers the best of both worlds: agility in implementation and high-performance in execution.
To learn more, check out the article from Stephen Brobst (CTO, Teradata) and Ron Tolido (EVP, CTO and Chief Innovation Officer, Insights and Data, Capgemini) titled,
An Engineering Approach to Data Mesh.