Data Mesh and Data Virtualization are not the Same Thing
The Data Mesh approach to enterprise data architecture has many benefits, but there is a widespread misunderstanding that will significantly limit those benefits for anyone who holds it.
The Data Mesh approach to enterprise data architecture has many benefits, but there is a widespread misunderstanding that will significantly limit those benefits for anyone who holds it.
Data Mesh advocates a “divide and conquer” approach to enterprise data. The idea is to split up the work so that data products within each “domain” (such as credit card, mortgage, and retail banking within a financial services firm) can be deployed as independently as possible by minimizing the amount of coordination required across domains. This approach enables faster delivery through parallelization of work, improved quality by allowing experts within each domain to have greater control of their data products, and reduced complexity by exposing data through easy-to-access, standard mechanisms.
Since data virtualization technology (such as Teradata Query Grid, Apache Presto, and others) is designed to provide seamless access across semi-independent data products, it clearly has a crucial role to play in the Data Mesh approach.
But Data Mesh and data virtualization technologies are not the same thing.
Treating data virtualization as if it is Data Mesh misses the opportunity for a more nuanced approach that is essential in any large enterprise. This narrow view could result in redundant and overlapping data and the inability to semantically link data across domains for initiatives that require this linkage, such as omni-channel customer experience or supply chain optimization. Another result could be sub-optimal performance of applications and analytics that join high-volume and complex data structures. Data virtualization technologies simply can’t bend the laws of physics, no matter how much they have matured.
To gain the full value of Data Mesh, we recommend three patterns for organizing domains, all of which are useful within any large organization, and only one of which involves virtualization technology:
Co-located and connected. In this pattern, database structures are semantically related (customer data meaningfully links to sales data, which links to product data, and so on) and reside in the same database instance. The schema associated with various objects are still developed semi-independently by teams associated with each domain, but the objects themselves contribute to a single data store, like multiple retail outlets in the same shopping mall. Storing separate (but related) schema in the same database instance allows for good performance for a wide variety demanding, business-critical applications accessing high-volume, complex, frequently joined data. In fact, there is no other way to meet performance requirements in this situation.
Distributed and connected. This pattern uses virtualization technology to link data that resides on separate platforms (separate clouds, separate clusters, separate database software, etc.). But the term “connected” means more than just providing physical access to multiple data products through a common interface. It also means ensuring that the data makes sense when it’s joined. That is, the data should also be semantically linked as in the previous pattern. That doesn’t happen automatically through virtualization. It takes planning and coordination across domains. This pattern is useful when co-locating data physically is either impractical or infeasible for regulatory or other reasons, or it provides little performance benefits based on the nature of the data and how it’s accessed.
Isolated. This pattern is useful for domains or groups of domains that rarely, if ever, need to be linked across domains. For example, data within independent business units (no cross marketing, no shared customer lists, no common supply chain elements, etc.) within a larger holding company might fit this pattern, although domains within such business units may still leverage the previous two patterns. Isolated domains may also (unfortunately) occur when the political hurdles of integrating a domain into the wider ecosystem presents an insurmountable challenge, even if the business benefits outweigh the coordination effort.
As you can see, care and thought are required when planning how data products should be architected when applying the Data Mesh philosophy. Thinking only in terms of linking data through virtualization is short-sighted at best and diminishes the essential value that virtualization does provide. Applying all three patterns appropriately, driven by the needs of domain-specific and cross-domain business initiatives, enables all the benefits of distributed development while simultaneously accelerating (not inhibiting) deployment of rational, integrated, and trusted data across the enterprise.
Please see this white paper for a deeper dive into Teradata’s viewpoint on Data Mesh.
Restez au courant
Abonnez-vous au blog de Teradata pour recevoir des informations hebdomadaires