```
Enjoying this? A quick like helps keep it online longer.
This content will be deleted in less than 24 hours. If you like it, you can extend its lifetime to keep it available.
A Technical History
Tracing 35 years of metadata management—from the open repository dreams of the 1990s, through the proprietary enclosure of the 2000s-2010s, to the re-emergence of open standards in modern lakehouse architectures.
The 1990s represent the high-water mark of architectural vision in data management. Bill Inmon's Corporate Information Factory (CIF), first published in 1990, described a complete metadata-driven enterprise architecture that wouldn't be matched for three decades.
CASE tools like Oracle Designer, Rochade, and ADW stored models in open repositories— accessible databases you could query with SQL. Code generation wasn't a luxury; it was the norm. The semantic layer existed in Business Objects Universes and Cognos metacubes. Everything was connected through metadata.
The 1990s understood that metadata is the product. Data models weren't documentation—they were active assets that drove generation, validation, and business abstraction. The repository wasn't just storage; it was the source of truth that everything else derived from.
``` ┌─────────────────────────────────────────────────────────────────┐ │ CORPORATE INFORMATION FACTORY │ │ (Inmon, 1990) │ └─────────────────────────────────────────────────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ OLTP/ODS │ ───▶ │ EDW │ ───▶ │ Data Marts │ │(Operational) │ │ (3NF Atomic) │ │(Dimensional) │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └───────────────────────────┼───────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ METADATA REPOSITORY (OPEN) │ │ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ │ │ Technical │ Business │ Operational │ Semantic │ │ │ │ Metadata │ Glossary │ Metadata │ Layer │ │ │ └─────────────┴─────────────┴─────────────┴─────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ CASE TOOLS → CODE GENERATION → SEMANTIC LAYER → BI
Inmon's vision: normalized atomic EDW → subject-area marts. The architecture presumed metadata would drive everything.
Oracle Designer, Rochade, ADW—stored models in queryable databases. You could SELECT from your data model.
Business Objects Universes mapped technical to business terms. Users queried concepts, not tables.
Models → DDL → Forms → Reports. Change the model, regenerate everything. Metadata-driven was the default.
MBI emerged from this environment—a system that read application data dictionaries and used template resolution to generate complete ETL/DW infrastructure. What CASE tools did for application code, MBI did for data pipelines. Both treated build specifications as data, not code.
The Universal Framework (1999) embodied the 1990s insight: everything at the boundary is data entry; templates are message bodies; resolution is just pattern substitution over namespaces.
The 2000s witnessed the systematic enclosure of open metadata. Vendors acquired CASE tools and either killed them or locked them behind proprietary formats. ETL tools like Informatica and DataStage grew powerful but siloed—their repositories couldn't talk to each other.
The OMG's Common Warehouse Metamodel (CWM), released in 2001, represented the last serious attempt at open metadata interchange. It was technically sound—based on MOF/XMI/UML—but arrived just as vendors were moving in the opposite direction. CWM compliance became a checkbox, not a practice.
Oracle bought Hyperion and Siebel. IBM bought Cognos and SPSS. SAP bought Business Objects. Each acquisition closed an open ecosystem. The integration that metadata promised became the lock-in that vendors delivered.
``` ┌─────────────────────────────────────────────────────────────────┐ │ THE FRAGMENTED 2000s LANDSCAPE │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Informatica │ │ DataStage │ │ SSIS │ │ Repository │ │ Repository │ │ Repository │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ X ───────────────── X ───────────────── X │ (No real interchange) │ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ CWM (2001) - Technically Sound, Practically Ignored │ │ MOF + XMI + UML = Open Standard → Vendor Checkbox, Not Used │ └─────────────────────────────────────────────────────────────────┘ Meanwhile: ┌────────────────────────────────────────────────────────────────┐ │ Oracle buys Hyperion, Siebel • IBM buys Cognos, SPSS │ │ SAP buys Business Objects • Open repositories → closed stacks │ └────────────────────────────────────────────────────────────────┘
The Common Warehouse Metamodel deserves careful examination. Released in 2001, it defined standard interfaces for relational, multidimensional, OLAP, data mining, and transformation metadata. Based on UML for modeling, MOF for metamodeling, and XMI for interchange.
Co-submitters included IBM, Oracle, NCR, Unisys, Hyperion—all the major players. The spec was technically complete. But compliance didn't guarantee interoperability. Vendors implemented subsets, extended in incompatible ways, or simply claimed compliance without real support.
By 2010, CWM was effectively dead—not because it was wrong, but because vendor incentives pointed toward lock-in, not interchange.
| CWM Component | What It Defined | Why It Failed |
|---|---|---|
| Relational Model | Tables, columns, constraints, keys | Vendors extended incompatibly |
| Transformation Model | ETL job definitions, mappings | Too abstract for vendor specifics |
| OLAP Model | Cubes, dimensions, measures | BI vendors refused adoption |
| XMI Interchange | XML-based metadata exchange | Complex, verbose, poorly tooled |
While the industry fragmented, MBI exploited the one thing that remained constant: information_schema and vendor export formats were text. Informatica XML, DataStage DSX, Ab Initio graphs—all reducible to template patterns over namespaces.
The 2008 MBI/MKB breakthrough: Oracle PL/SQL reading data dictionaries, generating complete Data Vault warehouses. What CWM tried to standardize, MBI achieved through pattern recognition.
The API read source metadata → generated Hubs (from PKs), Links (from FKs), Satellites (from columns). No modeling tools. No consultants. Just pattern recognition over information schemas.
James Dixon of Pentaho coined "Data Lake" in 2010. The promise: dump everything, figure it out later. Schema-on-read instead of schema-on-write. The data warehouse was declared obsolete. Hadoop would solve everything.
The reality was different. Without enforced metadata, lakes became swamps. Without semantic layers, everyone built their own definitions. Without lineage, nobody knew where data came from. The industry spent a decade unlearning everything the 1990s knew.
The Data Lake movement rejected schema enforcement as "too rigid" without understanding why schemas existed. They weren't bureaucracy—they were contracts that enabled automation. Without them, every consumer had to rediscover structure, rebuild definitions, re-implement lineage.
``` ┌─────────────────────────────────────────────────────────────────┐ │ THE DATA LAKE ERA │ │ “Store Now, Think Later” │ └─────────────────────────────────────────────────────────────────┘ Sources Lake Consumers ┌───────────┐ ┌───────────┐ ┌───────────┐ │ OLTP │ ──────────▶ │ │ │ Spark │ └───────────┘ │ │ └───────────┘ ┌───────────┐ │ HDFS │ ┌───────────┐ │ Logs │ ──────────▶ │ S3 │ ──────────▶ │ Hive │ └───────────┘ │ ADLS │ └───────────┘ ┌───────────┐ │ │ ┌───────────┐ │ APIs │ ──────────▶ │ │ │ Presto │ └───────────┘ └───────────┘ └───────────┘ ``` What's Missing: ``` ┌────────────────────────────────────────────────────────────┐ │ ❌ No enforced schema ❌ No semantic layer │ │ ❌ No lineage ❌ No business glossary │ │ ❌ No ACID transactions ❌ No metadata governance │ └────────────────────────────────────────────────────────────┘ ``` Result: DATA SWAMP
Distributed file system optimized for large files. No metadata layer, no schema enforcement, no ACID.
Retrofitted metadata—the industry's first admission that "schema-on-read" was insufficient.
S3, ADLS, GCS—cheap, infinite storage. But objects aren't tables. The impedance mismatch begins.
Better than MapReduce, but still required manual metadata management. DataFrames helped, but governance absent.
As data moved to lakes, the semantic layer fragmented into BI tools. Each tool—Tableau, Looker, Power BI—built its own layer. Definitions diverged. "Revenue" meant different things in different dashboards.
Looker's LookML represented an interesting counter-trend: semantic definitions as code. But it was proprietary to Looker, creating another silo rather than solving the problem.
The 1990s had Business Objects Universes as a shared semantic layer. The 2010s had every team defining their own metrics in their own tool. Progress in reverse.
The 2010s were frustrating for anyone who understood metadata. Organizations adopted "data lakes" while MBI clients were running self-building warehouses from recorded patterns. The industry chose manual notebook development over automated generation.
But one thing remained true: file ingestion patterns are finite. Schema detection is automatable. The metadata was always there—in information_schema, in Parquet headers, in Spark's inferred types. The tools just weren't reading it.
Organizations were using Databricks (impressive engines!) like glorified Excel spreadsheets— manually configuring clusters, hand-writing notebooks, rebuilding the same patterns repeatedly.
By 2020, the industry had learned its lesson. The "Lakehouse" pattern emerged: lake storage + warehouse semantics. Delta Lake, Apache Iceberg, and Hudi added ACID transactions, schema evolution, and time travel to object storage.
Data catalogs (Alation, Collibra, Atlan) reintroduced the metadata repository concept. dbt brought transformations-as-code and lineage tracking. The semantic layer returned with Cube, MetricFlow, and eventually dbt's acquisition of Transform.
Lakehouse = Lake storage economics + Warehouse semantics. Open table formats brought back what the 2010s forgot: schemas matter, ACID matters, lineage matters. The industry was rediscovering 1990s truths with 2020s infrastructure.
``` ┌─────────────────────────────────────────────────────────────────┐ │ LAKEHOUSE ARCHITECTURE │ │ “Best of Both Worlds” │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ SEMANTIC LAYER (Returning) │ │ ┌─────────────┬─────────────┬─────────────────────────┐ │ │ │ Metrics │ Entities │ Business Glossary │ │ │ │ (Cube) │ (MetricFlow)│ (Catalog) │ │ │ └─────────────┴─────────────┴─────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ DATA CATALOG │ │ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ │ │ Lineage │ Quality │ Ownership │ Access │ │ │ │ (OpenLineage)│ (Great Ex.)│ (Policies) │ (RBAC) │ │ │ └─────────────┴─────────────┴─────────────┴─────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ OPEN TABLE FORMATS (ACID on Object Store) │ │ ┌──────────────┬──────────────┬──────────────┐ │ │ │ Delta Lake │ Iceberg │ Hudi │ │ │ │ (Databricks) │ (Netflix) │ (Uber) │ │ │ └──────────────┴──────────────┴──────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ OBJECT STORAGE (S3/ADLS/GCS) │ │ Parquet Files + Metadata │ └─────────────────────────────────────────────────────────────────┘
Open table formats adding ACID, schema evolution, time travel to object storage. Metadata files track table state.
Transformations as code with version control. Lineage from refs. The return of metadata-driven pipelines.
Open standard for lineage collection. The spiritual successor to CWM—but focused on operational metadata.
Alation, Collibra, Atlan, DataHub—the metadata repository returns, now called "catalog."
| Feature | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| Metadata Format | Transaction log (JSON) | Manifest files (Avro) | Timeline + metadata |
| Schema Evolution | ✓ | ✓ (Most complete) | ✓ |
| Time Travel | ✓ | ✓ | ✓ |
| Partition Evolution | Limited | ✓ (Hidden partitioning) | Limited |
The Lakehouse era validates every MBI principle. Open table formats are just metadata files driving storage semantics—the same pattern as MBI's control tables driving ETL generation. dbt's ref() function is namespace resolution. Jinja templates are recorded patterns.
But there's a crucial gap: the engines are superb, the tooling is still manual. Organizations click through UIs and write notebooks instead of generating from metadata. The automation opportunity remains unseized.
2024-2025 marks a tectonic shift. Databricks open-sourced Unity Catalog. Snowflake launched Polaris (to be donated to Apache). The Iceberg REST Catalog API became the de facto standard for metadata interchange. Format wars are ending; interoperability is winning.
The semantic layer has returned with dbt's acquisition of Transform and the MetricFlow standard. Gartner now calls semantic technology "non-negotiable for AI success." We're approaching something that looks remarkably like Inmon's 1990 vision with better infrastructure.
The Iceberg REST Catalog API has become the "USB port" for metadata. What CWM should have been—but pragmatic (REST/JSON) rather than baroque (XMI/CORBA). Organizations can write tables with Databricks, read them from Snowflake, govern them with Unity Catalog, and query them from Trino.
``` ┌─────────────────────────────────────────────────────────────────┐ │ 2025+ OPEN CATALOG ARCHITECTURE │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ AI / LLM APPLICATIONS │ │ (Semantic layer critical for AI accuracy) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ HEADLESS SEMANTIC LAYER │ │ ┌─────────────┬─────────────┬─────────────────────────┐ │ │ │ dbt Semantic │ Cube │ AtScale │ │ │ │ Layer │ (OSS) │ (SML) │ │ │ └─────────────┴─────────────┴─────────────────────────┘ │ │ ↑ MetricFlow / SML Open Standards ↑ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ UNIFIED CATALOG LAYER │ │ ┌──────────────┬──────────────┬──────────────┐ │ │ │ Unity Catalog │ Polaris │ Nessie │ │ │ │ (OSS) │ (Apache) │ (Dremio) │ │ │ └──────────────┴──────────────┴──────────────┘ │ │ ↑ Iceberg REST Catalog API Standard ↑ │ └─────────────────────────────────────────────────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Delta Lake │ │ Iceberg │ │ Hudi │ │ (UniForm) │ ◄──▶ │ (Standard) │ ◄──▶ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └───────────────────────────┼───────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ OBJECT STORAGE (Universal) │ │ Parquet + Metadata Everywhere │ └─────────────────────────────────────────────────────────────────┘
Databricks open-sourced June 2024. Universal interface for Delta/Iceberg/Hudi via REST APIs. Supports HMS interface.
De facto standard API. Clients don't manage metadata files—catalog handles state. Enables credential vending.
MetricFlow acquired 2023. Metrics defined once, queried anywhere. The headless BI vision realized.
Write once, read from any format. Delta UniForm generates Iceberg metadata automatically. Format agnosticism arriving.
Large Language Models require semantic context to avoid hallucination. Without a semantic layer defining what metrics mean, AI systems generate plausible but wrong answers. This is driving rapid semantic layer adoption—the same need that drove Business Objects adoption in the 1990s, but with higher stakes.
Gartner now identifies semantic technology as "non-negotiable for AI success." The semantic layer enforces guardrails, ensuring AI systems query only approved, governed, and contextualized metrics.
The dbt MCP Server and dozens of integrations let you push governed, consistent definitions to embedded analytics, notebooks, spreadsheets, AI systems, and more. Every decision and experience built on trusted data.
The 2025 landscape validates MBI's core insight: everything reducible to metadata should be driven by metadata. Unity Catalog's managed tables use AI-driven optimization (automatic clustering, file compaction)—deterministic operations from recorded patterns, exactly like MBI's template resolution.
The gap that remains: even with superb catalogs and semantic layers, organizations still manually configure pipelines. The framework pattern—recorded expert knowledge replayed over discovered namespaces—remains the industry's blind spot.
The future isn't more tools. It's recognizing that the tools themselves should be generated from the same metadata they manage.
35 years of metadata management: from vision to fragmentation to rediscovery
| Concept | 1990s Term | 2025 Term | What Changed |
|---|---|---|---|
| Metadata Repository | CASE Repository | Data Catalog | Same concept, cloud-native, better UI |
| Business Abstraction | Semantic Layer / Universe | Headless BI / Metrics Layer | Same concept, API-first, decoupled from BI tool |
| Metadata Interchange | CWM / XMI | Iceberg REST / OpenLineage | Same goal, REST/JSON instead of CORBA/XML |
| Code Generation | CASE Tools | dbt / Terraform / IaC | Same pattern, different domains |
| Schema on Write | Data Warehouse (default) | Open Table Formats | Rediscovered after Lake failures |
| Lineage Tracking | CWM Transformation Model | OpenLineage | Same need, operational focus, open standard |
The data industry didn't advance from CIF to Lakehouse—it regressed and recovered. We spent two decades fragmenting what was integrated, enclosing what was open, and manually building what should have been generated. The 2020s "innovations" are largely rediscoveries of 1990s patterns with better infrastructure.
What remains unrealized is the deeper insight: if metadata drives catalogs, table formats, semantic layers, and lineage—it should also drive pipeline generation. The frameworks that treat build specifications as data, that record expert patterns and replay them over discovered namespaces, remain the industry's blind spot.
The future isn't more tools. It's recognizing that the tools themselves should be generated from the same metadata they manage.