```

Enjoying this? A quick like helps keep it online longer.

Content Expiring Soon

This content will be deleted in less than 24 hours. If you like it, you can extend its lifetime to keep it available.

0 likes
0 views
12 days left
Like what you see? Create your own
0
0
12d

A Technical History

The Arc of Metadata:
From CIF to Lakehouse Catalogs

Tracing 35 years of metadata management—from the open repository dreams of the 1990s, through the proprietary enclosure of the 2000s-2010s, to the re-emergence of open standards in modern lakehouse architectures.

```
90s
Open Repositories
00s
Proprietary Enclosure
10s
Lake Deluge
20s
Lakehouse Synthesis
25+
Open Catalogs
90s
Generation 1

The Golden Age of Open Metadata

CIF CASE Tools Open Repositories Semantic Layers

The 1990s represent the high-water mark of architectural vision in data management. Bill Inmon's Corporate Information Factory (CIF), first published in 1990, described a complete metadata-driven enterprise architecture that wouldn't be matched for three decades.

CASE tools like Oracle Designer, Rochade, and ADW stored models in open repositories— accessible databases you could query with SQL. Code generation wasn't a luxury; it was the norm. The semantic layer existed in Business Objects Universes and Cognos metacubes. Everything was connected through metadata.

Key Insight

The 1990s understood that metadata is the product. Data models weren't documentation—they were active assets that drove generation, validation, and business abstraction. The repository wasn't just storage; it was the source of truth that everything else derived from.

```

┌─────────────────────────────────────────────────────────────────┐
│                     CORPORATE INFORMATION FACTORY                │
│                          (Inmon, 1990)                          │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
▼                           ▼                           ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│   OLTP/ODS    │ ───▶ │      EDW      │ ───▶ │  Data Marts   │
│(Operational)  │      │ (3NF Atomic)  │      │(Dimensional)  │
└───────────────┘      └───────────────┘      └───────────────┘
│                           │                           │
└───────────────────────────┼───────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│                    METADATA REPOSITORY (OPEN)                    │
│    ┌─────────────┬─────────────┬─────────────┬─────────────┐   │
│    │  Technical  │   Business  │ Operational │  Semantic   │   │
│    │  Metadata   │   Glossary  │   Metadata  │    Layer    │   │
│    └─────────────┴─────────────┴─────────────┴─────────────┘   │
└─────────────────────────────────────────────────────────────────┘
│
▼
        CASE TOOLS → CODE GENERATION → SEMANTIC LAYER → BI
```

Corporate Information Factory

Inmon's vision: normalized atomic EDW → subject-area marts. The architecture presumed metadata would drive everything.

Inmon 3NF Subject-Oriented

The CIF defined a complete enterprise data architecture: ODS for operational reporting, atomic EDW for enterprise integration, and dimensional marts for departmental analytics. Each layer had defined metadata flows to the next.

CASE Tool Repositories

Oracle Designer, Rochade, ADW—stored models in queryable databases. You could SELECT from your data model.

Oracle Designer CDIF Repository API

CASE repositories were open relational databases. Models weren't locked in proprietary formats—they were tables you could query, join, and build automation on. This enabled code generation at scale.

Semantic Layer

Business Objects Universes mapped technical to business terms. Users queried concepts, not tables.

Business Objects Cognos Metacube

The semantic layer provided business abstraction: "Revenue" instead of SUM(order_lines.amount). Users could drag-and-drop business concepts without knowing SQL or table structures. This was "no-code BI" 30 years ago.

Code Generation

Models → DDL → Forms → Reports. Change the model, regenerate everything. Metadata-driven was the default.

Forms Gen Reports Gen DDL Gen

Code generation was mainstream. Define entities in the model, generate Oracle Forms, Reports, and database DDL. When requirements changed, update the model and regenerate. This is what "infrastructure as code" looked like in 1995.

MBI Emerged From This World

MBI emerged from this environment—a system that read application data dictionaries and used template resolution to generate complete ETL/DW infrastructure. What CASE tools did for application code, MBI did for data pipelines. Both treated build specifications as data, not code.

The Universal Framework (1999) embodied the 1990s insight: everything at the boundary is data entry; templates are message bodies; resolution is just pattern substitution over namespaces.

00s
Generation 2

The Enclosure & Fragmentation

Proprietary ETL Vendor Lock-in CWM Stillborn Knowledge Dispersal

The 2000s witnessed the systematic enclosure of open metadata. Vendors acquired CASE tools and either killed them or locked them behind proprietary formats. ETL tools like Informatica and DataStage grew powerful but siloed—their repositories couldn't talk to each other.

The OMG's Common Warehouse Metamodel (CWM), released in 2001, represented the last serious attempt at open metadata interchange. It was technically sound—based on MOF/XMI/UML—but arrived just as vendors were moving in the opposite direction. CWM compliance became a checkbox, not a practice.

The Enclosure Pattern

Oracle bought Hyperion and Siebel. IBM bought Cognos and SPSS. SAP bought Business Objects. Each acquisition closed an open ecosystem. The integration that metadata promised became the lock-in that vendors delivered.

```

┌─────────────────────────────────────────────────────────────────┐
│              THE FRAGMENTED 2000s LANDSCAPE                      │
└─────────────────────────────────────────────────────────────────┘

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Informatica │     │  DataStage  │     │    SSIS     │
│ Repository  │     │ Repository  │     │ Repository  │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
│                    │                    │
│                    │                    │
X ───────────────── X ───────────────── X(No real interchange)        │
▼                                        ▼
┌─────────────────────────────────────────────────────────────────┐
│        CWM (2001) - Technically Sound, Practically Ignored       │
│   MOF + XMI + UML = Open Standard → Vendor Checkbox, Not Used   │
└─────────────────────────────────────────────────────────────────┘

Meanwhile:
┌────────────────────────────────────────────────────────────────┐
│  Oracle buys Hyperion, Siebel • IBM buys Cognos, SPSS          │
│  SAP buys Business Objects • Open repositories → closed stacks │
└────────────────────────────────────────────────────────────────┘
```

What Was Lost

  • Open Repository Access — CASE tools could query their own models; ETL tools couldn't
  • Cross-Tool Metadata Flow — No practical way to share lineage between vendors
  • Code Generation Culture — Manual ETL development became "best practice"
  • Semantic Layer Portability — Business Objects/Cognos became isolated islands
  • Metadata-Driven Architecture — Hard-coded pipelines replaced generated ones

The Common Warehouse Metamodel deserves careful examination. Released in 2001, it defined standard interfaces for relational, multidimensional, OLAP, data mining, and transformation metadata. Based on UML for modeling, MOF for metamodeling, and XMI for interchange.

Co-submitters included IBM, Oracle, NCR, Unisys, Hyperion—all the major players. The spec was technically complete. But compliance didn't guarantee interoperability. Vendors implemented subsets, extended in incompatible ways, or simply claimed compliance without real support.

By 2010, CWM was effectively dead—not because it was wrong, but because vendor incentives pointed toward lock-in, not interchange.

CWM Component What It Defined Why It Failed
Relational Model Tables, columns, constraints, keys Vendors extended incompatibly
Transformation Model ETL job definitions, mappings Too abstract for vendor specifics
OLAP Model Cubes, dimensions, measures BI vendors refused adoption
XMI Interchange XML-based metadata exchange Complex, verbose, poorly tooled

MBI Exploited What Remained Constant

While the industry fragmented, MBI exploited the one thing that remained constant: information_schema and vendor export formats were text. Informatica XML, DataStage DSX, Ab Initio graphs—all reducible to template patterns over namespaces.

The 2008 MBI/MKB breakthrough: Oracle PL/SQL reading data dictionaries, generating complete Data Vault warehouses. What CWM tried to standardize, MBI achieved through pattern recognition.

The API read source metadata → generated Hubs (from PKs), Links (from FKs), Satellites (from columns). No modeling tools. No consultants. Just pattern recognition over information schemas.

10s
Generation 3

The Data Lake Deluge

Hadoop Schema-on-Read Data Swamp Semantic Collapse

James Dixon of Pentaho coined "Data Lake" in 2010. The promise: dump everything, figure it out later. Schema-on-read instead of schema-on-write. The data warehouse was declared obsolete. Hadoop would solve everything.

The reality was different. Without enforced metadata, lakes became swamps. Without semantic layers, everyone built their own definitions. Without lineage, nobody knew where data came from. The industry spent a decade unlearning everything the 1990s knew.

The Core Problem

The Data Lake movement rejected schema enforcement as "too rigid" without understanding why schemas existed. They weren't bureaucracy—they were contracts that enabled automation. Without them, every consumer had to rediscover structure, rebuild definitions, re-implement lineage.

```

┌─────────────────────────────────────────────────────────────────┐
│                    THE DATA LAKE ERA                             │
│                 “Store Now, Think Later”                         │
└─────────────────────────────────────────────────────────────────┘

        Sources                    Lake                    Consumers
    ┌───────────┐              ┌───────────┐              ┌───────────┐
    │   OLTP    │ ──────────▶ │           │              │  Spark    │
    └───────────┘              │           │              └───────────┘
    ┌───────────┐              │   HDFS    │              ┌───────────┐
    │   Logs    │ ──────────▶ │    S3     │ ──────────▶ │   Hive    │
    └───────────┘              │   ADLS    │              └───────────┘
    ┌───────────┐              │           │              ┌───────────┐
    │   APIs    │ ──────────▶ │           │              │  Presto   │
    └───────────┘              └───────────┘              └───────────┘

```
                    What's Missing:
```

    ┌────────────────────────────────────────────────────────────┐
    │  ❌ No enforced schema        ❌ No semantic layer         │
    │  ❌ No lineage                ❌ No business glossary      │
    │  ❌ No ACID transactions      ❌ No metadata governance    │
    └────────────────────────────────────────────────────────────┘

```
                    Result: DATA SWAMP

Hadoop/HDFS

Distributed file system optimized for large files. No metadata layer, no schema enforcement, no ACID.

MapReduce YARN Batch

Hadoop was designed for batch processing of log files at internet scale. It was never designed for analytical queries, schema management, or ACID transactions. The industry tried to use it for everything anyway.

Hive Metastore

Retrofitted metadata—the industry's first admission that "schema-on-read" was insufficient.

HMS Thrift Partitions

Hive Metastore (HMS) was the industry admitting defeat on "schema-on-read." You need somewhere to store table definitions, partition locations, and column types. HMS became the de facto standard—and remains so in many catalogs today.

Cloud Object Storage

S3, ADLS, GCS—cheap, infinite storage. But objects aren't tables. The impedance mismatch begins.

S3 ADLS GCS

Cloud object storage offered unlimited capacity at low cost. But objects aren't rows. You can't UPDATE an object. This fundamental impedance mismatch drove the need for table formats that would emerge in the 2020s.

Spark

Better than MapReduce, but still required manual metadata management. DataFrames helped, but governance absent.

RDDs DataFrames SQL

Spark dramatically improved the lake experience with in-memory processing and DataFrame APIs. But it couldn't solve the fundamental metadata problem—that was left to each organization to figure out themselves.

As data moved to lakes, the semantic layer fragmented into BI tools. Each tool—Tableau, Looker, Power BI—built its own layer. Definitions diverged. "Revenue" meant different things in different dashboards.

Looker's LookML represented an interesting counter-trend: semantic definitions as code. But it was proprietary to Looker, creating another silo rather than solving the problem.

The 1990s had Business Objects Universes as a shared semantic layer. The 2010s had every team defining their own metrics in their own tool. Progress in reverse.

Semantic Layer Fragmentation

  • Tableau — Each workbook defined its own calculations
  • Looker — LookML was powerful but Looker-only
  • Power BI — DAX measures duplicated across reports
  • Result — Five people, five reports, five different "Revenue" numbers

The Automation Opportunity Unseized

The 2010s were frustrating for anyone who understood metadata. Organizations adopted "data lakes" while MBI clients were running self-building warehouses from recorded patterns. The industry chose manual notebook development over automated generation.

But one thing remained true: file ingestion patterns are finite. Schema detection is automatable. The metadata was always there—in information_schema, in Parquet headers, in Spark's inferred types. The tools just weren't reading it.

Organizations were using Databricks (impressive engines!) like glorified Excel spreadsheets— manually configuring clusters, hand-writing notebooks, rebuilding the same patterns repeatedly.

20s
Generation 4

The Lakehouse Synthesis

Open Table Formats Data Catalogs dbt Return of Structure

By 2020, the industry had learned its lesson. The "Lakehouse" pattern emerged: lake storage + warehouse semantics. Delta Lake, Apache Iceberg, and Hudi added ACID transactions, schema evolution, and time travel to object storage.

Data catalogs (Alation, Collibra, Atlan) reintroduced the metadata repository concept. dbt brought transformations-as-code and lineage tracking. The semantic layer returned with Cube, MetricFlow, and eventually dbt's acquisition of Transform.

The Synthesis

Lakehouse = Lake storage economics + Warehouse semantics. Open table formats brought back what the 2010s forgot: schemas matter, ACID matters, lineage matters. The industry was rediscovering 1990s truths with 2020s infrastructure.

```

┌─────────────────────────────────────────────────────────────────┐
│                    LAKEHOUSE ARCHITECTURE                        │
│                   “Best of Both Worlds”                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    SEMANTIC LAYER (Returning)                    │
│     ┌─────────────┬─────────────┬─────────────────────────┐     │
│     │   Metrics   │   Entities  │    Business Glossary    │     │
│     │   (Cube)    │ (MetricFlow)│      (Catalog)          │     │
│     └─────────────┴─────────────┴─────────────────────────┘     │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│                    DATA CATALOG                                  │
│    ┌─────────────┬─────────────┬─────────────┬─────────────┐   │
│    │   Lineage   │   Quality   │   Ownership │   Access    │   │
│    │ (OpenLineage)│  (Great Ex.)│  (Policies) │  (RBAC)     │   │
│    └─────────────┴─────────────┴─────────────┴─────────────┘   │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│                 OPEN TABLE FORMATS (ACID on Object Store)        │
│         ┌──────────────┬──────────────┬──────────────┐          │
│         │  Delta Lake  │   Iceberg    │     Hudi     │          │
│         │ (Databricks) │   (Netflix)  │    (Uber)    │          │
│         └──────────────┴──────────────┴──────────────┘          │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│                    OBJECT STORAGE (S3/ADLS/GCS)                  │
│                    Parquet Files + Metadata                      │
└─────────────────────────────────────────────────────────────────┘
```

Delta Lake / Iceberg / Hudi

Open table formats adding ACID, schema evolution, time travel to object storage. Metadata files track table state.

Parquet ACID Time Travel

All three formats solve the same problem: ACID on object storage. They do this through metadata—manifest files tracking which Parquet files constitute the current table state. This is metadata management—the same concept the 1990s understood.

dbt (Data Build Tool)

Transformations as code with version control. Lineage from refs. The return of metadata-driven pipelines.

SQL Jinja DAG

dbt's ref() function is namespace resolution. Jinja templates are recorded patterns. The DAG is derived from metadata. dbt rediscovered what MBI knew: transformations should be generated from declared relationships, not hand-coded.

OpenLineage

Open standard for lineage collection. The spiritual successor to CWM—but focused on operational metadata.

LF AI Airflow Spark

OpenLineage is what CWM should have been: pragmatic, operational, and REST-based. It defines jobs, runs, and datasets with extensible facets. Since 2020, it's become the industry standard for data lineage—proving the 1990s vision was right.

Data Catalogs

Alation, Collibra, Atlan, DataHub—the metadata repository returns, now called "catalog."

Discovery Governance Lineage

Data catalogs are 1990s metadata repositories with better UIs and cloud-native architecture. They provide the same function: central store for technical metadata, business glossary, lineage, and access control. Same wine, new bottle.

What Was Rediscovered

  • Schema Enforcement — Open table formats brought back contracts
  • Metadata Repositories — Data catalogs = 1990s repository concept
  • Lineage Tracking — OpenLineage standardizing what CWM attempted
  • Semantic Layers — Headless BI, metrics layers = Business Objects Universes
  • Code Generation — dbt macros = template-based generation
Feature Delta Lake Iceberg Hudi
Metadata Format Transaction log (JSON) Manifest files (Avro) Timeline + metadata
Schema Evolution ✓ (Most complete)
Time Travel
Partition Evolution Limited ✓ (Hidden partitioning) Limited

Validation of Every MBI Principle

The Lakehouse era validates every MBI principle. Open table formats are just metadata files driving storage semantics—the same pattern as MBI's control tables driving ETL generation. dbt's ref() function is namespace resolution. Jinja templates are recorded patterns.

But there's a crucial gap: the engines are superb, the tooling is still manual. Organizations click through UIs and write notebooks instead of generating from metadata. The automation opportunity remains unseized.

25+
Generation 5

The Open Catalog Convergence

Unity Catalog OSS Iceberg REST API Polaris AI-Ready Metadata

2024-2025 marks a tectonic shift. Databricks open-sourced Unity Catalog. Snowflake launched Polaris (to be donated to Apache). The Iceberg REST Catalog API became the de facto standard for metadata interchange. Format wars are ending; interoperability is winning.

The semantic layer has returned with dbt's acquisition of Transform and the MetricFlow standard. Gartner now calls semantic technology "non-negotiable for AI success." We're approaching something that looks remarkably like Inmon's 1990 vision with better infrastructure.

The Convergence

The Iceberg REST Catalog API has become the "USB port" for metadata. What CWM should have been—but pragmatic (REST/JSON) rather than baroque (XMI/CORBA). Organizations can write tables with Databricks, read them from Snowflake, govern them with Unity Catalog, and query them from Trino.

```

┌─────────────────────────────────────────────────────────────────┐
│                 2025+ OPEN CATALOG ARCHITECTURE                  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    AI / LLM APPLICATIONS                         │
│         (Semantic layer critical for AI accuracy)                │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│                 HEADLESS SEMANTIC LAYER                          │
│    ┌─────────────┬─────────────┬─────────────────────────┐      │
│    │ dbt Semantic │    Cube     │       AtScale           │      │
│    │    Layer     │   (OSS)     │        (SML)            │      │
│    └─────────────┴─────────────┴─────────────────────────┘      │
│              ↑ MetricFlow / SML Open Standards ↑                 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│                    UNIFIED CATALOG LAYER                         │
│         ┌──────────────┬──────────────┬──────────────┐          │
│         │ Unity Catalog │   Polaris    │   Nessie     │          │
│         │    (OSS)      │   (Apache)   │   (Dremio)   │          │
│         └──────────────┴──────────────┴──────────────┘          │
│              ↑ Iceberg REST Catalog API Standard ↑               │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
▼                           ▼                           ▼
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│  Delta Lake   │      │   Iceberg     │      │     Hudi      │
│   (UniForm)   │ ◄──▶ │  (Standard)   │ ◄──▶ │               │
└───────────────┘      └───────────────┘      └───────────────┘
│                           │                           │
└───────────────────────────┼───────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│                    OBJECT STORAGE (Universal)                    │
│                    Parquet + Metadata Everywhere                 │
└─────────────────────────────────────────────────────────────────┘
```

Unity Catalog (OSS)

Databricks open-sourced June 2024. Universal interface for Delta/Iceberg/Hudi via REST APIs. Supports HMS interface.

Apache License REST Multi-Format

Unity Catalog OSS is the "USB port" for data. It supports Delta, Iceberg, and Hudi tables through open APIs. Managed tables use AI-driven optimization—automatic clustering, file compaction, intelligent statistics collection. Same patterns as MBI, now productized.

Iceberg REST Catalog

De facto standard API. Clients don't manage metadata files—catalog handles state. Enables credential vending.

Open Spec Flink Trino

The REST Catalog is what CWM should have been: pragmatic, operational, and widely adopted. Any Iceberg client can talk to any REST-compliant catalog. This is interoperability through open metadata APIs—exactly what the 1990s envisioned.

dbt Semantic Layer

MetricFlow acquired 2023. Metrics defined once, queried anywhere. The headless BI vision realized.

MetricFlow YAML API-First

dbt's Semantic Layer is Business Objects Universes for the modern era: metrics and entities no longer locked into a single BI tool, accessible by all downstream tools. Define once in YAML, query from anywhere via API. Same concept, better architecture.

UniForm / XTable

Write once, read from any format. Delta UniForm generates Iceberg metadata automatically. Format agnosticism arriving.

Interop Automatic Multi-Engine

UniForm and XTable bridge the format wars. Write in Delta, automatically generate Iceberg metadata. Read from Snowflake, BigQuery, or Trino without conversion. The format wars are ending through metadata translation layers.

The AI Forcing Function

Large Language Models require semantic context to avoid hallucination. Without a semantic layer defining what metrics mean, AI systems generate plausible but wrong answers. This is driving rapid semantic layer adoption—the same need that drove Business Objects adoption in the 1990s, but with higher stakes.

Gartner now identifies semantic technology as "non-negotiable for AI success." The semantic layer enforces guardrails, ensuring AI systems query only approved, governed, and contextualized metrics.

The dbt MCP Server and dozens of integrations let you push governed, consistent definitions to embedded analytics, notebooks, spreadsheets, AI systems, and more. Every decision and experience built on trusted data.

The Vision Validated, The Gap Remains

The 2025 landscape validates MBI's core insight: everything reducible to metadata should be driven by metadata. Unity Catalog's managed tables use AI-driven optimization (automatic clustering, file compaction)—deterministic operations from recorded patterns, exactly like MBI's template resolution.

The gap that remains: even with superb catalogs and semantic layers, organizations still manually configure pipelines. The framework pattern—recorded expert knowledge replayed over discovered namespaces—remains the industry's blind spot.

The future isn't more tools. It's recognizing that the tools themselves should be generated from the same metadata they manage.

The Full Circle

35 years of metadata management: from vision to fragmentation to rediscovery

Concept 1990s Term 2025 Term What Changed
Metadata Repository CASE Repository Data Catalog Same concept, cloud-native, better UI
Business Abstraction Semantic Layer / Universe Headless BI / Metrics Layer Same concept, API-first, decoupled from BI tool
Metadata Interchange CWM / XMI Iceberg REST / OpenLineage Same goal, REST/JSON instead of CORBA/XML
Code Generation CASE Tools dbt / Terraform / IaC Same pattern, different domains
Schema on Write Data Warehouse (default) Open Table Formats Rediscovered after Lake failures
Lineage Tracking CWM Transformation Model OpenLineage Same need, operational focus, open standard

The Lesson

The data industry didn't advance from CIF to Lakehouse—it regressed and recovered. We spent two decades fragmenting what was integrated, enclosing what was open, and manually building what should have been generated. The 2020s "innovations" are largely rediscoveries of 1990s patterns with better infrastructure.

What remains unrealized is the deeper insight: if metadata drives catalogs, table formats, semantic layers, and lineage—it should also drive pipeline generation. The frameworks that treat build specifications as data, that record expert patterns and replay them over discovered namespaces, remain the industry's blind spot.

The future isn't more tools. It's recognizing that the tools themselves should be generated from the same metadata they manage.

```