Key takeaways
  • A data lake stores raw data untransformed and resolves meaning at query time (schema-on-read); a warehouse imposes structure on write. They are different tools, not substitutes.
  • What turns a data lake into a data swamp is almost never the technology. It is missing governance: no catalog, no ownership, no lineage, no access model.
  • GDPR requirements (deletion, residency, pseudonymization) belong in the architecture from day one. Retrofitting Article 17 onto an append-only lake is painful and expensive.
  • For most enterprises that need both reporting and raw-data processing, a lakehouse is the pragmatic answer. Build incrementally around one real use case, not everything at once.

A data lake is not a data graveyard. In practice it becomes one anyway: badly structured, ungoverned, full of data nobody understands or can find. This article covers what a sound data lake architecture actually looks like, when it is justified, and the failure patterns that quietly turn storage into a liability.

What a data lake is, and what it is not

A data lake is a central repository that ingests structured, semi-structured and unstructured data in any format. Data lands raw, untransformed, and schema and meaning are resolved at read time. This is schema-on-read, the defining property.

That puts it in a different category from a data warehouse:

  • Data warehouse: schema-on-write, relational structure, optimized for SQL and reporting. Data is cleaned, transformed and modeled before it is loaded.
  • Data lake: schema-on-read, heterogeneous formats (JSON, Parquet, CSV, Avro, images, logs), optimized for exploration, machine learning and batch processing.

A data lake is not a cheaper warehouse and not a replacement for one. It is a different instrument for different requirements, and teams that miss this distinction tend to build both badly.

When a data lake is actually justified

A data lake earns its complexity when at least one of these holds:

  • Data volume exceeds what a structured warehouse can store economically: logs, sensor telemetry, clickstreams, documents.
  • Requirements are not yet defined. Data scientists need to explore raw data before a reporting use case or a model is even specified.
  • Machine learning pipelines need unprocessed history for training and feature engineering, not a pre-aggregated view someone decided was useful years ago.
  • Data arrives from many heterogeneous sources and needs to land without heavy upfront ETL.

If none of these is true and the organization mostly needs structured reporting and ad-hoc SQL, build a warehouse. Standing up a data lake architecture to serve four dashboards manufactures operational burden you did not need.

The reference architecture: zones

A production-grade data lake architecture is organized into zones. Each zone has a contract: what enters it, in what state, and who can read it. The common structure:

Landing zone (raw zone). Raw data lands exactly as the source delivered it. This zone is append-only and immutable. For audits and incident analysis, the ability to go back to the exact source bytes is non-negotiable. The moment you "clean as you ingest," you lose your ground truth.

Cleansed zone (curated zone). Data is normalized here: schema alignment, deduplication, encoding fixes, obvious quality issues filtered out. Format is typically Parquet or Avro, partitioned by date or source system.

Enriched zone (conformed zone). Data from different sources is joined, enriched and mapped into a single semantic model, so a "customer" means one thing across every source. This layer is the foundation for analytics and ML.

Serving zone. Aggregated, use-case-specific datasets for dashboards, APIs and ML inference. Often a data mart or a feature store.

Ingestion runs through batch or streaming pipelines (Apache Kafka, Apache Flink, AWS Kinesis or equivalents). A data catalog (AWS Glue, Apache Atlas, Collibra) is not an optional add-on. It is the structural precondition for anyone finding anything. The medallion framing some teams use (bronze, silver, gold) is the same idea under different names.

Governance is the real differentiator

The most common reason a data lake degrades into a data swamp is not the wrong technology. It is the absence of governance. Storage is cheap; making that storage trustworthy and discoverable is the hard part, and it is where most projects underinvest. Concretely, the failures look like this:

  • No data catalog. Nobody knows what is in the lake, who put it there, or what it means. Data gets duplicated, or sits with no consumer because nobody can find it.
  • No ownership model. Every team writes into the lake and nobody owns quality. The raw zone fills with fragments no downstream process can use.
  • No access model. Either everyone can see everything (a compliance problem waiting to happen) or nobody can reach the data they need. Both block adoption.
  • No data lineage. When a model produces wrong predictions, nobody can trace which source data fed the training run. You are debugging blind.

Data governance is not a bureaucracy exercise. It is the difference between a reliable asset and an expensive pile, and it is what lets a data lake architecture scale past its first three users.

GDPR by design, not as an afterthought

In a European context there is a regulatory layer you have to design in from the start, because retrofitting it is far harder than building it in. The critical points:

  • Personal data does not belong unencrypted in the raw zone. Pseudonymize or encrypt at ingest, or hold personal data in a separate zone with stricter access control. The append-only raw zone makes this discipline easy to skip and expensive to fix later.
  • Right to erasure (Article 17). Deletion in an append-only, immutable lake is not trivial. Workable patterns are cryptographic shredding (discard the key) or keeping identifying attributes in a separate, deletable lookup table the rest of the lake references by token.
  • Data residency. Most GDPR-compliant deployments run in EU regions (Frankfurt, Amsterdam, Dublin, Paris). The region is an architectural decision, not something you reconcile after the data is already in the wrong place.
  • Records of processing. Every data flow into the lake should appear in your records of processing activities. A data catalog supports this but does not replace it.

Design these in, and GDPR is a property of the system. Bolt them on, and each becomes a migration project under pressure.

The lakehouse: a pragmatic middle model

In many projects the lakehouse pattern has become the sensible default. It combines the storage flexibility of a data lake with transactional consistency and SQL-capable layers on top, implemented with frameworks like Apache Iceberg, Delta Lake or Apache Hudi over an object store (S3, Azure Data Lake Storage, GCS).

The lakehouse is not a marketing label. It is an architectural choice that makes sense when an organization genuinely needs both structured reporting and raw-data processing, and wants to consolidate instead of running a lake and a warehouse side by side. The tradeoffs are real: more management complexity, higher demands on the platform team. But for the common case it removes a whole class of data-copying and synchronization problems.

The failure patterns that create a swamp

From the field, the patterns that fail again and again:

1. Everything in, nothing out. Ingestion is never the bottleneck; consumption is. Without a serving layer and defined use cases, data lands and stays. A full lake nobody queries is not a success, it is sunk cost.

2. No schema versioning. Source systems change their schema. If nobody tracks those changes and the downstream consumer is caught off guard, the pipeline breaks silently and you find out from a wrong dashboard, not an alert.

3. No data quality monitoring. Silent errors accumulate over months. By the time a model trains on the data, the training set is already corrupted, and the model fails in ways nobody can attribute.

4. Cost blowups on bad partitioning. Object storage is cheap; compute is not. Uncontrolled ad-hoc queries scanning poorly partitioned raw data (Athena, BigQuery or similar) can spike monthly bills sharply. Partitioning and clustering strategy are not optional, they are part of the architecture.

5. Too much, too early. A lake meant to do everything (streaming, batch, ML, reporting, operational analytics) before a single use case is live never ships. Start with one concrete use case, prove it, then expand. The illusion that storage alone is a data strategy is what kills these programs.

Talk through your data lake architecture

DNA Solutions helps European enterprises design and build data platforms, from the first zoning decision to production-grade governance: catalog, ownership, lineage and GDPR-compliant data handling. Whether you are choosing between a data lake, a warehouse and a lakehouse, or rescuing a lake that has already turned into a swamp, we sequence it around the use cases that actually matter. Talk to us.

Related services: Data & Analytics, Cloud Solutions

Industry: Retail & Distribution