On Data Lakes and Swamp Shacks

Jun 20, 2024 5 min read
Architecture

Your data lake probably looks like this and your data lakehouse will probably look like this.

Realism #

When building out data lake infrastructure, know the rate at which you currently need to access that data, as a ratio of “people who ask important questions that require icky and difficult queries/multi-data-source wrangling in order to answer” per unit time. Do not confuse that with the rate at which it would be super cool if you could access that data if you had a better data-{base,warehouse,lake,lakehouse,swamp,space-station}.

At a point somewhere in between those two rates is the rate at which you actually would access the data if it were just several million dollars a month easier to query.

The “actually would” rate is a lot closer to the current rate (and further from the it-would-be-super-cool-if rate) than most people want to admit.

You. I am talking about you. You are “most people”.

Growth #

If your “data-lake-shaped questions that currently need answering” number is denominated in single digits per month, just write the hideous cronjob. You know, the one that dumps massive, slow query results from disparate “wouldn’t it be nice if we had a data warehouse instead of these” sources into memory and answers specific questions using that memory.
When the number of questions-that-need-answering hits low double digits, make the cron jobs (because there are probably more than one by now) read from replicas, and make them shovel the temporary into SQLite, or S3, or something cheaper/less crashy than memory.
Medium double digits? The massive read queries are probably hurting more than a bit. The cron jobs can provision database replicas on-demand, query the dedicated replicas, shut them down. If you detach the replicas once they’re caught up, you can even use the replicas instead of SQLite/S3 for temp space.
High double digits and beyond? Sure, pay somebody and make a proper data ~~swamp~~ palace. “Paying somebody” might mean buying a whole solution, buying a layer of glue, or staffing out a team to build things in-house. Same difference: it all costs money.

Many large organizations never need to go beyond step 3. Most large organizations do anyway. Money, meet dumpster fire.

Implementations of steps 1 through 3 are not large sunk costs. Not even for your team of two. Better people than you and I write such systems every day. In Bash. And they work great. It just feels bad, and BigQuery’s marketing team is very good at capitalizing on bad feelings, so people spend immense amounts of money they don’t have on solutions they don’t need.

Confusion #

Data lakes nominally exist to solve data size and ad-hoc queryability problems. However, many organizations instead use them to solve data ownership and prioritization problems instead.

An example: data needed by team X is already within the company, but it’s locked behind service Y/database Z/persnickety team W. Copying the data into a queryable-by-everyone data lake and telling team X to use that solves this problem, but solves it in a maximally inaccurate, slow, and expensive way.

Yes, yes, I’m sure the political/technical challenges of letting one team access another team’s data are quite real. And yes, I’m sure that it’s unthinkable to allow suspicious outsiders (other software professionals working for the same company as the data owners) to just reach in and do whatever they like (query the data just like the data owners do). Perish the thought.

So, off to the lake it is. And besides, you already had the lake for the business analyst/data science/AI training folks. So why not use it for a product purpose as well? Marginal cost of zero and all that.

It’s slow, you say? Not to worry, any data lake provider worth their salt makes it extremely easy to scale query compute/performance. Funny how that works.

Predictable Outcome #

Fast forward a few years. A roughly OLTP-shaped business is now spending highly significant fractions of its opex on OLAP-shaped solutions.

The people for whom this solution was originally built use it infrequently and need a tiny fraction of the scale you’ve provisioned.

The people you didn’t originally build for, on the other hand, use it quite a lot: the lake now drives parts of the product. In many cases, lake query performance and cost necessitate secondary databases downstream of the data lake, which serve as a sort of cache/denormalization. This brings the data duplication-at-rest multiplier up to 3 at minimum (original source of truth -> lake -> new DBs). The actual number is probably higher, thanks to the ETL/message broker/temporary storage systems that feed data into and out of the lake (and are you sure you put lifecycle policies on all those temporary S3 buckets?). Even with carefully chosen tools, the complexity/human maintenance cost of this approach is high. If the data’s big or high velocity, the operational cost of the datastores involved can be immense.

In other words, the people you incidentally built the data lake for now completely depend on it as a source of truth. This dependence is technical (the data has grown too much to query any other way, and the art of cron job authoring has been long forgotten) and political (solving data problems by getting other people at the company to do things is, similarly, a forgotten art). Someone tells you a “delta architecture” might solve these problems, but you’ll have to hire five experts to build one. Someone else tells you that “event sourcing” will solve these problems and make you coffee, too, but you’ll have to rewrite everything and perform a human sacrifice every time you need to mutate in a transaction. They assure you that database transactions are esoteric and rare occurrences, so it will probably not be an issue.

Now ask yourself: as sticky as the political problems were, as much as the thought of letting non-owning teams query outside of their domains scared you … was it really worth it?

By the time you become aware of the answer, it will be much too late.