The Problem with Data Lakes


Data lakes is a really nice idea because one of the problems with storing data is that you tend to store it E-V-E-R-Y-W-H-E-R-E. With a data lake, you throw all of your data into a centralized location for different uses. The centralization at least makes it easier to find the data in the company. It is understandable that you might want to look at establishing a great, big area for your data to reside. If the company has 14 domains, then a data lake avoids having all of the data siloed into each silo which is harder to find.

Data lakes will typically include structured, semi-structured and unstructured data. For those not familiar with these types of data –

  • structured data is most typically in a tabular format. Think standardized data in rows and columns.
  • semi-structured data is not in tabular but can be put into a standard format. Think JSON which is a great format for moving data around.
  • unstructured data is content that is raw and unformatted. Think about the data you find in emails, images, audio files, social media, etc.


Theoretically, data lakes are useful for various areas in the company to consume for reporting, analytics, etc. All sounds great, right? Well, one of the main challenges with data lakes is that if not governed well, the data can become stale, siloed with no common meaning and ultimately hard to trust for reliable reporting and other uses. So, garbage in and garbage out can become the norm as the data lake is crowded with polluted with junk.

In my experience, a data lake is good for smaller companies with a limited number of domains and data sources. For large companies with data all over the places (e.g. on-premise and in the Cloud), many data sources (e.g. Teradata, Druid, Hive, Oracle, AWS, Bigquery, Snowflake, etc.), a data lake will just become an unmanageable dumping ground and a nightmare to contend with.


Resources

Comments are closed.

Create a website or blog at WordPress.com

Up ↑