The Problem with Data Lakes

Image from Microsoft:
https://learn.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-lake

Data lakes is a really nice idea because one of the problems with storing data is that you tend to store it E-V-E-R-Y-W-H-E-R-E. With a data lake, you throw all of your data into a centralized location for different uses. The centralization at least makes it easier to find the data in the company. It is understandable that you might want to look at establishing a great, big area for your data to reside. If the company has 14 domains, then a data lake avoids having all of the data siloed into each silo which is harder to find.

Data lakes will typically include structured, semi-structured and unstructured data. For those not familiar with these types of data –

structured data is most typically in a tabular format. Think standardized data in rows and columns.
semi-structured data is not in tabular but can be put into a standard format. Think JSON which is a great format for moving data around.
unstructured data is content that is raw and unformatted. Think about the data you find in emails, images, audio files, social media, etc.

Theoretically, data lakes are useful for various areas in the company to consume for reporting, analytics, etc. All sounds great, right? Well, one of the main challenges with data lakes is that if not governed well, the data can become stale, siloed with no common meaning and ultimately hard to trust for reliable reporting and other uses. So, garbage in and garbage out can become the norm as the data lake is crowded with polluted with junk.

In my experience, a data lake is good for smaller companies with a limited number of domains and data sources. For large companies with data all over the places (e.g. on-premise and in the Cloud), many data sources (e.g. Teradata, Druid, Hive, Oracle, AWS, Bigquery, Snowflake, etc.), a data lake will just become an unmanageable dumping ground and a nightmare to contend with.

Resources

K21 Academy – a good explanation of the difference of data (structured, semi-structured vs. unstructured data)
https://k21academy.com/microsoft-azure/dp-900/structured-data-vs-unstructured-data-vs-semi-structured-data/
Microsoft – a great article about data lakes
https://learn.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-lake

The Problem with Data Lakes

© Angela Heath and Heath Information. All rights are reserved. Content is the property of Angela Heath. Company logos are copyrighted to their respective companies.

Expertise Areas

WORK WITH ME

Share this:

Related

© Angela Heath and Heath Information. All rights are reserved. Content is the property of Angela Heath. Company logos are copyrighted to their respective companies.

Expertise Areas

WORK WITH ME