Fundamentals of building data lake


If you're here then you have probably heard about data lake and really curious to understand what that means and how to built a data lake. Well, the good thing is that you might have already build a data lake but you're not aware about that. For example, if you've built a log storage solution where you've moved your server or application logs then you probably built something which is the first step in building data lake. Let's talk about that in detail and understand the data flow.
A data lake is nothing but a central repository of all your structured and unstructured data which can be consumed by different applications later on. The ingestion of server or application logs is one example of storing unstructured data at a central location. However, data lake is not just about storing data but it also cleanse the data and transform into separate catalogs based on the schema that you define. Below graphic explains the very high level but essential building blocks of a data lake.


Some of the important features of a data lake are listed below:


Let's deconstruct above graphic and understand what each block means.
Data Producers
Data producers are those resources which are generating data continuously. Those resources could be as common as servers and mobile apps or as exciting as your coffee machine, your kid's scooter or even the number of times your heart is beating per second. The data being generated by millions of devices nowadays is enormous and organizations are finding it hard to tackle with large amount of data. That's why they need to build data lakes.
Data Ingestion
Now that you've identified your data producers, you need a mechanism or a process to capture that data and store it in the central storage repository. One such process is called data ingestion and there are several ways to do that based on your use case.
Raw Data or Bronze Zone
This is the first block of the data lake. Date ingestion process dumps the data mostly in raw format which is also called Bronze Zone. It is nothing but similar to dumping your server or application logs to some centralized location. Some applications only need to access raw data, means they don't need any additional transformation in the data structure. However, if you're looking for more meaningful results without going through peta bytes of data then you need to break that down and do some transformation.
Staging Data or Silver Zone
This block represents transformed data, means you've gone one step ahead and transformed the raw data based on your schema definition. This process is very helpful if you're building your training data sets because it essentially cleanse the data so that your training model runs against meaningful data and not entire data set.
Processed Data or Gold Zone
Once you've transformed the data, you may need to create multiple separate catalogs out of that. It helps in mapping applications with specific catalog to execute a very defined process. Data stored in this zone is totally dedicated to specific use cases and cannot be used for generic purposes.
Data Visualization
In this phase detailed insights are derived from the data sets and visualized using various analytical tools. This is the phase when you would know about the value of data and how you can use those results and probably train your ML models for some predictions.


Now that we've got basic understanding of data lake blocks, let's fit in several AWS services in each of those blocks. Below graphic helps in understanding which AWS service fits best in the data lake formation.



The above graphic helps in understanding data lake setup w.r.t. AWS services. Now that we have convered basic building blocks of a data lake, I will explain how to build a basic data lake in AWS in some of my upcoming articles.
Happy Learning!