In this article, I'll explain about basic toolset required to write standard Data Analysis programs in the containerized environment using Docker. As always, my approach is to make your programs portable and platform independent.
Let's first understand briefly what I mean by toolset and what I'm going to package in the Docker container.
PySpark - PySpark programming is the collaboration of Apache Spark and Python. It is a Python API built to interact with Apache Spark. Since it's written in Python you can use other Python modules to be an efficient Data Analyst
Apache Spark - It is a very popular framework for handling and working with Big Data. It is almost 100x faster than any other traditional large scale data processing frameworks
Jupyter Notebook - It is an open source web application mostly used by Data Analysts / Engineers to write code, mathematical equations, data visualization, etc
NumPy - It is a Python library used to work with multi-dimensionsal arrays, matrices, high-level mathematical functions, etc
Now let's dig into technical details and see how to setup local environment which supports PySpark, Jupyter Notebook and NumPy. Here are step-by-step instructions:
Create a new folder on your system, e.g. c:\code\pyspark-jupyter or whatever name you want to give
Create a file in that folder and call it docker-compose.yaml with the content given below:
In the above file, I'm pulling an official jupyter docker image, mapping the local folder with a folder inside container and exposing container port 8888 to host port 8888. Simple, isn't it?
Now run this file using command docker-compose up and you'll see the output similar as shown below
Copy the URL http://127.0.0.1:8888/?token=YOUR_TOKEN and open in the browser of your choice
You'll see an instance of Jupyter Notebook running in a container. As you might have noticed that your local folder is mapped inside the container
Now let's create our first notebook and work with PySpark. This is just a brief introduction as I'll be writing separte articles about PySpark and NumPy in detail.
PySpark Demo using Docker
The second notebook briefs about using NumPy.
NumPy Demo using Docker
Hope it helps in starting your Data Analysis journey and using Docker to make portable programs.