If you're just getting started with your data scientist/analyst journey and want to explore some of the basic libraries to begin with, then this article is for you. I am going to cover Pandas library and demostarte some of the basic but very useful functions of this library.
Pandas is a Python library which is mostly used for constructing data frames to do some kind of analysis on the data. You can think Pandas as an Excel spreadsheet and one such spreadsheet is called a data frame in Pandas.
Basically when you read any kind of data in Pandas it constructs a data frame and then you use several functions to take action on this data frame. Let's understand it more by doing. As always, I'm using Docker to run Jupyter notebooks with Pandas functionality. If you want to know how to setup Docker with Jupyter, please checkout my article.
Importing Pandas and reading data
Before doing anything you've to import Pandas in your notebook. Once done you can load data and construct a data frame. You can either load data using a local file or by pointing to a web URL.
In the above snap, I have imported Pandas in my notebook, load a data set from my local system, consturcuted a data frame and displayed first five records. But I hit one issue rightaway. My data set has no header that's why Pandas assigned first row of my data as the header of this data frame. That's not what I'm looking for. Let's correct that problem. As you can see I have added an additional parameter header=None to avoid the above situation. Now there is second problem I need to address. I need to assign header to my data set so that I can understand the meaning of each value. I have created a list of headers and then assigned that to our data frame using df.columns function. Now when you list out the contents of data frame, you can see the header at the top.
Now let's go through some of the common functions we can use with our data frame.
df.head() and df.tail()
df.dtypes and df.describe()
Working with Missing Data
There is a possiblity that some of the values in your data frame shown as Nan. NaN is shorthand for Not a Number and it represents missing data. Missing data can also be a ?, 0(zero), NaN or simply empty value.
There are several approaches to tackle with missing data. Some of the popular ones are:
Let's take a look how missing data can be replaced or dropped.
Replace Missing Values Drop Missing Values
Data can be dropped along rows (axis=0 which is also default) or columns (axis=1) defined by axis. Here are couple of examples demonstrating it. And now drop data based on columns.
I believe it's good enough to start with. I will add some more functions in upcoming articles as this one is getting very long.
As always, Happy Learning!