Pandas: The Gateway to Data Exploration and Visualization
A beginner guide into the works of Pandas, an open-source data structures and analysis tool for Python
5 min read
What is Pandas?
Not sure whether Pandas was named after the dearly beloved panda, but Pandas is a popular open-source Python library for data manipulation and analysis. The name is derived from "panel data". It offers various tools for data structures and functions in manipulating numerical data.
The library includes a DataFrame object for multivariate data manipulation and a Series object for univariate data manipulation with integrated indexing. There are various methods for data manipulation with the help of vectorization. Data set merging, joining, reshaping, and pivoting. And most importantly, tools for reading and writing data in different file formats.
So, let's dive into the workings of Pandas. For the installation of Pandas, check the official documentation here.
In your Google Colab or Jupyter Notebook, we import pandas and assign it an alias for easy reference and use throughout the notebook. The most common alias is
import pandas as pd
Data exploration is typically the first step of data analysis used to explore and visualize data to uncover insights from the start or identify areas or patterns to dig into more. Using interactive dashboards and point-and-click data exploration, users can better understand the bigger picture and get insights faster.
The beauty of Pandas is that it allows you to work with data that is stored in different file formats. As a data analyst, you need to be flexible and ready to work with all sorts of file formats thrown at you. However, the most common file format is
Reading a CSV
data = pd.read_csv('csv_file_name.csv')
datais a variable that will store the file we are reading
pdis the alias for Pandas
read_csvis a Pandas function for reading the
Reading a Spreadsheet
#To read an entire spreadsheet data = pd.read_excel('spreadsheet_file_name.xlsx') #If you wish to read a specific sheet in the spreadsheet data = pd.read_excel('spreadsheet_file_name.xlsx', sheetname = 'sheet_name')
For us to be able to explore data stored in HTML tables, we need to first ensure that the
BeautifulSoup package is installed using the following command:
pip install BeautifulSoup4
Then run the following command to read data from the HTML file while importing
from bs4 import BeautifulSoup data = pd.read_html('url_to_html_file')
For this article, we are going to use the "PCOS" data set. This is non-personal data for Polycystic ovary syndrome obtained from Kaggle. The data set is a CSV file.
We will check out the information about the data set to know how many entries it contains. This helps you know whether the data set has
null values that need to be cleaned up.
When you notice
null values, the best way is to either drop them or fill them will data. Dropping null values helps you get accurate insights from the data during analysis and visualization.
#You can use the any() function to check for null values by Columns pd.isnull(data).any() # This will drop all columns that contain null values in the entire data set data.dropna() #This will fill the null values with the average of BMI data['BMI'].fillna(value = data['BMI'].mean())
After fixing the null values, we can now go ahead and take a sneak peek at our data. The
head() function enables us to display the first five (5) entries in the data set while the
tail() function displays the last five (5) entries in the data set.
Exploring the Data
The GroupBy function
Looking at our data, we can group the data by PCOS to see which age is affected most and count the number of those with PCOS for each age.
groupbyPcos = data.groupby('Age (yrs)').count() groupbyPcos.columns = ['Number of PCOS Cases'] groupbyPcos
The sum() funtion
The data set we are dealing with has data on age, height, BMI, etc. and we can sum up this data
groupbyPcos = data.groupby('BMI').sum()
Data visualization allows users to explore and analyze data quickly and easily. This is good to get visual insights on patterns that you are like to have in your model.
A line plot is a basic plot that shows the trend of data over time. In Pandas, you can create a line plot using the
plot() function. To visualize a simple line plot of the data we have, we run the following code:
#Line Plot of Sl. No against BMI data.plot(x="Sl. No", y="BMI", figsize=(12,4))
A bar plot is a plot that shows the distribution of data in different categories. In Pandas, you can create a bar plot using the
plot.bar() function. To visualize a bar plot of the data, we run the following code:
data.plot.bar(x="Sl. No", y="BMI", figsize=(12,4))
A scatter plot is a plot that shows the relationship between two variables. In Pandas, you can create a scatter plot using the
plot.scatter() function. To visualize a scatter plot of the data, we run the following code:
data.plot.scatter(x="Sl. No", y="BMI", figsize=(12,4))
A box plot is a plot that shows the distribution of data using quartiles. In Pandas, you can create a box plot using the
plot.box() function. To visualize a box plot of the data, we run the following code:
data.plot.box(x="Sl. No", y="BMI", figsize=(12,4))
In conclusion, Pandas is a very powerful library used to manipulate data and visualize it for analysis. It offers a wide range of functions to enable a user to play around with their data and make sense of it.
PS: Access the dataset used here. (License: MIT License)