PandasAI: The Generative AI Library

Pandas AI, the Python library that enhances Pandas with generative artificial intelligence capabilities.

PandasAI: The Generative AI Library

Introduction

What is PandasAI?

PandasAI is an advanced library built on top of the popular Pandas library, designed to provide enhanced functionality for data manipulation, analysis, and AI-driven tasks. With PandasAI, you can efficiently handle large datasets, perform complex operations, and leverage artificial intelligence techniques seamlessly. In this article, we will explore the key features of PandasAI with practical examples and code snippets.

Read more about Pandas here.

Key Features of PandasAI

PandasAI extends the functionality of Pandas with additional features. Some of the key features are:

Feature Engineering: PandasAI offers a wide range of feature engineering techniques such as one-hot encoding, binning, scaling, and generating new features.

AI-driven Operations: PandasAI integrates with popular AI libraries like scikit-learn and TensorFlow, enabling seamless integration of machine learning and deep learning algorithms with Pandas data-frames.

Exploratory Data Analysis (EDA): It provides various statistical and visualization tools for EDA, including descriptive statistics, correlation analysis, and interactive visualizations.

Time Series Analysis: PandasAI includes powerful tools for handling time series data, such as resampling, lagging, rolling computations, and date-based operations.

Installation of PandasAI

To install PandasAI, you can use the pip package manager, which simplifies the process. Run the following command to install PandasAI:

pip install pandasai

If you wish to use PandasAI in a Google Colab Notebook like I am doing, you need to run the following commands to install PandasAI and other necessary modules:

!pip install pandasai
!pip install langchain

After successfully installing PandasAI, you need to import it along with the Pandas library to start using its enhanced functionalities.

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

To use the OpenAI library, we will have to generate an OpenAI API Key here. Also, ensure that you set up a paid account for access to OpenAI's Large Language Models (LLM) which are priced per 1,000 tokens.

Functionality of PandasAI

Before exploring the functionality of PandasAI, we will have to first set up an OpenAI environment and create an instance of PandasAI with the OpenAI environment we created.

# Loading the API token to OpenAI environment
env = OpenAI(api_token='OpenAI API Key')

# Initializing an instance of PandasAI with OpenAI environment
pandasAi = PandasAI(env)

Data Exploration

In this article, we are going to use the employee dataset containing information about employees, including their names, ages, salaries, and departments. However, this dataset has missing values in some of the columns. Below is a screenshot of the dataset:

To explore our data, we will now prompt or ask pandasAi by passing in our dataset name and the question we wish to ask. In the code snippet below, we will ask PandasAI to tell us which employees have null values in the dataset.

question = "Which employees have null values?"
pandasAi.run(data, prompt=question)

Output:

To check whether PandasAI is correct with the output given from our dataset, we can run a query for null values using Pandas itself with the isna() function.

# Check for null values using Pandas
data = data[data.isna().any(axis=1)]
# Print the rows with the null values
print(data)

Output:

From the above output, we can see that both Pandas and PandasAI return the same rows that have null values in our dataset.

Next, we can prompt pandasAi to tell us which employee earns more than all the employees in the dataset. We will run the following code:

question = "Who earns more than the others?"
pandasAi.run(data, prompt=question)

Output:

With a salary of 2000000, Marvin is the highest-paid employee. And we can see from the output above that PandasAI is correct.

Up next, we can ask PandasAI to fill in the null values for the employees without salaries. To do this, we are going to run the following code:

question = "Fill in only the null values to 5 figure salaries"
pandasAi.run(data, prompt=question)

Output:

From the above output, PandasAI can fill in the null values with a 5-figure salary for each employee who previously had a null value. The salary is uniform for each, but you can always twist the question/prompt to randomize the numbers.

In the next prompt, we can ask PandasAI to tell us how many employees are in different departments. This is crucial if one wishes to analyze employee data to know how employees are distributed across departments. In the code below, we prompt for the number of employees in the Sales department:

question = "How many employees are in the Sales department?"
pandasAi.run(data, prompt=question)

Output:

Indeed there are 5 employees in the Sales department in our dataset.

There's more a data analyst can prompt PandasAI to do in terms of data exploration. As you can realize, we do not run the traditional Pandas codes and queries to analyze our dataset, but instead, supply English language prompts and the generative AI library will use OpenAI's LLM capabilities to return outputs.

Data Visualization

As with Pandas, PandasAI can also be used to visualize data with simple prompts supplied to pandasAi. Below, we prompt PandasAI to plot visual charts of our data.

question = "Plot a barplot of all employees and their salaries"
pandasAi.run(data, prompt=question)

Output:

We can plot the employee's salaries grouped by their departments and see which departments get more salary amounts compared to other departments. To do that, we twist the prompt as seen below:

question = "Plot a barplot of all employees and their salaries grouped in their departments"
pandasAi.run(data, prompt=question)

Output:

Next, we can plot a boxplot for the employee's salary and age with the following prompt:

question = "Plot a boxplot out of the employee salary and age"
pandasAi.run(data, prompt=question)

Output:

We can also show the relationship between the employee's salaries and their age. To do that, we run the following prompt:

question = "Plot a scatter graph from the employee data"
pandasAi.run(data, prompt=question)

Output:

Conclusion

PandasAI extends the capabilities of Pandas by providing advanced data manipulation, analysis, and AI-driven operations. In this article, we covered key features, and use cases, and provided examples and code snippets to illustrate the functionality of PandasAI. By leveraging PandasAI, you can streamline your data preprocessing pipeline and seamlessly integrate AI techniques into your Pandas workflows.

Resources

Google Colab Notebook

Dataset

PandasAI Repository

PandasAI Docs