Pandas Tutorials for Beginners

Pandas is an open-source data manipulation and analysis library for Python. It provides powerful data structures and tools for efficiently working with structured data. In this tutorial, you'll learn the basics of Pandas and how to use its core functionalities for data manipulation, cleaning, and analysis.

1. Introduction to Pandas

Pandas is built on top of the NumPy library and provides data structures that are essential for data analysis in Python. The two primary data structures in Pandas are Series and DataFrame.

Series: A one-dimensional array-like object that can hold various data types.

DataFrame: A two-dimensional table that can store data of different types in columns.

2. Installation

You can install Pandas using pip, a Python package installer:

pip install pandas

3. Pandas Data Structures

Series

A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, etc.). It's similar to a column in a spreadsheet or a one-dimensional NumPy array.

import pandas as pd

data = [10, 20, 30, 40, 50]
series = pd.Series(data, name="MySeries")
print(series)

DataFrame

A DataFrame is a two-dimensional table of data with rows and columns. It's the most commonly used Pandas object and is often used to represent tabular data.

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)

4. Loading Data

Pandas provides various functions to load data from different sources:

CSV Files

csv_data = pd.read_csv('data.csv')

Excel Files

excel_data = pd.read_excel('data.xlsx', sheet_name='Sheet1')

SQL Databases

import sqlite3

conn = sqlite3.connect('database.db')
sql_query = "SELECT * FROM table_name"
sql_data = pd.read_sql(sql_query, conn)

5. Data Manipulation

Selecting and Indexing

Selecting a column

ages = df['Age']

Selecting multiple columns

subset = df[['Name', 'City']]

Selecting rows by index

row = df.loc[0]

Filtering Data

Filtering based on a condition

young_people = df[df['Age'] < 30]

Adding and Removing Columns

Adding a new column

df['Gender'] = ['Female', 'Male', 'Male']

Removing a column

df.drop('Gender', axis=1, inplace=True)

Handling Missing Data

Checking for missing values

missing_values = df.isnull().sum()

Dropping rows with missing values

df.dropna(inplace=True)

Filling missing values

df['Age'].fillna(df['Age'].mean(), inplace=True)

Sorting and Ranking

Sorting by a column

sorted_df = df.sort_values(by='Age')

Ranking

df['Age_Rank'] = df['Age'].rank(ascending=False)

6. Data Analysis

Descriptive Statistics

Summary statistics of the dataframe

summary_stats = df.describe()

Correlation matrix

correlation_matrix = df.corr()

Grouping and Aggregation

Grouping by a column and calculating mean

grouped = df.groupby('City')['Age'].mean()

Pivot Tables

pivot_table = df.pivot_table(index='City', columns='Gender', values='Age', aggfunc='mean')

7. Data Visualization

Pandas can work seamlessly with other data visualization libraries like Matplotlib and Seaborn.

import matplotlib.pyplot as plt
import seaborn as sns

 Plotting a bar chart
sns.barplot(x='City', y='Age', data=df)
plt.show()

Pandas Tutorials for Beginners