Before reading this post, make sure you have a good understanding of Python basics - variables, data types, loops, functions, and lists. You should also have basic knowledge of NumPy, since Pandas is built on top of it. If you have not covered NumPy yet, go through the NumPy tutorial post first and then come back here.
This tutorial works on both VS Code and Google Colab. NumPy and Pandas must be installed on your system before running any code in this post.
Pandas is one of the most widely used Python libraries in data science and data analysis. Whether you are cleaning data, analysing trends, preparing datasets for machine learning, or generating reports - Pandas is the tool that gets the job done.
In this post, we are going to cover Pandas from scratch - creating Series and DataFrames, loading real data, inspecting it, filtering, sorting, grouping, merging, and exporting - all with real code and real output. By the end of this post you will have a solid understanding of how Pandas works and how to use it in real projects.
What is Pandas?
Pandas stands for Panel Data. It is an open source Python library built for data manipulation and analysis. It was originally developed for financial data analysis and later adopted across the entire data science community.
The simplest way to understand Pandas is to think of it as Excel inside Python - but far more powerful. You can handle millions of rows, automate repetitive tasks, connect to databases, read from CSV and Excel files, and perform complex transformations - all in a few lines of code.
Installing and Importing Pandas
If you are using VS Code, open your terminal and run:
pip install pandas
If you are using Google Colab, Pandas is already available. No installation is needed.
At the top of every Python file or notebook, always import Pandas and NumPy like this:
import pandas as pd import numpy as np
The alias pd is the universal convention for Pandas. Use it in every project.
Pandas Series
A Series is a one-dimensional labelled array. Think of it as a single column from a spreadsheet - with an index on the left and values on the right.
import pandas as pd s = pd.Series([10, 20, 30, 40, 50]) print(s)
Output:
0 10 1 20 2 30 3 40 4 50 dtype: int64
You can create a Series with custom labels instead of default numeric index:
scores = pd.Series([85, 90, 78, 95], index=['Maths', 'Science', 'English', 'History']) print(scores)
Output:
Maths 85 Science 90 English 78 History 95 dtype: int64
Accessing values from a Series:
print(scores['Science']) # Access by label print(scores[scores > 85]) # Boolean filtering
Output:
90 Science 90 History 95 dtype: int64
Creating a Series from a dictionary:
data = {'Monday': 200, 'Tuesday': 350, 'Wednesday': 300}
sales = pd.Series(data)
print(sales)
Output:
Monday 200 Tuesday 350 Wednesday 300 dtype: int64
Pandas DataFrame
A DataFrame is a two-dimensional table with rows and columns. This is the main data structure you will use in almost every Pandas project. Think of it as a full spreadsheet or a database table.
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 22],
'City': ['Mumbai', 'Delhi', 'Chennai', 'Pune', 'Hyderabad'],
'Salary': [50000, 70000, 65000, 80000, 45000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary 0 Alice 25 Mumbai 50000 1 Bob 30 Delhi 70000 2 Charlie 35 Chennai 65000 3 Diana 28 Pune 80000 4 Eve 22 Hyderabad 45000
Accessing a single column:
print(df['Name'])
Output:
0 Alice 1 Bob 2 Charlie 3 Diana 4 Eve Name: Name, dtype: object
Accessing multiple columns:
print(df[['Name', 'Salary']])
Output:
Name Salary 0 Alice 50000 1 Bob 70000 2 Charlie 65000 3 Diana 80000 4 Eve 45000
Loading Data from Files
In real projects you will load data from external files rather than creating DataFrames manually.
# Load from CSV
df = pd.read_csv('sales_data.csv')
# Load from Excel
df = pd.read_excel('sales_data.xlsx')
# Load from JSON
df = pd.read_json('data.json')
The read_csv() function is the most commonly used. One line is all it takes to load an entire file into a DataFrame.
Inspecting Your Data
After loading a dataset, the first step is always to inspect it before making any changes.
df.head() # First 5 rows df.tail() # Last 5 rows df.shape # (rows, columns) df.columns # List of column names df.dtypes # Data type of each column df.info() # Full summary df.describe() # Statistical summary
Running df.info() on the sample dataset:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 5 non-null object 1 Age 5 non-null int64 2 City 5 non-null object 3 Salary 5 non-null int64 dtypes: int64(2), object(2)
Running df.describe():
df.describe()
Output:
Age Salary count 5.000000 5.000000 mean 28.000000 62000.000000 std 4.848 13600.000 min 22.000000 45000.000000 25% 25.000000 50000.000000 50% 28.000000 65000.000000 75% 30.000000 70000.000000 max 35.000000 80000.000000
Indexing with loc and iloc
Pandas provides two ways to access specific rows and columns.
iloc - access by position (row number, column number):
print(df.iloc[0]) # First row print(df.iloc[2, 3]) # Row 2, Column 3 print(df.iloc[0:3, :]) # First 3 rows, all columns
Output:
Name Alice
Age 25
City Mumbai
Salary 50000
Name: 0, dtype: object
65000
Name Age City Salary
0 Alice 25 Mumbai 50000
1 Bob 30 Delhi 70000
2 Charlie 35 Chennai 65000
loc - access by label (row label and column name):
print(df.loc[0, 'Name']) print(df.loc[1:3, ['Name', 'Salary']])
Output:
Alice
Name Salary
1 Bob 70000
2 Charlie 65000
3 Diana 80000
Filtering Data
Rows can be filtered based on conditions without writing any loop.
# Salary greater than 60000 high_earners = df[df['Salary'] > 60000] print(high_earners)
Output:
Name Age City Salary 1 Bob 30 Delhi 70000 2 Charlie 35 Chennai 65000 3 Diana 28 Pune 80000
Multiple conditions using & for AND and | for OR:
# Salary > 60000 AND Age < 32 filtered = df[(df['Salary'] > 60000) & (df['Age'] < 32)] print(filtered)
Output:
Name Age City Salary 1 Bob 30 Delhi 70000 3 Diana 28 Pune 80000
Using isin() to filter by a list of values:
selected = df[df['City'].isin(['Mumbai', 'Pune'])] print(selected)
Output:
Name Age City Salary 0 Alice 25 Mumbai 50000 3 Diana 28 Pune 80000
Using str.contains() for text matching:
result = df[df['Name'].str.contains('a', case=False)]
print(result)
Output:
Name Age City Salary 0 Alice 25 Mumbai 50000 2 Charlie 35 Chennai 65000 3 Diana 28 Pune 80000
Adding, Renaming and Dropping Columns
Adding new columns to a DataFrame:
df['Tax'] = df['Salary'] * 0.10 df['Net Salary'] = df['Salary'] - df['Tax'] print(df[['Name', 'Salary', 'Tax', 'Net Salary']])
Output:
Name Salary Tax Net Salary 0 Alice 50000 5000.0 45000.0 1 Bob 70000 7000.0 63000.0 2 Charlie 65000 6500.0 58500.0 3 Diana 80000 8000.0 72000.0 4 Eve 45000 4500.0 40500.0
Renaming columns:
df.rename(columns={'Name': 'Employee', 'Salary': 'Gross Salary'}, inplace=True)
print(df.columns)
Dropping a column:
df.drop(columns=['Tax'], inplace=True)
Dropping a row by index:
df.drop(index=2, inplace=True)
Sorting Data
# Sort by Salary ascending
df_sorted = df.sort_values('Salary')
print(df_sorted)
# Sort by Salary descending
df_sorted_desc = df.sort_values('Salary', ascending=False)
print(df_sorted_desc)
# Sort by multiple columns
df_multi = df.sort_values(['City', 'Salary'], ascending=[True, False])
print(df_multi)
# Reset index after sorting
df_sorted = df.sort_values('Salary').reset_index(drop=True)
print(df_sorted)
GroupBy Operations
GroupBy allows you to split data into groups, apply a function to each group, and combine the results.
data = {
'Department': ['HR', 'Tech', 'HR', 'Tech', 'Finance', 'Finance', 'Tech'],
'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
'Salary': [50000, 90000, 55000, 95000, 70000, 72000, 85000],
'Experience': [3, 7, 4, 9, 5, 6, 8]
}
df3 = pd.DataFrame(data)
# Average salary by department
print(df3.groupby('Department')['Salary'].mean())
Output:
Department Finance 71000.0 HR 52500.0 Tech 90000.0 Name: Salary, dtype: float64
Multiple aggregations at once using agg():
result = df3.groupby('Department').agg(
Avg_Salary=('Salary', 'mean'),
Max_Salary=('Salary', 'max'),
Total_Employees=('Employee', 'count')
)
print(result)
Output:
Avg_Salary Max_Salary Total_Employees Department Finance 71000.0 72000 2 HR 52500.0 55000 2 Tech 90000.0 95000 3
Merging and Joining DataFrames
In real projects data is often spread across multiple tables. Pandas provides merge() to combine them - similar to SQL joins.
employees = pd.DataFrame({
'EmpID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'DeptID': [101, 102, 101, 103]
})
departments = pd.DataFrame({
'DeptID': [101, 102, 103],
'DeptName': ['HR', 'Tech', 'Finance']
})
# Inner join
merged = pd.merge(employees, departments, on='DeptID', how='inner')
print(merged)
Output:
EmpID Name DeptID DeptName 0 1 Alice 101 HR 1 3 Charlie 101 HR 2 2 Bob 102 Tech 3 4 Diana 103 Finance
Other join types:
pd.merge(employees, departments, on='DeptID', how='left') # Left join pd.merge(employees, departments, on='DeptID', how='right') # Right join pd.merge(employees, departments, on='DeptID', how='outer') # Outer join
Stacking DataFrames vertically using concat():
df_new = pd.DataFrame({
'EmpID': [5, 6],
'Name': ['Eve', 'Frank'],
'DeptID': [102, 101]
})
df_combined = pd.concat([employees, df_new], ignore_index=True)
print(df_combined)
Output:
EmpID Name DeptID 0 1 Alice 101 1 2 Bob 102 2 3 Charlie 101 3 4 Diana 103 4 5 Eve 102 5 6 Frank 101
Apply Function
The apply() function lets you apply any custom logic to a column row by row.
def salary_category(sal):
if sal < 55000:
return 'Low'
elif sal < 75000:
return 'Medium'
else:
return 'High'
df3['Salary_Category'] = df3['Salary'].apply(salary_category)
print(df3[['Employee', 'Salary', 'Salary_Category']])
Output:
Employee Salary Salary_Category 0 Alice 50000 Low 1 Bob 90000 High 2 Charlie 55000 Medium 3 Diana 95000 High 4 Eve 70000 Medium 5 Frank 72000 Medium 6 Grace 85000 High
Using a lambda function for quick one-line operations:
df3['Salary_in_Lakhs'] = df3['Salary'].apply(lambda x: round(x / 100000, 2)) print(df3[['Employee', 'Salary', 'Salary_in_Lakhs']])
Output:
Employee Salary Salary_in_Lakhs 0 Alice 50000 0.50 1 Bob 90000 0.90 2 Charlie 55000 0.55 3 Diana 95000 0.95 4 Eve 70000 0.70 5 Frank 72000 0.72 6 Grace 85000 0.85
Pivot Tables
data = {
'Region': ['North', 'South', 'North', 'East', 'South', 'East'],
'Product': ['A', 'B', 'A', 'B', 'A', 'A'],
'Sales': [200, 150, 300, 250, 180, 220],
'Quantity': [20, 15, 30, 25, 18, 22]
}
df4 = pd.DataFrame(data)
pivot = pd.pivot_table(df4,
values='Sales',
index='Region',
columns='Product',
aggfunc='sum',
fill_value=0)
print(pivot)
Output:
Product A B Region East 220 250 North 500 0 South 180 150
Exporting Data
# Export to CSV
df.to_csv('output.csv', index=False)
# Export to Excel
df.to_excel('output.xlsx', index=False)
# Export to JSON
df.to_json('output.json', orient='records')
Setting index=False prevents the row index from being written into the output file. This is almost always the preferred option.
Quick Command Reference
pd.Series()- Create a Seriespd.DataFrame()- Create a DataFramepd.read_csv()- Load CSV filedf.head()- First 5 rowsdf.tail()- Last 5 rowsdf.info()- Column info and data typesdf.describe()- Statistical summarydf.shape- Dimensions of DataFramedf.dtypes- Data type of each columndf.iloc[]- Access by positiondf.loc[]- Access by labeldf[condition]- Filter rowsdf.isin()- Filter by list of valuesdf.sort_values()- Sort rowsdf.groupby()- Group and aggregatedf.agg()- Multiple aggregationspd.merge()- Join DataFramespd.concat()- Stack DataFramesdf.apply()- Apply custom functionpd.pivot_table()- Create pivot tabledf.to_csv()- Export to CSVdf.to_excel()- Export to Exceldf.rename()- Rename columnsdf.drop()- Drop rows or columnsdf.reset_index()- Reset row index
Key Takeaways
- Pandas is built on NumPy - understanding NumPy first makes Pandas much easier to learn
- Series is a single column, DataFrame is a full table - these are the two core data structures
- Always inspect your data with
head(),info(), anddescribe()before making any changes - Use
locfor label-based access andilocfor position-based access - GroupBy and merge are the most powerful tools for real-world data analysis
- Use
apply()to run any custom logic across rows or columns without writing a loop
All the code examples in this post are also covered in detail in the YouTube video. The link is available at the top of this page. If you have any questions, drop them in the comments section below.

0 Comments