Python Pandas Tutorial: Basics to Advanced Guide

Python Pandas Tutorial: Basics to Advanced Guide

Before reading this post, make sure you have a good understanding of Python basics - variables, data types, loops, functions, and lists. You should also have basic knowledge of NumPy, since Pandas is built on top of it. If you have not covered NumPy yet, go through the NumPy tutorial post first and then come back here.

This tutorial works on both VS Code and Google Colab. NumPy and Pandas must be installed on your system before running any code in this post.

Pandas is one of the most widely used Python libraries in data science and data analysis. Whether you are cleaning data, analysing trends, preparing datasets for machine learning, or generating reports - Pandas is the tool that gets the job done.

In this post, we are going to cover Pandas from scratch - creating Series and DataFrames, loading real data, inspecting it, filtering, sorting, grouping, merging, and exporting - all with real code and real output. By the end of this post you will have a solid understanding of how Pandas works and how to use it in real projects.

What is Pandas?

Pandas stands for Panel Data. It is an open source Python library built for data manipulation and analysis. It was originally developed for financial data analysis and later adopted across the entire data science community.

The simplest way to understand Pandas is to think of it as Excel inside Python - but far more powerful. You can handle millions of rows, automate repetitive tasks, connect to databases, read from CSV and Excel files, and perform complex transformations - all in a few lines of code.

Installing and Importing Pandas

If you are using VS Code, open your terminal and run:

pip install pandas

If you are using Google Colab, Pandas is already available. No installation is needed.

At the top of every Python file or notebook, always import Pandas and NumPy like this:

import pandas as pd
import numpy as np

The alias pd is the universal convention for Pandas. Use it in every project.

Pandas Series

A Series is a one-dimensional labelled array. Think of it as a single column from a spreadsheet - with an index on the left and values on the right.

import pandas as pd

s = pd.Series([10, 20, 30, 40, 50])
print(s)

Output:

0    10
1    20
2    30
3    40
4    50
dtype: int64

You can create a Series with custom labels instead of default numeric index:

scores = pd.Series([85, 90, 78, 95], index=['Maths', 'Science', 'English', 'History'])
print(scores)

Output:

Maths      85
Science    90
English    78
History    95
dtype: int64

Accessing values from a Series:

print(scores['Science'])    # Access by label
print(scores[scores > 85])  # Boolean filtering

Output:

90

Science    90
History    95
dtype: int64

Creating a Series from a dictionary:

data = {'Monday': 200, 'Tuesday': 350, 'Wednesday': 300}
sales = pd.Series(data)
print(sales)

Output:

Monday       200
Tuesday      350
Wednesday    300
dtype: int64

Pandas DataFrame

A DataFrame is a two-dimensional table with rows and columns. This is the main data structure you will use in almost every Pandas project. Think of it as a full spreadsheet or a database table.

data = {
    'Name':   ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age':    [25, 30, 35, 28, 22],
    'City':   ['Mumbai', 'Delhi', 'Chennai', 'Pune', 'Hyderabad'],
    'Salary': [50000, 70000, 65000, 80000, 45000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age       City  Salary
0    Alice   25     Mumbai   50000
1      Bob   30      Delhi   70000
2  Charlie   35    Chennai   65000
3    Diana   28       Pune   80000
4      Eve   22  Hyderabad   45000

Accessing a single column:

print(df['Name'])

Output:

0      Alice
1        Bob
2    Charlie
3      Diana
4        Eve
Name: Name, dtype: object

Accessing multiple columns:

print(df[['Name', 'Salary']])

Output:

      Name  Salary
0    Alice   50000
1      Bob   70000
2  Charlie   65000
3    Diana   80000
4      Eve   45000

Loading Data from Files

In real projects you will load data from external files rather than creating DataFrames manually.

# Load from CSV
df = pd.read_csv('sales_data.csv')

# Load from Excel
df = pd.read_excel('sales_data.xlsx')

# Load from JSON
df = pd.read_json('data.json')

The read_csv() function is the most commonly used. One line is all it takes to load an entire file into a DataFrame.

Inspecting Your Data

After loading a dataset, the first step is always to inspect it before making any changes.

df.head()       # First 5 rows
df.tail()       # Last 5 rows
df.shape        # (rows, columns)
df.columns      # List of column names
df.dtypes       # Data type of each column
df.info()       # Full summary
df.describe()   # Statistical summary

Running df.info() on the sample dataset:

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      object
 1   Age     5 non-null      int64
 2   City    5 non-null      object
 3   Salary  5 non-null      int64
dtypes: int64(2), object(2)

Running df.describe():

df.describe()

Output:

             Age        Salary
count   5.000000      5.000000
mean   28.000000  62000.000000
std     4.848       13600.000
min    22.000000  45000.000000
25%    25.000000  50000.000000
50%    28.000000  65000.000000
75%    30.000000  70000.000000
max    35.000000  80000.000000

Indexing with loc and iloc

Pandas provides two ways to access specific rows and columns.

iloc - access by position (row number, column number):

print(df.iloc[0])           # First row
print(df.iloc[2, 3])        # Row 2, Column 3
print(df.iloc[0:3, :])      # First 3 rows, all columns

Output:

Name      Alice
Age          25
City     Mumbai
Salary    50000
Name: 0, dtype: object

65000

      Name  Age     City  Salary
0    Alice   25   Mumbai   50000
1      Bob   30    Delhi   70000
2  Charlie   35  Chennai   65000

loc - access by label (row label and column name):

print(df.loc[0, 'Name'])
print(df.loc[1:3, ['Name', 'Salary']])

Output:

Alice

      Name  Salary
1      Bob   70000
2  Charlie   65000
3    Diana   80000

Filtering Data

Rows can be filtered based on conditions without writing any loop.

# Salary greater than 60000
high_earners = df[df['Salary'] > 60000]
print(high_earners)

Output:

      Name  Age     City  Salary
1      Bob   30    Delhi   70000
2  Charlie   35  Chennai   65000
3    Diana   28     Pune   80000

Multiple conditions using & for AND and | for OR:

# Salary > 60000 AND Age < 32
filtered = df[(df['Salary'] > 60000) & (df['Age'] < 32)]
print(filtered)

Output:

    Name  Age   City  Salary
1    Bob   30  Delhi   70000
3  Diana   28   Pune   80000

Using isin() to filter by a list of values:

selected = df[df['City'].isin(['Mumbai', 'Pune'])]
print(selected)

Output:

    Name  Age    City  Salary
0  Alice   25  Mumbai   50000
3  Diana   28    Pune   80000

Using str.contains() for text matching:

result = df[df['Name'].str.contains('a', case=False)]
print(result)

Output:

      Name  Age     City  Salary
0    Alice   25   Mumbai   50000
2  Charlie   35  Chennai   65000
3    Diana   28     Pune   80000

Adding, Renaming and Dropping Columns

Adding new columns to a DataFrame:

df['Tax'] = df['Salary'] * 0.10
df['Net Salary'] = df['Salary'] - df['Tax']
print(df[['Name', 'Salary', 'Tax', 'Net Salary']])

Output:

      Name  Salary     Tax  Net Salary
0    Alice   50000  5000.0     45000.0
1      Bob   70000  7000.0     63000.0
2  Charlie   65000  6500.0     58500.0
3    Diana   80000  8000.0     72000.0
4      Eve   45000  4500.0     40500.0

Renaming columns:

df.rename(columns={'Name': 'Employee', 'Salary': 'Gross Salary'}, inplace=True)
print(df.columns)

Dropping a column:

df.drop(columns=['Tax'], inplace=True)

Dropping a row by index:

df.drop(index=2, inplace=True)

Sorting Data

# Sort by Salary ascending
df_sorted = df.sort_values('Salary')
print(df_sorted)

# Sort by Salary descending
df_sorted_desc = df.sort_values('Salary', ascending=False)
print(df_sorted_desc)

# Sort by multiple columns
df_multi = df.sort_values(['City', 'Salary'], ascending=[True, False])
print(df_multi)

# Reset index after sorting
df_sorted = df.sort_values('Salary').reset_index(drop=True)
print(df_sorted)

GroupBy Operations

GroupBy allows you to split data into groups, apply a function to each group, and combine the results.

data = {
    'Department': ['HR', 'Tech', 'HR', 'Tech', 'Finance', 'Finance', 'Tech'],
    'Employee':   ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Salary':     [50000, 90000, 55000, 95000, 70000, 72000, 85000],
    'Experience': [3, 7, 4, 9, 5, 6, 8]
}

df3 = pd.DataFrame(data)

# Average salary by department
print(df3.groupby('Department')['Salary'].mean())

Output:

Department
Finance    71000.0
HR         52500.0
Tech       90000.0
Name: Salary, dtype: float64

Multiple aggregations at once using agg():

result = df3.groupby('Department').agg(
    Avg_Salary=('Salary', 'mean'),
    Max_Salary=('Salary', 'max'),
    Total_Employees=('Employee', 'count')
)
print(result)

Output:

            Avg_Salary  Max_Salary  Total_Employees
Department
Finance      71000.0       72000             2
HR           52500.0       55000             2
Tech         90000.0       95000             3

Merging and Joining DataFrames

In real projects data is often spread across multiple tables. Pandas provides merge() to combine them - similar to SQL joins.

employees = pd.DataFrame({
    'EmpID':  [1, 2, 3, 4],
    'Name':   ['Alice', 'Bob', 'Charlie', 'Diana'],
    'DeptID': [101, 102, 101, 103]
})

departments = pd.DataFrame({
    'DeptID':   [101, 102, 103],
    'DeptName': ['HR', 'Tech', 'Finance']
})

# Inner join
merged = pd.merge(employees, departments, on='DeptID', how='inner')
print(merged)

Output:

   EmpID     Name  DeptID DeptName
0      1    Alice     101       HR
1      3  Charlie     101       HR
2      2      Bob     102     Tech
3      4    Diana     103  Finance

Other join types:

pd.merge(employees, departments, on='DeptID', how='left')   # Left join
pd.merge(employees, departments, on='DeptID', how='right')  # Right join
pd.merge(employees, departments, on='DeptID', how='outer')  # Outer join

Stacking DataFrames vertically using concat():

df_new = pd.DataFrame({
    'EmpID': [5, 6],
    'Name':  ['Eve', 'Frank'],
    'DeptID': [102, 101]
})

df_combined = pd.concat([employees, df_new], ignore_index=True)
print(df_combined)

Output:

   EmpID     Name  DeptID
0      1    Alice     101
1      2      Bob     102
2      3  Charlie     101
3      4    Diana     103
4      5      Eve     102
5      6    Frank     101

Apply Function

The apply() function lets you apply any custom logic to a column row by row.

def salary_category(sal):
    if sal < 55000:
        return 'Low'
    elif sal < 75000:
        return 'Medium'
    else:
        return 'High'

df3['Salary_Category'] = df3['Salary'].apply(salary_category)
print(df3[['Employee', 'Salary', 'Salary_Category']])

Output:

  Employee  Salary Salary_Category
0    Alice   50000             Low
1      Bob   90000            High
2  Charlie   55000          Medium
3    Diana   95000            High
4      Eve   70000          Medium
5    Frank   72000          Medium
6    Grace   85000            High

Using a lambda function for quick one-line operations:

df3['Salary_in_Lakhs'] = df3['Salary'].apply(lambda x: round(x / 100000, 2))
print(df3[['Employee', 'Salary', 'Salary_in_Lakhs']])

Output:

  Employee  Salary  Salary_in_Lakhs
0    Alice   50000             0.50
1      Bob   90000             0.90
2  Charlie   55000             0.55
3    Diana   95000             0.95
4      Eve   70000             0.70
5    Frank   72000             0.72
6    Grace   85000             0.85

Pivot Tables

data = {
    'Region':   ['North', 'South', 'North', 'East', 'South', 'East'],
    'Product':  ['A', 'B', 'A', 'B', 'A', 'A'],
    'Sales':    [200, 150, 300, 250, 180, 220],
    'Quantity': [20, 15, 30, 25, 18, 22]
}

df4 = pd.DataFrame(data)

pivot = pd.pivot_table(df4,
                       values='Sales',
                       index='Region',
                       columns='Product',
                       aggfunc='sum',
                       fill_value=0)
print(pivot)

Output:

Product  A    B
Region
East   220  250
North  500    0
South  180  150

Exporting Data

# Export to CSV
df.to_csv('output.csv', index=False)

# Export to Excel
df.to_excel('output.xlsx', index=False)

# Export to JSON
df.to_json('output.json', orient='records')

Setting index=False prevents the row index from being written into the output file. This is almost always the preferred option.

Quick Command Reference

  • pd.Series() - Create a Series
  • pd.DataFrame() - Create a DataFrame
  • pd.read_csv() - Load CSV file
  • df.head() - First 5 rows
  • df.tail() - Last 5 rows
  • df.info() - Column info and data types
  • df.describe() - Statistical summary
  • df.shape - Dimensions of DataFrame
  • df.dtypes - Data type of each column
  • df.iloc[] - Access by position
  • df.loc[] - Access by label
  • df[condition] - Filter rows
  • df.isin() - Filter by list of values
  • df.sort_values() - Sort rows
  • df.groupby() - Group and aggregate
  • df.agg() - Multiple aggregations
  • pd.merge() - Join DataFrames
  • pd.concat() - Stack DataFrames
  • df.apply() - Apply custom function
  • pd.pivot_table() - Create pivot table
  • df.to_csv() - Export to CSV
  • df.to_excel() - Export to Excel
  • df.rename() - Rename columns
  • df.drop() - Drop rows or columns
  • df.reset_index() - Reset row index

Key Takeaways

  • Pandas is built on NumPy - understanding NumPy first makes Pandas much easier to learn
  • Series is a single column, DataFrame is a full table - these are the two core data structures
  • Always inspect your data with head(), info(), and describe() before making any changes
  • Use loc for label-based access and iloc for position-based access
  • GroupBy and merge are the most powerful tools for real-world data analysis
  • Use apply() to run any custom logic across rows or columns without writing a loop

All the code examples in this post are also covered in detail in the YouTube video. The link is available at the top of this page. If you have any questions, drop them in the comments section below.

Post a Comment

0 Comments