Python Pandas Data Cleaning & Null Values Guide

Python Pandas Data Cleaning & Null Values Guide


Every data science project starts with one unavoidable step - cleaning the data.

Most beginners want to jump straight into analysis, building charts, or training machine learning models. But here is the reality: if the data going in is broken, the results coming out will be wrong. And the frustrating part is - the code will not always throw an error to warn you. It will just give you a number that looks correct but is not.

That is what makes dirty data dangerous.

In this post, we are going to work through the full data cleaning process using Python and Pandas - step by step, with real code and real output. By the end, you will know how to find missing values, remove duplicates, fix data types, and fill nulls using the right strategy for each column type.

What Is Dirty Data?

Dirty data is any dataset that contains errors, inconsistencies, or missing information that can affect the accuracy of your analysis.

The three most common problems in real-world datasets are:

  • Missing values - cells that are empty or contain NaN
  • Duplicate rows - the same record appearing more than once
  • Wrong data types - a numeric column stored as text, for example

None of these are rare. In production datasets - whether they come from databases, forms, APIs, or CSV exports - all three problems tend to show up together.

Setting Up the Environment

This tutorial works in both VS Code and Google Colab.

  • VS Code: Create a new file data_cleaning.py and run it from the integrated terminal.
  • Google Colab: Open colab.research.google.com, create a new notebook, and run each cell block by block.

Start every session with these two imports:

Code
import pandas as pd
import numpy as np

Step 1 - Load the Data and Spot the Problems

Before fixing anything, the first job is to see what you are working with. Below is a sample student dataset with intentional problems built in:

Code
import pandas as pd
import numpy as np

data = {
    'Name':  ['Alice', 'Bob', 'Charlie', 'Bob', 'Diana', 'Eve'],
    'Age':   [22.0, None, 21.0, None, 23.0, 25.0],
    'Score': [88.0, 92.0, None, 92.0, 79.0, None],
    'Grade': ['A', 'B', None, 'B', 'C', None]
}

df = pd.DataFrame(data)
print(df)
Output
Name Age Score Grade 0 Alice 22.0 88.0 A 1 Bob NaN 92.0 B 2 Charlie 21.0 NaN None 3 Bob NaN 92.0 B 4 Diana 23.0 79.0 C 5 Eve 25.0 NaN None

Problems visible at a glance:

  • Row 1 and Row 3 are exact duplicates - Bob appears twice
  • Age is missing in rows 1 and 3
  • Score is missing in rows 2 and 5
  • Grade is missing in rows 2 and 5

Step 2 - Find All Null Values

Before deciding how to handle missing values, you need to know exactly where they are and how many there are.

Code
print(df.isnull())
Output
Name Age Score Grade 0 False False False False 1 False True False False 2 False False True True 3 False True False False 4 False False False False 5 False False True True

Use .sum() to get a count per column:

Code
print(df.isnull().sum())
Output
Name 0 Age 2 Score 2 Grade 2 dtype: int64

For large datasets, check the percentage of missing values:

Code
print(df.isnull().sum() / len(df) * 100)
Output
Name 0.00 Age 33.33 Score 33.33 Grade 33.33 dtype: float64

33% of Age, Score, and Grade is missing. That is too significant to ignore.

Step 3 - Remove Duplicate Rows

Duplicates quietly inflate your numbers. If Bob appears twice, his score gets counted twice in every average or aggregation - without a single warning.

Code
# Count duplicates
print(df.duplicated().sum())
Output
1
Code
# See which row is duplicate
print(df[df.duplicated()])
Output
Name Age Score Grade 3 Bob NaN 92.0 B
Code
# Remove duplicates
df.drop_duplicates(inplace=True)
print(df.shape)
Output
(5, 4)

Additional options for drop_duplicates():

Code
df.drop_duplicates(keep='last', inplace=True)          # Keep last occurrence
df.drop_duplicates(keep=False, inplace=True)           # Remove all copies
df.drop_duplicates(subset=['Name'], inplace=True)      # Check by column only

Step 4 - Drop or Fill?

Once duplicates are handled, you need to deal with null values. You have two options - drop or fill.

SituationRecommended Action
Very few rows have nulls, large datasetDrop rows
Nulls are common or rows are valuableFill with mean / median / mode
Entire column is mostly nullDrop the column

Step 5 - Dropping Nulls

Code
# Drop any row with at least one null
df_dropped = df.dropna()
print(df_dropped)
Output
Name Age Score Grade 0 Alice 22.0 88.0 A 4 Diana 23.0 79.0 C
Code
df.dropna(axis=1)        # Drop columns with any null
df.dropna(thresh=3)      # Keep rows with at least 3 non-null values

Step 6 - Filling Nulls in Numeric Columns

For numeric columns, fill with mean when there are no significant outliers, or median when outliers exist.

Code
# Fill Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df['Age'])
Output
0 22.00 1 22.75 2 21.00 4 23.00 5 25.00 Name: Age, dtype: float64
Code
# Fill Score with median
df['Score'].fillna(df['Score'].median(), inplace=True)
print(df['Score'])
Output
0 88.0 1 92.0 2 88.5 4 79.0 5 88.5 Name: Score, dtype: float64

Step 7 - Filling Nulls in Categorical Columns

For categorical columns like Grade or City - filling with mean makes no sense. Use the mode - the most frequently occurring value.

Code
# Fill Grade with mode
df['Grade'].fillna(df['Grade'].mode()[0], inplace=True)
print(df['Grade'])
Output
0 A 1 B 2 B 4 C 5 B Name: Grade, dtype: object
💡 Note: mode()[0] is used because .mode() returns a Series. The [0] extracts the first - most frequent - value from it.

Step 8 - Checking and Fixing Data Types

A very common issue with CSV files is that numeric columns get loaded as strings when the column contains mixed content.

Code
data2 = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Watch'],
    'Price':   ['999', '499', 'NaN', '199'],
    'Stock':   [50, 30, 20, 15]
}

df2 = pd.DataFrame(data2)
print(df2.dtypes)
Output
Product object Price object Stock int64 dtype: object
Code
# Convert Price to numeric
df2['Price'] = pd.to_numeric(df2['Price'], errors='coerce')
print(df2.dtypes)
print(df2)
Output
Product object Price float64 Stock int64 Product Price Stock 0 Laptop 999.0 50 1 Phone 499.0 30 2 Tablet NaN 20 3 Watch 199.0 15
Code
df2['Stock']   = df2['Stock'].astype(int)     # To integer
df2['Product'] = df2['Product'].astype(str)   # To string

Step 9 - Complete End-to-End Cleaning Workflow

Below is the full seven-step workflow in a single block. This is the process to follow every time a new dataset lands on your desk.

Code
import pandas as pd
import numpy as np

# Step 1 - Load
df = pd.read_csv("students.csv")
print("Raw Data:")
print(df.head())

# Step 2 - Inspect
print("\nShape:", df.shape)
print(df.info())

# Step 3 - Check nulls
print("\nNull Counts:")
print(df.isnull().sum())
print("\nNull Percentage:")
print(df.isnull().sum() / len(df) * 100)

# Step 4 - Remove duplicates
print("\nDuplicates:", df.duplicated().sum())
df.drop_duplicates(inplace=True)

# Step 5 - Fix data types
df['Age']   = pd.to_numeric(df['Age'],   errors='coerce')
df['Score'] = pd.to_numeric(df['Score'], errors='coerce')

# Step 6 - Fill nulls by column type
df['Age'].fillna(df['Age'].mean(),        inplace=True)   # Numeric → mean
df['Score'].fillna(df['Score'].median(),  inplace=True)   # Numeric → median
df['Grade'].fillna(df['Grade'].mode()[0], inplace=True)   # Categorical → mode

# Step 7 - Confirm clean
print("\nNull counts after cleaning:")
print(df.isnull().sum())
print("\nClean Dataset:")
print(df)
Output
Null Counts: Name 0 Age 2 Score 1 Grade 1 Duplicates: 1 Null counts after cleaning: Name 0 Age 0 Score 0 Grade 0 Clean Dataset: Name Age Score Grade 0 Alice 22.00 88.0 A 1 Bob 22.00 88.5 B 2 Charlie 21.00 88.5 A 4 Diana 23.00 79.0 C

Quick Command Reference

TaskCommand
Check nulls per celldf.isnull()
Count nulls per columndf.isnull().sum()
Null percentagedf.isnull().sum() / len(df) * 100
Drop rows with nullsdf.dropna()
Drop columns with nullsdf.dropna(axis=1)
Fill with fixed valuedf.fillna(0)
Fill numeric with meandf['col'].fillna(df['col'].mean())
Fill numeric with mediandf['col'].fillna(df['col'].median())
Fill categorical with modedf['col'].fillna(df['col'].mode()[0])
Count duplicatesdf.duplicated().sum()
Remove duplicatesdf.drop_duplicates()
Check data typesdf.dtypes
Convert to numericpd.to_numeric(df['col'], errors='coerce')
Convert type manuallydf['col'].astype(int)

Post a Comment

0 Comments