Python Pandas Data Cleaning & Null Values Guide

Every data science project starts with one unavoidable step - cleaning the data.

Most beginners want to jump straight into analysis, building charts, or training machine learning models. But here is the reality: if the data going in is broken, the results coming out will be wrong. And the frustrating part is - the code will not always throw an error to warn you. It will just give you a number that looks correct but is not.

That is what makes dirty data dangerous.

In this post, we are going to work through the full data cleaning process using Python and Pandas - step by step, with real code and real output. By the end, you will know how to find missing values, remove duplicates, fix data types, and fill nulls using the right strategy for each column type.

You can also watch this tutorial in youtube.

What Is Dirty Data?

Dirty data is any dataset that contains errors, inconsistencies, or missing information that can affect the accuracy of your analysis.

The three most common problems in real-world datasets are:

Missing values - cells that are empty or contain NaN
Duplicate rows - the same record appearing more than once
Wrong data types - a numeric column stored as text, for example

None of these are rare. In production datasets - whether they come from databases, forms, APIs, or CSV exports - all three problems tend to show up together.

Setting Up the Environment

This tutorial works in both VS Code and Google Colab.

VS Code: Create a new file data_cleaning.py and run it from the integrated terminal.
Google Colab: Open colab.research.google.com, create a new notebook, and run each cell block by block.

Start every session with these two imports:

Code

import pandas as pd
import numpy as np

Step 1 - Load the Data and Spot the Problems

Before fixing anything, the first job is to see what you are working with. Below is a sample student dataset with intentional problems built in:

Code

import pandas as pd
import numpy as np

data = {
    'Name':  ['Alice', 'Bob', 'Charlie', 'Bob', 'Diana', 'Eve'],
    'Age':   [22.0, None, 21.0, None, 23.0, 25.0],
    'Score': [88.0, 92.0, None, 92.0, 79.0, None],
    'Grade': ['A', 'B', None, 'B', 'C', None]
}

df = pd.DataFrame(data)
print(df)

Output

Name Age Score Grade 0 Alice 22.0 88.0 A 1 Bob NaN 92.0 B 2 Charlie 21.0 NaN None 3 Bob NaN 92.0 B 4 Diana 23.0 79.0 C 5 Eve 25.0 NaN None

Problems visible at a glance:

Row 1 and Row 3 are exact duplicates - Bob appears twice
Age is missing in rows 1 and 3
Score is missing in rows 2 and 5
Grade is missing in rows 2 and 5

Step 2 - Find All Null Values

Before deciding how to handle missing values, you need to know exactly where they are and how many there are.

Code

print(df.isnull())

Output

Name Age Score Grade 0 False False False False 1 False True False False 2 False False True True 3 False True False False 4 False False False False 5 False False True True

Use .sum() to get a count per column:

Code

print(df.isnull().sum())

Output

Name 0 Age 2 Score 2 Grade 2 dtype: int64

For large datasets, check the percentage of missing values:

Code

print(df.isnull().sum() / len(df) * 100)

Output

Name 0.00 Age 33.33 Score 33.33 Grade 33.33 dtype: float64

33% of Age, Score, and Grade is missing. That is too significant to ignore.

Step 3 - Remove Duplicate Rows

Duplicates quietly inflate your numbers. If Bob appears twice, his score gets counted twice in every average or aggregation - without a single warning.

Code

# Count duplicates
print(df.duplicated().sum())

Output

Code

# See which row is duplicate
print(df[df.duplicated()])

Output

Name Age Score Grade 3 Bob NaN 92.0 B

Code

# Remove duplicates
df.drop_duplicates(inplace=True)
print(df.shape)

Output

(5, 4)

Additional options for drop_duplicates():

Code

df.drop_duplicates(keep='last', inplace=True)          # Keep last occurrence
df.drop_duplicates(keep=False, inplace=True)           # Remove all copies
df.drop_duplicates(subset=['Name'], inplace=True)      # Check by column only

Step 4 - Drop or Fill?

Once duplicates are handled, you need to deal with null values. You have two options - drop or fill.

Situation	Recommended Action
Very few rows have nulls, large dataset	Drop rows
Nulls are common or rows are valuable	Fill with mean / median / mode
Entire column is mostly null	Drop the column

Step 5 - Dropping Nulls

Code

# Drop any row with at least one null
df_dropped = df.dropna()
print(df_dropped)

Output

Name Age Score Grade 0 Alice 22.0 88.0 A 4 Diana 23.0 79.0 C

Code

df.dropna(axis=1)        # Drop columns with any null
df.dropna(thresh=3)      # Keep rows with at least 3 non-null values

Step 6 - Filling Nulls in Numeric Columns

For numeric columns, fill with mean when there are no significant outliers, or median when outliers exist.

Code

# Fill Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df['Age'])

Output

0 22.00 1 22.75 2 21.00 4 23.00 5 25.00 Name: Age, dtype: float64

Code

# Fill Score with median
df['Score'].fillna(df['Score'].median(), inplace=True)
print(df['Score'])

Output

0 88.0 1 92.0 2 88.5 4 79.0 5 88.5 Name: Score, dtype: float64

Step 7 - Filling Nulls in Categorical Columns

For categorical columns like Grade or City - filling with mean makes no sense. Use the mode - the most frequently occurring value.

Code

# Fill Grade with mode
df['Grade'].fillna(df['Grade'].mode()[0], inplace=True)
print(df['Grade'])

Output

0 A 1 B 2 B 4 C 5 B Name: Grade, dtype: object

💡 Note: mode()[0] is used because .mode() returns a Series. The [0] extracts the first - most frequent - value from it.

Step 8 - Checking and Fixing Data Types

A very common issue with CSV files is that numeric columns get loaded as strings when the column contains mixed content.

Code

data2 = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Watch'],
    'Price':   ['999', '499', 'NaN', '199'],
    'Stock':   [50, 30, 20, 15]
}

df2 = pd.DataFrame(data2)
print(df2.dtypes)

Output

Product object Price object Stock int64 dtype: object

Code

# Convert Price to numeric
df2['Price'] = pd.to_numeric(df2['Price'], errors='coerce')
print(df2.dtypes)
print(df2)

Output

Product object Price float64 Stock int64 Product Price Stock 0 Laptop 999.0 50 1 Phone 499.0 30 2 Tablet NaN 20 3 Watch 199.0 15

Code

df2['Stock']   = df2['Stock'].astype(int)     # To integer
df2['Product'] = df2['Product'].astype(str)   # To string

Step 9 - Complete End-to-End Cleaning Workflow

Below is the full seven-step workflow in a single block. This is the process to follow every time a new dataset lands on your desk.

Code

import pandas as pd
import numpy as np

# Step 1 - Load
df = pd.read_csv("students.csv")
print("Raw Data:")
print(df.head())

# Step 2 - Inspect
print("\nShape:", df.shape)
print(df.info())

# Step 3 - Check nulls
print("\nNull Counts:")
print(df.isnull().sum())
print("\nNull Percentage:")
print(df.isnull().sum() / len(df) * 100)

# Step 4 - Remove duplicates
print("\nDuplicates:", df.duplicated().sum())
df.drop_duplicates(inplace=True)

# Step 5 - Fix data types
df['Age']   = pd.to_numeric(df['Age'],   errors='coerce')
df['Score'] = pd.to_numeric(df['Score'], errors='coerce')

# Step 6 - Fill nulls by column type
df['Age'].fillna(df['Age'].mean(),        inplace=True)   # Numeric → mean
df['Score'].fillna(df['Score'].median(),  inplace=True)   # Numeric → median
df['Grade'].fillna(df['Grade'].mode()[0], inplace=True)   # Categorical → mode

# Step 7 - Confirm clean
print("\nNull counts after cleaning:")
print(df.isnull().sum())
print("\nClean Dataset:")
print(df)

Output

Null Counts: Name 0 Age 2 Score 1 Grade 1 Duplicates: 1 Null counts after cleaning: Name 0 Age 0 Score 0 Grade 0 Clean Dataset: Name Age Score Grade 0 Alice 22.00 88.0 A 1 Bob 22.00 88.5 B 2 Charlie 21.00 88.5 A 4 Diana 23.00 79.0 C

Quick Command Reference

Task	Command
Check nulls per cell	`df.isnull()`
Count nulls per column	`df.isnull().sum()`
Null percentage	`df.isnull().sum() / len(df) * 100`
Drop rows with nulls	`df.dropna()`
Drop columns with nulls	`df.dropna(axis=1)`
Fill with fixed value	`df.fillna(0)`
Fill numeric with mean	`df['col'].fillna(df['col'].mean())`
Fill numeric with median	`df['col'].fillna(df['col'].median())`
Fill categorical with mode	`df['col'].fillna(df['col'].mode()[0])`
Count duplicates	`df.duplicated().sum()`
Remove duplicates	`df.drop_duplicates()`
Check data types	`df.dtypes`
Convert to numeric	`pd.to_numeric(df['col'], errors='coerce')`
Convert type manually	`df['col'].astype(int)`

Python Pandas Data Cleaning & Null Values Guide

What Is Dirty Data?

Setting Up the Environment

Step 1 - Load the Data and Spot the Problems

Step 2 - Find All Null Values

Step 3 - Remove Duplicate Rows

Step 4 - Drop or Fill?

Step 5 - Dropping Nulls

Step 6 - Filling Nulls in Numeric Columns

Step 7 - Filling Nulls in Categorical Columns

Step 8 - Checking and Fixing Data Types

Step 9 - Complete End-to-End Cleaning Workflow

Quick Command Reference

Post a Comment

0 Comments

Support / Buy Me a Coffee

Watch Free Tutorials

Quick Tutorial Links

Video Tutorial Links

Free Online Tools

Popular in Week