Every data science project starts with one unavoidable step - cleaning the data.
Most beginners want to jump straight into analysis, building charts, or training machine learning models. But here is the reality: if the data going in is broken, the results coming out will be wrong. And the frustrating part is - the code will not always throw an error to warn you. It will just give you a number that looks correct but is not.
That is what makes dirty data dangerous.
In this post, we are going to work through the full data cleaning process using Python and Pandas - step by step, with real code and real output. By the end, you will know how to find missing values, remove duplicates, fix data types, and fill nulls using the right strategy for each column type.
What Is Dirty Data?
Dirty data is any dataset that contains errors, inconsistencies, or missing information that can affect the accuracy of your analysis.
The three most common problems in real-world datasets are:
- Missing values - cells that are empty or contain NaN
- Duplicate rows - the same record appearing more than once
- Wrong data types - a numeric column stored as text, for example
None of these are rare. In production datasets - whether they come from databases, forms, APIs, or CSV exports - all three problems tend to show up together.
Setting Up the Environment
This tutorial works in both VS Code and Google Colab.
- VS Code: Create a new file
data_cleaning.pyand run it from the integrated terminal. - Google Colab: Open colab.research.google.com, create a new notebook, and run each cell block by block.
Start every session with these two imports:
import pandas as pd import numpy as np
Step 1 - Load the Data and Spot the Problems
Before fixing anything, the first job is to see what you are working with. Below is a sample student dataset with intentional problems built in:
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Diana', 'Eve'],
'Age': [22.0, None, 21.0, None, 23.0, 25.0],
'Score': [88.0, 92.0, None, 92.0, 79.0, None],
'Grade': ['A', 'B', None, 'B', 'C', None]
}
df = pd.DataFrame(data)
print(df)
Problems visible at a glance:
- Row 1 and Row 3 are exact duplicates - Bob appears twice
- Age is missing in rows 1 and 3
- Score is missing in rows 2 and 5
- Grade is missing in rows 2 and 5
Step 2 - Find All Null Values
Before deciding how to handle missing values, you need to know exactly where they are and how many there are.
print(df.isnull())
Use .sum() to get a count per column:
print(df.isnull().sum())
For large datasets, check the percentage of missing values:
print(df.isnull().sum() / len(df) * 100)
33% of Age, Score, and Grade is missing. That is too significant to ignore.
Step 3 - Remove Duplicate Rows
Duplicates quietly inflate your numbers. If Bob appears twice, his score gets counted twice in every average or aggregation - without a single warning.
# Count duplicates print(df.duplicated().sum())
# See which row is duplicate print(df[df.duplicated()])
# Remove duplicates df.drop_duplicates(inplace=True) print(df.shape)
Additional options for drop_duplicates():
df.drop_duplicates(keep='last', inplace=True) # Keep last occurrence df.drop_duplicates(keep=False, inplace=True) # Remove all copies df.drop_duplicates(subset=['Name'], inplace=True) # Check by column only
Step 4 - Drop or Fill?
Once duplicates are handled, you need to deal with null values. You have two options - drop or fill.
| Situation | Recommended Action |
|---|---|
| Very few rows have nulls, large dataset | Drop rows |
| Nulls are common or rows are valuable | Fill with mean / median / mode |
| Entire column is mostly null | Drop the column |
Step 5 - Dropping Nulls
# Drop any row with at least one null df_dropped = df.dropna() print(df_dropped)
df.dropna(axis=1) # Drop columns with any null df.dropna(thresh=3) # Keep rows with at least 3 non-null values
Step 6 - Filling Nulls in Numeric Columns
For numeric columns, fill with mean when there are no significant outliers, or median when outliers exist.
# Fill Age with mean df['Age'].fillna(df['Age'].mean(), inplace=True) print(df['Age'])
# Fill Score with median df['Score'].fillna(df['Score'].median(), inplace=True) print(df['Score'])
Step 7 - Filling Nulls in Categorical Columns
For categorical columns like Grade or City - filling with mean makes no sense. Use the mode - the most frequently occurring value.
# Fill Grade with mode df['Grade'].fillna(df['Grade'].mode()[0], inplace=True) print(df['Grade'])
mode()[0] is used because .mode() returns a Series. The [0] extracts the first - most frequent - value from it.Step 8 - Checking and Fixing Data Types
A very common issue with CSV files is that numeric columns get loaded as strings when the column contains mixed content.
data2 = {
'Product': ['Laptop', 'Phone', 'Tablet', 'Watch'],
'Price': ['999', '499', 'NaN', '199'],
'Stock': [50, 30, 20, 15]
}
df2 = pd.DataFrame(data2)
print(df2.dtypes)
# Convert Price to numeric df2['Price'] = pd.to_numeric(df2['Price'], errors='coerce') print(df2.dtypes) print(df2)
df2['Stock'] = df2['Stock'].astype(int) # To integer df2['Product'] = df2['Product'].astype(str) # To string
Step 9 - Complete End-to-End Cleaning Workflow
Below is the full seven-step workflow in a single block. This is the process to follow every time a new dataset lands on your desk.
import pandas as pd
import numpy as np
# Step 1 - Load
df = pd.read_csv("students.csv")
print("Raw Data:")
print(df.head())
# Step 2 - Inspect
print("\nShape:", df.shape)
print(df.info())
# Step 3 - Check nulls
print("\nNull Counts:")
print(df.isnull().sum())
print("\nNull Percentage:")
print(df.isnull().sum() / len(df) * 100)
# Step 4 - Remove duplicates
print("\nDuplicates:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
# Step 5 - Fix data types
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Score'] = pd.to_numeric(df['Score'], errors='coerce')
# Step 6 - Fill nulls by column type
df['Age'].fillna(df['Age'].mean(), inplace=True) # Numeric → mean
df['Score'].fillna(df['Score'].median(), inplace=True) # Numeric → median
df['Grade'].fillna(df['Grade'].mode()[0], inplace=True) # Categorical → mode
# Step 7 - Confirm clean
print("\nNull counts after cleaning:")
print(df.isnull().sum())
print("\nClean Dataset:")
print(df)
Quick Command Reference
| Task | Command |
|---|---|
| Check nulls per cell | df.isnull() |
| Count nulls per column | df.isnull().sum() |
| Null percentage | df.isnull().sum() / len(df) * 100 |
| Drop rows with nulls | df.dropna() |
| Drop columns with nulls | df.dropna(axis=1) |
| Fill with fixed value | df.fillna(0) |
| Fill numeric with mean | df['col'].fillna(df['col'].mean()) |
| Fill numeric with median | df['col'].fillna(df['col'].median()) |
| Fill categorical with mode | df['col'].fillna(df['col'].mode()[0]) |
| Count duplicates | df.duplicated().sum() |
| Remove duplicates | df.drop_duplicates() |
| Check data types | df.dtypes |
| Convert to numeric | pd.to_numeric(df['col'], errors='coerce') |
| Convert type manually | df['col'].astype(int) |

0 Comments