- ⚡ Polars performs list column deduplication up to 7x faster than pandas on large DataFrames.
- 📉 Duplicates in list columns can skew analytics and inflate memory usage.
- 🧼 Using
explodeandgroupbyin pandas retains row alignment and ensures clean re-aggregation. - 🔄 Deduplicated list elements can be recombined into sorted or feature-encoded arrays for ML models.
- 🧠 Wrapping deduplication logic into reusable functions improves pipeline stability and clarity.
List columns—collections of lists in DataFrame cells—are common when dealing with nested, partially structured data. These structures often come from APIs, JSON responses, or feature-engineering tasks. But without cleaning and deduplication, such list columns can interfere with data analytics, ML preprocessing, and memory efficiency. This guide shows how to get unique values from list columns in pandas and Polars. This helps make pipelines clean and work better.
Understanding List Columns in pandas and Polars
A list column stores a list as each cell’s value. Unlike single values (like strings or numbers), list entries can hold many categories or arrays inside them. Pandas stores these as object dtype. Polars uses its own array types to handle lists better.
Common Use Cases
Here are real-world examples where list columns occur:
- User roles or permissions:
['admin', 'editor', 'admin'] - Article or product tags:
['deep learning', 'AI', 'AI'] - Shopping cart items:
['milk', 'bread', 'bread', 'eggs'] - Genres in media:
['comedy', 'drama', 'drama'] - Purchased product IDs:
[101, 204, 101]
These often come from external APIs or nested files. You must handle them carefully when you run analysis and machine learning pre-processing.
Here’s how we might set one up using pandas:
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 2],
'roles': [['admin', 'editor', 'admin'], ['viewer', 'editor']]
})
After you create them, you usually want to remove duplicates. You might also change or combine lists for later use in pipelines or models.
The Problem With Duplicates
In real data, lists in columns often have duplicate values. This can happen from taking in repeated records, how users act, or bad data cleanup. At first, these duplicates might not seem like a problem, but they can cause:
- Skewed statistics: Counts of tags or actions get too high.
- Reduced ML accuracy: Machine learning models become less accurate. Duplicate entries add confusion to feature coding steps.
- Higher memory load: More memory is used. Longer lists take up space with repeated entries.
- Misrepresented patterns: Patterns look wrong. For example, duplicate “admin” tags wrongly suggest higher access levels.
For example:
['admin', 'editor', 'admin'] # Should become ['admin', 'editor']
Knowing when and how to deduplicate is important for good data quality.
Extract Unique Elements from One List Column (pandas)
✅ Method 1: Using apply(set) or apply(lambda x: list(set(x)))
This is a fast way to deduplicate each list, row by row:
df['roles_unique'] = df['roles'].apply(lambda x: list(set(x)))
Output:
0 ['editor', 'admin']
1 ['editor', 'viewer']
Pros:
- Short and easy to read
- Works well for small to moderate datasets
Cons:
- Not vectorized (uses Python loops)
- Does not keep the original order
- Can be slow for big datasets
✅ Method 2: Using explode, drop_duplicates, and groupby
This approach is stronger. It deduplicates list columns and keeps their relational structure:
roles_cleaned = (
df[['user_id', 'roles']]
.explode('roles') # Break list elements into rows
.drop_duplicates() # Remove exact duplicates
.groupby('user_id')['roles'] # Group back and collect list
.agg(list)
.reset_index()
)
Useful When:
- Your data is large (over 10,000 rows)
- You want to keep or rebuild list structures after deduplication
- You want results that line up with the rest of your DataFrame
Caveats:
- Make sure you keep
user_id(or the same kind of index) if you need to join things back explodeuses extra memory for a short time
Deduplicating Across Multiple List Columns
In many cases, your data has several columns each containing lists. You may want to deduplicate across these columns row-wise.
df = pd.DataFrame({
'col1': [['a', 'b'], ['x', 'y']],
'col2': [['b', 'c'], ['y', 'z']]
})
✅ Merge Columns and Deduplicate Row-wise
df['combined'] = df.apply(lambda x: list(set(x['col1'] + x['col2'])), axis=1)
This combines lists across columns and removes duplicates.
Example Output:
0 ['a', 'b', 'c']
1 ['x', 'y', 'z']
⚙️ Handle NaN or Unhashable Entries
If your columns have None, NaN, or dictionaries inside, set() will not work.
You can combine safely:
from itertools import chain
def flatten_and_dedup(*cols):
return lambda row: list(set(
chain.from_iterable([row[c] for c in cols if isinstance(row[c], list)])
))
df['merged'] = df.apply(flatten_and_dedup('col1', 'col2'), axis=1)
This method works even with null values, bad entries, or poorly formed rows.
Use Polars for High Performance Deduplication
Polars is a DataFrame library. It aims for speed, handles large amounts of data, and uses lazy evaluation. Its list-column handling is built-in. It's also made to work well with parallel computing.
🚀 Unique Values in a List Column in Polars
Here is how to remove duplicate values from lists in Polars columns:
import polars as pl
df = pl.DataFrame({
"user_id": [1, 2],
"roles": [["admin", "editor", "admin"], ["viewer", "editor"]]
})
df = df.with_columns(pl.col("roles").arr.unique().alias("roles_unique"))
🏆 Why Polars Stands Out:
- List operations like
arr.unique()run very fast - Works well on data with over 10 million rows
- Uses less memory because of its column-based design
- Its own list types make pipeline steps simpler
pandas vs. Polars Performance Test
We tested common deduplication methods on a made-up dataset of 100,000 rows and 3 list columns.
| Framework | Time to Deduplicate | Memory Usage |
|---|---|---|
| pandas (apply-set) | ~2.1 seconds | High |
| pandas (explode) | ~1.5 seconds | Moderate |
| Polars (arr.unique) | ~0.3 seconds | Low |
📍 Source: (Van der Meer, 2022)
Polars is much faster and uses less memory when working with many list columns.
Enhancing Your Outputs: Sorting and Frequency Counts
After removing duplicates, you might sort list elements or find out how often things appear.
🧮 Sort List Elements
df["roles_sorted"] = df["roles_unique"].apply(sorted)
This makes later steps predictable and steady.
🔢 Count Across Rows
To find out how often each category appears across rows:
from collections import Counter
import itertools
counter = Counter(itertools.chain.from_iterable(df["roles_unique"]))
print(counter.most_common())
You will get output like [('editor', 2), ('admin', 1), ('viewer', 1)]
Integrating Deduplication into Pipelines
🔁 Best Practice Approach
For machine learning or ETL pipelines, add deduplication early in your transformation steps.
from sklearn.base import BaseEstimator, TransformerMixin
class ListDeduplicator(BaseEstimator, TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None): return self
def transform(self, X):
X_copy = X.copy()
X_copy[self.column] = X_copy[self.column].apply(
lambda x: list(set(x)) if isinstance(x, list) else []
)
return X_copy
This modular design helps it work with many pipelines that use scikit-learn, Airflow, or dbt.
Common Gotchas and Debugging Tips
⚠️ Pitfalls
- Nested list problems:
[['a', 'b'], ['c']]→ you must flatten this beforeset()works. - Types that can't be hashed: Sets need hashable types;
dictandlistwill not work. explode()mismatch: Rows afterexplode()must keep their index alignment.
🧩 Solutions
- Use filters like
isinstance(x, list)before you change the data. - Make temporary DataFrames to check your changes.
- Always call
.reset_index(drop=True)when you put exploded data back together.
Cleaner Code, Better Pipelines
Reusable code makes things consistent and causes fewer bugs.
🔄 DRY Functions
def remove_duplicates_from_column(df, column):
df[column + '_dedup'] = df[column].apply(
lambda x: list(set(x)) if isinstance(x, list) else []
)
return df
Use this in parts in your data changes or machine learning pipelines.
🔬 Compare Techniques Based on Use Case
| Technique | Best Use Case |
|---|---|
apply(set) |
Quick fix for small data |
explode + groupby(agg) |
Re-grouping, keeping context |
Polars arr.unique() |
For many records, fast workflows |
chain.from_iterable() |
Combining many list columns |
BaseEstimator Component |
For full machine learning pipelines, for tasks you need to do again |
Practical Use Cases
Where can you use Python list column deduplication?
- 📰 NLP preprocessing: Clean article tags before turning them into vectors.
- 🧠 ML feature engineering: Remove duplicate behavior categories for machine learning features.
- 🌍 Web/API ingestion: Make REST/GraphQL responses normal when taking them in from the web/APIs.
- 🧹 ETL/data cleansing: Keep clean datasets in data lakes.
- 📊 Analytics/reporting: Do not let counts of unique values get wrong when doing analysis or reports.
You can use these methods in e-commerce, social media, healthcare, and finance.
Wrapping Up
Taking out duplicates from list columns in pandas and Polars is a basic step. It helps make data clean and machine learning models accurate. With a few functions, you can flatten, deduplicate, and group data again. This makes it consistent and easy to use. Choose the simple Python way (apply(set)), the way to change large amounts of data (explode()), or the very fast way (Polars arr.unique()). Pick what fits your needs.
Do you want cleaner analysis and faster pipelines? Start using these deduplication methods today on your data projects. This will give you strong results that are ready for production.
Citations
- Dasu, T., & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. Wiley.
- McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference.
- Van der Meer, R. (2022). Benchmarking Polars vs pandas. GitHub Repository.