Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python Memory Usage: How to Handle Large Data?

Struggling with Python RAM limits on large datasets? Learn how to manage memory while preserving data order and structure effectively.
Visualization showing Python crashing from memory error on left and efficient data streaming on the right with optimized memory usage Visualization showing Python crashing from memory error on left and efficient data streaming on the right with optimized memory usage
  • 🧠 Python’s object model uses extra memory for each variable, which makes it hard to scale for big data jobs.
  • ⚠️ Loading a 1GB CSV in pandas can use over 5GB RAM because of how it guesses data types and its internal object structures.
  • 🛠️ Tools like Dask, Vaex, and NumPy help manage large datasets well, letting you do parallel or out-of-core processing.
  • 💡 Streaming data with generators or by reading in chunks greatly lowers memory usage compared to loading whole files.
  • 📊 Using memory diagnostics like psutil and tracemalloc shows hidden problems and memory leaks in your Python code.

If you've run into out-of-memory errors or slow performance when working with large files in Python, you're not alone. Python's flexible, high-level ways of working make coding easier, but they also have a downside, especially with memory management. This guide looks at ways to manage Python memory usage well when working with large data. It also gives memory-saving techniques for handling large files in Python. From streaming data and generator patterns to using special libraries like Dask or NumPy, you'll get useful advice to handle even the biggest jobs.


Why Python Struggles with Large In-Memory Data

Python puts developer productivity and readability first, not top speed or memory use. Every Python object, from a simple integer to a complex dictionary, has extra info like reference counts, type, and memory padding. Python hides how this works, but it means objects take up a lot of memory.

Let’s compare native Python and C:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • A Python int can use over 28 bytes.
  • A Python str uses even more memory from Unicode handling and how memory is given out as needed.
  • Lists and dictionaries point to other full objects instead of keeping simple data directly.

This extra object memory makes your program use much more memory. For example, what looks like a small 1GB CSV might take up 5–7 GB of RAM once loaded into a pandas DataFrame. This happens because of how it guesses data types, temporary storage, and how it makes indexes (Python Software Foundation, 2023).

So, if you're handling large data in Python, especially on machines with little RAM, you’ll need tools and ways to make storage better at every step.


Use Iterators & Generators Instead of Lists

When working with large files or datasets, list comprehensions and loading everything at once can use up too much memory. Lists keep all items in memory. This can overload your computer's memory for files with millions of entries.

Use Generators for Lazy Evaluation

A generator gives one item at a time. It only makes the data when it's needed. This greatly cuts down memory usage.

Inefficient:

lines = [line for line in open('large_file.txt')]

Here, all lines load into memory right away. This can make your computer crash for files larger than its RAM.

Efficient:

def read_lines(filename):
    with open(filename) as f:
        for line in f:
            yield line

for line in read_lines('large_file.txt'):
    process(line)

This way gives one line at a time, letting go of memory after each step. It works well for processing large files like logs, data collected from devices, or collections of text.

Other Ways to Use Generators

  • Reading records from databases with cursor pagination.
  • Processing JSON records from an API feed.
  • Real-time log processing.

Using iterators is good Python practice. And it's key if you want to handle large files in Python without problems.


Chunking Large Files

Chunking is the way of breaking large datasets into smaller parts. These parts are then processed one after another. Pandas and standard I/O make this easy and save memory.

Using pandas.read_csv() with Chunks

import pandas as pd

for chunk in pd.read_csv('big_data.csv', chunksize=10000):
    process(chunk)
  • Loads 10,000 rows at a time.
  • It's best for filtering rows, changing them, or grouping them.
  • This keeps memory usage in a known range and stops the whole dataset from being loaded into RAM at once.

Manual File Chunking

You can also stream file chunks by hand:

with open('large_log.txt') as f:
    while (chunk := f.read(8192)):
        process(chunk)

Or line by line:

with open('large_log.txt') as f:
    for line in f:
        process(line)

This is very useful if you only need to filter, parse, or change plain-text files like logs, data collected from devices, or files with a separator.


Streaming Data Instead of Loading All at Once

Python has powerful tools to stream data instead of loading it all at once. This way stops sudden jumps in memory use and makes processing faster when handling large files.

Streaming with Standard Libraries

import gzip

with gzip.open('large_data.gz', 'rt') as f:
    for line in f:
        process(line)

This is used for compressed log files or genome data. Streaming lets you parse line by line from large gzip archives.

And then, you can also stream CSVs:

import csv

with open('big.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        process(row)

Streaming is the main choice for network feeds, database pipelines, and log processors. It is also a good way to work when managing large data in Python with limited memory.


Memory-Efficient Libraries for Big Data

Python’s community has several memory-efficient libraries made to perform well and handle data that doesn't fit in memory.

Dask

  • Parallel, computing that can grow.
  • Works with pandas-like DataFrames too big for memory.
  • It does lazy evaluation and uses blocked ways of working.
import dask.dataframe as dd

df = dd.read_csv('big_data.csv')
result = df[df['value'] > 100].compute()

Dask Development Team, 2020

Vaex

  • Fast DataFrame library for lazy evaluation.
  • It's made for filtering, grouping, and showing data without loading it all into memory.
import vaex

df = vaex.open('large_dataset.csv')
df_filtered = df[df.value > 100]

PyTables / HDF5

  • Works well for storing numerical data in a tree-like format.
  • It can compress data, make indexes, and save to disk.

Columnar Formats: Feather and Parquet

  • These store DataFrames in a format made for reading and for small queries.
  • They are good for machine learning, analytics, and saving data for later use.

These libraries change how Python handles large files. They move big memory tasks to disk or faster CPU actions.


Use of NumPy for Lower-Level Memory Control

When working with numerical data, NumPy is a key tool. It uses memory more tightly and performs better than Python lists.

Why NumPy Wins

  • It uses arrays with a fixed type and packed data.
  • It stores data in memory right next to each other.
  • It avoids Python's extra memory use for each object.
import numpy as np

arr = np.array([1, 2, 3, 4], dtype=np.int32)  # 4 bytes per element

Compare:

data = [1, 2, 3, 4]  # ~28+ bytes per int object

NumPy lets you compute efficiently on large datasets with SIMD and vectorization.

It works well with:

  • Pandas (as the backend).
  • Machine Learning jobs.
  • Scientific computing or image processing.

(Oliphant, 2006, Van der Walt et al., 2011)


Avoiding Common Mistakes When Handling Big Data

Memory problems often come from poor coding habits:

  • ❌ Copies you don't need: e.g., data_copy = data[:]
  • ❌ Deep copying nested dicts and lists.
  • ❌ Gathering all results in large lists or dicts.
  • ❌ Forgetting to delete large variables: use del variable and gc.collect()

Better ways to work:

  • Use .loc[] in pandas instead of boolean indexing to avoid copies.
  • Store only the data you need from processes.
  • Don’t create all results at once. Instead, write them to a file or stream to a queue.

And remember: in large data Python jobs, every improvement helps.


Using psutil and Tools to Monitor Python Memory Usage

Checking current memory usage can show problems and find leaks soon.

Install psutil

pip install psutil

Track Memory Usage

import psutil, os

proc = psutil.Process(os.getpid())
print(f"Memory usage: {proc.memory_info().rss / (1024 ** 2):.2f} MB")

Use tracemalloc for Leak Detection

import tracemalloc

tracemalloc.start()
# ... run memory-heavy code ...
print(tracemalloc.get_traced_memory())

By adding memory checks to unit tests or ETL pipelines, you can find scaling problems early.


Moving Tasks with Multiprocessing and Memory Maps

Splitting work across multiple cores helps spread out both CPU and memory usage.

Basic Multiprocessing

from multiprocessing import Pool

with Pool(4) as p:
    p.map(process_func, data_chunks)

Each process loads a part of data, which keeps total memory in check. For tasks that wait on input/output, use threading instead.

Memory Mapping Large Files

mmap lets you work with a file as if it were a byte array, without reading it all into memory.

import mmap

with open("large_binary.dat", "r+b") as f:
    mmapped = mmap.mmap(f.fileno(), 0)
    fragment = mmapped[:10000]  # Read first 10KB

This is good for accessing parts of a file randomly, like making indexes for documents or working with large binary data streams.


When to Use Serialization and Memory Mapping

In situations with results you can use again later, or with limited RAM, saving data to disk can move memory use off RAM.

Using joblib

import joblib

joblib.dump(large_data, "snapshot.pkl")
# When needed
large_data = joblib.load("snapshot.pkl")

This is good for storing ML feature data for faster access.

NumPy Memory-Mapped Arrays

np.memmap("big_array.dat", dtype="float32", mode="r+", shape=(10000, 10000))

This lets you work with data as if it’s in memory, even when it's really on disk.


Main Ways to Write Python Code That Uses Memory Well

Good code for processing large files in Python often follows a design that uses modules and streams.

Best ways to work:

  • ✅ Use deque or array.array over lists for small, repeated items.
  • ✅ Compress data with gzip or columnar formats before saving or sending it.
  • ✅ Always free up memory by hand with del and force Python to clean up memory.
  • ✅ Stream analysis results to disk/files instead of gathering everything in memory.
  • ✅ Filter data early (on disk or while loading).

These coding habits lower how much memory is used and freed. And they help programs grow bigger.


Real-World Example: Processing 10GB CSV File on 1GB RAM

You need to process a 10GB file, but only have 1GB of RAM. Here's how to do it:

import pandas as pd
import gc

results = []
for chunk in pd.read_csv("huge_logs.csv", chunksize=50000):
    chunk_filtered = chunk[chunk["status"] == "OK"]
    results.append(chunk_filtered[["timestamp", "user_id"]])
    gc.collect()  # Free memory aggressively

To make it even better, avoid gathering all results. Instead:

with open("filtered_output.csv", "a") as out_f:
    for chunk in pd.read_csv("huge_logs.csv", chunksize=50000):
        chunk_filtered = chunk[chunk["status"] == "OK"]
        chunk_filtered[["timestamp", "user_id"]].to_csv(out_f, index=False, header=False)
        gc.collect()

This lets you handle large files in Python with very tight RAM limits. And it still keeps processing speed high.


Additional Tips for Low-RAM Environments (Containers, Raspberry Pi)

Devices with less than 1GB RAM (e.g., Raspberry Pi, VPS containers) need careful memory management.

Tips:

  • Turn on SWAP as backup. It's slow, but it stops out-of-memory crashes.
  • Prefer Parquet or Feather DataFrames (compressed + columnar).
  • Remove modules you don't use to make container images smaller.
  • Save steps to disk as you go.
  • Log memory usage and how many times something ran to help find problems.

These ways to work that know about the device make Python able to be used even on small or serverless computers.


Summary and Next Steps

Handling large data in Python doesn’t need supercomputers. It just needs smarter ways to work. Make file reading better, use generators, give tasks to tools like Dask or Vaex. And then, check your program's memory use early. If you are building machine learning pipelines, IoT log processors, or bioinformatics tools, using these good ways to work will let your Python code grow well.

Before starting on your next large dataset:

  • Can you stream the file instead of reading all at once?
  • Would reading in chunks make memory use easier?
  • Can Dask or NumPy speed up and move big tasks?

By adding these ways to your work over time, you'll get efficient ways to work with data without slowdowns.


Citations

  • Oliphant, T. E. (2006). A guide to NumPy. USA: Trelgol Publishing.
  • Dask Development Team. (2020). Dask: Library for dynamic task scheduling. Retrieved from https://dask.org
  • Python Software Foundation. (2023). Memory management in Python. Retrieved from https://docs.python.org/3
  • McKinney, W. (2017). Python for Data Analysis: Data Wrangling with pandas, NumPy, and IPython (2nd ed.). O’Reilly Media.
  • Van der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: A structure for efficient numerical computation. Computing in Science & Engineering, 13(2), 22-30.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading