Why it does take more time changing a pandas column values in compare of list? In my project, I usualy create a list as a DataFrame column:
my_list = my_datafram['column a']
and after changing data, save it to DataFrame column:
my_datafram['column a'] = my_list
So it’s not a problem, But I am curious to know what behinds it. I ran a code test:
from time import perf_counter
import pandas as pd
example_list = [0 for _ in range(1000000)]
start = perf_counter()
for i in range(1000000):
example_list[i] = i
end = perf_counter()
print(f"Speed of list: {end-start}")
example_dataframe = pd.DataFrame([0 for _ in range(1000000)])
start = perf_counter()
for i in range(1000000):
example_dataframe.loc[i, 0] = i
end = perf_counter()
print(f"Speed of dataframe: {end-start}")
and the result was:
Speed of list: 0.034354630000052566
Speed of dataframe: 21.46266417700008
>Solution :
The difference in performance between modifying values in a list and a DataFrame column is primarily due to the underlying data structures and the overhead associated with DataFrame operations.
When you modify values in a list, you are directly accessing and updating elements in a contiguous block of memory. This operation is efficient because lists are implemented as dynamic arrays in Python, allowing for fast indexing and modification of elements.
On the other hand, when you modify values in a DataFrame column, you are operating on a more complex data structure. DataFrame columns are internally represented as Series objects, which are essentially one-dimensional arrays with additional metadata and index labels. When you modify a value in a DataFrame column, Pandas needs to perform additional checks and operations to maintain the integrity of the DataFrame structure, such as alignment with the index labels and potential resizing of underlying arrays. These additional operations introduce overhead compared to modifying a simple list.
Additionally, accessing and updating individual elements in a DataFrame using loc involves more overhead than directly accessing elements in a list due to the indexing and alignment mechanisms used by Pandas.
In your specific test case, modifying values in the list is significantly faster than modifying values in the DataFrame column because of these factors. If you need to perform large-scale element-wise operations or modifications, working with native Python data structures like lists may offer better performance compared to DataFrame operations. However, DataFrame operations provide the benefits of built-in functionality for data manipulation, aggregation, and analysis, which can be more convenient and expressive for many tasks despite the associated overhead.