Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

I currently have the following dataframe

data = {'col_a': [['a', 'b'], ['a', 'b', 'c'], ['a'], ['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 'b', 'c', 'd']],
        'col_b':[[1, 3], [1, 0, 0], [4], [1, 1, 2, 0], [0, 0, 5], [3, 1, 2, 5]]}
df= pd.DataFrame(data)

Suppose I work with col_a, I want to resize the lists in col_a in a vectorized manner so that the length of all the sub lists = max length of largest list and I want to fill the empty values with 'None' in the case of col_a. I want the final output to look as follows

                   col_a               col_b
0     [a, b, None, None]    [1, 3, nan, nan]
1        [a, b, c, None]      [1, 0, 0, nan]
2  [a, None, None, None]  [4, nan, nan, nan]
3           [a, b, c, d]        [1, 1, 2, 0]
4        [a, b, c, None]      [0, 0, 5, nan]
5           [a, b, c, d]        [3, 1, 2, 5]

So far I have done the following

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# Convert the column to a NumPy array with object dtype
col_np = df['col_a'].to_numpy()

# Find the maximum length of the lists using NumPy operations
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))

# Create a mask for padding
mask = np.arange(max_length) < np.frompyfunc(len, 1, 1)(col_np)[:, None]

# Pad the lists with None where necessary
result = np.where(mask, col_np, 'None')

This results in the following error
ValueError: operands could not be broadcast together with shapes (6,4) (6,) ()

I feel like I’m close but there’s something that I’m missing here. Please note that only vectorized solutions will be marked as the answer.

>Solution :

Only vectorized solutions will be marked as the answer. -> that’s too bad because no (true) vectorized approach is possible with an array of lists. To this extent, np.frompyfunc is certainly not truly vectorized.

If by "vectorized" you mean without explicit python loop, you could use:

df['out_a'] = pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())

An alternative with an explicit loop would be:

size = df['col_a'].str.len().max()

df['out_a'] = [l+[None]*(size-len(l)) for l in df['col_a']]

Output:

          col_a         col_b                  out_a
0        [a, b]        [1, 3]     [a, b, None, None]
1     [a, b, c]     [1, 0, 0]        [a, b, c, None]
2           [a]           [4]  [a, None, None, None]
3  [a, b, c, d]  [1, 1, 2, 0]           [a, b, c, d]
4     [a, b, c]     [0, 0, 5]        [a, b, c, None]
5  [a, b, c, d]  [3, 1, 2, 5]           [a, b, c, d]

timings

For small lists the "vectorized" and loop solution have very similar timings.

Here for lists with 1 to 10 items:

enter image description here

However, when the size of lists increases, the python loop become more efficient.

For list with 0 to 50 items:

enter image description here

0 to 200 items:

enter image description here

0 to 2000 items:

enter image description here

Code used for the timings:

import pandas as pd
import perfplot
import numpy as np

def pandas_vectorized(df):
    df['out_a'] = pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())
    
def python_loop(df):
    size = df['col_a'].str.len().max()
    df['out_a'] = [l+[None]*(size-len(l)) for l in df['col_a']]

MAX_LIST_SIZE = 2000
    
perfplot.show(
    setup=lambda n: pd.DataFrame({'col_a': [['x']*n for n in np.random.randint(0, MAX_LIST_SIZE, size=n)]}),
    kernels=[pandas_vectorized, python_loop],
    n_range=[2**k for k in range(1, 18)],  # upper bound was 22 for small lists
    xlabel="len(df)",
    equality_check=None,
    max_time=10,
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading