Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Speeding up loop for reorganizing pandas DataFrame into numpy array using slicing throws exception – what am I missing?

I have a pandas DataFrame like so:

raw_data = DataFrame({
    'date_idx': [0, 1, 2, 0, 1, 2],
    'element_idx': [0, 0, 0, 1, 1, 1],
    'a': [10, 20, 30, 40, 50, 60],
    'b': [11, 21, 31, 41, 51, 61],
    'c': [12, 22, 32, 42, 52, 62],
})

I call the columns other than date_idx and element_idx "inputs". I want to reorganize it into a 3d numpy array by date_idx -> input_idx -> element_idx, so that the result is like so:

[[[10. 40.]
  [11. 41.]
  [12. 42.]]

 [[20. 50.]
  [21. 51.]
  [22. 52.]]

 [[30. 60.]
  [31. 61.]
  [32. 62.]]]

I did it with two for loops, and it works well:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

date_idx = [0, 1, 2, 0, 1, 2]
element_idx = [0, 0, 0, 1, 1, 1]
raw_data = DataFrame({
    'date_idx': date_idx,
    'element_idx': element_idx,
    'a': [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
    'b': [11.0, 21.0, 31.0, 41.0, 51.0, 61.0],
    'c': [12.0, 22.0, 32.0, 42.0, 52.0, 62.0],
})

inputs = ['a', 'b', 'c']

unique_dates = set(date_idx)
unique_elements = set(element_idx)
data = np.zeros(shape=(len(unique_dates), len(inputs), len(unique_elements)), dtype=np.float64)

for i in range(len(raw_data)):
    row = raw_data.iloc[i]
    date_idx = int(row['date_idx'])
    element_idx = int(row['element_idx'])

    for input_idx in range(len(inputs)):
        data[date_idx][input_idx][element_idx] = float(row[inputs[input_idx]])

print(data)

However, this is very slow. I have millions of entries for the date_idx array, and dozens for both inputs and element_idx. It takes 7 hours on my machine for this to complete with my real data set.

I have a feeling this could be done with slicing, no loops, but my attempts always fail – I’m missing something.

For example, I tried to eliminate the inner loop with:

for i in range(len(raw_data)):
    row = raw_data.iloc[i]
    date_idx = int(row['date_idx'])
    element_idx = int(row['element_idx'])

    data[date_idx][:][element_idx] = list(dict(row[inputs]).values())

And it fails with:

Traceback (most recent call last):
  File "/home/stark/Work/mmr6/test2.py", line 84, in <module>
    data[date_idx][:][element_idx] = list(dict(row[inputs]).values())
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (3,) into shape (2,)

My question is, can slicing and / or fast technique be used to reorganize this DataFrame in that fashion on the plain numpy array, or do I really need the loops here?

>Solution :

I think you’re searching for pivoting the dataframe and then convert it to numpy array:

num_unique_elements = raw_data['element_idx'].nunique()
num_unique_inputs = 3 # a, b, c

df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
df = df.stack(level=0)
print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))

Prints:

[[[10 40]
  [11 41]
  [12 42]]

 [[20 50]
  [21 51]
  [22 52]]

 [[30 60]
  [31 61]
  [32 62]]]

Steps:

df = pd.pivot(raw_data, index='date_idx', columns='element_idx')
print(df)

              a       b       c    
element_idx   0   1   0   1   0   1
date_idx                           
0            10  40  11  41  12  42
1            20  50  21  51  22  52
2            30  60  31  61  32  62

Then reshape it using .stack()

df = df.stack(level=0)
print(df)

element_idx   0   1
date_idx           
0        a   10  40
         b   11  41
         c   12  42
1        a   20  50
         b   21  51
         c   22  52
2        a   30  60
         b   31  61
         c   32  62

Then convert it to numpy array:

print(df.to_numpy().reshape(-1, num_unique_inputs, num_unique_elements))

[[[10 40]
  [11 41]
  [12 42]]

 [[20 50]
  [21 51]
  [22 52]]

 [[30 60]
  [31 61]
  [32 62]]]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading