Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Speed up pandas iteration with multiple excel sheets

I have an excel file with 1000 sheets and each sheet contain a dataframe. In order to feed my model with these data, I try to convert it to 1000 batches of tensors and here’s my code:

df = pd.read_excel('file.xlsx', sheet_name=None)
file_names = list(df.keys())

columns = ['A','B','C']
features = []
labels = []
for n in file_names:
  df = pd.read_excel('file.xlsx', sheet_name=n)
  features.append(df[columns].to_numpy())
  labels.append(df['D'].to_numpy())
  
Y = tf.convert_to_tensor(np.stack(labels), dtype=tf.float32)
X = tf.convert_to_tensor(np.stack(features), dtype=tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((X, Y))

My code works fine, but it takes over an hour to iterate it.
I will have more than 1000 batches of data in the future so it seems not a good idea to have several thousand of csv files.
How can I speed up the process?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You could retrieve your file.xlsx once, which will read all worksheets into a dictionary of dataframes and then you can get the sheets from that dictionary:

import tensorflow as tf
import pandas as pd
import numpy as np
from random import sample

### Create data
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
for i in range(1000):
  df = pd.DataFrame({'A': [1, i, 1, 2, 9], 'B': [3, 4, i, 1, 4], 'C': [3, 4, 3, i, 4], 'D': [1, 2, 6, 1, 4], 'E': [0, 1, 1, 0, 1]})
  df.to_excel(writer, sheet_name='Sheet'+ str(i))

writer.save()

df = pd.read_excel('file.xlsx', sheet_name=None)
file_names = list(df.keys())

columns = ['A','B','C']
features = []
labels = []
for n in file_names:
  temp_df = df[n]
  features.append(temp_df[columns].to_numpy())
  labels.append(temp_df['D'].to_numpy())
  
Y = tf.convert_to_tensor(np.stack(labels), dtype=tf.float32)
X = tf.convert_to_tensor(np.stack(features), dtype=tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((X, Y))

Furthermore, you could try creating your own custom data generator and retrieve random samples from your Excel file, which should also speed things up:

df = pd.read_excel('file.xlsx', sheet_name=None)
file_names = list(df.keys())
columns = ['A','B','C']

def generator_function(samples = 64):
    def generator():
        for n in sample(file_names, samples):
            temp_df = df[n]
            x = temp_df[columns].to_numpy()
            y = temp_df['D'].to_numpy()
            yield x, y
    return generator

gen = generator_function()
dataset = tf.data.Dataset.from_generator(
    generator=gen,
    output_types=(np.float32, np.int32), 
    output_shapes=((5, 3), (5))
)
batch_size = 16
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading