Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why does Training time not reduce when training a keras model after Increasing the batch size in beyond a certain amount

I am currently traing an NLP model in Keras with TF 2.8 where I am experimenting by adding GRU and LSTM layers. When I train the model, I used different batch size to see the impact it had on the accuracy and overal training time.

What I noticed was that after Increasing the batch size after a certain amount the training time doesnt reduce, after a certain amount the training size stayed the same.

I started with a batch size of 2 then slowly increased upto 4096 trying multiples of two, yet after 512 the training time remained the same.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

It’s often wrongly mentioned that batch learning is as fast or faster than on-line training. In fact, batch-learning is changing the weights once, the complete set of data (the batch) has been presented to the network. Therefore, the weight update frequency is rather slow. This explains why the processing speed in your measurements acts like you observed.

Even if its matrix operation, each row-colum multiplication might be happening on one gpu-core. So, full matrix multiplication is divided on as many cores as possible. For one matrix mul, each gpu-core takes some time, and when you add more images, that time increases, do more rows. If at batch size of 4, your gpu is already at full performance capacity, i.e. all cores are running, then increasing batch size is not going to give any advantage. Your added data just sits in gpu memory and is processed when an nvidia dice gets free of previous operation.

To get a further understanding for the training techniques, have a look at the 2003 paper The general inefficiency of batch training for gradient descent learning. It deals with the comparison of batch and on-line learning.

Also generally, RNN kernels can have O(timesteps) complexity, with batch size having a smaller effect than you might anticipate.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading