Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why does post-padding train faster than pre-padding?

I have been doing some NLP categorisation tasks and noticed that my models train much faster if I use post-padding instead of pre-padding, and was wondering why that is the case.

I am using Google Colab to train these model with the GPU runtime. Here is my preprocessing code:

PADDING = 'post'

# Tokenising the input strings and padding

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)

# Encoding output one

y1 = y1.to_numpy().reshape(-1, 1)   # Reshape to an array of features
encoder_1 = OneHotEncoder()         # Instantiate encoder
y1 = encoder_1.fit_transform(y1)    # Fit encoder to output 
y1 = y1.toarray()                   # Make output a numpy array

# Encoding output two
    
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()

Now to create my model:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# --- MODEL PARAMETERS ---

vocab_size = len(tokenizer.index_word) + 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])

embedding_size = 175
units = 96

# --- MODEL ARCHITECTURE ---

inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)

shared_lstm = Bidirectional(LSTM(units, return_sequences=True, 
                                 dropout=0.3))(input_embeddings)

y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)

y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)

split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])

Which is then compiled as:

split_shared_model.compile(
    optimizer='adam', 
    loss=CategoricalCrossentropy(), 
    metrics=['accuracy']
    )

The summary of the model is as follows:

__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_4 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_3 (Embedding)        (None, None, 175)    19075       ['input_4[0][0]']                
                                                                                                  
 bidirectional_8 (Bidirectional  (None, None, 192)   208896      ['embedding_3[0][0]']            
 )                                                                                                
                                                                                                  
 bidirectional_9 (Bidirectional  (None, 192)         221952      ['bidirectional_8[0][0]']        
 )                                                                                                
                                                                                                  
 bidirectional_10 (Bidirectiona  (None, 192)         221952      ['bidirectional_8[0][0]']        
 l)                                                                                               
                                                                                                  
 y1 (Dense)                     (None, 912)          176016      ['bidirectional_9[0][0]']        
                                                                                                  
 y2 (Dense)                     (None, 617)          119081      ['bidirectional_10[0][0]']       
                                                                                                  
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________

After calling the fit() method the model starts training. Below is an intermediary result with the above settings:

Epoch 1/50
 398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------

However, if I change PADDING to 'pre' I find that training is much slower!

Epoch 1/50
  90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788

Can anyone explain why this is? I think it might have something to do with the Embedding layer and it’s masking but I am not sure.

>Solution :

This is related to the underlying LSTM implementation. There are in fact two: A "native Tensorflow" one and a highly optimized pure CUDA implementation which is MUCH faster. However, the latter can only be used under specific conditions (certain parameter settings etc.). You can find details in the docs. The main point here is:

Inputs, if use masking, are strictly right-padded.

This implies that the pre-padding version does not use the efficient implementation, which explains the much slower runtime. I don’t think there is a reasonable workaround here except for sticking with post-padding.

Note that sometimes, Tensorflow actually outputs a warning message that it had to use the inefficient implementation. However, for me this has been inefficient. Maybe keep your eyes out if any additional warning outputs are produced in the pre-padding case.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading