Why does post-padding train faster than pre-padding?

March 20, 2022

I have been doing some NLP categorisation tasks and noticed that my models train much faster if I use post-padding instead of pre-padding, and was wondering why that is the case.

I am using Google Colab to train these model with the GPU runtime. Here is my preprocessing code:

PADDING = 'post'

# Tokenising the input strings and padding

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)

# Encoding output one

y1 = y1.to_numpy().reshape(-1, 1)   # Reshape to an array of features
encoder_1 = OneHotEncoder()         # Instantiate encoder
y1 = encoder_1.fit_transform(y1)    # Fit encoder to output 
y1 = y1.toarray()                   # Make output a numpy array

# Encoding output two
    
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()

Now to create my model:

# --- MODEL PARAMETERS ---

vocab_size = len(tokenizer.index_word) + 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])

embedding_size = 175
units = 96

# --- MODEL ARCHITECTURE ---

inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)

shared_lstm = Bidirectional(LSTM(units, return_sequences=True, 
                                 dropout=0.3))(input_embeddings)

y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)

y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)

split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])

Which is then compiled as:

split_shared_model.compile(
    optimizer='adam', 
    loss=CategoricalCrossentropy(), 
    metrics=['accuracy']
    )

The summary of the model is as follows:

__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_4 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_3 (Embedding)        (None, None, 175)    19075       ['input_4[0][0]']                
                                                                                                  
 bidirectional_8 (Bidirectional  (None, None, 192)   208896      ['embedding_3[0][0]']            
 )                                                                                                
                                                                                                  
 bidirectional_9 (Bidirectional  (None, 192)         221952      ['bidirectional_8[0][0]']        
 )                                                                                                
                                                                                                  
 bidirectional_10 (Bidirectiona  (None, 192)         221952      ['bidirectional_8[0][0]']        
 l)                                                                                               
                                                                                                  
 y1 (Dense)                     (None, 912)          176016      ['bidirectional_9[0][0]']        
                                                                                                  
 y2 (Dense)                     (None, 617)          119081      ['bidirectional_10[0][0]']       
                                                                                                  
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________

After calling the fit() method the model starts training. Below is an intermediary result with the above settings:

Epoch 1/50
 398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------

However, if I change PADDING to 'pre' I find that training is much slower!

Epoch 1/50
  90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788

Can anyone explain why this is? I think it might have something to do with the Embedding layer and it’s masking but I am not sure.

>Solution :

This is related to the underlying LSTM implementation. There are in fact two: A "native Tensorflow" one and a highly optimized pure CUDA implementation which is MUCH faster. However, the latter can only be used under specific conditions (certain parameter settings etc.). You can find details in the docs. The main point here is:

Inputs, if use masking, are strictly right-padded.

This implies that the pre-padding version does not use the efficient implementation, which explains the much slower runtime. I don’t think there is a reasonable workaround here except for sticking with post-padding.

Note that sometimes, Tensorflow actually outputs a warning message that it had to use the inefficient implementation. However, for me this has been inefficient. Maybe keep your eyes out if any additional warning outputs are produced in the pre-padding case.