Massoud Mazar

Sharing The Knowledge

NAVIGATION - SEARCH

GPU assisted Machine Learning: Benchmark

A recent project at work, involving binary classification using a Keras LSTM layer with 1000 nodes which took almost an hour to run initiated my effort to speedup this type of problems. In my previous post, I explained the hardware and software configuration I'm about to use for this benchmark. Now I'm going to run the same training exercise with and without GPU and compare the runtimes.

The Data

The dataset I'm about to use is for about 1300 cases. For each case I have 100 data points in form of a time series and based on these data points, we need to do a binary classification.

The Model

For the sake of simplicity, I'm using only one LSTM layer with 1000 nodes and use the result to do binary classification:

def largeLayer_model():
    # create model
    model = Sequential()
    model.add(LSTM(1000, return_sequences=False, input_shape=(trainX.shape[1], trainX.shape[2])))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
    return model

Training is done with 100 epochs and batch size of 100:

start = time.time()
model = largeLayer_model()
summary = model.fit(trainX, trainY, epochs=100, batch_size=100, verbose=1)
end = time.time()
print("Elapsed {} sec".format(end - start))

With GTX 1070 Ti GPU

Since my machine was already configured to use GPU, results of the first run shows execution time when GPU is used:

Epoch 1/100
938/938 [==============================] - 3s 3ms/step - loss: 0.6386 - acc: 0.6365
Epoch 2/100
938/938 [==============================] - 2s 2ms/step - loss: 0.6292 - acc: 0.6748
Epoch 3/100
938/938 [==============================] - 1s 2ms/step - loss: 0.6215 - acc: 0.6780
.
.
.
Epoch 98/100
938/938 [==============================] - 1s 2ms/step - loss: 0.5791 - acc: 0.7015
Epoch 99/100
938/938 [==============================] - 1s 2ms/step - loss: 0.5719 - acc: 0.6919
Epoch 100/100
938/938 [==============================] - 1s 2ms/step - loss: 0.5733 - acc: 0.6791
Elapsed 150.32436347007751 sec

With Titan RTX GPU

Later on I upgraded my GPU to a Titan RTX. Following is the exact same test as above, which runs in 34 seconds (vs. 150 seconds with 1070 Ti):

Epoch 1/100
937/937 [==============================] - 2s 2ms/sample - loss: 0.6498 - accuracy: 0.6617
Epoch 2/100
937/937 [==============================] - 0s 331us/sample - loss: 0.6713 - accuracy: 0.6734
Epoch 3/100
937/937 [==============================] - 0s 332us/sample - loss: 0.6287 - accuracy: 0.6681
.
.
.
Epoch 98/100
937/937 [==============================] - 0s 340us/sample - loss: 0.5920 - accuracy: 0.6969
Epoch 99/100
937/937 [==============================] - 0s 339us/sample - loss: 0.5807 - accuracy: 0.7012
Epoch 100/100
937/937 [==============================] - 0s 339us/sample - loss: 0.5799 - accuracy: 0.7044
Elapsed 34.02235388755798 sec

Titan RTX with Mixed Precision

With TensorFlow 2.4, it is super easy to benefit from Mixed Precision floating point optimization. Just need to update the model to use Mixed Precision:

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import mixed_precision

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

def largeLayer_model():
    # create model
    model = keras.models.Sequential()
    model.add(layers.LSTM(1000, return_sequences=False, input_shape=(trainX.shape[1], trainX.shape[2])))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.add(layers.Activation('linear', dtype='float32'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.001), metrics=['accuracy'])
    return model

Note how an extra linear activation is added to convert results back to float32. See Mixed Precision for more details.

Use of Mixed Precision resulted in 2x faster speed, around 16 seconds vs. 34 seconds on the same GPU. 

CPU only

To run the same test without GPU assistance, I uninstalled TensorFlow-GPU and installed the CPU only version:

pip3 uninstall tensorflow-gpu
pip3 install tensorflow

As expected, execution times are much higher:

Epoch 1/100
938/938 [==============================] - 25s 27ms/step - loss: 0.6370 - acc: 0.6546
Epoch 2/100
938/938 [==============================] - 25s 26ms/step - loss: 0.6141 - acc: 0.6684
Epoch 3/100
938/938 [==============================] - 24s 26ms/step - loss: 0.6063 - acc: 0.6834
.
.
.
Epoch 98/100
938/938 [==============================] - 24s 26ms/step - loss: 0.5743 - acc: 0.6940
Epoch 99/100
938/938 [==============================] - 24s 26ms/step - loss: 0.5627 - acc: 0.7058
Epoch 100/100
938/938 [==============================] - 24s 26ms/step - loss: 0.5703 - acc: 0.7004
Elapsed 2449.41867685318 sec

Conclusion

This specific software and hardware configuration showed GPU assisted training was 16x faster than the CPU only option. This GPU upgrade (Geforce GTX 1070 Ti and a 650W power supply) cost me around $550, which was well worth it.

On the other hand, upgrading from 1070 Ti to a Titan RTX (which costs $2,500) may not be the best bang for your buck, with only 4.4X performance improvement over 1070 Ti. A 1080 Ti (or comparable RTX GPU) would make more financial sense.

Add comment