biodan (3) [Avatar] Offline
#1
I've just begun learning how to run Keras/Tensorflow examples having purchased the hardcopy Chollet/Allaire book from Amazon.

After installing it on my Dell XPS/GTX960M (640 Cuda cores) running Ubuntu16.04, I also installed on a Ubuntu 16.04 desktop (i7-3930 with Geforce 1080Ti with 11GB DDR5 and 64GB of DDR3 memory). The execution times for 25 epochs of the mnist demo code takes 0.64 sec on the laptop but twice as long on the desktop with the 1080Ti. Similarly, the execution times for the Boston house-prices takes 2x longer on the 1080Ti.

Initially I thought the problem was which PCIExpress slot the 1080 Ti card was placed, but this is not the reason because after 3 complete re-installations of Ubuntu, the 2x slower performance on the 1080Ti card has not improved. The 1080Ti card now sits in slot 0 (PCIE16_1); the only other card is my wifi card which sits in a PCIE16_2 slot.

In googling performance concerns about Tensorflow, I've read that the input may need to be optimized or that one may need to compile Tensorflow from source with additional nvcc options. Since I'm running the exact same R-Keras code, I suspect compiling from scratch may be needed. Can anyone advise on how to do this? There are no such options in the install_keras() function.

Below I've pasted the first few lines of the output from the R-Keras code - my naive conclusion is that the 1080Ti card is not running as fast as it could be (highlighted in red). Note: nvidia-smi out shows the "Volatile GPU-Util" on the laptop will hit 58% but the 1080 Ti card never runs over 10%, I don't know if this is a true reflection of the load.

Laptop with 960M:
2018-04-16 20:02:06.924282: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-16 20:02:06.992276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-16 20:02:06.992639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.0975
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.59GiB
2018-04-16 20:02:06.992653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-16 20:02:07.451438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 20:02:07.451460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-16 20:02:07.451485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-16 20:02:07.451681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1351 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
60000/60000 [==============================] - 2s 36us/step - loss: 0.2539 - acc: 0.9266
Epoch 2/25
60000/60000 [==============================] - 1s 23us/step - loss: 0.1053 - acc: 0.9692

Desktop with 1080Ti
Epoch 1/25
2018-04-16 20:04:48.824036: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-16 20:04:48.824472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:02:00.0
totalMemory: 10.91GiB freeMemory: 10.39GiB
2018-04-16 20:04:48.824493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-16 20:04:49.075027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 20:04:49.075071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-16 20:04:49.075081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-16 20:04:49.075346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10058 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
60000/60000 [==============================] - 3s 56us/step - loss: 0.2586 - acc: 0.9253
Epoch 2/25
60000/60000 [==============================] - 3s 45us/step - loss: 0.1344 - acc: 0.9674
256385 (46) [Avatar] Offline
#2
Unfortunately I'm not familiar enough with the vagaries of these cards and their interaction with CUDA / TF to say why this might be the case.

If you do decide to compile from source you can still use the R interface. Just make sure that you pip install keras within the same Python environment as you have the compiled version of TF within. You might in addition want to avail yourself of the use_python family of functions to ensure that the correct version of TF is used, see https://tensorflow.rstudio.com/tensorflow/articles/installation.html#custom-installation
biodan (3) [Avatar] Offline
#3
After building a new virtual env and installing a freshly compiled tensor flow source (using bazel v0.11) on the desktop (i7-3930k on Gigabyte x79 UP4), I only gained about 1 second in speed. A larger improvement (3 sec) resulted from customizing the session with this python code (not sure how to do this in R-keras):

import tensorflow as tf
import keras
from keras.backend.tensorflow_backend import set_session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)
gpu_options.allow_growth = True 
gpu_options.force_gpu_compatible = True
sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads = 2048,
        gpu_options=gpu_options))


I ran the python code below 4 times for benchmarking:

  • With default pip installed tensorflow 1.7: mean=39.85 sec, sd=0.17 sec

  • With default pip installed tensorflow 1.7 and customized session: mean=37.05 sec, sd=0.23 sec

  • With freshly compiled tensorflow 1.7 and customized session: mean= 36.225 sec, sd=0.2


  • Additionally, i inserted the MSI GeForce GTX 1080 Ti ARMOR 11G into a Gigabyte Z77MX motherboard with i7-3770 (Ivy Bridge).
    With the pip installed tensorflow (but not customized session): mean=34.4 sec sd=0.32

    Both motherboards have Gen 3 PCIExpress slots, but since the slightly newer CPU of the 2nd desktop and the new-ish laptop (Skylake i7-6700HQ) are faster, i'm thinking that maybe its time to upgrade my desktop's motherboard and CPU smilie

    As a reminder, recall that the laptop is still almost 2x faster than any of my desktops using any version of tensorflow or session settings (20 sec for the same code).

    FYI, here is the tensorflow-keras python code used for benchmarking. (when doing my testing and re-installing ubuntu, i began to use the python version of Keras for testing).

    import keras,time
    from keras.datasets import mnist
    (train_images, train_labels), (test_images, test_labels) = mnist.load_data()
    from keras import models
    from keras import layers
    network = models.Sequential()
    network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
    network.add(layers.Dense(10, activation='softmax'))
    
    network.compile(optimizer='rmsprop',
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])
    
    train_images = train_images.reshape((60000, 28 * 28))
    train_images = train_images.astype('float32') / 255
    test_images = test_images.reshape((10000, 28 * 28))
    test_images = test_images.astype('float32') / 255
    
    from keras.utils import to_categorical
    train_labels = to_categorical(train_labels)
    test_labels = to_categorical(test_labels)
    
    t0 = time.time()
    network.fit(train_images, train_labels, epochs=15, batch_size=128)
    test_loss, test_acc = network.evaluate(test_images, test_labels)
    t1=time.time()
    print(t1-t0)
    
    print('test_acc:', test_acc)
    biodan (3) [Avatar] Offline
    #4
    Update: after building a new desktop with i7-8700 (6-cores) on a MSI Z370M gaming motherboard, the 1080Ti GPU is now 30-40% faster than the laptop's GTX960M.

    However, for logistical reasons, i replaced the MSI 10 80Ti GPU for a similar EVGA GPU card. The MSI card was about 25mm taller than the EVGA card; the EVGA fit into a new, smaller PC case better (Corsair Air 240). I have not tested the EVGA GPU on the old desktop running a i7-3930 CPU on a Gigabyte UP4 motherboard inside a full-sized case. Both desktops have 64gb of memory.