Speeding up TFRecords feed into Keras model on CloudML for GPU











up vote
6
down vote

favorite
1












I would like to feed TFRecords into my model at a super fast rate. However, currently, my GPU(Single K80 on GCP) is at 0% load which is super slow on CloudML.



I have TFRecords in GCS: train_directory = gs://bucket/train/*.tfrecord, (around 100 files of 30mb-800mb in size), but for some reason it struggles to feed the data into my model fast enough for GPU.



Interestingly, loading data into memory and using numpy arrays using fit_generator() is 7x faster. There I can specify multi-processing and multi workers.



My current set up parses tf records and loads an infinite tf.Dataset. Ideally, the solution would save/prefecth some batches in memory, for the gpu to use on demand.



def _parse_func(record):
""" Parses TF Record"""
keys_to_features = {}
for _ in feature_list: # 300 features ['height', 'weights', 'salary']
keys_to_features[_] = tf.FixedLenFeature([TIME_STEPS], tf.float32)
parsed = tf.parse_single_example(record, keys_to_features)
t = [tf.manip.reshape(parsed[_], [-1, 1]) for _ in feature_list]
numeric_tensor = tf.concat(values=t, axis=1)

x = dict()
x['numeric'] = numeric_tensor
y = ...
w = ...

return x, y, w

def input_fn(file_pattern, b=BATCH_SIZE):
"""
:param file_pattern: GCS bucket to read from
:param b: Batch size, defaults to BATCH_SIZE in hparams.py
:return: And infinitely iterable data set using tf records of tf.data.Dataset class
"""
files = tf.data.Dataset.list_files(file_pattern=file_pattern)
d = files.apply(
tf.data.experimental.parallel_interleave(
lambda filename: tf.data.TFRecordDataset(filename),
cycle_length=4,
block_length=16,
buffer_output_elements=16,
prefetch_input_elements=16,
sloppy=True))
d = d.apply(tf.contrib.data.map_and_batch(
map_func=_parse_func, batch_size=b,
num_parallel_batches=4))
d = d.cache()
d = d.repeat()
d = d.prefetch(1)
return d


Get train data



# get files from GCS bucket and load them into dataset
train_data = input_fn(train_directory, b=BATCH_SIZE)


Fit the model



model.fit(x=train_data.make_one_shot_iterator())


I am running it on CloudML so GCS and CloudML should be pretty fast.





CloudML CPU Usage:



As we can see below, the CPU is at 70% and the memory doesn't increase past 10%. So what does the dataset.cache() do?



enter image description here



GPU metrics in CloudML logs



As seen below, it seems that the GPU is off! Also the memory is at 0mb. Where is the cache stored?



No processes running on GPU!



enter image description here



Edit:



It seems that indeed, there are not processes running on GPU. I tried to explicitly state:



tf.keras.backend.set_session(tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=True)))

train_data = input_fn(file_pattern=train_directory, b=BATCH_SIZE)

model = create_model()

with tf.device('/gpu:0'):
model.fit(x=train_data.make_one_shot_iterator(),
epochs=EPOCHS,
steps_per_epoch=STEPS_PER_EPOCH,
validation_data=test_data.make_one_shot_iterator(),
validation_steps=VALIDATION_STEPS)


but everything still utilises the CPU!










share|improve this question




























    up vote
    6
    down vote

    favorite
    1












    I would like to feed TFRecords into my model at a super fast rate. However, currently, my GPU(Single K80 on GCP) is at 0% load which is super slow on CloudML.



    I have TFRecords in GCS: train_directory = gs://bucket/train/*.tfrecord, (around 100 files of 30mb-800mb in size), but for some reason it struggles to feed the data into my model fast enough for GPU.



    Interestingly, loading data into memory and using numpy arrays using fit_generator() is 7x faster. There I can specify multi-processing and multi workers.



    My current set up parses tf records and loads an infinite tf.Dataset. Ideally, the solution would save/prefecth some batches in memory, for the gpu to use on demand.



    def _parse_func(record):
    """ Parses TF Record"""
    keys_to_features = {}
    for _ in feature_list: # 300 features ['height', 'weights', 'salary']
    keys_to_features[_] = tf.FixedLenFeature([TIME_STEPS], tf.float32)
    parsed = tf.parse_single_example(record, keys_to_features)
    t = [tf.manip.reshape(parsed[_], [-1, 1]) for _ in feature_list]
    numeric_tensor = tf.concat(values=t, axis=1)

    x = dict()
    x['numeric'] = numeric_tensor
    y = ...
    w = ...

    return x, y, w

    def input_fn(file_pattern, b=BATCH_SIZE):
    """
    :param file_pattern: GCS bucket to read from
    :param b: Batch size, defaults to BATCH_SIZE in hparams.py
    :return: And infinitely iterable data set using tf records of tf.data.Dataset class
    """
    files = tf.data.Dataset.list_files(file_pattern=file_pattern)
    d = files.apply(
    tf.data.experimental.parallel_interleave(
    lambda filename: tf.data.TFRecordDataset(filename),
    cycle_length=4,
    block_length=16,
    buffer_output_elements=16,
    prefetch_input_elements=16,
    sloppy=True))
    d = d.apply(tf.contrib.data.map_and_batch(
    map_func=_parse_func, batch_size=b,
    num_parallel_batches=4))
    d = d.cache()
    d = d.repeat()
    d = d.prefetch(1)
    return d


    Get train data



    # get files from GCS bucket and load them into dataset
    train_data = input_fn(train_directory, b=BATCH_SIZE)


    Fit the model



    model.fit(x=train_data.make_one_shot_iterator())


    I am running it on CloudML so GCS and CloudML should be pretty fast.





    CloudML CPU Usage:



    As we can see below, the CPU is at 70% and the memory doesn't increase past 10%. So what does the dataset.cache() do?



    enter image description here



    GPU metrics in CloudML logs



    As seen below, it seems that the GPU is off! Also the memory is at 0mb. Where is the cache stored?



    No processes running on GPU!



    enter image description here



    Edit:



    It seems that indeed, there are not processes running on GPU. I tried to explicitly state:



    tf.keras.backend.set_session(tf.Session(config=tf.ConfigProto(
    allow_soft_placement=True,
    log_device_placement=True)))

    train_data = input_fn(file_pattern=train_directory, b=BATCH_SIZE)

    model = create_model()

    with tf.device('/gpu:0'):
    model.fit(x=train_data.make_one_shot_iterator(),
    epochs=EPOCHS,
    steps_per_epoch=STEPS_PER_EPOCH,
    validation_data=test_data.make_one_shot_iterator(),
    validation_steps=VALIDATION_STEPS)


    but everything still utilises the CPU!










    share|improve this question


























      up vote
      6
      down vote

      favorite
      1









      up vote
      6
      down vote

      favorite
      1






      1





      I would like to feed TFRecords into my model at a super fast rate. However, currently, my GPU(Single K80 on GCP) is at 0% load which is super slow on CloudML.



      I have TFRecords in GCS: train_directory = gs://bucket/train/*.tfrecord, (around 100 files of 30mb-800mb in size), but for some reason it struggles to feed the data into my model fast enough for GPU.



      Interestingly, loading data into memory and using numpy arrays using fit_generator() is 7x faster. There I can specify multi-processing and multi workers.



      My current set up parses tf records and loads an infinite tf.Dataset. Ideally, the solution would save/prefecth some batches in memory, for the gpu to use on demand.



      def _parse_func(record):
      """ Parses TF Record"""
      keys_to_features = {}
      for _ in feature_list: # 300 features ['height', 'weights', 'salary']
      keys_to_features[_] = tf.FixedLenFeature([TIME_STEPS], tf.float32)
      parsed = tf.parse_single_example(record, keys_to_features)
      t = [tf.manip.reshape(parsed[_], [-1, 1]) for _ in feature_list]
      numeric_tensor = tf.concat(values=t, axis=1)

      x = dict()
      x['numeric'] = numeric_tensor
      y = ...
      w = ...

      return x, y, w

      def input_fn(file_pattern, b=BATCH_SIZE):
      """
      :param file_pattern: GCS bucket to read from
      :param b: Batch size, defaults to BATCH_SIZE in hparams.py
      :return: And infinitely iterable data set using tf records of tf.data.Dataset class
      """
      files = tf.data.Dataset.list_files(file_pattern=file_pattern)
      d = files.apply(
      tf.data.experimental.parallel_interleave(
      lambda filename: tf.data.TFRecordDataset(filename),
      cycle_length=4,
      block_length=16,
      buffer_output_elements=16,
      prefetch_input_elements=16,
      sloppy=True))
      d = d.apply(tf.contrib.data.map_and_batch(
      map_func=_parse_func, batch_size=b,
      num_parallel_batches=4))
      d = d.cache()
      d = d.repeat()
      d = d.prefetch(1)
      return d


      Get train data



      # get files from GCS bucket and load them into dataset
      train_data = input_fn(train_directory, b=BATCH_SIZE)


      Fit the model



      model.fit(x=train_data.make_one_shot_iterator())


      I am running it on CloudML so GCS and CloudML should be pretty fast.





      CloudML CPU Usage:



      As we can see below, the CPU is at 70% and the memory doesn't increase past 10%. So what does the dataset.cache() do?



      enter image description here



      GPU metrics in CloudML logs



      As seen below, it seems that the GPU is off! Also the memory is at 0mb. Where is the cache stored?



      No processes running on GPU!



      enter image description here



      Edit:



      It seems that indeed, there are not processes running on GPU. I tried to explicitly state:



      tf.keras.backend.set_session(tf.Session(config=tf.ConfigProto(
      allow_soft_placement=True,
      log_device_placement=True)))

      train_data = input_fn(file_pattern=train_directory, b=BATCH_SIZE)

      model = create_model()

      with tf.device('/gpu:0'):
      model.fit(x=train_data.make_one_shot_iterator(),
      epochs=EPOCHS,
      steps_per_epoch=STEPS_PER_EPOCH,
      validation_data=test_data.make_one_shot_iterator(),
      validation_steps=VALIDATION_STEPS)


      but everything still utilises the CPU!










      share|improve this question















      I would like to feed TFRecords into my model at a super fast rate. However, currently, my GPU(Single K80 on GCP) is at 0% load which is super slow on CloudML.



      I have TFRecords in GCS: train_directory = gs://bucket/train/*.tfrecord, (around 100 files of 30mb-800mb in size), but for some reason it struggles to feed the data into my model fast enough for GPU.



      Interestingly, loading data into memory and using numpy arrays using fit_generator() is 7x faster. There I can specify multi-processing and multi workers.



      My current set up parses tf records and loads an infinite tf.Dataset. Ideally, the solution would save/prefecth some batches in memory, for the gpu to use on demand.



      def _parse_func(record):
      """ Parses TF Record"""
      keys_to_features = {}
      for _ in feature_list: # 300 features ['height', 'weights', 'salary']
      keys_to_features[_] = tf.FixedLenFeature([TIME_STEPS], tf.float32)
      parsed = tf.parse_single_example(record, keys_to_features)
      t = [tf.manip.reshape(parsed[_], [-1, 1]) for _ in feature_list]
      numeric_tensor = tf.concat(values=t, axis=1)

      x = dict()
      x['numeric'] = numeric_tensor
      y = ...
      w = ...

      return x, y, w

      def input_fn(file_pattern, b=BATCH_SIZE):
      """
      :param file_pattern: GCS bucket to read from
      :param b: Batch size, defaults to BATCH_SIZE in hparams.py
      :return: And infinitely iterable data set using tf records of tf.data.Dataset class
      """
      files = tf.data.Dataset.list_files(file_pattern=file_pattern)
      d = files.apply(
      tf.data.experimental.parallel_interleave(
      lambda filename: tf.data.TFRecordDataset(filename),
      cycle_length=4,
      block_length=16,
      buffer_output_elements=16,
      prefetch_input_elements=16,
      sloppy=True))
      d = d.apply(tf.contrib.data.map_and_batch(
      map_func=_parse_func, batch_size=b,
      num_parallel_batches=4))
      d = d.cache()
      d = d.repeat()
      d = d.prefetch(1)
      return d


      Get train data



      # get files from GCS bucket and load them into dataset
      train_data = input_fn(train_directory, b=BATCH_SIZE)


      Fit the model



      model.fit(x=train_data.make_one_shot_iterator())


      I am running it on CloudML so GCS and CloudML should be pretty fast.





      CloudML CPU Usage:



      As we can see below, the CPU is at 70% and the memory doesn't increase past 10%. So what does the dataset.cache() do?



      enter image description here



      GPU metrics in CloudML logs



      As seen below, it seems that the GPU is off! Also the memory is at 0mb. Where is the cache stored?



      No processes running on GPU!



      enter image description here



      Edit:



      It seems that indeed, there are not processes running on GPU. I tried to explicitly state:



      tf.keras.backend.set_session(tf.Session(config=tf.ConfigProto(
      allow_soft_placement=True,
      log_device_placement=True)))

      train_data = input_fn(file_pattern=train_directory, b=BATCH_SIZE)

      model = create_model()

      with tf.device('/gpu:0'):
      model.fit(x=train_data.make_one_shot_iterator(),
      epochs=EPOCHS,
      steps_per_epoch=STEPS_PER_EPOCH,
      validation_data=test_data.make_one_shot_iterator(),
      validation_steps=VALIDATION_STEPS)


      but everything still utilises the CPU!







      python tensorflow keras google-cloud-ml






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 18:32

























      asked Nov 9 at 18:54









      GRS

      457625




      457625
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          In my case, I was using a custom setup.py file which used a CPU-only Tensorflow version.



          I am kicking myself, please install tensorflow-gpu instead.






          share|improve this answer





















          • Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
            – Tay2510
            Nov 12 at 19:29






          • 1




            The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
            – GRS
            Nov 12 at 20:00










          • If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
            – spicyramen
            Nov 19 at 23:17













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231758%2fspeeding-up-tfrecords-feed-into-keras-model-on-cloudml-for-gpu%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote













          In my case, I was using a custom setup.py file which used a CPU-only Tensorflow version.



          I am kicking myself, please install tensorflow-gpu instead.






          share|improve this answer





















          • Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
            – Tay2510
            Nov 12 at 19:29






          • 1




            The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
            – GRS
            Nov 12 at 20:00










          • If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
            – spicyramen
            Nov 19 at 23:17

















          up vote
          0
          down vote













          In my case, I was using a custom setup.py file which used a CPU-only Tensorflow version.



          I am kicking myself, please install tensorflow-gpu instead.






          share|improve this answer





















          • Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
            – Tay2510
            Nov 12 at 19:29






          • 1




            The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
            – GRS
            Nov 12 at 20:00










          • If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
            – spicyramen
            Nov 19 at 23:17















          up vote
          0
          down vote










          up vote
          0
          down vote









          In my case, I was using a custom setup.py file which used a CPU-only Tensorflow version.



          I am kicking myself, please install tensorflow-gpu instead.






          share|improve this answer












          In my case, I was using a custom setup.py file which used a CPU-only Tensorflow version.



          I am kicking myself, please install tensorflow-gpu instead.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 at 18:46









          GRS

          457625




          457625












          • Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
            – Tay2510
            Nov 12 at 19:29






          • 1




            The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
            – GRS
            Nov 12 at 20:00










          • If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
            – spicyramen
            Nov 19 at 23:17




















          • Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
            – Tay2510
            Nov 12 at 19:29






          • 1




            The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
            – GRS
            Nov 12 at 20:00










          • If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
            – spicyramen
            Nov 19 at 23:17


















          Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
          – Tay2510
          Nov 12 at 19:29




          Wait. If you did use CPU mode, the procedure should be bottlenecked at back-prop stage instead of data I/O. How come you reached 7x faster speed by loading data into memory?
          – Tay2510
          Nov 12 at 19:29




          1




          1




          The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
          – GRS
          Nov 12 at 20:00




          The 7x was on the default runtime, v1.10 which is preconfigured with GPU. When I specified the 1.12 version of Tensorflow at setup.py, it was a CPU only version, where the error occurred. The direct feed of weighs from tf.Dataset is only possible in Tensorflow v.1.12 which is the reason I had to install it. But still, the GPU usage is now at maximum 70%. While the CPU is just at 30% instead of previous 70%. So there is still room for improvement
          – GRS
          Nov 12 at 20:00












          If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
          – spicyramen
          Nov 19 at 23:17






          If you are using GCP ML, they only support up to TF 1.10 today in their runtime. Which scale tier are you using? You should be using: BASIC_GPU. cloud.google.com/ml-engine/docs/tensorflow/machine-types
          – spicyramen
          Nov 19 at 23:17




















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53231758%2fspeeding-up-tfrecords-feed-into-keras-model-on-cloudml-for-gpu%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Guess what letter conforming each word

          Run scheduled task as local user group (not BUILTIN)

          Port of Spain