Tag Archives: GPU memory leak

Solution to GPU memory leak problem of tensorflow operation efficiency

Problem Description:

Tensorflow runs slower and slower during training, and it gets better after restarting.

I used Tensorflow-GPU version 1.2, and I ran on the GPU. When I started training, the time of each batch was very low. Then as the training progressed, the time of each batch was getting longer and longer, but when I After restarting, everything is normal again?

Problem finding:

  The reason I found at the beginning was the batch_size and batch_num problems. It was solved by the python yield data generator to ensure that the data processed by the memory each time was determined to be the batch_size size, but it was found that the operating efficiency was still not high, so I checked some information on Google and found the following solutions Method.

problem solved:

It is caused by the definition of tf’s op in the runtime session. In this way, each iteration will add a new node in the graph, resulting in a memory leak, the program is getting slower and slower, and finally it is forced to exit. As for whether the program adds nodes during runtime, you can define graph.finalize() in the session to lock the graph. If an error is reported when running, it proves that the program is getting slower and slower by dynamically adding nodes.

The code before modification is as follows:

def one_hot(labels):
    labels_num = [strnum_convert(i) for i in labels]
    batch_size = tf.size(labels_num)
    labels = tf.expand_dims(labels_num, 1 )
    indices = tf.expand_dims(tf.range(0, batch_size, 1), 1 )
    concated = tf.concat([indices, labels],1 )
    onehot_labels = tf.sparse_to_dense(concated, tf.stack([batch_size, 8]), 1 , 0)
     # all_hot_labels = tf.reshape(onehot_labels,(1,612)) 
    return onehot_labels

The modified code is as follows:

def one_hot(labels):
    one_hot_label = np.array([int(i == int(labels)) for i in range(8 )])    
    ... ...
     return one_hot_label

You can see that the culprit is the one_hot operation of this tf version, which is modified to the numpy version to perfectly solve the problem of operating efficiency.

Thinking:

Method Two:

The cause of the above problem is GPU memory leak, we can also use a curve to save the country; every 1000 batches, when the speed is significantly slower, reset the graph, and then rebuild the model, and then save before load The parameter tf.reset_default_graph()self.build_model();

Method three:

When we used tensorflow to make a data set, we found that when I run the eval() function, the program will run slower and slower. The value generated by eval() is not deleted, and then it will occupy more memory. The solution is Just use the del command, usually written as.

data = Var.eval()  
 # save data to file 
del data