Tag Archives: tensorflow

Solution to GPU memory leak problem of tensorflow operation efficiency

Problem Description:

Tensorflow runs slower and slower during training, and it gets better after restarting.

I used Tensorflow-GPU version 1.2, and I ran on the GPU. When I started training, the time of each batch was very low. Then as the training progressed, the time of each batch was getting longer and longer, but when I After restarting, everything is normal again?

Problem finding:

  The reason I found at the beginning was the batch_size and batch_num problems. It was solved by the python yield data generator to ensure that the data processed by the memory each time was determined to be the batch_size size, but it was found that the operating efficiency was still not high, so I checked some information on Google and found the following solutions Method.

problem solved:

It is caused by the definition of tf’s op in the runtime session. In this way, each iteration will add a new node in the graph, resulting in a memory leak, the program is getting slower and slower, and finally it is forced to exit. As for whether the program adds nodes during runtime, you can define graph.finalize() in the session to lock the graph. If an error is reported when running, it proves that the program is getting slower and slower by dynamically adding nodes.

The code before modification is as follows:

def one_hot(labels):
    labels_num = [strnum_convert(i) for i in labels]
    batch_size = tf.size(labels_num)
    labels = tf.expand_dims(labels_num, 1 )
    indices = tf.expand_dims(tf.range(0, batch_size, 1), 1 )
    concated = tf.concat([indices, labels],1 )
    onehot_labels = tf.sparse_to_dense(concated, tf.stack([batch_size, 8]), 1 , 0)
     # all_hot_labels = tf.reshape(onehot_labels,(1,612)) 
    return onehot_labels

The modified code is as follows:

def one_hot(labels):
    one_hot_label = np.array([int(i == int(labels)) for i in range(8 )])    
    ... ...
     return one_hot_label

You can see that the culprit is the one_hot operation of this tf version, which is modified to the numpy version to perfectly solve the problem of operating efficiency.

Thinking:

Method Two:

The cause of the above problem is GPU memory leak, we can also use a curve to save the country; every 1000 batches, when the speed is significantly slower, reset the graph, and then rebuild the model, and then save before load The parameter tf.reset_default_graph()self.build_model();

Method three:

When we used tensorflow to make a data set, we found that when I run the eval() function, the program will run slower and slower. The value generated by eval() is not deleted, and then it will occupy more memory. The solution is Just use the del command, usually written as.

data = Var.eval()  
 # save data to file 
del data

 

Preservation and recovery of TF. Train. Saver () model of tensorflow

The trained model parameters are saved for later verification or testing. The TF. Train. Saver () module provides model storage in TF

To save a model, first create a saver object, such as

saver=tf.train.Saver()

When creating this Saver object, we often use one parameter, which is max_ to_ The keep parameter is used to set the number of saved models. The default value is 5, that is, max_ to_ Keep = 5, save the latest 5 models. If you want to save the model every epoch, you can save max_ to_ Keep is set to none or 0, for example:

saver=tf.train.Saver(max_to_keep=0)

However, in addition to occupying more hard disk, it is not of much practical use, so it is not recommended

Of course, if you only want to save the last generation of models, you just need to save max_ to_ Keep is set to 1, that is to say

saver=tf.train.Saver(max_to_keep=1)

After creating the saver object, you can save the trained model, such as:

saver.save(sess,'ckpt/mnist.ckpt',global_step=step)

The first parameter sess, needless to say. The second parameter sets the saved path and name, and the third parameter adds the number of training times as a suffix to the model name

saver.save(sess, ‘my-model’, global_ step=0) ==> filename: ‘my-model-0’

saver.save(sess, ‘my-model’, global_ step=1000) ==> filename: ‘my-model-1000’

2. Examples

import tensorflow as tf
import numpy as np
x = tf.placeholder(tf.float32, shape=[None, 1])
y = 4 * x + 4
w = tf.Variable(tf.random_normal([1], -1, 1))
b = tf.Variable(tf.zeros([1]))
y_predict = w * x + b
loss = tf.reduce_mean(tf.square(y - y_predict))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
isTrain = False
train_steps = 100
checkpoint_steps = 50
checkpoint_dir = ''
saver = tf.train.Saver()  # defaults to saving all variables - in this case w and b
x_data = np.reshape(np.random.rand(10).astype(np.float32), (10, 1))
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    if isTrain:
        for i in xrange(train_steps):
            sess.run(train, feed_dict={x: x_data})
            if (i + 1) % checkpoint_steps == 0:
                saver.save(sess, checkpoint_dir + 'model.ckpt', global_step=i+1)
    else:
        ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
        else:
            pass
        print(sess.run(w))
        print(sess.run(b)) 

3. Recovery

Use the saver. Restore() method to recover variables

saver.restore(sess,'ckpt.model_checkpoint_path')

Sess: indicates the current session, and the previously saved results will be loaded into this session

ckpt.model_ checkpoint_ Path: indicates the storage location of the model. It does not need to provide the name of the model. It will check the checkpoint file to see who is the latest and what is its name

Reprinted:

【1】 https://www.cnblogs.com/denny402/p/6940134.html

【2】 https://blog.csdn.net/u011500062/article/details/51728830

【3】 https://www.cnblogs.com/chamie/p/8780508.html

Tensorflow gradients TypeError: Fetch argument None has invalid type

In the process of back propagation, the neural network needs to calculate the partial derivative of the learning parameter corresponding to each loss. The calculated value is the gradient, which is used to multiply the learning rate to update the learning parameter. It is used through the gradients function in tensorflow

We parse the function prototype according to the official documents

The function prototype and parameters in the official document are as follows:

tf.gradients(
    ys,
    xs,
    grad_ys=None,
    name='gradients',
    colocate_gradients_with_ops=False,
    gate_gradients=False,
    aggregation_method=None,
    stop_gradients=None,
    unconnected_gradients=tf.UnconnectedGradients.NONE
)

Ys and XS are tensors or tensor lists. The function TF. Gradients is used to derive XS in ys. The return value of the derivation is a list, and the length of the list is the same as that of XS

Let’s introduce the usage of functions through examples (this is the example given in Teacher Li Jinhong’s book)

import tensorflow as tf
w1 = tf.Variable([[1,2]])
w2 = tf.Variable([[3,4]])

y = tf.matmul(w1, [[9],[10]])
grads = tf.gradients(y,[w1])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    gradval = sess.run(grads)
    print(gradval)

Running this code will report an error as follows:

TypeError: Fetch argument None has invalid type <class 'NoneType'>

The reason is that tensorflow gradients are like int type tensorflow gradients that set W1 to float type, such as tf.float32 gards, and tensorflow gradients are generally of float32 type. So we modify the code to change the tensor of integer to floating point

import tensorflow as tf
w1 = tf.Variable([[1.,2.]])
w2 = tf.Variable([[3.,4.]])

y = tf.matmul(w1, [[9.],[10.]])
grads = tf.gradients(y,[w1])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    gradval = sess.run(grads)
    print(gradval)

The output results are as follows

[array([[ 9., 10.]], dtype=float32)]

In the above example, since y is multiplied by W1 and [[9], [10]], its derivatives are [[9], [10]] (i.e. slope)

Note: if there is no variable requiring partial derivative in the gradient formula, the system will report an error. For example, write grads = TF. Gradients (y, [W1, W2])

Tensorflowcenter {typeerror} non hashable type: “numpy. Ndarray”

In my experiment, I use feed to fill in the data. The code at sess is as follows:

1 with tf.Session() as sess:
2     init = tf.global_variables_initializer()
3     sess.run(init)
4     for epoch in range(a.epochs):
5         input, target = load_batch_data(batch_size=16, a=a)
6         batch_input = input.astype(np.float32)
7         batch_target = target.astype(np.float32)
8         sess.run(predict_real, feed_dict={input: batch_input, target: batch_target})

When running: {typeerror} unhashable type: ‘numpy. Ndarray’

Later, we found that:

When defining input and target outside the session, it is written as follows:

1 input = tf.placeholder(dtype=tf.float32, shape=[None, image_size, image_size, num_channels])
2 target = tf.placeholder(dtype=tf.float32, shape=[None, image_size, image_size, num_channels])

However, I defined input, target after opening session. This results in me running the following line of code

1 sess.run(predict_real, feed_dict={input: batch_input, target: batch_target})

There is an error like {typeerror} unhashable type: ‘numpy. Ndarray’. However, this input and target are not input and target outside the session. If you know the reason, it’s easy to correct it. Just change the names of input and target in the session, as follows:

 1     with tf.Session() as sess:
 2         init = tf.global_variables_initializer()
 3         sess.run(init)
 4         if a.mode == 'train':
 5             for epoch in range(a.epochs):
 6                 batch_input, batch_target = load_batch_data(a=a)
 7                 batch_input = batch_input.astype(np.float32)
 8                 batch_target = batch_target.astype(np.float32)
 9                 sess.run(model, feed_dict={input: batch_input, target: batch_target})
10                 print('epoch' + str(epoch) + ':')
11             saver.save(sess, 'model_parameter/train.ckpt')
12             print('training finished!!!')
13         elif a.mode == 'test':
14             #ceshi
15             ckpt = tf.train.latest_checkpoint(a.checkpoint)
16             saver.restore(sess, ckpt)
17             # Get the image at the time of the test and add the label
18             batch_input, _ = load_batch_data(a=a)
19             # batch_input = batch_input/255.
20             batch_input = batch_input.astype(np.float32)
21             generator_output = sess.run(test_output, feed_dict={input: batch_input})
22             # The result is processed and 3 is subtracted from the image channel to obtain the rgb image
23             result = process_generator_output(generator_output)
24             if result:
25                 print('Done!')
26         else:
27             print('the MODE is not avaliable...')

Problems and solutions in running tensorflow

========== RESTART: D:/pythonwork/tensorflow/example/5/nonlinear.py ==========
Traceback (most recent call last):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 6, in <module>
from sklearn import datasets, cross_validation, metrics
ImportError: cannot import name ‘cross_validation’

#Since version 0.18.1, sklearn has removed cross validation, so the following has been changed to model selection

>>>
========== RESTART: D:/pythonwork/tensorflow/example/5/nonlinear.py ==========
Traceback (most recent call last):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 13, in <module>
from keras.models import Sequential
ModuleNotFoundError: No module named ‘keras’
>>>
========== RESTART: D:/pythonwork/tensorflow/example/5/nonlinear.py ==========
Using TensorFlow backend.
Traceback (most recent call last):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 37, in <module>
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,
NameError: name ‘cross_validation’ is not defined
>>>
========== RESTART: D:/pythonwork/tensorflow/example/5/nonlinear.py ==========
Using TensorFlow backend.

Warning (from warnings module):
File “D:\Python36\lib\site-packages\sklearn\preprocessing\data.py”, line 625
return self.partial_fit(X, y)
DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

Warning (from warnings module):
File “D:\Python36\lib\site-packages\sklearn\base.py”, line 462
return self.fit(X, **fit_params).transform(X)
DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

Warning (from warnings module):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 52
model.add(Dense(10, input_dim=7, init=’normal’, activation=’relu’))
UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(10, input_dim=7, activation=”relu”, kernel_initializer=”normal”)`

Warning (from warnings module):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 53
model.add(Dense(5, init=’normal’, activation=’relu’))
UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(5, activation=”relu”, kernel_initializer=”normal”)`

Warning (from warnings module):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 54
model.add(Dense(1, init=’normal’))
UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, kernel_initializer=”normal”)`

Warning (from warnings module):
File “D:/pythonwork/tensorflow/example/5/nonlinear.py”, line 60
model.fit(X_train, y_train, nb_epoch=1000, validation_split=0.33, shuffle=True,verbose=2 )
UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.PS

After tensorflow is installed, an error occurred during import: importerror: DLL load failed: the specified module cannot be found

(author: Chen Yao)
share a friend’s AI tutorial. Zero base! Easy to understand! Funny! And yellow jokes! You can see if it helps you http://www.captainbed.net/luanpeng

Tensorflow can be installed through pip or anaconda, but after installation, it can be run in Python script

import tensorflow as tf

Error: importerror: DLL load failed: unable to find the specified module.

Three solutions are tried

1) In addition to tensorflow, we need to install tensorflow GPU

Later I learned that the installation of tensorflow GPU is mainly for acceleration, and this problem has not been solved after the installation.

2) Second, unloading and reloading

Therefore, we upgraded pip, uninstalled it through PIP install tensorflow and then re installed it through PIP install tensorflow , which still can't solve the problem.

3) Third, we need to update pilot

Pilot is an image processing library in Python, which comes with anaconda. But maybe because of the older version of pilot, it needs to be updated.

conda uninstall pillow
conda update pip
pip install pillow

Through the above three lines of commands, first uninstall the pilot in anaconda, then update the PIP, and then install the latest pilot through the upgraded pip. The problem is solved. Hehe, it's also amazing. I don't know why the python package conflicts with tensorflow.... However, when installing, most of the strange problems are version problems. You can only check the version more, but most of the time it's upgrade, sometimes it's a headache to downgrade.

reference material:

https://blog.csdn.net/blueheart20/article/details/79612985

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

I tried to change the cuddn version, but it didn’t work. It should be a matter of memory, as follows:

tensorflow:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as session:

Hard

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))

My environment is cuddn7.6 + cuda10.0 + Python 3.6 2080ti, so it can run

“Failed to get convolution algorithm. This is probably because cuDNN failed to initialize”

You’ve recently been using the version of inf.0. The following error occurred while running the program.

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

At first, it was suspected that CUDA and cudnn were configured incorrectly (version matching is required). After trial and error, there is still this error

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

It means to allocate GPUs on demand
the main reason is that my image is relatively large and consumes more GPU resources. But my graphics card (rtx2060) has only 6GB memory, so this error will appear. This error prompt is very misleading, which makes people obsess about the versions of CUDA and cudnn. Therefore, I’ll stick it here to prevent future generations from repeating the same mistakes.


reference resources:

https://github.com/tensorflow/tensorflow/issues/24828