[D] Tensorflow GPU memory management (TF_FORCE_GPU_ALLOW_GROWTH)
So this is more of an exploratory question. I am deploying models using a TF serving docker image with the flag TF_FORCE_GPU_ALLOW_GROWTH. I am deploying a small fashion mnist model, resnet (99MB), and inception v3(92MB) models. Because of the flag, the tf model server initially occupies only ~300 MB approx, then on sequential requests to the models it increases as follows (according to nvidia-smi):
~300 MB | after inception request ~4306MB | after resnet request ~ 8402 MB
if I send a request to resnet first, the GPU usage does not increase at all (Even when I add more models):
~300 MB | after resnet request ~7888MB | after inception request ~ 7888 MB
Why does the GPU usage not increase after adding more models? Are they flushed from memory when new models are loaded for inference? How can I accurately estimate how many similar sized models can be loaded on one GPU enabled machine without the trial and error method? Is there a pattern to what fraction of GPU memory is progressively allocated?
Note: This is run on an EC2 instance with available GPU memory 11441MiB [ Tesla K80 ] when I trey to run the same on a machine with lower capacity [Quadro P2000 – 5059 MB], I face a similar situation where there is no increase in memory usage. However, I also get the following in the logs:
2019-12-11 05:10:54.727985: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-12-11 05:10:54.736610: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available