Google Colaboratory: misleading information about its GPU (only 5% RAM available to some users)

Question

update: this question is related to Google Colab's "Notebook settings: Hardware accelerator: GPU". This question was written before the "TPU" option was added.

Reading multiple excited announcements about Google Colaboratory providing free Tesla K80 GPU, I tried to run fast.ai lesson on it for it to never complete - quickly running out of memory. I started investigating of why.

The bottom line is that “free Tesla K80” is not "free" for all - for some only a small slice of it is "free".

I connect to Google Colab from West Coast Canada and I get only 0.5GB of what supposed to be a 24GB GPU RAM. Other users get access to 11GB of GPU RAM.

Clearly 0.5GB GPU RAM is insufficient for most ML/DL work.

If you're not sure what you get, here is little debug function I scraped together (only works with the GPU setting of the notebook):

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Executing it in a jupyter notebook before running any other code gives me:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 566MB | Used: 10873MB | Util  95% | Total 11439MB

The lucky users who get access to the full card will see:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 11439MB | Used: 0MB | Util  0% | Total 11439MB

Do you see any flaw in my calculation of the GPU RAM availability, borrowed from GPUtil?

Can you confirm that you get similar results if you run this code on Google Colab notebook?

If my calculations are correct, is there any way to get more of that GPU RAM on the free box?

update: I'm not sure why some of us get 1/20th of what other users get. e.g. the person who helped me to debug this is from India and he gets the whole thing!

note: please don't send any more suggestions on how to kill the potentially stuck/runaway/parallel notebooks that might be consuming parts of the GPU. No matter how you slice it, if you are in the same boat as I and were to run the debug code you'd see that you still get a total of 5% of GPU RAM (as of this update still).

Any solution to this? why do i get different results when doing !cat /proc/meminfo — figs_and_nuts, Commented Feb 19, 2018 at 4:09
Yep, same problem, just around 500 mb of GPU ram...misleading description :( — Naveen, Commented Apr 10, 2018 at 8:31
Try IBM open source data science tools(cognitiveclass.ai) as they also have a free GPU with jupyter notebooks. — A Q, Commented Jun 24, 2018 at 11:14
I've rolled back this question to a state where there's actually a question in it. If you've done more research and found an answer, the appropriate place for that is in the answer box. It is incorrect to update the question with a solution. — Chris Hayes, Commented Aug 24, 2018 at 0:30
@ChrisHayes, I understand your intention, but this is not right, since your rollback deleted a whole bunch of relevant details that are now gone. If you'd like to suggest a better wording that better fits the rules of this community please do so, but otherwise please revert your rollback. Thank you. p.s. I already did post the answer. — stason, Commented Aug 24, 2018 at 1:02

stason · Accepted Answer · 2019-12-20 19:50:10Z

55

So to prevent another dozen of answers suggesting invalid in the context of this thread suggestion to !kill -9 -1, let's close this thread:

The answer is simple:

As of this writing Google simply gives only 5% of GPU to some of us, whereas 100% to the others. Period.

dec-2019 update: The problem still exists - this question's upvotes continue still.

mar-2019 update: A year later a Google employee @AmiF commented on the state of things, stating that the problem doesn't exist, and anybody who seems to have this problem needs to simply reset their runtime to recover memory. Yet, the upvotes continue, which to me this tells that the problem still exists, despite @AmiF's suggestion to the contrary.

dec-2018 update: I have a theory that Google may have a blacklist of certain accounts, or perhaps browser fingerprints, when its robots detect a non-standard behavior. It could be a total coincidence, but for quite some time I had an issue with Google Re-captcha on any website that happened to require it, where I'd have to go through dozens of puzzles before I'd be allowed through, often taking me 10+ min to accomplish. This lasted for many months. All of a sudden as of this month I get no puzzles at all and any google re-captcha gets resolved with just a single mouse click, as it used to be almost a year ago.

And why I'm telling this story? Well, because at the same time I was given 100% of the GPU RAM on Colab. That's why my suspicion is that if you are on a theoretical Google black list then you aren't being trusted to be given a lot of resources for free. I wonder if any of you find the same correlation between the limited GPU access and the Re-captcha nightmare. As I said, it could be totally a coincidence as well.

edited Dec 20, 2019 at 19:50

answered Jul 4, 2018 at 18:02

stason

6,4135 gold badges38 silver badges52 bronze badges

5

Your statement of "As of this writing Google simply gives only 5% of GPU to some of us, whereas 100% to the others. Period." is incorrect - Colab has never worked this way. All diagnosed cases of users seeing less than the full complement of GPU RAM available to them have boiled down to another process (started by the same user, possibly in another notebook) using the rest of the GPU's RAM.
– Ami F
Commented Mar 22, 2019 at 22:37
14

Future readers: if you think you're seeing this or similar symptoms of GPU RAM unavailability, "Reset all runtimes" in the Runtime menu will get you a fresh VM guaranteeing no stale processes are still holding on to GPU RAM. If you still see this symptom immediately after using that menu option please file a bug at github.com/googlecolab/colabtools/issues
– Ami F
Commented Mar 22, 2019 at 22:37
5

In case it was unclear: I'm not describing what I believe the implementation is based on observation of the system's behavior as a user. I'm describing what I directly know the implementation to be. I posted hoping that users who see less than full availability report it as an issue (either user error or system bug) instead of reading the incorrect statements above and assuming things are working as intended.
– Ami F
Commented Mar 24, 2019 at 1:28
1

In other words you're saying you're a Google employee and you're implying that colab stopped discriminating users and from now on if a user doesn't get 100% GPU RAM on their first connect and not due to some previous usage by the same user, there must be a bug in your system, which you request to report. And you will actually look at the problem and not deal with it like I have shown in this example where one of you lied to the user giving him an incorrect reason for the problem. @AmiF.
– stason
Commented Mar 24, 2019 at 15:17
2

No, GPUs have never been shared, and there are no lies in the example you linked (simply a guess at and explanation of the far-and-away most-common reason for the symptom reported).
– Ami F
Commented Mar 24, 2019 at 20:08

| Show 1 more comment

Nguyễn Tài Long · Accepted Answer · 2018-05-25 09:31:29Z

24

Last night I ran your snippet and got exactly what you got:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 566MB | Used: 10873MB | Util  95% | Total 11439MB

but today:

Gen RAM Free: 12.2 GB  I Proc size: 131.5 MB
GPU RAM Free: 11439MB | Used: 0MB | Util   0% | Total 11439MB

I think the most probable reason is the GPUs are shared among VMs, so each time you restart the runtime you have chance to switch the GPU, and there is also probability you switch to one that is being used by other users.

UPDATED: It turns out that I can use GPU normally even when the GPU RAM Free is 504 MB, which I thought as the cause of ResourceExhaustedError I got last night.

edited May 25, 2018 at 9:31

answered Feb 15, 2018 at 8:53

Nguyễn Tài Long

6482 gold badges7 silver badges15 bronze badges

1

I think I re-connected probably 50 times over the period of a few days and I was always getting the same 95% usage to start with. Only once I saw 0%. In all those attempts I was getting cuda out of memory error once it was coming close to 100%.
– stason
Commented Feb 16, 2018 at 4:40
What do you mean with your update? Can you still run stuff with 500Mb? I have the same problem, I am getting RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generated/../THCTensorMathCompare.cuh:84
– Ivan Bilan
Commented Mar 11, 2018 at 21:03

Add a comment |

Ajaychhimpa1 · Accepted Answer · 2018-02-26 05:26:32Z

7

If you execute a cell that just has
!kill -9 -1
in it, that'll cause all of your runtime's state (including memory, filesystem, and GPU) to be wiped clean and restarted. Wait 30-60s and press the CONNECT button at the top-right to reconnect.

answered Feb 26, 2018 at 5:26

Ajaychhimpa1

1277 bronze badges

2

thank you, but your suggestion doesn't change anything. I'm still getting 5% of GPU RAM.
– stason
Commented Mar 2, 2018 at 6:57
This doesn't help. After killing and reconnecting, the GPU memory is still at 500Mb out of ~12GB.
– Ivan Bilan
Commented Mar 11, 2018 at 21:04

Add a comment |

mkczyk · Accepted Answer · 2018-04-22 03:22:34Z

3

Restart Jupyter IPython Kernel:

!pkill -9 -f ipykernel_launcher

answered Apr 22, 2018 at 3:22

mkczyk

2,6623 gold badges32 silver badges43 bronze badges

1

close, but no cigar: GPU RAM Free: 564MB
– Ivan Bilan
Commented Apr 22, 2018 at 16:47
as simpler method for restarting the kernel, you can just click Runtime | Restart runtime... or the shortcut CMD/CTRL+M
– Agile Bean
Commented Nov 29, 2018 at 12:57

Add a comment |

Manivannan Murugavel · Accepted Answer · 2018-04-05 06:33:26Z

2

Find the Python3 pid and kill the pid. Please see the below image

Note: kill only python3(pid=130) not jupyter python(122).

answered Apr 5, 2018 at 6:33

Manivannan Murugavel

1,55217 silver badges15 bronze badges

will this help with the memory issue? aren't you killing all other people's runs then?
– Ivan Bilan
Commented Apr 6, 2018 at 11:17
this doesn't help, got same problem: GPU RAM Free: 564MB
– Ivan Bilan
Commented Apr 22, 2018 at 16:48

Add a comment |

Jainil Patel · Accepted Answer · 2019-09-24 13:44:25Z

2

just give a heavy task to google colab, it will ask us to change to 25 gb of ram.

example run this code twice:

import numpy as np
from keras.layers import Conv2D, MaxPooling2D, AveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential
from keras.layers.advanced_activations import LeakyReLU
from keras.datasets import cifar10
(train_features, train_labels), (test_features, test_labels) = cifar10.load_data()
model = Sequential()

model.add(Conv2D(filters=16, kernel_size=(2, 2), padding="same", activation="relu", input_shape=(train_features.shape[1:])))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(filters=64, kernel_size=(4, 4), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())

model.add(Dense(25600, activation="relu"))
model.add(Dense(25600, activation="relu"))
model.add(Dense(25600, activation="relu"))
model.add(Dense(25600, activation="relu"))
model.add(Dense(10, activation="softmax"))

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(train_features, train_labels, validation_split=0.2, epochs=10, batch_size=128, verbose=1)

then click on get more ram :)

edited Sep 24, 2019 at 13:44

answered Sep 24, 2019 at 13:29

Jainil Patel

1,3149 silver badges16 bronze badges

I can confirm this. I had a 15 gig dataset of mostly HD pictures (my drive has 30 gigs instead of 15gigs) and I ran my code to resize the image dataset to 224,224,3 and I was switched to a high RAM runtime. Then when I began training RAM usage went up to 31.88gigs.
– Anshuman Kumar
Commented Feb 25, 2020 at 6:40
But I would like to add that once I finished that job, I have not been able to access another GPU/TPU for the past 24 hours. It is possible I was blacklisted.
– Anshuman Kumar
Commented Feb 25, 2020 at 7:01
@AnshumanKumar , give the high load in beginning only otherwise on changing configuration you will lose previously done work which in ram. I didn't used high configuration for 24 hour so I don't know about blacklisting.
– Jainil Patel
Commented Feb 26, 2020 at 20:02
Yes, that did happen with me. However the work got done.
– Anshuman Kumar
Commented Feb 27, 2020 at 3:21

Add a comment |

desertnaut · Accepted Answer · 2020-09-06 19:07:31Z

2

Im not sure if this blacklisting is true! Its rather possible, that the cores are shared among users. I ran also the test, and my results are the following:

Gen RAM Free: 12.9 GB  | Proc size: 142.8 MB
GPU RAM Free: 11441MB | Used: 0MB | Util   0% | Total 11441MB

It seems im getting also full core. However i ran it a few times, and i got the same result. Maybe i will repeat this check a few times during the day to see if there is any change.

edited Sep 6, 2020 at 19:07

desertnaut

59.7k29 gold badges149 silver badges170 bronze badges

answered Feb 28, 2019 at 10:48

Kregnach

1171 silver badge12 bronze badges

Add a comment |

Ritwik G · Accepted Answer · 2018-02-25 13:10:20Z

1

I believe if we have multiple notebooks open. Just closing it doesn't actually stop the process. I haven't figured out how to stop it. But I used top to find PID of the python3 that was running longest and using most of the memory and I killed it. Everything back to normal now.

answered Feb 25, 2018 at 13:10

Ritwik G

4162 silver badges8 bronze badges

Add a comment |

wall-e · Accepted Answer · 2024-01-21 18:22:05Z

0

Not sure in TensorFlow case (didn't test), but JAX preallocates 75% of GPU.

Source: https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html

This help in case of JAX:

# Must set this environment variable before importing JAX
import os
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '.10'  # preallocate 10% instead of the default 75%.

import jax

answered Jan 21 at 18:22

wall-e

7,2982 gold badges17 silver badges9 bronze badges

Add a comment |

Ankit Veer Singh · Accepted Answer · 2020-09-10 11:24:04Z

-1

Google Colab resource allocation is dynamic, based on users past usage. Suppose if a user has been using more resources recently and a new user who is less frequently uses Colab, he will be given relatively more preference in resource allocation.

Hence to get the max out of Colab , close all your Colab tabs and all other active sessions, reset the runtime of the one you want to use. You'll definitely get better GPU allocation.

answered Sep 10, 2020 at 11:24

Ankit Veer Singh

1631 silver badge9 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Google Colaboratory: misleading information about its GPU (only 5% RAM available to some users)

10 Answers 10

Not the answer you're looking for? Browse other questions tagged
python
machine-learning
gpu
ram
google-colaboratory
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

Not the answer you're looking for? Browse other questions tagged pythonmachine-learninggpuramgoogle-colaboratory or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
machine-learning
gpu
ram
google-colaboratory
or ask your own question.