Message boards :
Number crunching :
Possible Memory Leak with Nvidia GPU
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Apr 16 Posts: 3 Credit: 31,943 RAC: 0 |
My system: Ubuntu 14.04 server running kernel 3.16.0-69 BOINC client version 7.2.42 for x86_64-pc-linux-gnu OpenCL: NVIDIA GPU 0: GeForce GTX 560 Ti (driver version 361.42, device version OpenCL 1.1 CUDA, 1023MB, 990MB available, 1306 GFLOPS peak) I've configured the project to only run GPU work units and have configured the client to not use more than 25% of RAM. The problem that I see on this system: Every time a Work Unit is started, it uses up another 10M to 30M of RAM. I can monitor this by issuing a "free -h" command every time a work unit starts (a work unit only takes about 3 minutes to complete). This slowly fills up my RAM and after a couple hours the system has to start using swap space. I have tried to clear the ram by issuing a "echo 3 > /proc/sys/vm/drop_caches" at regular intervals, but this does not keep the RAM from getting completely filled by the work units. Once the RAM is filled, I have to reboot the server. For the time being, I have switched to Einstein@home and my machine is humming along fine with no problems. Looks like a memory leak issue with the Milkyway@home work units (i.e. the work unit does not release all the RAM it used after it completes). Also annoying that the client doesn't seem to pay any attention to that 25% limitation. |
Send message Joined: 15 Jun 16 Posts: 5 Credit: 199,739 RAC: 0 |
Looks like I'm also seeing this issue - my desktop is using quite a bit of swap space, and I don't think MilkyWay@Home is meant to be using 7.7 GB of RAM? I'm running only this project, with default settings (Ubuntu 16.04): -> sudo service boinc-client status ◠boinc-client.service - Berkeley Open Infrastructure Network Computing Client Loaded: loaded (/lib/systemd/system/boinc-client.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2016-06-16 09:11:06 CDT; 4 days ago Process: 14284 ExecStopPost=/bin/rm -f /var/lib/boinc-client/lockfile (code=exited, status=0/SUCCESS) Process: 10403 ExecStartPre=/bin/chown boinc:boinc /var/log/boinc.log /var/log/boincerr.log (code=exited, status=0/SUCCESS) Process: 10400 ExecStartPre=/usr/bin/touch /var/log/boinc.log /var/log/boincerr.log (code=exited, status=0/SUCCESS) Main PID: 10409 (sh) Tasks: 3 Memory: 7.7G CPU: 2w 3h 33min 57.426s CGroup: /system.slice/boinc-client.service ├─10409 /bin/sh -c /usr/bin/boinc --dir /var/lib/boinc-client >/var/log/boinc.log 2>/var/log/boincerr.log └─10414 /usr/bin/boinc --dir /var/lib/boinc-client |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Guys, Any idea if it is the Nbody application or the separation application that you are seeing the memory issues with? Jake |
Send message Joined: 15 Jun 16 Posts: 5 Credit: 199,739 RAC: 0 |
Not sure how to tell. I can easily disable GPU jobs overnight and see what happens; is there a way to pick only one of the CPU applications to isolate those? |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Yes you can stop receiving work units from specific applications by changing your settings on your account settings for MilkyWay@home here. Thanks for your willingness to help test this. Jake |
Send message Joined: 15 Jun 16 Posts: 5 Credit: 199,739 RAC: 0 |
Well, that was easy. It's definitely the GPU jobs. With GPU suspended the service's RAM usage stays below 40 MB, but with GPU enabled it climbs to ~600 MB within a couple of hours. My GPU is an Nvidia Quadro 2000; I can give you more system details if they would be helpful. Killing boinc-client and simply restarting X (via logout/login) seems to clear it up, which is what led me to suspect the GPU app initially. Maybe a bug in the code that interacts with Xorg? |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
It looks like you are using BOINC client version 7.6.31 which is not currently available for download on the BOINC website (Maybe an old beta version they have since updated). Can you try downloading their newest stable version 7.6.22 and seeing if the issue persists? Also which Linux OS are you using? Jake |
Send message Joined: 15 Jun 16 Posts: 5 Credit: 199,739 RAC: 0 |
Additional information: should have mentioned I determined those numbers during the day, not actually by leaving it overnight. It looks like suspending and resuming GPU jobs may be the culprit - when I suspend a job, memory usage drops 25MB, but when I resume, it goes up by 50 MB. So having dynamic 'Suspend when computer is busy' will cause some serious memory growth if it cycles very often. I can try using the older client, though I'm not sure why the download on the site is four minor versions behind. 7.6.31 is the version in the Ubuntu 16.04 repos, and was released Mar 3 (https://github.com/BOINC/boinc/releases/tag/client_release%2F7.6%2F7.6.31). 7.6.33 was released June 5, in fact. So that line is definitely not old, though it may not be considered stable in some way? I don't understand their versioning scheme. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Sounds like whoever makes the packages for the package manager picked a strange version. I would try one of the other ones and please let me know what you see. Jake |
Send message Joined: 15 Jun 16 Posts: 5 Credit: 199,739 RAC: 0 |
Haven't tried the older client, but A/B testing confirms the problem is with the separation application. I left N-body running for the full long weekend and it's still at 41.1M RAM usage. Running just 'Separation (Modified Fit)' eats RAM fast. (I didn't realize about the 'Reset project' button, so switching tasks to confirm things took a while.) |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
N-body is a CPU application and Separation is a GPU and CPU application. It may be that the client version you are running has a memory leak when dealing with GPU runs. Since you are the first user to bring up this problem and we are unable to reproduce the problem I find it most likely to be related to the BOINC version you are running. If you can test with most recent stable version of the BOINC client code and still see the problem, I will be able to help you more. Jake |
©2024 Astroinformatics Group