Message boards :
Number crunching :
MT Nbody 1.62 workunit locked up
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Apr 14 Posts: 35 Credit: 9,708,616 RAC: 0 |
So, I thought i'd cook a few units last night, and left my CPU and GPUs to finish running the batch I pulled from the servers. For the most part, everything worked out fine (some of the CPU MT units went by really fast), and I left it running overnight. I wake up this morning and notice that i've still got CPU activity. I check BOINC and I see one of the SMT units has frozen at 91.019%, and has been in this state for the past 11 hours or so. There was also a fairly considerable memory leak in progress, and when I aborted the unit I ended up freeing up about 3GB of memory even though taskmanager, resource monitor, and process explorer all showed it was using about 800mb. The WU in question is de_nbody_8_1_16_v162_2k_3_1470395169_81723_4 |
Send message Joined: 9 Apr 14 Posts: 35 Credit: 9,708,616 RAC: 0 |
Would like to add (since it wont let me edit): I've also been getting a number of errors in nbody, both for MT and ST tasks, all for v1.62. 2 MT units (the worse of the two highlighted above), and 5 (and counting) ST Nbody units have all failed with a computation error. Checking the log shows that all of the failed units have experienced an "Exceeded disk limit xx.xx MB > 50MB". I have 4GB of disk space set aside for BOINC units, so there's no way that can be it. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Captiosus, I will let Sidd (the scientist in charge of N-body) know to take a look at your post when I see him later today. Jake |
Send message Joined: 9 Apr 14 Posts: 35 Credit: 9,708,616 RAC: 0 |
Kewl, I'll be looking forward to the response, if any. |
Send message Joined: 27 Nov 12 Posts: 8 Credit: 126,516,924 RAC: 0 |
Also got some of those on Linux (Ubuntu): task wu computer 1739366241 1277165096 698989 1739366264 1274864663 698989 1739367380 1277019122 698989 1737734321 1275223469 698989 1736835394 1259652472 698989 1736836444 1275252939 698989 1736835522 1274974287 698989 1736835532 1271470693 698989 1733987832 1271863723 698989 1730688175 1271147729 698989 1725242958 1267766875 698989 1723391303 1266509406 698989 Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED Stderr output: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> Disk usage limit exceeded </message> <stderr_txt> <search_application> milkyway_nbody 1.62 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 7 max threads on a system with 8 processors </stderr_txt> ]]> Milkyway assumes a disk limit of 50.0 MB, exceeds this limit (two tasks still in my event log: 77.57, 117.54 MB) and aborts those tasks. There's still almost 20 GB available to BOINC. |
Send message Joined: 23 Aug 11 Posts: 7 Credit: 498,188 RAC: 0 |
Similar problem with 3 tasks. 1742933685 1742917766 1742912895 |
Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0 |
Here is another N-Body task that went rogue, using more than 800MB of RAM and getting stuck: http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1742961635 After I restarted my BOINC Manager, it resumed normally and finished within few minutes, without errors. This is not the only one N-Body task that got stuck on my PC - I am getting one such task almost every day and if I don't restart my BOINC Manager, it drags on for hours, without any progress whatsoever. However, I caught this one tonight, just as it got stuck, so I decided to post it here. |
Send message Joined: 9 Apr 14 Posts: 35 Credit: 9,708,616 RAC: 0 |
Adding in I just had a Nbody unit lock up on my secondary rig. Made it to 1.516% complete then froze. WU is de_nbody_8_1_16_v162_2k_3_147135217_585063_0 Not entirely sure if this particular one froze because of the secondary rig being under the effects of a fairly significant overclock (+1.1ghz on a C2Q), so I will back it down to 3.6ghz (which i know is stable on this rig) just to be sure. |
Send message Joined: 9 Apr 14 Posts: 35 Credit: 9,708,616 RAC: 0 |
Well, unit hasnt locked up, but I now have an Nbody WU that is suffering a memory leak as it progresses. Idle for this rig is about 1GB. In the past hour and a half, this unit has pushed the computer from 1GB memory useage to 1.9GB, and the climb started in earnest about 10 minutes in. Memory increase rate looks to be about ~100mb every 10 minutes or so, and is linear in growth after the first 10 minutes. WU that is leaking is de_nbody_8_1_16_v162_2k_3_1471352127_585064_0, although shorter workunits also show the same memory behavior until they end (very slow growth for the first 10 minutes, then 100MB every following 10 minutes). |
©2024 Astroinformatics Group