Welcome to MilkyWay@home

MT Nbody 1.62 workunit locked up

Message boards : Number crunching : MT Nbody 1.62 workunit locked up
Message board moderation

To post messages, you must log in.

AuthorMessage
Captiosus

Send message
Joined: 9 Apr 14
Posts: 35
Credit: 9,708,616
RAC: 0
Message 64998 - Posted: 7 Aug 2016, 23:12:17 UTC

So, I thought i'd cook a few units last night, and left my CPU and GPUs to finish running the batch I pulled from the servers. For the most part, everything worked out fine (some of the CPU MT units went by really fast), and I left it running overnight.

I wake up this morning and notice that i've still got CPU activity. I check BOINC and I see one of the SMT units has frozen at 91.019%, and has been in this state for the past 11 hours or so.
There was also a fairly considerable memory leak in progress, and when I aborted the unit I ended up freeing up about 3GB of memory even though taskmanager, resource monitor, and process explorer all showed it was using about 800mb.

The WU in question is de_nbody_8_1_16_v162_2k_3_1470395169_81723_4
ID: 64998 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Captiosus

Send message
Joined: 9 Apr 14
Posts: 35
Credit: 9,708,616
RAC: 0
Message 65001 - Posted: 8 Aug 2016, 0:20:11 UTC

Would like to add (since it wont let me edit):

I've also been getting a number of errors in nbody, both for MT and ST tasks, all for v1.62. 2 MT units (the worse of the two highlighted above), and 5 (and counting) ST Nbody units have all failed with a computation error. Checking the log shows that all of the failed units have experienced an "Exceeded disk limit xx.xx MB > 50MB". I have 4GB of disk space set aside for BOINC units, so there's no way that can be it.
ID: 65001 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 65006 - Posted: 8 Aug 2016, 12:58:39 UTC

Hey Captiosus,

I will let Sidd (the scientist in charge of N-body) know to take a look at your post when I see him later today.

Jake
ID: 65006 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Captiosus

Send message
Joined: 9 Apr 14
Posts: 35
Credit: 9,708,616
RAC: 0
Message 65017 - Posted: 11 Aug 2016, 12:28:41 UTC

Kewl, I'll be looking forward to the response, if any.
ID: 65017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kwartet!

Send message
Joined: 27 Nov 12
Posts: 8
Credit: 126,516,924
RAC: 0
Message 65064 - Posted: 22 Aug 2016, 18:20:21 UTC - in response to Message 65006.  

Also got some of those on Linux (Ubuntu):

task wu computer
1739366241 1277165096 698989
1739366264 1274864663 698989
1739367380 1277019122 698989
1737734321 1275223469 698989
1736835394 1259652472 698989
1736836444 1275252939 698989
1736835522 1274974287 698989
1736835532 1271470693 698989
1733987832 1271863723 698989
1730688175 1271147729 698989
1725242958 1267766875 698989
1723391303 1266509406 698989

Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

Stderr output:
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.62 Linux x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 7 max threads on a system with 8 processors

</stderr_txt>
]]>

Milkyway assumes a disk limit of 50.0 MB, exceeds this limit (two tasks still in my event log: 77.57, 117.54 MB) and aborts those tasks.

There's still almost 20 GB available to BOINC.
ID: 65064 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kylinblue

Send message
Joined: 23 Aug 11
Posts: 7
Credit: 498,188
RAC: 0
Message 65065 - Posted: 23 Aug 2016, 16:53:54 UTC

Similar problem with 3 tasks.

1742933685
1742917766
1742912895
ID: 65065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 65066 - Posted: 23 Aug 2016, 20:10:27 UTC

Here is another N-Body task that went rogue, using more than 800MB of RAM and getting stuck: http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1742961635

After I restarted my BOINC Manager, it resumed normally and finished within few minutes, without errors. This is not the only one N-Body task that got stuck on my PC - I am getting one such task almost every day and if I don't restart my BOINC Manager, it drags on for hours, without any progress whatsoever. However, I caught this one tonight, just as it got stuck, so I decided to post it here.
ID: 65066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Captiosus

Send message
Joined: 9 Apr 14
Posts: 35
Credit: 9,708,616
RAC: 0
Message 65069 - Posted: 24 Aug 2016, 14:35:53 UTC

Adding in I just had a Nbody unit lock up on my secondary rig. Made it to 1.516% complete then froze.

WU is de_nbody_8_1_16_v162_2k_3_147135217_585063_0

Not entirely sure if this particular one froze because of the secondary rig being under the effects of a fairly significant overclock (+1.1ghz on a C2Q), so I will back it down to 3.6ghz (which i know is stable on this rig) just to be sure.
ID: 65069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Captiosus

Send message
Joined: 9 Apr 14
Posts: 35
Credit: 9,708,616
RAC: 0
Message 65070 - Posted: 24 Aug 2016, 19:11:42 UTC

Well, unit hasnt locked up, but I now have an Nbody WU that is suffering a memory leak as it progresses. Idle for this rig is about 1GB. In the past hour and a half, this unit has pushed the computer from 1GB memory useage to 1.9GB, and the climb started in earnest about 10 minutes in.
Memory increase rate looks to be about ~100mb every 10 minutes or so, and is linear in growth after the first 10 minutes.

WU that is leaking is de_nbody_8_1_16_v162_2k_3_1471352127_585064_0, although shorter workunits also show the same memory behavior until they end (very slow growth for the first 10 minutes, then 100MB every following 10 minutes).
ID: 65070 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : MT Nbody 1.62 workunit locked up

©2024 Astroinformatics Group