Welcome to MilkyWay@home

MilkyWay@Home N-Body Simulation arrors...


Advanced search

Message boards : Number crunching : MilkyWay@Home N-Body Simulation arrors...
Message board moderation

To post messages, you must log in.

AuthorMessage
Dunx

Send message
Joined: 13 Feb 11
Posts: 23
Credit: 605,307,351
RAC: 661,980
500 million credit badge8 year member badge
Message 48442 - Posted: 2 May 2011, 20:39:12 UTC

name de_nbody_orphan_test_2model_4_11216_1304343747
application MilkyWay@Home N-Body Simulation

All failed !

dunx
ID: 48442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
10 million credit badge9 year member badge
Message 48449 - Posted: 3 May 2011, 4:30:30 UTC - in response to Message 48442.  

See here: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=2293&nowrap=true#48417
It seems to not happen if you aren't using app_info.xml
ID: 48449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge10 year member badge
Message 48452 - Posted: 3 May 2011, 12:54:52 UTC - in response to Message 48449.  
Last modified: 3 May 2011, 13:05:53 UTC

Hmmm...

I have a 2 Phenom II 955's and 1 945. One 955 is running XPP x64 and the other W7U x64 (similar to Techbird), and the 945 is running XPP x64, all with no app_info. Just for reference, all my hosts run multiple CPU projects, in case that makes any difference.

So far the, the only task to complete successfully was on the host running W7U (yesterday morning). 10 consecutive tries on the XPP hosts have failed with code 128 (No child processes to wait for).

I went and took a look at the rest of the N-Body tasks I have assigned to all my hosts and most them have multiple failures on a variety of platforms (hardware and OS) and a number of different error codes (code 177, code 128, code 185, code 226).

So it looks like something really screwy is going on! ;-)

Unfortunately, since you're set to insta poof the valid tasks, it's hard to get a feel for what platforms are successfully completing N-Body from our end of the equation. :-(
ID: 48452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
10 million credit badge9 year member badge
Message 48454 - Posted: 3 May 2011, 14:24:40 UTC - in response to Message 48452.  

Hmmm...

I have a 2 Phenom II 955's and 1 945. One 955 is running XPP x64 and the other W7U x64 (similar to Techbird), and the 945 is running XPP x64, all with no app_info. Just for reference, all my hosts run multiple CPU projects, in case that makes any difference.

So far the, the only task to complete successfully was on the host running W7U (yesterday morning). 10 consecutive tries on the XPP hosts have failed with code 128 (No child processes to wait for).

I went and took a look at the rest of the N-Body tasks I have assigned to all my hosts and most them have multiple failures on a variety of platforms (hardware and OS) and a number of different error codes (code 177, code 128, code 185, code 226).

The -177 one I think is the maximum time exceeded problem, which is this one. The others I think are all weird system problems, in particular the 128 no child processes one. There seem to be lots of references to it on the internet (like this: http://setiathome.berkeley.edu/forum_thread.php?id=31443).
ID: 48454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge10 year member badge
Message 48455 - Posted: 3 May 2011, 15:09:34 UTC - in response to Message 48454.  
Last modified: 3 May 2011, 15:27:44 UTC

The -177 one I think is the maximum time exceeded problem, which is this one. The others I think are all weird system problems, in particular the 128 no child processes one. There seem to be lots of references to it on the internet (like this: http://setiathome.berkeley.edu/forum_thread.php?id=31443).


Agreed, and I believe you commented earlier there were some N-Body tasks which had faulty FLOP estimates which explains those.

However, all the other codes tend to indicate N-body is failing to start and/or initialize at all (the 128), or sometime soon thereafter (all the others I've seen so far).

<edit> I just checked my hosts status, and the W7U 955 has completed another N-Body successfully. Validation was inconclusive so it might stick around awhile and you can look at it. Here's the WU.

Interestingly, the failed task for the first wingman was a 177 on a C2D P8700 running running XPP x86 SP3.
ID: 48455 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge10 year member badge
Message 48457 - Posted: 3 May 2011, 17:54:15 UTC
Last modified: 3 May 2011, 18:14:58 UTC

BOINC!!!!!!

Duhhhh...... Alinator!

I may have the answer. The only host I have currently completing N-Body tasks is the only one which doesn't have a GPU running tasks at all, due to being a W7U host with BOINC running as a service.

My hypothesis at this point is when the MT NB task wants to run, the concurrently running GPU apps refuse to relinquish the core controlling them and thus the NB task fails to intialize properly. Depending on just how a host is configured and what it's running at the time, it would appear to range from not being able to start at all to failing somwhere else in the initialization chain of events.

<edit> I just checked the WU I posted about earlier, and now I'm not so confident about my hypothesis. The new wingman is running W7HP x64 with no GPU's. It looks like it might be awhile before the task actually tries to run, so I'll have to wait and see.
ID: 48457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge10 year member badge
Message 48458 - Posted: 3 May 2011, 19:33:45 UTC
Last modified: 3 May 2011, 19:37:59 UTC

Well scratch the hypothesis. I just had a an N-Body complete on the Ph II 945 XPP x64 w/ GPU complete successfully.

Back to ground zero on this mystery! :-)

<edit> Correction: I was looking at the wrong host, the hypothesis is still in play. The 945 w/ GPU *did* fail the task. Sorry.
ID: 48458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileToppie*

Send message
Joined: 28 Mar 09
Posts: 68
Credit: 1,003,982,681
RAC: 0
1 billion credit badge10 year member badge
Message 48471 - Posted: 4 May 2011, 17:41:44 UTC - in response to Message 48458.  

Well scratch the hypothesis. I just had a an N-Body complete on the Ph II 945 XPP x64 w/ GPU complete successfully.

Back to ground zero on this mystery! :-)

<edit> Correction: I was looking at the wrong host, the hypothesis is still in play. The 945 w/ GPU *did* fail the task. Sorry.


I had a similar problem on one of my machines. This machine was running Vista 64
and each and every N-body errored out. Even happened with the previous batch of N-body work( before the outage). Yesterday I upgraded to Win7 64 and the problem has gone away.
I am not a Win expert, but something I noticed after upgrading was that Windows was repairing the .NET 4.0 framework. Took a few minutes. I don't know what the role of the .NET framework is, or if it had any part in my problem. At least that machine can crunch happily now.
ID: 48471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile[AF>EDLS] Polynesia
Avatar

Send message
Joined: 5 Apr 09
Posts: 71
Credit: 6,120,786
RAC: 0
5 million credit badge10 year member badge
Message 48472 - Posted: 4 May 2011, 19:08:59 UTC
Last modified: 4 May 2011, 19:12:44 UTC

weird because a few days ago it works fine with my file and now everything app_info in by mistake ...

it must come from the file but I do not see why?

<app_info>
<app>
<name>milkyway_nbody</name>
<user_friendly_name>MilkyWay@Home nbody Simulation</user_friendly_name>
</app>
<file_info>
<name>milkyway_nbody_0.40_windows_x86_64__mt.exe</name>
<executable/>
</file_info>
<file_info>
<name>libgomp_64-1.dll</name>
<executable/>
</file_info>
<file_info>
<name>pthreadGC2_64.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_nbody</app_name>
<version_num>40</version_num>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
<max_ncpus>4</max_ncpus>
<cmdline>--nthreads=4</cmdline>
<file_ref>
<file_name>milkyway_nbody_0.40_windows_x86_64__mt.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>libgomp_64-1.dll</file_name>
</file_ref>
<file_ref>
<file_name>pthreadGC2_64.dll</file_name>
</file_ref>
</app_version>
<app>
<name>milkyway</name>
<user_friendly_name>Milkyway@home Separation</user_friendly_name>
</app>
<file_info>
<name>milkyway_0.52_windows_intelx86__cuda_opencl.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway</app_name>
<version_num>52</version_num>
<plan_class>cuda_opencl</plan_class>
<avg_ncpus>0.15</avg_ncpus>
<max_ncpus>0.30</max_ncpus>
<flops>1.0e11</flops>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
<file_ref>
<file_name>milkyway_0.52_windows_intelx86__cuda_opencl.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
Team Alliance francophone, boinc: 7.0.18

GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470
ID: 48472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zoltan

Send message
Joined: 3 May 11
Posts: 2
Credit: 129,332
RAC: 0
100 thousand credit badge8 year member badge
Message 48482 - Posted: 5 May 2011, 9:41:09 UTC

all N-body simulations are failing for me. It's a 24 core machine + HT, with 4 GPUs.

Should I try to disable the GPUs, or any known solutions?
ID: 48482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileToppie*

Send message
Joined: 28 Mar 09
Posts: 68
Credit: 1,003,982,681
RAC: 0
1 billion credit badge10 year member badge
Message 48502 - Posted: 6 May 2011, 6:41:19 UTC - in response to Message 48482.  

all N-body simulations are failing for me. It's a 24 core machine + HT, with 4 GPUs.

Should I try to disable the GPUs, or any known solutions?


Same here. They are erroring out on the four machines which run MW.
From Core2 quad without a gpu (Win xp), Core2 quad with 5870 gpu(Vista),
Core2 quad with hyperthreading with 5870 x2 (Win7), i7 2600K with 6970 gpu
(Win7)....I am not sure but I doubt if any N-body workunits are completed.
At least they error out immediately, previously Boinc was stuck in limbo
with them for up to 24hrs, until I noticed it and aborted the w/u.

So..."everything comes to he who waits."
ID: 48502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : MilkyWay@Home N-Body Simulation arrors...

©2019 Astroinformatics Group