Message boards :
Number crunching :
MilkyWay@Home N-Body Simulation arrors...
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Feb 11 Posts: 31 Credit: 1,403,524,537 RAC: 0 |
name de_nbody_orphan_test_2model_4_11216_1304343747 application MilkyWay@Home N-Body Simulation All failed ! dunx |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
See here: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=2293&nowrap=true#48417 It seems to not happen if you aren't using app_info.xml |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
Hmmm... I have a 2 Phenom II 955's and 1 945. One 955 is running XPP x64 and the other W7U x64 (similar to Techbird), and the 945 is running XPP x64, all with no app_info. Just for reference, all my hosts run multiple CPU projects, in case that makes any difference. So far the, the only task to complete successfully was on the host running W7U (yesterday morning). 10 consecutive tries on the XPP hosts have failed with code 128 (No child processes to wait for). I went and took a look at the rest of the N-Body tasks I have assigned to all my hosts and most them have multiple failures on a variety of platforms (hardware and OS) and a number of different error codes (code 177, code 128, code 185, code 226). So it looks like something really screwy is going on! ;-) Unfortunately, since you're set to insta poof the valid tasks, it's hard to get a feel for what platforms are successfully completing N-Body from our end of the equation. :-( |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
Hmmm... The -177 one I think is the maximum time exceeded problem, which is this one. The others I think are all weird system problems, in particular the 128 no child processes one. There seem to be lots of references to it on the internet (like this: http://setiathome.berkeley.edu/forum_thread.php?id=31443). |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
The -177 one I think is the maximum time exceeded problem, which is this one. The others I think are all weird system problems, in particular the 128 no child processes one. There seem to be lots of references to it on the internet (like this: http://setiathome.berkeley.edu/forum_thread.php?id=31443). Agreed, and I believe you commented earlier there were some N-Body tasks which had faulty FLOP estimates which explains those. However, all the other codes tend to indicate N-body is failing to start and/or initialize at all (the 128), or sometime soon thereafter (all the others I've seen so far). <edit> I just checked my hosts status, and the W7U 955 has completed another N-Body successfully. Validation was inconclusive so it might stick around awhile and you can look at it. Here's the WU. Interestingly, the failed task for the first wingman was a 177 on a C2D P8700 running running XPP x86 SP3. |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
BOINC!!!!!! Duhhhh...... Alinator! I may have the answer. The only host I have currently completing N-Body tasks is the only one which doesn't have a GPU running tasks at all, due to being a W7U host with BOINC running as a service. My hypothesis at this point is when the MT NB task wants to run, the concurrently running GPU apps refuse to relinquish the core controlling them and thus the NB task fails to intialize properly. Depending on just how a host is configured and what it's running at the time, it would appear to range from not being able to start at all to failing somwhere else in the initialization chain of events. <edit> I just checked the WU I posted about earlier, and now I'm not so confident about my hypothesis. The new wingman is running W7HP x64 with no GPU's. It looks like it might be awhile before the task actually tries to run, so I'll have to wait and see. |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
Well scratch the hypothesis. I just had a an N-Body complete on the Ph II 945 XPP x64 w/ GPU complete successfully. Back to ground zero on this mystery! :-) <edit> Correction: I was looking at the wrong host, the hypothesis is still in play. The 945 w/ GPU *did* fail the task. Sorry. |
Send message Joined: 28 Mar 09 Posts: 68 Credit: 1,003,982,681 RAC: 0 |
Well scratch the hypothesis. I just had a an N-Body complete on the Ph II 945 XPP x64 w/ GPU complete successfully. I had a similar problem on one of my machines. This machine was running Vista 64 and each and every N-body errored out. Even happened with the previous batch of N-body work( before the outage). Yesterday I upgraded to Win7 64 and the problem has gone away. I am not a Win expert, but something I noticed after upgrading was that Windows was repairing the .NET 4.0 framework. Took a few minutes. I don't know what the role of the .NET framework is, or if it had any part in my problem. At least that machine can crunch happily now. |
Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0 |
weird because a few days ago it works fine with my file and now everything app_info in by mistake ... it must come from the file but I do not see why? <app_info> <app> <name>milkyway_nbody</name> <user_friendly_name>MilkyWay@Home nbody Simulation</user_friendly_name> </app> <file_info> <name>milkyway_nbody_0.40_windows_x86_64__mt.exe</name> <executable/> </file_info> <file_info> <name>libgomp_64-1.dll</name> <executable/> </file_info> <file_info> <name>pthreadGC2_64.dll</name> <executable/> </file_info> <app_version> <app_name>milkyway_nbody</app_name> <version_num>40</version_num> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> <max_ncpus>4</max_ncpus> <cmdline>--nthreads=4</cmdline> <file_ref> <file_name>milkyway_nbody_0.40_windows_x86_64__mt.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libgomp_64-1.dll</file_name> </file_ref> <file_ref> <file_name>pthreadGC2_64.dll</file_name> </file_ref> </app_version> <app> <name>milkyway</name> <user_friendly_name>Milkyway@home Separation</user_friendly_name> </app> <file_info> <name>milkyway_0.52_windows_intelx86__cuda_opencl.exe</name> <executable/> </file_info> <app_version> <app_name>milkyway</app_name> <version_num>52</version_num> <plan_class>cuda_opencl</plan_class> <avg_ncpus>0.15</avg_ncpus> <max_ncpus>0.30</max_ncpus> <flops>1.0e11</flops> <coproc> <type>CUDA</type> <count>1</count> </coproc> <file_ref> <file_name>milkyway_0.52_windows_intelx86__cuda_opencl.exe</file_name> <main_program/> </file_ref> </app_version> </app_info> Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 |
Send message Joined: 3 May 11 Posts: 2 Credit: 129,332 RAC: 0 |
all N-body simulations are failing for me. It's a 24 core machine + HT, with 4 GPUs. Should I try to disable the GPUs, or any known solutions? |
Send message Joined: 28 Mar 09 Posts: 68 Credit: 1,003,982,681 RAC: 0 |
all N-body simulations are failing for me. It's a 24 core machine + HT, with 4 GPUs. Same here. They are erroring out on the four machines which run MW. From Core2 quad without a gpu (Win xp), Core2 quad with 5870 gpu(Vista), Core2 quad with hyperthreading with 5870 x2 (Win7), i7 2600K with 6970 gpu (Win7)....I am not sure but I doubt if any N-body workunits are completed. At least they error out immediately, previously Boinc was stuck in limbo with them for up to 24hrs, until I noticed it and aborted the w/u. So..."everything comes to he who waits." |
©2024 Astroinformatics Group