Welcome to MilkyWay@home

Nbody 1.04

Message boards : News : Nbody 1.04
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Miklos M

Send message
Joined: 29 Dec 11
Posts: 26
Credit: 1,462,682,655
RAC: 9,527
Message 56766 - Posted: 6 Jan 2013, 14:20:51 UTC - in response to Message 56759.  

It is bad on Windows too. Ties up the NVIDIA Card for too long.
ID: 56766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56767 - Posted: 6 Jan 2013, 14:38:21 UTC - in response to Message 56766.  

It is bad on Windows too. Ties up the NVIDIA Card for too long.

I don't understand that remark. According to the applications page there is only one app_version installed for Windows, and it doesn't have the (entirely inappropriate) opencl plan classes that can identify NBody as a GPU application under Linux.

Certainly, there are some long tasks (WU 291391225 has reached 87% after 28 hours on my Windows machine), but the Windows NBody app uses CPU resources only.
ID: 56767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ronald R Codney

Send message
Joined: 29 Nov 11
Posts: 18
Credit: 815,433
RAC: 0
Message 56770 - Posted: 6 Jan 2013, 17:41:20 UTC

Nbody computation error after 3hrs, sterr report:
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code -1073740940 (0xc0000374)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.04 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 4 processors
Warning: not applying timestep correction for workunit with min version 0.80
Using OpenMP 1 max threads on a system with 4 processors
Using OpenMP 1 max threads on a system with 4 processors
<search_likelihood>-62200.827114903703000</search_likelihood>

</stderr_txt>
]]>


No GPU so why is it sending it to me at all.??
ID: 56770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Bauer
Project developer
Project tester
Project scientist

Send message
Joined: 20 Aug 12
Posts: 66
Credit: 406,916
RAC: 0
Message 56771 - Posted: 6 Jan 2013, 19:29:46 UTC

Please report the work unit numbers of some work units that error. We are currently looking at several issues. There will be an update very soon.

Thanks to all for their help and feedback. We are working to have these issues resolved promptly.

Jake
ID: 56771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56773 - Posted: 6 Jan 2013, 19:43:00 UTC - in response to Message 56771.  

WU 290674321

That's the only WU which has errored for me this run (knock on wood), but all four replications failed - that seems significant.
ID: 56773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 8 Oct 07
Posts: 24
Credit: 111,325
RAC: 0
Message 56774 - Posted: 6 Jan 2013, 20:24:42 UTC
Last modified: 6 Jan 2013, 21:07:13 UTC

Only wu 290721561 for me out of 47 completed.
2 wingmen with the same -1073741571 (0xffffffffc00000fd) Unknown error number and 1 with -1073741515 (0xffffffffc0000135) Unknown error number and zero runtime.

2 other WUs are so far inconclusive.
ID: 56774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Bauer
Project developer
Project tester
Project scientist

Send message
Joined: 20 Aug 12
Posts: 66
Credit: 406,916
RAC: 0
Message 56775 - Posted: 6 Jan 2013, 20:30:30 UTC

Thank you.

Also, in regards to insanely long WU times, I believe they may just be estimated poorly. I have one that says 52 hours to completion that did 1% in 5 minutes. 500 minutes is clearly not 52 hours. Let them run unless they actually have been running for 30+ hours or something ridiculous. I don't know why the estimates are wrong though...

Jake
ID: 56775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 8 Oct 07
Posts: 24
Credit: 111,325
RAC: 0
Message 56776 - Posted: 6 Jan 2013, 20:45:05 UTC
Last modified: 6 Jan 2013, 21:06:39 UTC

This one has gone 39 hrs with about 2 hrs left. Progress % is increasing (and it's checkpointing) and estimate to finish is decreasing (although not incrementally with elapsed time) so it isn't stuck so I'm just letting it run to completion. It hasn't been sent to anyone else yet.

It's a de_nbody_100K_104 rather than ps_.....
ID: 56776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56777 - Posted: 6 Jan 2013, 21:29:06 UTC - in response to Message 56775.  

Thank you.

Also, in regards to insanely long WU times, I believe they may just be estimated poorly. I have one that says 52 hours to completion that did 1% in 5 minutes. 500 minutes is clearly not 52 hours. Let them run unless they actually have been running for 30+ hours or something ridiculous. I don't know why the estimates are wrong though...

Jake

WU 291391225 ran for 32 hours, completed and validated. (After an initial estimate of 120 hours)

I've been trying to work out what's going on with the estimation, but failing utterly.

My host is an i7-3770K, clocked at 4.5 GHz. The application details page says that for NBody tasks, it has an APR of 1613.9 (1.6 THz!) - the APR for standard Milkyway tasks is a much more sedate 6.58 GHz.

The <app_version> section of my client_state.xml for NBody v104 contains <flops>44345657128.456566</flops> or 44.3 GHz. That figure, at least, seems to be unchanging, and coupled with the <rsc_fpops_est>19198700000000000.000000</rsc_fpops_est> value for my 32 hour task gives the original estimate of 120 hours.

But the <rsc_fpops_est> values seem to be growing exponentially. I've just received a <rsc_fpops_est>12972200000000000.000000</rsc_fpops_est> (81 hours) estimate for WU 292161163, but the running speed (15% in 10 minutes - at least progress is linear) suggests it will finish in little over an hour.

I fear you may be running into some of the boundary condition handling problems of http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation.
ID: 56777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 56778 - Posted: 7 Jan 2013, 1:54:12 UTC

Ok we have another binary to take care of some checkpointing issues that may be related to some of these errors. I am in the process of testing it. Specifically computational errors and long computation times may be addressed by this. The gpu resources issue we will need to explore collectively a bit more. I do not have an answer for that at this time.

I will continue to post details here. Until that is released and we will than start a separate thread. I am using this thread and the work units reporting data to make a list of errors and systems involved.

Jeff Thompson
ID: 56778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56781 - Posted: 7 Jan 2013, 15:04:13 UTC

Just recorded the first "exit code -1073740940 (0xc0000374)" for this host, this run - WU 292476933.

The only thing different from all the others was that BOINC restarted while the task was active:

07/01/2013 14:37:20 | | Starting BOINC client version 7.0.42 for windows_x86_64
07/01/2013 14:37:21 | | Running CPU benchmarks
07/01/2013 14:37:54 | Milkyway@Home | Restarting task de_nbody_105_1356215205_156148_0 using milkyway_nbody version 104 in slot 7
07/01/2013 14:45:49 | Milkyway@Home | [sched_op] Deferring communication for 1 min 51 sec
07/01/2013 14:45:49 | Milkyway@Home | [sched_op] Reason: Unrecoverable error for task de_nbody_105_1356215205_156148_0
07/01/2013 14:45:49 | | [work_fetch] Request work fetch: application exited
07/01/2013 14:45:49 | Milkyway@Home | Computation for task de_nbody_105_1356215205_156148_0 finished

Maybe this (Windows) version of the app has problems re-initialising memory when restarting from a checkpoint?

"The exception code 0xc0000374 indicates a heap corruption"
ID: 56781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
Message 56783 - Posted: 7 Jan 2013, 17:34:11 UTC
Last modified: 7 Jan 2013, 17:34:56 UTC

Well, the hosts I have which can run nBody have had a pretty easy time running them so far (only CPU ones at this point).

My only comment is since they are supposed to be MT apps, then it's probably not too good an idea to let more than one of them run at a time. Of course that could depend on a given hosts' local prefs, but I'm referring to a more or less 'default' settings host here.

Previous versions that worked would take over the CPU for the duration of it's run when it came up in the task queue, and more than one nBody task NEVER ran at the same time. This is not the case for 1.04.

Al
ID: 56783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56784 - Posted: 7 Jan 2013, 18:15:55 UTC

OK, may I claim the record for the highest runtime estimate so far? The third task below is for WU 290835322:



(it has <rsc_fpops_est>480966000000000000.000000</rsc_fpops_est>)

And, while I'm at it, a n00b question prompted by a colleague: with no uploaded result data file, what science do you get at the end of all this work? Could I simply paste in

<search_likelihood>-6438.250734594399500</search_likelihood>

from my wingmate's result, and save myself 269,046.80 CPU seconds? Neither of us can find any other data returned by this app to the project.
ID: 56784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Bauer
Project developer
Project tester
Project scientist

Send message
Joined: 20 Aug 12
Posts: 66
Credit: 406,916
RAC: 0
Message 56786 - Posted: 7 Jan 2013, 21:03:41 UTC - in response to Message 56784.  

The number you are reporting is a likelihood of fit. We give you a set of parameters which are then used by n-body on your computer to generate a model of a stellar stream. It is then compared to the input model. The number reported is actually the EMD (earth mover distance) of the comparison. In principle, we should be able to use n-body to take real data and fit orbital parameters to stellar streams. This is very useful for the science we are doing.

I hope this answers your question.

Jake
ID: 56786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr6686

Send message
Joined: 15 Oct 12
Posts: 3
Credit: 18,270,747
RAC: 0
Message 56790 - Posted: 8 Jan 2013, 14:40:03 UTC

For me, 2 times same error ( - exit code -1073740940 (0xc0000374)) with WU 292055375 and WU 291398862
ID: 56790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Miklos M

Send message
Joined: 29 Dec 11
Posts: 26
Credit: 1,462,682,655
RAC: 9,527
Message 56792 - Posted: 8 Jan 2013, 16:20:55 UTC - in response to Message 56767.  

My mistake, they are running on the cpu's only. Thanks for pointing it out.
ID: 56792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Overtonesinger
Avatar

Send message
Joined: 15 Feb 10
Posts: 63
Credit: 1,836,010
RAC: 0
Message 56794 - Posted: 9 Jan 2013, 10:03:59 UTC - in response to Message 56784.  

Yes, You can have this RECORD. Very nice btw! :)

I have also one of those strange-long WUs.

My high-est estimated number of hours for an N-Body 1.04 is only 558 hours.

But as it has had 1 percent after 12 hours CPU time, I guess it will be actually done after 120 hours on the one logical core of Core i7 720QM (1.73 GHz at 8 threads... , up to 2.8 TurboBoost with one thread). ... Sometimes I even let her to RUN alone, to speed her up ;)
----------------------------------------------------------------

And there is also HIGH probability that it will error out in the end... like on the other computer!

See it here:
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=291600433

Is it normal?
Because - to me: This Work-Unit does not seem normal at all. :)

Melwen - Child of the Fangorn Forest
Rig "BRISINGR" [ASUS G73-JH, i7 720QM 1.73, 4x2GB DDR3 1333 CL7, ATi HD5870M 1GB GDDR5],bought on 2011-02-24
ID: 56794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56797 - Posted: 9 Jan 2013, 15:44:11 UTC - in response to Message 56794.  

Well, my record-breaker finished successfully in 150602 seconds (42 hours), but has been declared inconclusive despite agreeing with my wingmate to 19 significant digits. Now some other poor sucker (sorry, Shane!) gets to spend three days calculating -6438.250734594399500 all over again.

My task finished without error, but I don't think that's a significant feature of the WU itself. Rather, I was careful not to restart BOINC during all that time: the task was pre-empted a couple of times, but kept in memory - so all it needed was a simple "resume" from the memory image, rather than a full "restart" from the checkpoint file.

I'll test that theory of mine more fully in due course, but for the time being I've got WU 292870082 estimating 50 days still to run (and on target for a 60-hour run, after 45% progress). That'll delay me upgrading to BOINC v7.0.44 for a couple of days - see you all on Friday.

BTW, we're beginning to get an idea where these estimates are coming from. I've posted before about the absurd APR (speed in Gigaflops) value my host is getting - today it's up to 1696.96

The server is supposed to tweak the task estimates by manipulating both <rsc_fpops_est> (workunit size) and <flops> (processor speed). It looks as if the size of the WU was calculated to suit a 1600 Gfl processor - it said half a zetta-fpop - but the transmitted <flops> was subjected to the John McEnroe sanity clause ("You cannot be serious!") and capped at (exactly - to 16 digits) ten times the CPU's Whetstone benchmark. Half a zetta-fpop at 44.3 gigaflops gives exactly the 3,015 hour estimate I screen-grabbed.
ID: 56797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Miklos M

Send message
Joined: 29 Dec 11
Posts: 26
Credit: 1,462,682,655
RAC: 9,527
Message 56800 - Posted: 9 Jan 2013, 19:05:25 UTC

I am curious, what are we accomplishing by running these units on our computers instead of others, when they end up in errors? Also, wondering when the replacements will be sent out, that do not end in so many errors.
ID: 56800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56801 - Posted: 9 Jan 2013, 19:19:32 UTC - in response to Message 56800.  

... what are we accomplishing ...

Testing, and helping the developers find and eradicate any remaining bugs.

State: All (20) | In progress (1) | Pending (7) | Valid (10) | Invalid (1) | Error (1)

for host 465695

Only one error here, which I reported. Try not to close down BOINC while a MW task is in progress, until they fix it.
ID: 56801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : News : Nbody 1.04

©2024 Astroinformatics Group