Nbody 1.06
log in

Advanced search

Message boards : News : Nbody 1.06

1 · 2 · Next
Author Message
Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57073 - Posted: 29 Jan 2013, 22:22:11 UTC

I have released a version of the nbody binaries to address the large work unit sizes and errors with checkpoint files. Jake will be starting some runs to test the the values generated. I am looking into addressing the issue with threads on windows 64 bit systems.


Jeff Thompson

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 57074 - Posted: 29 Jan 2013, 22:35:53 UTC - in response to Message 57073.

I have released a version of the nbody binaries to address the large work unit sizes and errors with checkpoint files. Jake will be starting some runs to test the the values generated. I am looking into addressing the issue with threads on windows 64 bit systems.

Is this app supposed to be

1) Multithreaded?
2) CPU only?

You (or your BOINC Administrator) have defined Linux OpenCL plan_classes for "amd_ati" and "nvidia" - again.

And no MT plan_classes, for any platform, yet.

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 57075 - Posted: 29 Jan 2013, 23:32:45 UTC
Last modified: 29 Jan 2013, 23:35:50 UTC

To add to that, I'm getting

signature verification failed

on

<download_url>http://milkyway.cs.rpi.edu/milkyway/download/milkyway_nbody_1.06_windows_x86_64__mt.exe</download_url>

Edit - tasks 391488628, 391487385

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57076 - Posted: 30 Jan 2013, 1:21:21 UTC

Thanks looking at the signature file. Now.

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57077 - Posted: 30 Jan 2013, 5:05:02 UTC

I updated the signature file it is verifying with the public key can you test via the interface again.

I will work with the admin to get the the plan classes populated.

All the nbody binaries are compiled against openmp to be multithreaded.

Jeff

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 57078 - Posted: 30 Jan 2013, 9:41:06 UTC - in response to Message 57077.

... can you test via the interface again.

Will do, but can you ask Jake to (re-)submit a batch of tasks, please? I think all last night's must have errored out and emptied the pool.

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 57079 - Posted: 30 Jan 2013, 9:58:58 UTC

It's OK - managed to get some resends, and the app has downloaded properly. But from WU 297041595, it appears that the validator thinks that

-inf

is different from

-1.#INF00000000000

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57082 - Posted: 30 Jan 2013, 12:51:20 UTC

I will get Jake to start a new run.

And look at that work unit.

Thank you


Jeff

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 57083 - Posted: 30 Jan 2013, 12:58:31 UTC - in response to Message 57082.

I will get Jake to start a new run.

And look at that work unit.

Thank you


Jeff

Have a look at WU 296570890 while you're at it.

Four successful returns: two 'inf', two '-1.#INF00000000000' - and all four still inconclusive.

Profile Ray Murray
Avatar
Send message
Joined: 8 Oct 07
Posts: 24
Credit: 100,180
RAC: 0

Message 57089 - Posted: 30 Jan 2013, 19:32:51 UTC

All 5 of my 1.06s failed on startup with -1073741511 (0xffffffffc0000139) Unknown error number
I was going to try a project reset in case the new ones had missed something in the download but I've got a couple of ordinary 1.00s running so won't be able to reset for a couple of hours until they've finished.

Win 7 (64)

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57094 - Posted: 31 Jan 2013, 1:13:56 UTC

I am seeing these being returned. Only on windows clients, I am checking the files to make sure the support dlls are coming down with the exe.

Thank you

Ronald R Codney
Send message
Joined: 29 Nov 11
Posts: 18
Credit: 815,433
RAC: 0

Message 57096 - Posted: 31 Jan 2013, 8:57:00 UTC

Rcvd 5 n-body (s) and all of them went "computation error"

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code -1073741511 (0xc0000139)
</message>
]]>

this is the STDERR output for one of them. Same for all 5

Profile Ray Murray
Avatar
Send message
Joined: 8 Oct 07
Posts: 24
Credit: 100,180
RAC: 0

Message 57097 - Posted: 31 Jan 2013, 9:45:40 UTC - in response to Message 57094.

Did the reset just now and got 3 more (estimates of 40s - 40hrs) and got the same error.
The accompanying downloads were:

libgomp_64-1_nbody_1.06.dll
pthreadGC2_64_nbody_1.06.dll
milkyway_nbody_1.06_windows_x86_64__mt.exe

Is that what you expect or should there be something else?

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 57098 - Posted: 31 Jan 2013, 11:06:55 UTC - in response to Message 57094.

I am seeing these being returned. Only on windows clients, I am checking the files to make sure the support dlls are coming down with the exe.

Thank you

OK, some suggestions. Error 0xc0000139 is 'Entry point not found' - usually a DLL error, but more likely a wrong version rather than a completely missing DLL.

I've just tried reactivating host 479865. The new DLLs downloaded correctly, and the first task ran OK. But...

Checking with Process Explorer, I found that the application was using copies of

libgomp_64-1.dll (48,128 bytes)
pthreadGC2_64.dll (49,664 bytes)

which I had downloaded manually and placed in the project directory when the relaunch of NBody was announced - my copies are dated 13 November 2012

When I removed these manual downloads, and tried using the 1.06 versions of the files - 49,152 bytes and 49,152 bytes respectively (??? identical size - suspicious) - the app failed with a 0xc0000139 error.

Ah - the two DLLs are binary identical, according to FC. Finger-fumble with the copying/versioning process?

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57105 - Posted: 31 Jan 2013, 17:54:30 UTC

I pulled the original files from the Windows Development box they have the same size on disk due to block sizes but differ running diff to verify. I am currently replacing them on the server and updating the signatures. I will stop the current run and start a new one after I test they are going out correctly.

Profile Ray Murray
Avatar
Send message
Joined: 8 Oct 07
Posts: 24
Credit: 100,180
RAC: 0

Message 57108 - Posted: 31 Jan 2013, 20:05:22 UTC
Last modified: 31 Jan 2013, 20:30:03 UTC

Just did another reset to force download of the new/old dlls and all seems to be working fine again.
I missed the estimate of the first one that finished in one minute but of those I've paid attention to so far;
1+3/4 hour estimate finished in 2 mins
4+1/2 hour estimate finished in 6 mins
8 hour estimate finished in 8 mins
10+1/4 hour estimate finished in 10 mins
10+1/2 hour estimate finished in 12 mins
No rouge, huge estimates and ticks down rather than up.

A 113 hour estimate just in. Estimate ticks up between progress steps but drops at each progress update. If there is a similar initial overestimate of c.60X then I expect this to finish after 113mins rather than 113Hours which would sit about right compared to the Wingman's faster machine finishing in 89 mins.

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57109 - Posted: 31 Jan 2013, 20:24:05 UTC

The functions that calculate the estimates of time are based on the space having a large number of calculations. In some of the extreme parameters we are sweeping over there may be less calculations in this space. As these results come back the server will target more relevant space and the actual calculations will go up.


Jeff

Profile Ray Murray
Avatar
Send message
Joined: 8 Oct 07
Posts: 24
Credit: 100,180
RAC: 0

Message 57113 - Posted: 31 Jan 2013, 23:20:31 UTC
Last modified: 31 Jan 2013, 23:29:16 UTC

Should the corrected dlls be going out with any new WUs or are people who got the faulty ones now stuck with them until they either reset the project (as I did to get the corrected dlls as shown in the posts below) or 1.07 is issued? The reason I ask is after looking at my only invalid task. All others who got that WU errored out, voiding the WU. One wingman there who got the task after me presumably still has the faulty dlls so errored out, but he has over 1000 errors with 1.06.

Jeffery M. Thompson
Volunteer moderator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 23 Sep 12
Posts: 145
Credit: 12,420,708
RAC: 4,419

Message 57116 - Posted: 1 Feb 2013, 2:36:53 UTC

They will need to reset at this point.

Pushing a 1.07 isn't happening right now.



Jeff

Ronald R Codney
Send message
Joined: 29 Nov 11
Posts: 18
Credit: 815,433
RAC: 0

Message 57118 - Posted: 1 Feb 2013, 4:04:50 UTC

Does resetting the project cause me to lose all previous results/data?

1 · 2 · Next
Post to thread

Message boards : News : Nbody 1.06


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group