Welcome to MilkyWay@home

Posts by (retired account)

1) Message boards : Number crunching : Computation Error on NBody Model (Message 42352)
Posted 24 Sep 2010 by (retired account)
Post:
Hi, I got a second one, which received some additional time to be finished. It did today after 185-something hours. Since validation is inconclusive, it is still in the database here, workunit id 150689696, workunit name de_nbody_model1_1_52544_1284309885, task id 196714928. I guess I won't get those 3,956.93 credits claimed? *g*
2) Message boards : Number crunching : Computation Error on NBody Model (Message 42332)
Posted 23 Sep 2010 by (retired account)
Post:
This afternoon (12:23 UTC) I finished a workunit after more than 100 hours of run time. This unit named de_nbody_model1_1_35711_1284295325_2 has previously been finished with max. time exceeded error on the wingmen, so I decided to give it a bit more time. I guess it was valid, because it is already purged from the database:
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=150672863

I was not at home this afternoon, so I'm not 100% sure, but if it were invalid, it should still be in the database, right?

Regards
Alex
3) Message boards : Number crunching : GPU detection problem with both ATI and nVidia cards installed (Message 42279)
Posted 21 Sep 2010 by (retired account)
Post:
I recently had a similar problem with a GTX260, an onboard ATI GPU, current drivers and Windows 7 64bit. I fixed it by renaming two ATI dlls after reading this post here at the AMD Dev forum. The dlls to be renamed were:

C:\windows\SysWOW64\aticfx32.dll
C:\windows\System32\aticfx64.dll

Just completely stop and restart BOINC after that. Of course on your own risk! *g* I also used a cc_config.xml with the use_all_gpus = 1 option.
4) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42213)
Posted 17 Sep 2010 by (retired account)
Post:
We're at v0.07 for Windows already, please see here: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1917
5) Message boards : Number crunching : A problem with downloading the nbody app (Message 42168)
Posted 15 Sep 2010 by (retired account)
Post:
My guess is that your client_state.xml ist corrupted or something else defining the projects parameters. The app_info content you posted looks allright, if these are all entries for milkyway_nbody. However, BOINC is asking for three files of which none is the right one.

I'd say save the app_info.xml, detach the project, reattach again and throw the app_info.xml into the .../projects/... directory again. You might want to download milkyway_nbody_0.07_windows_intelx86__sse2.exe manually, too.
6) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42151)
Posted 15 Sep 2010 by (retired account)
Post:
We're also talking about doing the rough phases of the search with single precision which would allow more GPUs to work on it.


Would this be an application which does a part of the calculations on the (single precision) GPU and the other part on the CPU? A bit like the current Einstein CUDA application, using the GPU really as a coprocessor? Sounds interesting.
7) Message boards : News : updated the nbody applications again (Message 42145)
Posted 15 Sep 2010 by (retired account)
Post:

shmget in attach_shmem: Invalid argument
16:10:20 (83546): Can't set up shared mem: -1. Will run in standalone mode.



I've seen the same error on this wingman of mine, also a Mac running Darwin x86_64: http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=196251224
8) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42143)
Posted 15 Sep 2010 by (retired account)
Post:
Are you seeing a lot of WUs with granted credit much lower than the claimed credit?


No. None with such big differences. My guess was that in the above case it was caused somehow by the restart, the checkpoint bug and the ongoing count of the run time, making the total run time somewhat bigger than it actually was.
9) Message boards : Number crunching : A problem with downloading the nbody app (Message 42136)
Posted 14 Sep 2010 by (retired account)
Post:
Hi, the app has just been updated to v0.07. Please note the two underscores between '64' and 'sse2'.
10) Message boards : News : updated the nbody applications again (Message 42127)
Posted 14 Sep 2010 by (retired account)
Post:
Depends on how you define 'working'. I'm not getting paid for what I do right now. But that's fine with me. :) Guess, we shouldn't chat here too much. *g*

The result I linked to below was just purged from the database...
11) Message boards : News : updated the nbody applications again (Message 42125)
Posted 14 Sep 2010 by (retired account)
Post:

Failed to calculate chisq


Yeah, seen this, too. On my only valid v0.06 result here, it was included in both results output. So the big question is, if this is really a valid result to the project?

Btw, Guten Abend, Alexander! *g*
12) Message boards : News : updated the nbody applications again (Message 42123)
Posted 14 Sep 2010 by (retired account)
Post:
Hmm.. from my result list it seems that v0.06 will not validate against v0.04. Anyone else seeing this? My only valid v0.06 result (win 64bit) so far was against linux v0.06. On the other hand, I've also seen some v0.04 results which did not validate against another v0.04. Guess we could need some more data here *g*.
13) Message boards : News : updated the nbody applications again (Message 42116)
Posted 14 Sep 2010 by (retired account)
Post:
Another issue: Is there (or was there) some glitch in the database concerning the nbody application name? For example see here: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=150916643 Above the tasklist it says: "This is displayed on the workunit pageDatabase Error" and in the cell for the application name there is noted "v0.00" which is also repeated in the task itself. I've seen a number of instances. Maybe this is already resolved, because here is a workunit where a task with linux app 0.06 is finished and the app version is shown correctly.

EDIT: Checkpointing seems to work now. I shut down BOINC on purpose and work was resumed at the last checkpoint. This is Win 7 64bit. Great.

Checkpoint: tnow = 2.01929. time since last = 361.459s
Checkpoint exists. Attempting to resume from it.
Thawing state
Successfully read checkpoint
Checkpoint: tnow = 2.46124. time since last = 972626s


Regards
Alex
14) Message boards : News : updated the nbody applications again (Message 42115)
Posted 14 Sep 2010 by (retired account)
Post:
I received no workunits for nbody 0.06, only for milkyway 0.19. However, I prefer to leave the latter to the ATI guys. *grin* So I set up an app_info.xml again and downloaded nbody 0.06 manually. This is the app_info part only for nbody 0.06 CPU tasks:

<app_info>
 <app>
  <name>milkyway_nbody</name>
 </app>
 <file_info>
  <name>milkyway_nbody_0.06_windows_x86_64__sse2.exe</name>
  <executable/>
 </file_info>
 <app_version>
  <app_name>milkyway_nbody</app_name>
  <version_num>6</version_num>
  <file_ref>
   <file_name>milkyway_nbody_0.06_windows_x86_64__sse2.exe</file_name>
   <main_program/>
  </file_ref>
 </app_version>
</app_info>


This is for Windows 64bit, for 32bit remove the _64 part from the download link and the app_info file references. Btw, I think it's a good idea to include the sse2 requirement now, should save people with old computers from some frustation.
15) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42113)
Posted 14 Sep 2010 by (retired account)
Post:

I currently have the longest running workunit up to now. 7 h run time were already done and approx. 8 h were still to go, when I had to close BOINC. After restart, it started again at 0 % progress, but run time started at the approx. 7 h were I stopped it before. So currently I am at 5.4 % again and the total run time has risen from 15 h to approx. 22 h now.


Just for the records (because we now have moved to a new app version): the workunit mentioned above was finished this morning and is now validated. The stderr out has some interesting info about the checkpointing problem, excerpt:


Checkpoint: tnow = 1.20291. time since last = 360.466s
Checkpoint: tnow = 1.22032. time since last = 361.073s
Checkpoint: tnow = 1.238. time since last = 360.637s
Checkpoint: tnow = 1.25557. time since last = 362.311s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint
Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
Starting nbody system
<plummer_r> -38.146212235604 2.2104695431195 32.223568725294 </plummer_r>
<plummer_v> 69.480777935001 95.95483517654 -100.99755377651 </plummer_v>
Checkpoint: tnow = 0.0197762. time since last = 903435s
Checkpoint: tnow = 0.0406272. time since last = 394.064s
Checkpoint: tnow = 0.0593286. time since last = 366.626s



Btw, claimed credit 495.43, granted credit 65.73 is a bit disappointing. Never mind. ;)
16) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42105)
Posted 13 Sep 2010 by (retired account)
Post:
has run 2 hours and is showing 9.259% done


Hi Paul, 10% in 2 hours should be 100% in 20 hours, right? So this should be fine.

Brian has also reported here that the workunits will be terminated with "max. time exceeded" error at some point (should depend on the system on which they run), I guess that means they can not really run into the deadline of 8 days until you have a very slow system.
17) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42102)
Posted 13 Sep 2010 by (retired account)
Post:

The Windows checkpointing is currently broken (it will always restart from the beginning), but I think I've fixed all the problems with it. (...) I'll try to update the binaries sometime today.


Hello Matt, is this fix already included in the current version 0.04? Or will it be in the upcoming one?

I currently have the longest running workunit up to now. 7 h run time were already done and approx. 8 h were still to go, when I had to close BOINC. After restart, it started again at 0 % progress, but run time started at the approx. 7 h were I stopped it before. So currently I am at 5.4 % again and the total run time has risen from 15 h to approx. 22 h now. So something is wrong with checkpointing, I guess.

Regards
Alex
18) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42088)
Posted 13 Sep 2010 by (retired account)
Post:
Update: Five more workunits, all de_nbody_model1_1, and for a change all now completed, four of them already validated. Run time between 30 and 60 minutes. Still don't have a clue why some crash and others not.
19) Message boards : News : started a new nbody search: de_nbody_model1_1 (Message 42085)
Posted 13 Sep 2010 by (retired account)
Post:
I got my first workunits tonight for N-Body Simulation v0.04. The outcome was rather odd: Three workunits were from the de_nbody_test_10 series and they were all completed and validated. The five others were from the de_nbody_model1_1 series and they all crashed after one or two seconds. Looking on the wingmen I can not see any pattern. Sometimes they crash also on a wingman, sometimes they seem to finish without error.

Well, I try to get some more, maybe I catch a good one *g* ...

Regards

List of Error results
20) Message boards : Number crunching : nbody (Message 41672)
Posted 22 Aug 2010 by (retired account)
Post:

The parameters of the simulation can cause a wide variation in how long it takes. For example, over the range of masses being fit, the run times vary by a factor of 10. This also means on lots of systems, the workunits exceed the maximum allowed time and then get killed by BOINC before finishing.


Hmm.. is a variation in run time by a factor of 10 really that much? I remember that at RCN the workunit run times can vary between a few seconds and hundreds of hours, which is factor of 1,000,000 (however, if not changed by the user, they will stop at 24 hours even if no result is found and are reissued later in a less complex calculation, if I remember correctly).

Maybe a dumb question, but can't you simply extend the maximum allowed time by a factor of 10? Does it really matter if some workunits run for example 10 hours instead of one as intended? Well, I'm sure I'm missing something. Otherwise you guys already had solved the problem. ;-)


Next 20

©2020 Astroinformatics Group