Welcome to MilkyWay@home

Posts by Stick

1) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 67561)
Posted 4 Jun 2018 by Stick
Post:
I started this thread a few days ago and it looks like we're having the same problem. In my case, it is only occasional and restarting BOINC gets the tasks working again. Also, I first noticed the problem after updating BOINC to v7.10.2. (So, you might want to try v7.8.3.)

If there are any moderators out there, it's OK with me if you would like to combine our 2 threads. And, retitling would probably be a good idea, as well - maybe something like "3-CPU Nbody Task hang-ups"
2) Message boards : Number crunching : MilkyWay@Home N-Body Simulation v1.68 (mt) tasks hanging up (Message 67557)
Posted 2 Jun 2018 by Stick
Post:
This is a very minor, yet annoying, problem I've seen a several times very recently. That is, these 3-cpu tasks are showing status as running and elapsed time is counting up but progress gets stuck (sometimes for hours). Restarting BOINC always gets it going again.

I updated BOINC about a week ago and wonder if the update might be the problem. I never noticed the problem with v7.8.3.

CPU type:Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz [Family 6 Model 142 Stepping 9]
Number of processors: 4
Operating System: Microsoft Windows 10 Core x64 Edition, (10.00.17134.00)
BOINC version: 7.10.2
3) Message boards : News : New Nbody version 1.46 (Message 62993)
Posted 12 Jan 2015 by Stick
Post:
OK - I just looked at Task 945867764 and it has checkpointed. It has now run over for over 7 hours and is around 30% complete. When I looked earlier (before it had checkpointed), it had run for well over an hour over and was around 7% complete.

I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed.
4) Message boards : News : New Nbody version 1.46 (Message 62991)
Posted 11 Jan 2015 by Stick
Post:
I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed.
5) Message boards : News : New Nbody version 1.46 (Message 62882)
Posted 24 Dec 2014 by Stick
Post:
ps_nbody_12_19_orphan_sim_0_1413455402_1435063 - Completed, can't validate Too many total results

EDIT: And it looks like de_nbody_12_20_orphan_sim_2_1413455402_1448056 may be headed that way.

EDIT2: Just noticed that another, in progress, unit has been underway for over 45 minutes and is over 40% complete but it has not yet checkpointed.
6) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 62717)
Posted 19 Nov 2014 by Stick
Post:
Unless someone else points us to the specific problem, we might just have to live with the errors and chalk them up to a flaky app or something.

I presume your nvidia driver is up to date.

As I'm sure you know, other versions of the Modified Fit program have had/still have bugs. (i.e. See this thread.) Therefore, it's seems logical to assume that there may be problems with the opencl_ati_101 and opencl_nvidia_101 version as well. In other words, it might be a good idea for you to block Mod Fit units for a while.
7) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 62714)
Posted 18 Nov 2014 by Stick
Post:
Haven't figured out what causes these validate errors. They all finish with the correct exit code but I notice the majority of them have an empty std_error.txt output. So, in all about 3% of my tasks have these Invalid errors or 32 out of 964 valid results. Anyone have an idea what is going on?

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=257518&offset=0&show_names=0&state=5&appid=

Cheers, Keith


I was just checking up on one of my WU's and noticed that one of my wingmen is having similar validation issues. I know that's not much help but at least you know you are not alone.
8) Message boards : News : New Separation Modfit Version 1.36 (Message 62583)
Posted 16 Oct 2014 by Stick
Post:
Your Win32 machines will no longer be receiving Modfit work units so that won't be a problem any more.

I was aware of the ongoing Win32 problems and that you had subsequently blocked Modfit from Win32 machines. I was just trying to add some specifics to the discussion - that my Win32 errors were immediate computation errors [-1073741515 (0xffffffffc0000135) Unknown error number]. Other posts on this thread seemed to indicate that some Win32 units were failing after processing for significant amounts of time.

As for your 64 bit computer that is completely normal. We require ~3 computers to return the same result for every work unit we send out. Validation inconclusive just means we are waiting for others to return their results for the work unit before we award credit. This ensure people aren't trying to game the system just for credits and it ensures we are getting reliable results for our optimizations.

It may be Milkyway "completely normal" but it's not BOINC "completely normal". ;-)
9) Message boards : News : New Separation Modfit Version 1.36 (Message 62580)
Posted 16 Oct 2014 by Stick
Post:
I've got 3 computers and all have had problems with Modfit v1.36 - but the symptoms differ. Two computers are 32 bit Intel processors and run Win XP. Modfit v1.36 units fail immediately with computation errors on these 2 computers. The third computer has a 64 bit AMD processor and runs Win 7. Modfit v1.36 units run to completion on this computer but immediately go to "validation inconclusive".
10) Message boards : Number crunching : Computation ERROR (Message 62141)
Posted 9 Aug 2014 by Stick
Post:
I was just browsing the message boards here (to report a problem I am having with the new N-Body Program) and I noticed that no one had replied to your post. I saw that your tasks list indicates that you have a mixture of recent successfully completed tasks and ones that have errored out. I looked at the Stderr from one of your errored tasks and noticed this:

Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'


I suggest you try resetting the project. That should delete files like these from BOINC and cause it to redownload them from the project. Hope this helps.
11) Message boards : Number crunching : New N-Body Release 1.42 (Message 62140)
Posted 9 Aug 2014 by Stick
Post:
Task 803429280 crashed on my computer with a different error:
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
The handle is invalid.
(0x6) - exit code 6 (0x6)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.42 Windows x86 double , Crlibm </search_application>
RHO MAX IS 57.28266
57.28266Could not load Ktm32.dll (126): The specified module could not be found.

Failed to find end marker in checkpoint file.
14:20:42 (764): called boinc_finish
</stderr_txt>

That seems a little strange because, as far as I know, the unit was the only unit in cache at the time and should not have had any reason to revert to a checkpoint. That is, neither the computer nor BOINC was restarted during the timeframe. But the unit had run for a long time and was very close to finishing up.

OTOH, Task 803429273 finished up OK but its Stderr is also a little strange:
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
<search_application> milkyway_nbody 1.42 Windows x86 double , Crlibm </search_application>
RHO MAX IS 33.35565
33.35565Could not load Ktm32.dll (126): The specified module could not be found.

Poor likelihood. Returning worst case.
<search_likelihood>-9999999.900000000400000</search_likelihood>
22:14:21 (1440): called boinc_finish
</stderr_txt>

It is currently in "Checked, but no consensus yet" state because its wingman's unit crashed with a -185 (0xffffffffffffff47) ERR_RESULT_START error and the replacement is "Unsent".
12) Message boards : News : New Separation Modified Fit Runs Started (Message 59790)
Posted 3 Sep 2013 by Stick
Post:
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=421532475
13) Message boards : Number crunching : New WUs - Complete, can't validate? (Message 38101)
Posted 5 Apr 2010 by Stick
Post:
I suggest that "DB purging" be suspended until this "Too many errors . . ." problem is fixed. Else, a lot of work will be lost. The WU I reported (below) is already gone.

IMO, the logic behind the "Too many error results, Too many total results" circuitbreaker and/or Milkway's "3, 6, 1" threshholding of it, is flawed. That is, once the "circuitbreaker" is triggered, any (as yet unreturned), "In progess" results appear to be excluded from consideration, despite the fact that they could be "valid". That doesn't make sense - especially on a project where only one valid result is required.

At least, that seems to be the case with ps_13_3s_const_v2_5688232_1269636198 and Task 95459389 (mine) which lists its "Validate state" as: "Workunit error - check skipped".

EDIT: Sorry - I didn't see the other thread until after I posted here.

14) Message boards : Number crunching : New WUs - Complete, can't validate? (Message 38082)
Posted 5 Apr 2010 by Stick
Post:
IMO, the logic behind the "Too many error results, Too many total results" circuitbreaker and/or Milkway's "3, 6, 1" threshholding of it, is flawed. That is, once the "circuitbreaker" is triggered, any (as yet unreturned), "In progess" results appear to be excluded from consideration, despite the fact that they could be "valid". That doesn't make sense - especially on a project where only one valid result is required.

At least, that seems to be the case with ps_13_3s_const_v2_5688232_1269636198 and Task 95459389 (mine) which lists its "Validate state" as: "Workunit error - check skipped".

EDIT: Sorry - I didn't see the other thread until after I posted here.
15) Message boards : Number crunching : "Team" software problem (???) (Message 33835)
Posted 27 Nov 2009 by Stick
Post:
I noticed that, after my last "Update", my account no longer has me listed as a member of a team - although I didn't do anything to "Quit team". My Your account page also has 3 "Database Error" messages at the top right corner of the page. Anybody else having this problem?
16) Message boards : Number crunching : 3rd.in - optimized apps (Message 17477)
Posted 3 Apr 2009 by Stick
Post:
what version would be right for a asus a8n sli-prem(amd cpu) and a 9800gtx with latest patch


Use CPU-Z to identify your computer's capabilities. (Follow the links to CPU-Z on the z-slip site.)
17) Message boards : Number crunching : So, what's the state of play (Message 4561)
Posted 30 Jul 2008 by Stick
Post:
I would add that all 3 types progress similarly in that most of the work is done between 0% and 50%. The last 50% goes by very quickly.
18) Message boards : Number crunching : Compute Errors (Message 3772)
Posted 13 Jun 2008 by Stick
Post:
I think i've found the problem with the gs_600s (and the older bad search). I've removed all the gs_600s from the database so you shouldn't get any more. Let me know if you see any problems with gs_601


I noticed that my computers successfully completed several gs_600s WU's yesterday - but my Tasks for user still shows all of them as "Pending". I am guessing this is because my results were returned after you removed the WU's from the DB.

Note: I am not looking for credit here - just pointing out another issue.
19) Message boards : Number crunching : application v1.21/v1.22 errors/memory leaks/crashes here (Message 2056)
Posted 7 Mar 2008 by Stick
Post:
I just noticed that v1.21 the "Progress" meter is more linear than v1.19's. That is, as compared to what I reported here for v1.19, my last v1.21 unit took about 9 minutes to do the first 50% and about 2 minutes for the last 50%. Much better!
20) Message boards : Number crunching : Please post app 1.17/1.18/1.19 memory leaks/errors (Message 1957)
Posted 5 Mar 2008 by Stick
Post:
EDIT: I watched my second unit a little more carefully. The first 10% took about 9 minutes and the last 90% took about 2 minutes. In other words, the "Progress" meter works OK - it's just not linear.


yeah crunch3r had it right about this. the first 10% is doing an integral calculation (which is pretty computationally intensive), and the last 90% is comparing the star values to this integral calculation. there's really no good way to calculate the progress in a linear way, since on some architectures the integral takes a lot less time than on others (probably due to optimizations).


I didn't mean to imply that being non-linear was a problem. As long as we know what to expect, we can deal with it.


Next 20

©2019 Astroinformatics Group