Welcome to MilkyWay@home

Posts by Stick

1) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 70730)
Posted 1 day ago by Stick
Post:
But you seem to be running mobile CPUs I run my machines 24/7, since they are dedicated.
It could be one of the power-down tricks that Intel or Microsoft uses that causes the problem. I set my power options to "high performance" mode..
Thanks again for the reply.. You are right. Both my multi-core computers are laptops. They are older and the batteries are shot. I run them plugged into the charger, pretty much 24/7 for BOINC. I run several different BOINC projects and it's only the 3 CPUs Nbody tasks that have any problems.
Are the work units being suspended?
BOINC does not show the hung-up tasks as suspended. They are shown as Running with Elapsed time counting up but Progress is frozen. Most Nbody tasks take 20 tp 25 minutes to finish up - so when I see one with a longer elapsed time, I restart BOINC. When BOINC restarts the hung-up task starts running again, but its Elapsed time has been reset to a much earlier time (less than 20 minutes) Guessing that only around 10% of tasks hang-up. Some tasks hang up multiple times.
2) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 70727)
Posted 3 days ago by Stick
Post:
After switching to MS Defender it didn't take long for a Milkyway@home N-Body Simulation 1.76 (3 CPUs) hangup to occur. But I had forgotten to exclude the BOINC folders. Restarting BOINC now with folders excluded.
3) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 70726)
Posted 3 days ago by Stick
Post:
Jim.
Thank uou for the suggestion. I will change to Microsoft Defender (from Avast) on one of my computers to see if it makes a difference.
Stick
4) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 70724)
Posted 4 days ago by Stick
Post:
Tom,
Thanks for the reply! And it's good to know somebody is watching. If you read my earlier posts on the subject you know that the problem is easily gotten around by restarting BOINC. And, right now I am restarting BOINC 2 or 3 times a day. If there is anything you would like me to do before restarting, please let me know.
Stick
5) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 70722)
Posted 4 days ago by Stick
Post:
Deleted accidental double post.
6) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 70721)
Posted 4 days ago by Stick
Post:
This problem has existed for roughly 3 years and, as far as I can tell, no project administrators or moderators have ever responded to this thread. I first reported the problem on 2 Jun 2018 in this post.. Then, on 13 May 2020, i reported it again in this post. To be clear, there is a problem with N-Body Simulation (mt) (3 CPUs) tasks hanging up. The problem existed with V1.68 and continues with V1.76.. And over the last 3 years it has continued to crop up under all versions of BOINC and under all versions of Windows that I have used - on 3 different computers
7) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 69808)
Posted 13 May 2020 by Stick
Post:
I almost started a new thread for my problem but then I realized that I am encountering essentially the same issue as others who have posted here. That is, tasks running the Milkyway@home N-Body Simulation v1.76 (mt) windows_x86_64 app are often hanging up for hours at a time until discovered by me. And, restarting BOINC always get them going again. I used to think that the problem might be related to incompatibility with other programs I might be running.but I have since convinced myself that other programs are irrelevant. The problem seems to occur just as often when other programs are not running as when they are running. Conversely, I sometimes run other programs and then check BOINC to find that N-Body tasks are still running OK. However, I do think that the problem is somehow related to the characteristics of specific tasks. That is, some tasks require relatively few (0-2) restarts while others may need restarting 8+ times.

Although I have been a MilkyWay contributor for 10+ years, my participation rate has recently increased substantially (due to the SETI hibernation). And as a result, this issue has become very annoying. I am also getting older and my memory isn't what it used to be. Case in point, I had completely forgotten that I had posted this message a little over 2 years ago - essentially reporting this exact same issue. The only difference being updated versions of the N-Body app, BOINC, and Windows, as well as the addition of a new computer.
8) Message boards : Number crunching : problem with de_nbody tasks never finishing (Message 67561)
Posted 4 Jun 2018 by Stick
Post:
I started this thread a few days ago and it looks like we're having the same problem. In my case, it is only occasional and restarting BOINC gets the tasks working again. Also, I first noticed the problem after updating BOINC to v7.10.2. (So, you might want to try v7.8.3.)

If there are any moderators out there, it's OK with me if you would like to combine our 2 threads. And, retitling would probably be a good idea, as well - maybe something like "3-CPU Nbody Task hang-ups"
9) Message boards : Number crunching : MilkyWay@Home N-Body Simulation v1.68 (mt) tasks hanging up (Message 67557)
Posted 2 Jun 2018 by Stick
Post:
This is a very minor, yet annoying, problem I've seen a several times very recently. That is, these 3-cpu tasks are showing status as running and elapsed time is counting up but progress gets stuck (sometimes for hours). Restarting BOINC always gets it going again.

I updated BOINC about a week ago and wonder if the update might be the problem. I never noticed the problem with v7.8.3.

CPU type:Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz [Family 6 Model 142 Stepping 9]
Number of processors: 4
Operating System: Microsoft Windows 10 Core x64 Edition, (10.00.17134.00)
BOINC version: 7.10.2
10) Message boards : News : New Nbody version 1.46 (Message 62993)
Posted 12 Jan 2015 by Stick
Post:
OK - I just looked at Task 945867764 and it has checkpointed. It has now run over for over 7 hours and is around 30% complete. When I looked earlier (before it had checkpointed), it had run for well over an hour over and was around 7% complete.

I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed.
11) Message boards : News : New Nbody version 1.46 (Message 62991)
Posted 11 Jan 2015 by Stick
Post:
I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed.
12) Message boards : News : New Nbody version 1.46 (Message 62882)
Posted 24 Dec 2014 by Stick
Post:
ps_nbody_12_19_orphan_sim_0_1413455402_1435063 - Completed, can't validate Too many total results

EDIT: And it looks like de_nbody_12_20_orphan_sim_2_1413455402_1448056 may be headed that way.

EDIT2: Just noticed that another, in progress, unit has been underway for over 45 minutes and is over 40% complete but it has not yet checkpointed.
13) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 62717)
Posted 19 Nov 2014 by Stick
Post:
Unless someone else points us to the specific problem, we might just have to live with the errors and chalk them up to a flaky app or something.

I presume your nvidia driver is up to date.

As I'm sure you know, other versions of the Modified Fit program have had/still have bugs. (i.e. See this thread.) Therefore, it's seems logical to assume that there may be problems with the opencl_ati_101 and opencl_nvidia_101 version as well. In other words, it might be a good idea for you to block Mod Fit units for a while.
14) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 62714)
Posted 18 Nov 2014 by Stick
Post:
Haven't figured out what causes these validate errors. They all finish with the correct exit code but I notice the majority of them have an empty std_error.txt output. So, in all about 3% of my tasks have these Invalid errors or 32 out of 964 valid results. Anyone have an idea what is going on?

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=257518&offset=0&show_names=0&state=5&appid=

Cheers, Keith


I was just checking up on one of my WU's and noticed that one of my wingmen is having similar validation issues. I know that's not much help but at least you know you are not alone.
15) Message boards : News : New Separation Modfit Version 1.36 (Message 62583)
Posted 16 Oct 2014 by Stick
Post:
Your Win32 machines will no longer be receiving Modfit work units so that won't be a problem any more.

I was aware of the ongoing Win32 problems and that you had subsequently blocked Modfit from Win32 machines. I was just trying to add some specifics to the discussion - that my Win32 errors were immediate computation errors [-1073741515 (0xffffffffc0000135) Unknown error number]. Other posts on this thread seemed to indicate that some Win32 units were failing after processing for significant amounts of time.

As for your 64 bit computer that is completely normal. We require ~3 computers to return the same result for every work unit we send out. Validation inconclusive just means we are waiting for others to return their results for the work unit before we award credit. This ensure people aren't trying to game the system just for credits and it ensures we are getting reliable results for our optimizations.

It may be Milkyway "completely normal" but it's not BOINC "completely normal". ;-)
16) Message boards : News : New Separation Modfit Version 1.36 (Message 62580)
Posted 16 Oct 2014 by Stick
Post:
I've got 3 computers and all have had problems with Modfit v1.36 - but the symptoms differ. Two computers are 32 bit Intel processors and run Win XP. Modfit v1.36 units fail immediately with computation errors on these 2 computers. The third computer has a 64 bit AMD processor and runs Win 7. Modfit v1.36 units run to completion on this computer but immediately go to "validation inconclusive".
17) Message boards : Number crunching : Computation ERROR (Message 62141)
Posted 9 Aug 2014 by Stick
Post:
I was just browsing the message boards here (to report a problem I am having with the new N-Body Program) and I noticed that no one had replied to your post. I saw that your tasks list indicates that you have a mixture of recent successfully completed tasks and ones that have errored out. I looked at the Stderr from one of your errored tasks and noticed this:

Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'


I suggest you try resetting the project. That should delete files like these from BOINC and cause it to redownload them from the project. Hope this helps.
18) Message boards : Number crunching : New N-Body Release 1.42 (Message 62140)
Posted 9 Aug 2014 by Stick
Post:
Task 803429280 crashed on my computer with a different error:
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
The handle is invalid.
(0x6) - exit code 6 (0x6)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.42 Windows x86 double , Crlibm </search_application>
RHO MAX IS 57.28266
57.28266Could not load Ktm32.dll (126): The specified module could not be found.

Failed to find end marker in checkpoint file.
14:20:42 (764): called boinc_finish
</stderr_txt>

That seems a little strange because, as far as I know, the unit was the only unit in cache at the time and should not have had any reason to revert to a checkpoint. That is, neither the computer nor BOINC was restarted during the timeframe. But the unit had run for a long time and was very close to finishing up.

OTOH, Task 803429273 finished up OK but its Stderr is also a little strange:
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
<search_application> milkyway_nbody 1.42 Windows x86 double , Crlibm </search_application>
RHO MAX IS 33.35565
33.35565Could not load Ktm32.dll (126): The specified module could not be found.

Poor likelihood. Returning worst case.
<search_likelihood>-9999999.900000000400000</search_likelihood>
22:14:21 (1440): called boinc_finish
</stderr_txt>

It is currently in "Checked, but no consensus yet" state because its wingman's unit crashed with a -185 (0xffffffffffffff47) ERR_RESULT_START error and the replacement is "Unsent".
19) Message boards : News : New Separation Modified Fit Runs Started (Message 59790)
Posted 3 Sep 2013 by Stick
Post:
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=421532475
20) Message boards : Number crunching : New WUs - Complete, can't validate? (Message 38101)
Posted 5 Apr 2010 by Stick
Post:
I suggest that "DB purging" be suspended until this "Too many errors . . ." problem is fixed. Else, a lot of work will be lost. The WU I reported (below) is already gone.

IMO, the logic behind the "Too many error results, Too many total results" circuitbreaker and/or Milkway's "3, 6, 1" threshholding of it, is flawed. That is, once the "circuitbreaker" is triggered, any (as yet unreturned), "In progess" results appear to be excluded from consideration, despite the fact that they could be "valid". That doesn't make sense - especially on a project where only one valid result is required.

At least, that seems to be the case with ps_13_3s_const_v2_5688232_1269636198 and Task 95459389 (mine) which lists its "Validate state" as: "Workunit error - check skipped".

EDIT: Sorry - I didn't see the other thread until after I posted here.



Next 20

©2021 Astroinformatics Group