Welcome to MilkyWay@home

Apology for recent bad batches of workunits

Message boards : News : Apology for recent bad batches of workunits
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 55799 - Posted: 14 Oct 2012, 22:11:37 UTC

Hi Everyone,

Just wanted to send out an apology for all the problematic workunits being sent out lately. This was mostly my fault -- we have quite a few new students using MilkyWay@Home for their research -- and the code to start up new searches wasn't doing a very good job of ensuring the quality of the input files they were submitting to it.

We didn't have much problem with this in the past as it was mostly just me and Matt Newby submitting jobs and we knew what we were doing, but the client applications are rather finicky about extra whitespace and things like that in the input files, and this was causing a lot of the crashing workunits.

I'm in the process of updating the search submission code to be a lot more robust and catch these errors before sending things out, so hopefully this won't be as much of an issue in the future (hopefully not an issue at all!).

So I again I want to apologize for the lack of robustness in some of my code that was causing the crashes in the last few days. We're looking into the issue that's causing the las bit of errors that seem to be happening out there (it looks like there's some error in the client code that is causing NANs with certain input parameters, which results in the client reporting an error for the workunit). So I hope we can get that cleared up soon as well.

Happy crunching!
--Travis
ID: 55799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,271,356
RAC: 7,596
Message 55801 - Posted: 15 Oct 2012, 7:25:15 UTC

Hea! Ya'll doing a great job and I KNEW ya'll were on top of things..

Keep up the good work!!!

8-)
ID: 55801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tackleway

Send message
Joined: 17 Mar 10
Posts: 20
Credit: 5,641,904
RAC: 0
Message 55803 - Posted: 15 Oct 2012, 10:12:06 UTC

Hi Travis,

Still having a few invalids occuring on machines which are usually very stable.
See Tasks below :-(

Task: 320542150
Name: de_separation_22_3s_free_3_1350173304_527364_1
Task: 320410712
Name: de_separation_22_3s_free_3_1350173304_448209_1
Task: 320192784
Name: de_separation_22_3s_free_3_1350173304_329191_0

Best regards, Tackleway


ID: 55803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 26 Feb 11
Posts: 170
Credit: 205,557,553
RAC: 0
Message 55809 - Posted: 15 Oct 2012, 14:40:03 UTC

I think we must wait some days until all errorous workunits are complete invalid marked by multiple machines and not sent out then anymore?
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 55809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dataman
Avatar

Send message
Joined: 5 Sep 08
Posts: 28
Credit: 245,585,043
RAC: 0
Message 55812 - Posted: 15 Oct 2012, 16:16:32 UTC

No harm, no foul as far as I am concerned. Thanks for the weekend support.
My last error was 15 Oct 2012 | 15:51:01 UTC.

ID: 55812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
astro-marwil

Send message
Joined: 3 Jul 12
Posts: 13
Credit: 7,601,982
RAC: 0
Message 55820 - Posted: 15 Oct 2012, 22:43:38 UTC - in response to Message 55809.  
Last modified: 15 Oct 2012, 22:54:44 UTC

Hallo!
I think we must wait some days until all errorous workunits are complete invalid marked by multiple machines and not sent out then anymore?

No, there are lots of crunching errors from tasks that became send today. I don´t think, the rate of errors has dropped significiently. Specially the tasks of "de_separation_22_3s_free_xxxxxxx" are tending to fail. Furthermore, there have much more tasks to wait for validation. All erroronous tasks are ending up with errorcode 1 (0x1), incorrect function.This is not only limited to Win7/Vista, but some come also from Linux.

Kind regards and happy crunching.
Martin
ID: 55820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,499
RAC: 0
Message 55823 - Posted: 16 Oct 2012, 0:00:28 UTC - in response to Message 55820.  

No, there are lots of crunching errors from tasks that became send today....

Are you crunching two at a time?

Yesterday, all my hosts (which were crunching two at a time) were rapidly erroring out full caches of tasks. I noticed that if I got a new cache of work and immediately suspended all but one task, then that one task would complete without problems. If I then unsuspended a second task, it would also crunch successfully. The minute I tried to launch a second task with one already running, it would try to start and then error out after about 5 seconds.

So I reconfigured all my hosts to only crunch one at a time and the major problem went away. All have been crunching without further issue for about 12 hours now. I've had a look through a couple of lists of completed tasks and there are occasional (and quite different) failures - I would estimate about 5% of tasks are failing like this. With these tasks that fail, they seem to go to normal completion and then fail right at the very end.

A normal and successful completion shows the following right at the end of the stderr.txt output:

Integration time: 26.084413 s. Average time per iteration = 40.756896 ms
Integral 2 time = 26.670313 s
Running likelihood with 66200 stars
Likelihood time = 0.281871 s
<background_integral> 0.000229475606607 </background_integral>
<stream_integral>  29.075788494514907  1751.674920113726300  265.410894993209520 </stream_integral>
<background_likelihood> -3.630395539836397 </background_likelihood>
<stream_only_likelihood>  -50.559887396327156  -4.291799741661306  -3.193754139881464 </stream_only_likelihood>
<search_likelihood> -2.933654500572366 </search_likelihood>
06:07:42 (4016): called boinc_finish

</stderr_txt>
]]>


A task that fails has the following output (note the 5th and 6th lines - the *** are mine):

Integration time: 26.096254 s. Average time per iteration = 40.775398 ms
Integral 2 time = 26.683731 s
Running likelihood with 66200 stars
Likelihood time = 0.345676 s
*** Non-finite result
Failed to calculate likelihood ***
<background_integral> 0.000127021322682 </background_integral>
<stream_integral>  0.000000000000000  21.097641379193686  135.128876475406910 </stream_integral>
<background_likelihood> -4.749804775987869 </background_likelihood>
<stream_only_likelihood>  -1.#IND00000000000  -3.735624970800870  -173.837339342064330 </stream_only_likelihood>
<search_likelihood> -241.000000000000000 </search_likelihood>
06:09:32 (1384): called boinc_finish

</stderr_txt>
]]>

It would appear that the liklihood calculation is failing perhaps through a 'divide by zero' or something like that.

Cheers,
Gary.
ID: 55823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,143
RAC: 932
Message 55825 - Posted: 16 Oct 2012, 9:49:07 UTC - in response to Message 55823.  

No, there are lots of crunching errors from tasks that became send today....

Are you crunching two at a time?

Yesterday, all my hosts (which were crunching two at a time) were rapidly erroring out full caches of tasks. I noticed that if I got a new cache of work and immediately suspended all but one task, then that one task would complete without problems. If I then unsuspended a second task, it would also crunch successfully. The minute I tried to launch a second task with one already running, it would try to start and then error out after about 5 seconds.

No issues with running two at a time here, the tasks that error out for me also error out for everybody else. But it's not that much anymore, yersterday it was just 5 out of over 200.
ID: 55825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 26 Feb 11
Posts: 170
Credit: 205,557,553
RAC: 0
Message 55828 - Posted: 16 Oct 2012, 12:47:30 UTC
Last modified: 16 Oct 2012, 12:49:57 UTC

Your right i have errors too until today.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 55828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 29 Aug 10
Posts: 25
Credit: 2,172,252,217
RAC: 0
Message 55831 - Posted: 16 Oct 2012, 20:50:13 UTC

Still about 160 errors today on one of my clients.
ID: 55831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 55833 - Posted: 16 Oct 2012, 21:09:17 UTC - in response to Message 55831.  

I'm going to try and selectively delete out the ps_separation_22_edge/free_1s that are causing errors ( i don't want to delete the valid workunits because people would be losing credit from them).

I think if there is more than 2 errors on one of these workunits I should be able to safely delete it and it's results so it doesn't keep getting sent out until it hits the bound on errors...
ID: 55833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,143
RAC: 932
Message 55838 - Posted: 17 Oct 2012, 8:15:41 UTC - in response to Message 55833.  

I think if there is more than 2 errors on one of these workunits I should be able to safely delete it and it's results so it doesn't keep getting sent out until it hits the bound on errors...

I wouldn't be so sure about it: click.

Currently the amount of errors from separation_22_edge/free_1s is negligible IMHO, ATM I don't have even one of those in my error list. So not sure if it's worth the efford.
ID: 55838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 55853 - Posted: 18 Oct 2012, 22:19:40 UTC - in response to Message 55838.  

I think if there is more than 2 errors on one of these workunits I should be able to safely delete it and it's results so it doesn't keep getting sent out until it hits the bound on errors...

I wouldn't be so sure about it: click.

Currently the amount of errors from separation_22_edge/free_1s is negligible IMHO, ATM I don't have even one of those in my error list. So not sure if it's worth the efford.


Okay so hopefully all the errors out there are just from the NAN/infinity issue. Will be more pressure for the new students to update the binaries and fix the issue...
ID: 55853 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthew
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 6 May 09
Posts: 217
Credit: 6,856,375
RAC: 0
Message 55863 - Posted: 19 Oct 2012, 17:23:47 UTC - in response to Message 55853.  

It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions.

I'm still looking into the problem, but if you are having errors it wouldn't hurt to update your drivers and/or BOINC app and see if that fixes the problem. Please let us know if it does.
ID: 55863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 55865 - Posted: 19 Oct 2012, 17:44:07 UTC
Last modified: 19 Oct 2012, 17:46:07 UTC

apparently i'm one of those people (i'm running Catalyst driver v12.4), but i'm also one of those people who had zero problems and zero errors before this NAN/infinity issue cropped up. so i'm reluctant to change what worked so well for so long. if you guys had updated the binaries or something, then i could understand the need to possibly update the driver version or the BOINC platform. but if the NAN/infinity issue gets fixed, then technically nothing else should be required on my end for things to go back to normal (error-free) on my end, right?
ID: 55865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,143
RAC: 932
Message 55867 - Posted: 19 Oct 2012, 22:05:53 UTC - in response to Message 55863.  
Last modified: 19 Oct 2012, 22:18:18 UTC

It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions.

No, don't think so.

Example: WU 255498769. The host 436974 is using current nVidia drivers (306.97) and BOINC v7.0.31, still he got the same error like everybody else.

Most of my wingmen have newer ATI drivers than me (11.9), but with those strange numbers it's hard to tell if they are on the most current version. But quite a lot of them are on BOINC v7.0.28-31 (needed for OpenCL). The errors occur also on CPUs like for example this WU, where at least the driver should not be an issue.

If you are familiar with the ATI driver version numbers, you can check the wingmen of my error tasks (not many, just 12 ATM), you'll find for sure many that have current drivers, BOINC and science app version.

I mean, yes, it happens every now and than, that my old CAL app makes a mistake and I get an error (usually just missing output in the std_err, i.e. validation error), but before the new separation runs the wingmen always could successfully crunch such WU, now all of them fail as well. It is IMHO highly unlikely that suddenly all my errored out tasks get resend to wingmen which have unusable old drivers and/or BOINC version.
ID: 55867 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jari Pyyluoma

Send message
Joined: 6 Mar 09
Posts: 4
Credit: 1,027,033,744
RAC: 0
Message 55870 - Posted: 20 Oct 2012, 11:36:36 UTC

I have noticed that there are so many errors in my output, that I have started to question this project. For every day that goes by my willingnes to waste electricity lessens.

I hope you solve your problems soon. Please consider to introduce a way for you to kill bad tasks, so that they do not need to be processed needlessly. Other projects kill unneeded tasks remotely. In these circumstances it also seems unnecessary to have a bad task fail 4 times before removing it, but that is your call.

Even though electricity will still be needlessly wasted, it may be argued that the project is responsible for the tasks failing, and the donors should not have to be punished for that, meaning that credits should be retroactively be awarded even for errored tasks.

ID: 55870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tackleway

Send message
Joined: 17 Mar 10
Posts: 20
Credit: 5,641,904
RAC: 0
Message 55871 - Posted: 20 Oct 2012, 11:52:44 UTC - in response to Message 55870.  

I have noticed that there are so many errors in my output, that I have started to question this project. For every day that goes by my willingness to waste electricity lessens.

I hope you solve your problems soon. Please consider to introduce a way for you to kill bad tasks, so that they do not need to be processed needlessly. Other projects kill unneeded tasks remotely. In these circumstances it also seems unnecessary to have a bad task fail 4 times before removing it, but that is your call.

Even though electricity will still be needlessly wasted, it may be argued that the project is responsible for the tasks failing, and the donors should not have to be punished for that, meaning that credits should be retroactively be awarded even for errored tasks.



Agreed! Also, all the tasks failures here were processed only by CPU, so all the chat about different drivers for GPU's is not addressing the problem
ID: 55871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,143
RAC: 932
Message 55884 - Posted: 20 Oct 2012, 15:35:22 UTC - in response to Message 55870.  

I hope you solve your problems soon. Please consider to introduce a way for you to kill bad tasks, so that they do not need to be processed needlessly. Other projects kill unneeded tasks remotely. In these circumstances it also seems unnecessary to have a bad task fail 4 times before removing it, but that is your call.

How shall they know if a task is bad before it fails? 1-2 failures on a good task are nothing uncommon and for sure do not indicate that a task is bad.



Also, all the tasks failures here were processed only by CPU, so all the chat about different drivers for GPU's is not addressing the problem

No, most separation WUs are processed on GPUs, n-Body WUs are CPU-only.

ID: 55884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tackleway

Send message
Joined: 17 Mar 10
Posts: 20
Credit: 5,641,904
RAC: 0
Message 55891 - Posted: 20 Oct 2012, 20:36:54 UTC - in response to Message 55884.  

Okay what ever! you know best
ID: 55891 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : News : Apology for recent bad batches of workunits

©2024 Astroinformatics Group