Apology for recent bad batches of workunits

Author	Message
Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 55799 - Posted: 14 Oct 2012, 22:11:37 UTC Hi Everyone, Just wanted to send out an apology for all the problematic workunits being sent out lately. This was mostly my fault -- we have quite a few new students using MilkyWay@Home for their research -- and the code to start up new searches wasn't doing a very good job of ensuring the quality of the input files they were submitting to it. We didn't have much problem with this in the past as it was mostly just me and Matt Newby submitting jobs and we knew what we were doing, but the client applications are rather finicky about extra whitespace and things like that in the input files, and this was causing a lot of the crashing workunits. I'm in the process of updating the search submission code to be a lot more robust and catch these errors before sending things out, so hopefully this won't be as much of an issue in the future (hopefully not an issue at all!). So I again I want to apologize for the lack of robustness in some of my code that was causing the crashes in the last few days. We're looking into the issue that's causing the las bit of errors that seem to be happening out there (it looks like there's some error in the client code that is causing NANs with certain input parameters, which results in the client reporting an error for the workunit). So I hope we can get that cleared up soon as well. Happy crunching! --Travis ID: 55799 · Rating: 0 · rate: / Reply Quote

Tex1954 Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,190,828 RAC: 35,136	Message 55801 - Posted: 15 Oct 2012, 7:25:15 UTC Hea! Ya'll doing a great job and I KNEW ya'll were on top of things.. Keep up the good work!!! 8-) ID: 55801 · Rating: 0 · rate: / Reply Quote

Tackleway Send message Joined: 17 Mar 10 Posts: 20 Credit: 5,641,904 RAC: 0	Message 55803 - Posted: 15 Oct 2012, 10:12:06 UTC Hi Travis, Still having a few invalids occuring on machines which are usually very stable. See Tasks below :-( Task: 320542150 Name: de_separation_22_3s_free_3_1350173304_527364_1 Task: 320410712 Name: de_separation_22_3s_free_3_1350173304_448209_1 Task: 320192784 Name: de_separation_22_3s_free_3_1350173304_329191_0 Best regards, Tackleway ID: 55803 · Rating: 0 · rate: / Reply Quote

dskagcommunity Send message Joined: 26 Feb 11 Posts: 170 Credit: 205,557,553 RAC: 0	Message 55809 - Posted: 15 Oct 2012, 14:40:03 UTC I think we must wait some days until all errorous workunits are complete invalid marked by multiple machines and not sent out then anymore? DSKAG Austria Research Team: http://www.research.dskag.at ID: 55809 · Rating: 0 · rate: / Reply Quote

Dataman Send message Joined: 5 Sep 08 Posts: 28 Credit: 245,585,043 RAC: 0	Message 55812 - Posted: 15 Oct 2012, 16:16:32 UTC No harm, no foul as far as I am concerned. Thanks for the weekend support. My last error was 15 Oct 2012 \| 15:51:01 UTC. ID: 55812 · Rating: 0 · rate: / Reply Quote

astro-marwil Send message Joined: 3 Jul 12 Posts: 13 Credit: 7,601,982 RAC: 0	Message 55820 - Posted: 15 Oct 2012, 22:43:38 UTC - in response to Message 55809. Last modified: 15 Oct 2012, 22:54:44 UTC Hallo! I think we must wait some days until all errorous workunits are complete invalid marked by multiple machines and not sent out then anymore? No, there are lots of crunching errors from tasks that became send today. I donÂ´t think, the rate of errors has dropped significiently. Specially the tasks of "de_separation_22_3s_free_xxxxxxx" are tending to fail. Furthermore, there have much more tasks to wait for validation. All erroronous tasks are ending up with errorcode 1 (0x1), incorrect function.This is not only limited to Win7/Vista, but some come also from Linux. Kind regards and happy crunching. Martin ID: 55820 · Rating: 0 · rate: / Reply Quote

Gary Roberts Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0	Message 55823 - Posted: 16 Oct 2012, 0:00:28 UTC - in response to Message 55820. No, there are lots of crunching errors from tasks that became send today.... Are you crunching two at a time? Yesterday, all my hosts (which were crunching two at a time) were rapidly erroring out full caches of tasks. I noticed that if I got a new cache of work and immediately suspended all but one task, then that one task would complete without problems. If I then unsuspended a second task, it would also crunch successfully. The minute I tried to launch a second task with one already running, it would try to start and then error out after about 5 seconds. So I reconfigured all my hosts to only crunch one at a time and the major problem went away. All have been crunching without further issue for about 12 hours now. I've had a look through a couple of lists of completed tasks and there are occasional (and quite different) failures - I would estimate about 5% of tasks are failing like this. With these tasks that fail, they seem to go to normal completion and then fail right at the very end. A normal and successful completion shows the following right at the end of the stderr.txt output: Integration time: 26.084413 s. Average time per iteration = 40.756896 ms Integral 2 time = 26.670313 s Running likelihood with 66200 stars Likelihood time = 0.281871 s <background_integral> 0.000229475606607 </background_integral> <stream_integral> 29.075788494514907 1751.674920113726300 265.410894993209520 </stream_integral> <background_likelihood> -3.630395539836397 </background_likelihood> <stream_only_likelihood> -50.559887396327156 -4.291799741661306 -3.193754139881464 </stream_only_likelihood> <search_likelihood> -2.933654500572366 </search_likelihood> 06:07:42 (4016): called boinc_finish </stderr_txt> ]]> A task that fails has the following output (note the 5th and 6th lines - the * are mine): Integration time: 26.096254 s. Average time per iteration = 40.775398 ms Integral 2 time = 26.683731 s Running likelihood with 66200 stars Likelihood time = 0.345676 s * Non-finite result Failed to calculate likelihood *** <background_integral> 0.000127021322682 </background_integral> <stream_integral> 0.000000000000000 21.097641379193686 135.128876475406910 </stream_integral> <background_likelihood> -4.749804775987869 </background_likelihood> <stream_only_likelihood> -1.#IND00000000000 -3.735624970800870 -173.837339342064330 </stream_only_likelihood> <search_likelihood> -241.000000000000000 </search_likelihood> 06:09:32 (1384): called boinc_finish </stderr_txt> ]]> It would appear that the liklihood calculation is failing perhaps through a 'divide by zero' or something like that. Cheers, Gary. ID: 55823 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,302,355 RAC: 2,564	Message 55825 - Posted: 16 Oct 2012, 9:49:07 UTC - in response to Message 55823. No, there are lots of crunching errors from tasks that became send today.... Are you crunching two at a time? Yesterday, all my hosts (which were crunching two at a time) were rapidly erroring out full caches of tasks. I noticed that if I got a new cache of work and immediately suspended all but one task, then that one task would complete without problems. If I then unsuspended a second task, it would also crunch successfully. The minute I tried to launch a second task with one already running, it would try to start and then error out after about 5 seconds. No issues with running two at a time here, the tasks that error out for me also error out for everybody else. But it's not that much anymore, yersterday it was just 5 out of over 200. ID: 55825 · Rating: 0 · rate: / Reply Quote

dskagcommunity Send message Joined: 26 Feb 11 Posts: 170 Credit: 205,557,553 RAC: 0	Message 55828 - Posted: 16 Oct 2012, 12:47:30 UTC Last modified: 16 Oct 2012, 12:49:57 UTC Your right i have errors too until today. DSKAG Austria Research Team: http://www.research.dskag.at ID: 55828 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 29 Aug 10 Posts: 25 Credit: 2,172,252,217 RAC: 0	Message 55831 - Posted: 16 Oct 2012, 20:50:13 UTC Still about 160 errors today on one of my clients. ID: 55831 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 55833 - Posted: 16 Oct 2012, 21:09:17 UTC - in response to Message 55831. I'm going to try and selectively delete out the ps_separation_22_edge/free_1s that are causing errors ( i don't want to delete the valid workunits because people would be losing credit from them). I think if there is more than 2 errors on one of these workunits I should be able to safely delete it and it's results so it doesn't keep getting sent out until it hits the bound on errors... ID: 55833 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,302,355 RAC: 2,564	Message 55838 - Posted: 17 Oct 2012, 8:15:41 UTC - in response to Message 55833. I think if there is more than 2 errors on one of these workunits I should be able to safely delete it and it's results so it doesn't keep getting sent out until it hits the bound on errors... I wouldn't be so sure about it: click. Currently the amount of errors from separation_22_edge/free_1s is negligible IMHO, ATM I don't have even one of those in my error list. So not sure if it's worth the efford. ID: 55838 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 55853 - Posted: 18 Oct 2012, 22:19:40 UTC - in response to Message 55838. I think if there is more than 2 errors on one of these workunits I should be able to safely delete it and it's results so it doesn't keep getting sent out until it hits the bound on errors... I wouldn't be so sure about it: click. Currently the amount of errors from separation_22_edge/free_1s is negligible IMHO, ATM I don't have even one of those in my error list. So not sure if it's worth the efford. Okay so hopefully all the errors out there are just from the NAN/infinity issue. Will be more pressure for the new students to update the binaries and fix the issue... ID: 55853 · Rating: 0 · rate: / Reply Quote

Matthew Volunteer moderator Project developer Project scientist Send message Joined: 6 May 09 Posts: 217 Credit: 6,856,375 RAC: 0	Message 55863 - Posted: 19 Oct 2012, 17:23:47 UTC - in response to Message 55853. It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions. I'm still looking into the problem, but if you are having errors it wouldn't hurt to update your drivers and/or BOINC app and see if that fixes the problem. Please let us know if it does. ID: 55863 · Rating: 0 · rate: / Reply Quote

Sunny129 Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0	Message 55865 - Posted: 19 Oct 2012, 17:44:07 UTC Last modified: 19 Oct 2012, 17:46:07 UTC apparently i'm one of those people (i'm running Catalyst driver v12.4), but i'm also one of those people who had zero problems and zero errors before this NAN/infinity issue cropped up. so i'm reluctant to change what worked so well for so long. if you guys had updated the binaries or something, then i could understand the need to possibly update the driver version or the BOINC platform. but if the NAN/infinity issue gets fixed, then technically nothing else should be required on my end for things to go back to normal (error-free) on my end, right? ID: 55865 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,302,355 RAC: 2,564	Message 55867 - Posted: 19 Oct 2012, 22:05:53 UTC - in response to Message 55863. Last modified: 19 Oct 2012, 22:18:18 UTC It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions. No, don't think so. Example: WU 255498769. The host 436974 is using current nVidia drivers (306.97) and BOINC v7.0.31, still he got the same error like everybody else. Most of my wingmen have newer ATI drivers than me (11.9), but with those strange numbers it's hard to tell if they are on the most current version. But quite a lot of them are on BOINC v7.0.28-31 (needed for OpenCL). The errors occur also on CPUs like for example this WU, where at least the driver should not be an issue. If you are familiar with the ATI driver version numbers, you can check the wingmen of my error tasks (not many, just 12 ATM), you'll find for sure many that have current drivers, BOINC and science app version. I mean, yes, it happens every now and than, that my old CAL app makes a mistake and I get an error (usually just missing output in the std_err, i.e. validation error), but before the new separation runs the wingmen always could successfully crunch such WU, now all of them fail as well. It is IMHO highly unlikely that suddenly all my errored out tasks get resend to wingmen which have unusable old drivers and/or BOINC version. ID: 55867 · Rating: 0 · rate: / Reply Quote

Jari Pyyluoma Send message Joined: 6 Mar 09 Posts: 4 Credit: 1,027,033,744 RAC: 0	Message 55870 - Posted: 20 Oct 2012, 11:36:36 UTC I have noticed that there are so many errors in my output, that I have started to question this project. For every day that goes by my willingnes to waste electricity lessens. I hope you solve your problems soon. Please consider to introduce a way for you to kill bad tasks, so that they do not need to be processed needlessly. Other projects kill unneeded tasks remotely. In these circumstances it also seems unnecessary to have a bad task fail 4 times before removing it, but that is your call. Even though electricity will still be needlessly wasted, it may be argued that the project is responsible for the tasks failing, and the donors should not have to be punished for that, meaning that credits should be retroactively be awarded even for errored tasks. ID: 55870 · Rating: 0 · rate: / Reply Quote

Tackleway Send message Joined: 17 Mar 10 Posts: 20 Credit: 5,641,904 RAC: 0	Message 55871 - Posted: 20 Oct 2012, 11:52:44 UTC - in response to Message 55870. I have noticed that there are so many errors in my output, that I have started to question this project. For every day that goes by my willingness to waste electricity lessens. I hope you solve your problems soon. Please consider to introduce a way for you to kill bad tasks, so that they do not need to be processed needlessly. Other projects kill unneeded tasks remotely. In these circumstances it also seems unnecessary to have a bad task fail 4 times before removing it, but that is your call. Even though electricity will still be needlessly wasted, it may be argued that the project is responsible for the tasks failing, and the donors should not have to be punished for that, meaning that credits should be retroactively be awarded even for errored tasks. Agreed! Also, all the tasks failures here were processed only by CPU, so all the chat about different drivers for GPU's is not addressing the problem ID: 55871 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,302,355 RAC: 2,564	Message 55884 - Posted: 20 Oct 2012, 15:35:22 UTC - in response to Message 55870. I hope you solve your problems soon. Please consider to introduce a way for you to kill bad tasks, so that they do not need to be processed needlessly. Other projects kill unneeded tasks remotely. In these circumstances it also seems unnecessary to have a bad task fail 4 times before removing it, but that is your call. How shall they know if a task is bad before it fails? 1-2 failures on a good task are nothing uncommon and for sure do not indicate that a task is bad. Also, all the tasks failures here were processed only by CPU, so all the chat about different drivers for GPU's is not addressing the problem No, most separation WUs are processed on GPUs, n-Body WUs are CPU-only. ID: 55884 · Rating: 0 · rate: / Reply Quote

Tackleway Send message Joined: 17 Mar 10 Posts: 20 Credit: 5,641,904 RAC: 0	Message 55891 - Posted: 20 Oct 2012, 20:36:54 UTC - in response to Message 55884. Okay what ever! you know best ID: 55891 · Rating: 0 · rate: / Reply Quote