Computation Error of Milkyway@home task

Author	Message
mpujari Send message Joined: 11 Dec 10 Posts: 3 Credit: 5,484 RAC: 0	Message 45290 - Posted: 25 Dec 2010, 6:13:24 UTC Today morning there was power cut down and my UPS got off, due to which there was system restart, and after booting up when I saw the status of the taks (below) it says "Computation error", and in message it states that it has completed the task, below are the details and wanted to know is there a way to save "08:37:12" of computed task which has gone in error state. If any one knows how to recover the task pleas let me know. Messages 12/25/2010 11:05:32 AM Milkyway@home Restarting task de_nbody_model6_3_34680_1293183807_0 using milkyway_nbody version 21 12/25/2010 11:06:54 AM Milkyway@home Computation for task de_nbody_model6_3_34680_1293183807_0 finished Task Details Project : Milkyway@home Appication : Milkyway@home N-Body Simulation 0.21 (sse2) Name : de_nbody_model6_3_34680_1293183807_0 Elapsed : 08:37:12 Progress : 100.000% To Completion : --- Report deadline : Status : Computation error Milkyway account Task Details as below <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 19:30:21: Starting fresh nbody run 19:30:21: Starting nbody system <plummer_r> 53.371822392274 -12.240286451316 -51.361384548074 </plummer_r> <plummer_v> 36.686878292596 -74.28378407906 -36.093081147496 </plummer_v> <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 19:48:14: Checkpoint exists. Attempting to resume from it. Thawing state 19:48:14: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 04:30:36: Checkpoint exists. Attempting to resume from it. Thawing state 04:30:36: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 06:20:44: Checkpoint exists. Attempting to resume from it. Thawing state 06:20:44: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 06:24:51: Checkpoint exists. Attempting to resume from it. Thawing state 06:24:51: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 06:27:41: Checkpoint exists. Attempting to resume from it. Thawing state 06:27:41: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 06:32:43: Checkpoint exists. Attempting to resume from it. Thawing state 06:32:43: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 08:03:53: Checkpoint exists. Attempting to resume from it. Thawing state 08:03:53: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 10:12:48: Checkpoint exists. Attempting to resume from it. Thawing state 10:12:48: Successfully read checkpoint <search_application>milkywayathome nbody 0.21 Windows x86 double</search_application> 17:50:14: Checkpoint exists. Attempting to resume from it. Thawing state Failed to find end marker in checkpoint file. Thawing state failed 17:50:14: Failed to read checkpoint 17:50:14: Starting fresh nbody run 17:50:14: Starting nbody system <plummer_r> 53.371822392274 -12.240286451316 -51.361384548074 </plummer_r> <plummer_v> 36.686878292596 -74.28378407906 -36.093081147496 </plummer_v> Failed to update checkpoint with temporary: No error Failed to write checkpoint Failed to write checkpoint 17:51:32 (3812): called boinc_finish </stderr_txt> ]]> ID: 45290 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 45293 - Posted: 25 Dec 2010, 9:44:54 UTC Seems you got caught at the worst moment. A second earlier or later and you would have been fine. Your computer lost power while writing the checkpoint for the WU. When restarting, the app could not find a proper checkpoint state ('Failed to find end marker in checkpoint file.') and had to start the WU from the beginning again. The next checkpoint could not overwrite the old file (most likely windows had it still marked as open/write) and so the app declared a computation error. ID: 45293 · Rating: 0 · rate: / Reply Quote

mpujari Send message Joined: 11 Dec 10 Posts: 3 Credit: 5,484 RAC: 0	Message 45295 - Posted: 25 Dec 2010, 16:58:39 UTC - in response to Message 45293. So you mean to say there is no way to recover from the error. ID: 45295 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 45302 - Posted: 26 Dec 2010, 4:06:09 UTC - in response to Message 45295. No chance, sorry. The checkpoint file holds the info about how far application was crunching and the intermediate result at that point. (Your log file showed it was working 8 times when switching between applications) With that file being damaged (incomplete checkpoint written because of the power loss), the application had no chance to see how far it was working on that WU. Guessing with only a part of the checkpoint data can only lead to unreliable results at the best, missing parts will most likely lead to wrong results. At that point your 8.5h of crunching was already lost. The application tried to start the WU from the beginning again to save at least the downloaded WU which failed too because the damaged checkpoint file could not even be overwritten. So the server got notified of a 'Computation error' and knows he has to send the WU to another computer to be worked on. There should be a program to watch your UPS status and to do a proper system shutdown before the UPS is out of power. That should prevent a worst case like you just had. One could think of a more complex checkpoint system where every checkpoint is written in a different file and hold until the WU is finished. That way an application could go back in the checkpoint history until a valid one is found and recover from there even if a single checkpoint file is locked. I would be surprised if this idea is new but I haven't seen it in any BOINC project yet. ID: 45302 · Rating: 0 · rate: / Reply Quote

Ascholten Send message Joined: 2 Nov 10 Posts: 17 Credit: 4,224,561 RAC: 0	Message 45310 - Posted: 26 Dec 2010, 14:53:25 UTC - in response to Message 45302. Or it could just write a new file and name it differently, possibly numerically and dump the 2nd earliest one. For example. write Checkpoint 1 write checkpoint 2 is it valid dump checkpoint 1 write checkpoint 3 is it valid dump checkpoint 2 write checkpoint 4 is it valid dump checkpoint 3 Just do a simple verify after a checkpoint is disked, if the one last written is good, go ahead and dump a previous, if the system crashes in the middle of a job, or in his case, the middle of a write, lets say in the middle of checkpoint 3, that file is corrupt so it backs down to checkpoint 2, continues from there, and then 3 eventually gets a valid write. This way you are not holding on to a bunch of non needed files. Aaron ID: 45310 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 45318 - Posted: 26 Dec 2010, 17:57:47 UTC - in response to Message 45302. One could think of a more complex checkpoint system where every checkpoint is written in a different file and hold until the WU is finished. That way an application could go back in the checkpoint history until a valid one is found and recover from there even if a single checkpoint file is locked. I would be surprised if this idea is new but I haven't seen it in any BOINC project yet. The checkpointing ideally would be atomic update, and it is on Linux/ OS X. There isn't actually a way with the Win32 API to do atomic file update Pre-Vista (and even then, it's stupidly complex). The checkpointing writes to a temporary file, and then does an atomic rename() to replace the checkpoint. However on Windows, this atomic rename doesn't exist and is replaced with some kind of file move, so it does something with the potential of breaking if interrupted at the wrong time. ID: 45318 · Rating: 0 · rate: / Reply Quote

mpujari Send message Joined: 11 Dec 10 Posts: 3 Credit: 5,484 RAC: 0	Message 45469 - Posted: 10 Jan 2011, 3:38:17 UTC Yet again :( I lost 39 hours of CPU time, I ended in Computing Error for the task "de_nbody_model2_3_112303_1294298757_0", if there is any thing I can recover this pls help me out. ID: 45469 · Rating: 0 · rate: / Reply Quote

Ascholten Send message Joined: 2 Nov 10 Posts: 17 Credit: 4,224,561 RAC: 0	Message 45472 - Posted: 10 Jan 2011, 5:27:14 UTC - in response to Message 45469. I think once something is lost like that it's gone. Sorry. For what its worth though, I do tend to notice when I do things on my computer that can turn 'cpu intensive' or potentially 'stall' until something requested is completed, like viewing some web pages for an example where the thing will 'lock up' for a few seconds until whatever crap is on it, finally loads up all the way or the app it just called to life finally initializes in your browser etc etc. During events like this, I have seen it cause a computation error. I do not know what you have your settings at but might recommend that if you are going to be doing something that can get memory / cycle intensive, lower your max CPU use number down or change the 'suspend work if over' xx percent to a lower number. That way if something does kick in, (oh and virus programs are another notorious recourse hog when they kick in a scan or update) your work in progress on the project will temporarily suspend until after the event, and help avoid an error. Also doing things like changing system settings / screen resolutions / frame rates etc can kill you as well. If you must change something like that, suspend first. Not sure if this info will help you but it can't hurt to try. Aaron ID: 45472 · Rating: 0 · rate: / Reply Quote