Welcome to MilkyWay@home

Computation Error of Milkyway@home task

Message boards : Number crunching : Computation Error of Milkyway@home task
Message board moderation

To post messages, you must log in.

AuthorMessage
mpujari

Send message
Joined: 11 Dec 10
Posts: 3
Credit: 5,484
RAC: 0
Message 45290 - Posted: 25 Dec 2010, 6:13:24 UTC

Today morning there was power cut down and my UPS got off, due to which there was system restart, and after booting up when I saw the status of the taks (below) it says "Computation error", and in message it states that it has completed the task, below are the details and wanted to know is there a way to save "08:37:12" of computed task which has gone in error state.

If any one knows how to recover the task pleas let me know.

Messages
12/25/2010 11:05:32 AM Milkyway@home Restarting task de_nbody_model6_3_34680_1293183807_0 using milkyway_nbody version 21
12/25/2010 11:06:54 AM Milkyway@home Computation for task de_nbody_model6_3_34680_1293183807_0 finished


Task Details
Project : Milkyway@home
Appication : Milkyway@home N-Body Simulation 0.21 (sse2)
Name : de_nbody_model6_3_34680_1293183807_0
Elapsed : 08:37:12
Progress : 100.000%
To Completion : ---
Report deadline :
Status : Computation error


Milkyway account Task Details as below

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
19:30:21: Starting fresh nbody run
19:30:21: Starting nbody system
<plummer_r> 53.371822392274 -12.240286451316 -51.361384548074 </plummer_r>
<plummer_v> 36.686878292596 -74.28378407906 -36.093081147496 </plummer_v>
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
19:48:14: Checkpoint exists. Attempting to resume from it.
Thawing state
19:48:14: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
04:30:36: Checkpoint exists. Attempting to resume from it.
Thawing state
04:30:36: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
06:20:44: Checkpoint exists. Attempting to resume from it.
Thawing state
06:20:44: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
06:24:51: Checkpoint exists. Attempting to resume from it.
Thawing state
06:24:51: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
06:27:41: Checkpoint exists. Attempting to resume from it.
Thawing state
06:27:41: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
06:32:43: Checkpoint exists. Attempting to resume from it.
Thawing state
06:32:43: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
08:03:53: Checkpoint exists. Attempting to resume from it.
Thawing state
08:03:53: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
10:12:48: Checkpoint exists. Attempting to resume from it.
Thawing state
10:12:48: Successfully read checkpoint
<search_application>milkywayathome nbody 0.21 Windows x86 double</search_application>
17:50:14: Checkpoint exists. Attempting to resume from it.
Thawing state
Failed to find end marker in checkpoint file.
Thawing state failed
17:50:14: Failed to read checkpoint
17:50:14: Starting fresh nbody run
17:50:14: Starting nbody system
<plummer_r> 53.371822392274 -12.240286451316 -51.361384548074 </plummer_r>
<plummer_v> 36.686878292596 -74.28378407906 -36.093081147496 </plummer_v>
Failed to update checkpoint with temporary: No error
Failed to write checkpoint
Failed to write checkpoint
17:51:32 (3812): called boinc_finish

</stderr_txt>
]]>
ID: 45290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 45293 - Posted: 25 Dec 2010, 9:44:54 UTC

Seems you got caught at the worst moment. A second earlier or later and you would have been fine.
Your computer lost power while writing the checkpoint for the WU. When restarting, the app could not find a proper checkpoint state ('Failed to find end marker in checkpoint file.') and had to start the WU from the beginning again.
The next checkpoint could not overwrite the old file (most likely windows had it still marked as open/write) and so the app declared a computation error.
ID: 45293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mpujari

Send message
Joined: 11 Dec 10
Posts: 3
Credit: 5,484
RAC: 0
Message 45295 - Posted: 25 Dec 2010, 16:58:39 UTC - in response to Message 45293.  

So you mean to say there is no way to recover from the error.
ID: 45295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 45302 - Posted: 26 Dec 2010, 4:06:09 UTC - in response to Message 45295.  

No chance, sorry.
The checkpoint file holds the info about how far application was crunching and the intermediate result at that point. (Your log file showed it was working 8 times when switching between applications) With that file being damaged (incomplete checkpoint written because of the power loss), the application had no chance to see how far it was working on that WU. Guessing with only a part of the checkpoint data can only lead to unreliable results at the best, missing parts will most likely lead to wrong results.
At that point your 8.5h of crunching was already lost.
The application tried to start the WU from the beginning again to save at least the downloaded WU which failed too because the damaged checkpoint file could not even be overwritten.
So the server got notified of a 'Computation error' and knows he has to send the WU to another computer to be worked on.

There should be a program to watch your UPS status and to do a proper system shutdown before the UPS is out of power. That should prevent a worst case like you just had.

One could think of a more complex checkpoint system where every checkpoint is written in a different file and hold until the WU is finished. That way an application could go back in the checkpoint history until a valid one is found and recover from there even if a single checkpoint file is locked. I would be surprised if this idea is new but I haven't seen it in any BOINC project yet.
ID: 45302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ascholten

Send message
Joined: 2 Nov 10
Posts: 17
Credit: 4,224,561
RAC: 0
Message 45310 - Posted: 26 Dec 2010, 14:53:25 UTC - in response to Message 45302.  

Or it could just write a new file and name it differently, possibly numerically and dump the 2nd earliest one.

For example.

write Checkpoint 1
write checkpoint 2 is it valid dump checkpoint 1
write checkpoint 3 is it valid dump checkpoint 2
write checkpoint 4 is it valid dump checkpoint 3

Just do a simple verify after a checkpoint is disked, if the one last written is good, go ahead and dump a previous, if the system crashes in the middle of a job, or in his case, the middle of a write, lets say in the middle of checkpoint 3, that file is corrupt so it backs down to checkpoint 2, continues from there, and then 3 eventually gets a valid write. This way you are not holding on to a bunch of non needed files.

Aaron
ID: 45310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 45318 - Posted: 26 Dec 2010, 17:57:47 UTC - in response to Message 45302.  

One could think of a more complex checkpoint system where every checkpoint is written in a different file and hold until the WU is finished. That way an application could go back in the checkpoint history until a valid one is found and recover from there even if a single checkpoint file is locked. I would be surprised if this idea is new but I haven't seen it in any BOINC project yet.
The checkpointing ideally would be atomic update, and it is on Linux/ OS X. There isn't actually a way with the Win32 API to do atomic file update Pre-Vista (and even then, it's stupidly complex). The checkpointing writes to a temporary file, and then does an atomic rename() to replace the checkpoint. However on Windows, this atomic rename doesn't exist and is replaced with some kind of file move, so it does something with the potential of breaking if interrupted at the wrong time.
ID: 45318 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mpujari

Send message
Joined: 11 Dec 10
Posts: 3
Credit: 5,484
RAC: 0
Message 45469 - Posted: 10 Jan 2011, 3:38:17 UTC

Yet again :( I lost 39 hours of CPU time, I ended in Computing Error for the task "de_nbody_model2_3_112303_1294298757_0", if there is any thing I can recover this pls help me out.
ID: 45469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ascholten

Send message
Joined: 2 Nov 10
Posts: 17
Credit: 4,224,561
RAC: 0
Message 45472 - Posted: 10 Jan 2011, 5:27:14 UTC - in response to Message 45469.  

I think once something is lost like that it's gone. Sorry. For what its worth though, I do tend to notice when I do things on my computer that can turn 'cpu intensive' or potentially 'stall' until something requested is completed, like viewing some web pages for an example where the thing will 'lock up' for a few seconds until whatever crap is on it, finally loads up all the way or the app it just called to life finally initializes in your browser etc etc. During events like this, I have seen it cause a computation error.

I do not know what you have your settings at but might recommend that if you are going to be doing something that can get memory / cycle intensive, lower your max CPU use number down or change the 'suspend work if over' xx percent to a lower number. That way if something does kick in, (oh and virus programs are another notorious recourse hog when they kick in a scan or update) your work in progress on the project will temporarily suspend until after the event, and help avoid an error. Also doing things like changing system settings / screen resolutions / frame rates etc can kill you as well. If you must change something like that, suspend first.

Not sure if this info will help you but it can't hurt to try.
Aaron
ID: 45472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Computation Error of Milkyway@home task

©2025 Astroinformatics Group