Welcome to MilkyWay@home

MW@H Computing Failures

Message boards : Number crunching : MW@H Computing Failures
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55442 - Posted: 1 Sep 2012, 15:12:50 UTC

Hi people,

I have had a number of computation failures over the last 24 hours. Before that MW@H calculations went well and I am currently running SETI with no problems. I should say that all these calculations are done on a dual core E4700 CPU as my graphics card is too wimpy.

SDERR looks like this:-


<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.00 Windows x86 double </search_application>
Unrecognized XML in project preferences: nvidia_block_amount
Skipping: 128
Skipping: /nvidia_block_amount
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file
Using SSE3 path
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory
Write checkpoint failed
12:50:15 (4548): called boinc_finish


As I am not computer literate I don't know whether this is a fault at my end i.e. my pc is goosed or an error with the workunit.

Does anybody have any advice?

Thanks
John
ID: 55442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55445 - Posted: 1 Sep 2012, 19:58:40 UTC

Hi again people,

I rebooted after a power down and the first MW@H task has been running for about 30 minutes whereas they had been erroring out just after a minute.

I still don't know what went wrong but perhaps there was a wee bit of corruption somewhere on my system and the power down and reboot fixed it. I would appreciate any suggestions as to what caused this.

I will repost the news as it develops.

Thanks for any suggestions.

John
ID: 55445 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 55447 - Posted: 2 Sep 2012, 2:59:38 UTC

Guessing you have a standard install of BOINC, I would start with a file system check and than run a test of the harddrive.
ID: 55447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,601,995
RAC: 30,584
Message 55449 - Posted: 2 Sep 2012, 11:57:51 UTC - in response to Message 55445.  

Hi again people,

I rebooted after a power down and the first MW@H task has been running for about 30 minutes whereas they had been erroring out just after a minute.

I still don't know what went wrong but perhaps there was a wee bit of corruption somewhere on my system and the power down and reboot fixed it. I would appreciate any suggestions as to what caused this.

I will repost the news as it develops.

Thanks for any suggestions.

John


I know you said you aren't crunching with the gpu but in Windows when the gpu driver crashes the ONLY way to get it back working again is to reboot, Windows can not have a way to reset the gpu any other way. It could be your gpu was causing issues.
ID: 55449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55454 - Posted: 2 Sep 2012, 17:32:13 UTC

Hi,

firstly Len LE/GE: yep standard 7.0.28. I have checked both filesystem and hard drive no problems (now after reboot).

secondly mikey: as I am set to do no GPU tasks and SETI tasks were running ok on the same CPU whilst the MW@H tasks were erroring out I cannot figure out why a GPU fault would give this result.

When I rebooted the MW@H tasks started to run ok so whatever it was it was not a permanent hardware fault. Unfortunately the error occurred overnight and I had managed to run about 20 errored tasks before I found out what was happening. Anyway all is now ok (for the moment?).

I cannot interpret the SDERR results or figure out why it refers to the NVIDIA GPU.

Thanks to both of you for your input.

John[/b]
ID: 55454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55495 - Posted: 5 Sep 2012, 16:39:03 UTC

Hi again people,

since I last reported MW@H tasks ran successfully but today the tasks started to show "error while computing" again. As per my previous posts I have rebooted to see if that restores things and I will report here later.

My enquiry here is for someone to translate the SDERR output as I am unsure what it is indicating. I notice that it says "Unrecognized XML in project preferences: nvidia_block_amount" and as I have said I do not use my wee NVIDIA gpu to do calculations so I cannot figure it out.

Stderr output

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.00 Windows x86 double </search_application>
Unrecognized XML in project preferences: nvidia_block_amount
Skipping: 128
Skipping: /nvidia_block_amount
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file
Using SSE3 path
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory
Write checkpoint failed
16:54:28 (5448): called boinc_finish

</stderr_txt>
]]>


Thanks in advance for any help offered.

John
ID: 55495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55498 - Posted: 5 Sep 2012, 19:49:52 UTC

Hi again,

as before a reboot has restored normal calculations of MW@H tasks.

I am at a total loss to figure this out so can anyone help with interpretation of the SDERR output.

Thanks
John
ID: 55498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,601,995
RAC: 30,584
Message 55500 - Posted: 5 Sep 2012, 21:04:25 UTC - in response to Message 55498.  

Hi again,

as before a reboot has restored normal calculations of MW@H tasks.

I am at a total loss to figure this out so can anyone help with interpretation of the SDERR output.

Thanks
John


Are you running an app_info or a cc_config file?
ID: 55500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55502 - Posted: 5 Sep 2012, 22:11:59 UTC

Hi Mikey,

I dont know the answer to your question, is there a way in which I can find out and then post it here?

Thanks

John
ID: 55502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 55503 - Posted: 5 Sep 2012, 23:43:13 UTC - in response to Message 55495.  


<search_application> milkyway_separation 1.00 Windows x86 double </search_application>


You are running the cpu version of mw separation.


Unrecognized XML in project preferences: nvidia_block_amount
Skipping: 128
Skipping: /nvidia_block_amount
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file


Ignore those, you will see them even when running on amd gpu.


Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory
Write checkpoint failed
16:54:28 (5448): called boinc_finish


mw is writing checkpoints to save the actual progress of the calculation. Those will be used each time boinc is switching back to mw, so mw knows from where to continue the calculation.

This repeated error is what you have to worry about.
Data is first written to a temp file and than the regular checkpoint file gets replaced by the temp file.
A google search shows that it is often a known bug in the transaction manager of vista where the transaction log gets corrupted. Don't know if there is a patch awailable, only seen a "Microsoft Fix it 50140" for situations when the error occures.
There are other cases where this error occurs too, described in the third link.

see
1) http://support.microsoft.com/kb/939399
2) http://serverfault.com/questions/350374/transaction-support-within-the-specified-resource-manager-is-not-started-or-was
3) http://errordecoder.com/system-error-codes/8/code-6801.html
ID: 55503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55507 - Posted: 6 Sep 2012, 0:38:30 UTC

Hi Len LE/GE,

thanks for your informative post. I ran the Microsoft Fixit routine. "fsutil resource setautoreset true c:\" to clean out the file system transaction log.

Now everything seems to be working ok but as it is 01:35 in Scotland I am going back to my bed. If everything is ok tomorrow then I will post here to let you know.

Thanks again

John
ID: 55507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,601,995
RAC: 30,584
Message 55508 - Posted: 6 Sep 2012, 12:10:25 UTC - in response to Message 55502.  

Hi Mikey,

I dont know the answer to your question, is there a way in which I can find out and then post it here?

Thanks

John


They are two files used to customize Boinc, since you don't know what they are you are probably NOT using them.
ID: 55508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile John Black

Send message
Joined: 3 May 10
Posts: 74
Credit: 1,532,760
RAC: 0
Message 55510 - Posted: 6 Sep 2012, 12:28:14 UTC

Hi Guys,

OK Mikey fair enough I just run a straight BOINC on a dual core cpu.

Len LE/GE:the fsutil routine with the file transaction log seems to have worked so far as MW@H is still running perfectly 12 hours later.

Thanks guys this is what I like about BOINC, everyone is so helpful.

John
ID: 55510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : MW@H Computing Failures

©2024 Astroinformatics Group