Message boards :
Number crunching :
MW@H Computing Failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi people, I have had a number of computation failures over the last 24 hours. Before that MW@H calculations went well and I am currently running SETI with no problems. I should say that all these calculations are done on a dual core E4700 CPU as my graphics card is too wimpy. SDERR looks like this:- <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> <search_application> milkyway_separation 1.00 Windows x86 double </search_application> Unrecognized XML in project preferences: nvidia_block_amount Skipping: 128 Skipping: /nvidia_block_amount Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Error reading astronomy parameters from file 'astronomy_parameters.txt' Trying old parameters file Using SSE3 path Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory Write checkpoint failed 12:50:15 (4548): called boinc_finish As I am not computer literate I don't know whether this is a fault at my end i.e. my pc is goosed or an error with the workunit. Does anybody have any advice? Thanks John |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi again people, I rebooted after a power down and the first MW@H task has been running for about 30 minutes whereas they had been erroring out just after a minute. I still don't know what went wrong but perhaps there was a wee bit of corruption somewhere on my system and the power down and reboot fixed it. I would appreciate any suggestions as to what caused this. I will repost the news as it develops. Thanks for any suggestions. John |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
Guessing you have a standard install of BOINC, I would start with a file system check and than run a test of the harddrive. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Hi again people, I know you said you aren't crunching with the gpu but in Windows when the gpu driver crashes the ONLY way to get it back working again is to reboot, Windows can not have a way to reset the gpu any other way. It could be your gpu was causing issues. |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi, firstly Len LE/GE: yep standard 7.0.28. I have checked both filesystem and hard drive no problems (now after reboot). secondly mikey: as I am set to do no GPU tasks and SETI tasks were running ok on the same CPU whilst the MW@H tasks were erroring out I cannot figure out why a GPU fault would give this result. When I rebooted the MW@H tasks started to run ok so whatever it was it was not a permanent hardware fault. Unfortunately the error occurred overnight and I had managed to run about 20 errored tasks before I found out what was happening. Anyway all is now ok (for the moment?). I cannot interpret the SDERR results or figure out why it refers to the NVIDIA GPU. Thanks to both of you for your input. John[/b] |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi again people, since I last reported MW@H tasks ran successfully but today the tasks started to show "error while computing" again. As per my previous posts I have rebooted to see if that restores things and I will report here later. My enquiry here is for someone to translate the SDERR output as I am unsure what it is indicating. I notice that it says "Unrecognized XML in project preferences: nvidia_block_amount" and as I have said I do not use my wee NVIDIA gpu to do calculations so I cannot figure it out. Stderr output <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> <search_application> milkyway_separation 1.00 Windows x86 double </search_application> Unrecognized XML in project preferences: nvidia_block_amount Skipping: 128 Skipping: /nvidia_block_amount Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Error reading astronomy parameters from file 'astronomy_parameters.txt' Trying old parameters file Using SSE3 path Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory Write checkpoint failed 16:54:28 (5448): called boinc_finish </stderr_txt> ]]> Thanks in advance for any help offered. John |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi again, as before a reboot has restored normal calculations of MW@H tasks. I am at a total loss to figure this out so can anyone help with interpretation of the SDERR output. Thanks John |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Hi again, Are you running an app_info or a cc_config file? |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi Mikey, I dont know the answer to your question, is there a way in which I can find out and then post it here? Thanks John |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
You are running the cpu version of mw separation.
Ignore those, you will see them even when running on amd gpu.
mw is writing checkpoints to save the actual progress of the calculation. Those will be used each time boinc is switching back to mw, so mw knows from where to continue the calculation. This repeated error is what you have to worry about. Data is first written to a temp file and than the regular checkpoint file gets replaced by the temp file. A google search shows that it is often a known bug in the transaction manager of vista where the transaction log gets corrupted. Don't know if there is a patch awailable, only seen a "Microsoft Fix it 50140" for situations when the error occures. There are other cases where this error occurs too, described in the third link. see 1) http://support.microsoft.com/kb/939399 2) http://serverfault.com/questions/350374/transaction-support-within-the-specified-resource-manager-is-not-started-or-was 3) http://errordecoder.com/system-error-codes/8/code-6801.html |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi Len LE/GE, thanks for your informative post. I ran the Microsoft Fixit routine. "fsutil resource setautoreset true c:\" to clean out the file system transaction log. Now everything seems to be working ok but as it is 01:35 in Scotland I am going back to my bed. If everything is ok tomorrow then I will post here to let you know. Thanks again John |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Hi Mikey, They are two files used to customize Boinc, since you don't know what they are you are probably NOT using them. |
Send message Joined: 3 May 10 Posts: 74 Credit: 1,532,760 RAC: 0 |
Hi Guys, OK Mikey fair enough I just run a straight BOINC on a dual core cpu. Len LE/GE:the fsutil routine with the file transaction log seems to have worked so far as MW@H is still running perfectly 12 hours later. Thanks guys this is what I like about BOINC, everyone is so helpful. John |
©2024 Astroinformatics Group