Welcome to MilkyWay@home

Compute errors

Message boards : Number crunching : Compute errors
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile [BAT] tutta55
Avatar

Send message
Joined: 27 Aug 07
Posts: 46
Credit: 8,529,766
RAC: 0
Message 4491 - Posted: 27 Jul 2008, 9:53:27 UTC

Yesterday I got some compute errors on 2 of my machines. See http://milkyway.cs.rpi.edu/milkyway/results.php?userid=9, if they haven't been purged already <grmbl>. In all cases I got "process got signal 11".

This could indicate a hardware problem, if it weren't that it occurs on 2 different (physical) machines. Both clients are Ubuntu 8.04 64-bit running in VMWare Workstation 6.0.4-built 93057, hosted by 64-bit Vista.

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair
ID: 4491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Mac-Nic
Avatar

Send message
Joined: 26 Mar 08
Posts: 15
Credit: 2,045,502
RAC: 0
Message 4493 - Posted: 27 Jul 2008, 11:59:42 UTC - in response to Message 4491.  
Last modified: 27 Jul 2008, 12:01:19 UTC

Yesterday I got some compute errors on 2 of my machines. See http://milkyway.cs.rpi.edu/milkyway/results.php?userid=9, if they haven't been purged already <grmbl>. In all cases I got "process got signal 11".

This could indicate a hardware problem, if it weren't that it occurs on 2 different (physical) machines. Both clients are Ubuntu 8.04 64-bit running in VMWare Workstation 6.0.4-built 93057, hosted by 64-bit Vista.


Hi Tutta,

"Signal 11" can be a sign of low virtual memory.

Best regards
ID: 4493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jayargh
Avatar

Send message
Joined: 8 Oct 07
Posts: 289
Credit: 3,690,838
RAC: 0
Message 4496 - Posted: 27 Jul 2008, 12:53:53 UTC - in response to Message 4491.  

Yesterday I got some compute errors on 2 of my machines. See http://milkyway.cs.rpi.edu/milkyway/results.php?userid=9, if they haven't been purged already <grmbl>. In all cases I got "process got signal 11".

This could indicate a hardware problem, if it weren't that it occurs on 2 different (physical) machines. Both clients are Ubuntu 8.04 64-bit running in VMWare Workstation 6.0.4-built 93057, hosted by 64-bit Vista.


Guy - Compute errors have been going on for over 6 months here on reboots,abnormal shut-offs,aborting another projects work,abnormal project switching.and a few other obscure times...all get process got signal 11,

On the short tasks no one much cared as it only lost a few minutes of cpu time,,,now the loss may be hours so it its noticeable...happens on windows and linux both.
ID: 4496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,514,284
RAC: 37,074
Message 4497 - Posted: 27 Jul 2008, 12:55:37 UTC

If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :)
ID: 4497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [BAT] tutta55
Avatar

Send message
Joined: 27 Aug 07
Posts: 46
Credit: 8,529,766
RAC: 0
Message 4500 - Posted: 27 Jul 2008, 21:53:27 UTC

Whatever it was, it didn't occur again today. I guess yesterday both machines has a bad hair day :p

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair
ID: 4500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stefsaber

Send message
Joined: 2 Apr 08
Posts: 32
Credit: 1,017,362
RAC: 0
Message 4503 - Posted: 28 Jul 2008, 14:22:53 UTC - in response to Message 4500.  

Just noticed I got a computer error on a 3 hour WU...

"<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
too many normally harmless exit(s)
</message>
<stderr_txt>
No heartbeat from core client for 31 sec - exiting

</stderr_txt>
]]>
"
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=45159505

-Stefan-
ID: 4503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Altivo

Send message
Joined: 5 Dec 07
Posts: 6
Credit: 1,687,632
RAC: 0
Message 4504 - Posted: 28 Jul 2008, 15:42:23 UTC - in response to Message 4503.  

I've seen this before, and not just on this project. It happens to me on machines that have slow or dialup network access. Apparently there are timeouts set much too tightly for that condition, so that instead of waiting patiently, things freak out and abort. It's irritating to say the least, and especially with WUs that were running successfully.

Just noticed I got a computer error on a 3 hour WU...

"<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
too many normally harmless exit(s)
</message>
<stderr_txt>
No heartbeat from core client for 31 sec - exiting

</stderr_txt>
]]>
"
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=45159505

ID: 4504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,514,284
RAC: 37,074
Message 4505 - Posted: 28 Jul 2008, 16:13:45 UTC - in response to Message 4497.  

If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :)


The No heartbeat from core client is related to the Process 11 Error, see the post I made to Tuttas ... :)
ID: 4505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stefsaber

Send message
Joined: 2 Apr 08
Posts: 32
Credit: 1,017,362
RAC: 0
Message 4507 - Posted: 29 Jul 2008, 0:26:11 UTC - in response to Message 4505.  

If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :)


The No heartbeat from core client is related to the Process 11 Error, see the post I made to Tuttas ... :)


Thanks, just looked it up. I'll keep an eye on the WU, see if the other person gets the error or if my computer was in a quirky mood.
-Stefan-
ID: 4507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan Ver3

Send message
Joined: 17 May 08
Posts: 16
Credit: 528,507
RAC: 0
Message 4546 - Posted: 30 Jul 2008, 14:58:38 UTC - in response to Message 4497.  

If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :)


Thanks for the info Poorboy. I had 3 wu's crash yesterday upon an Ubuntu update
while it was writing to disk. 10,000 seconds worth of crunch time I had lost on that little error. :(
ID: 4546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
koschi

Send message
Joined: 10 Aug 08
Posts: 5
Credit: 10,366,849
RAC: 22,124
Message 4666 - Posted: 10 Aug 2008, 22:09:16 UTC
Last modified: 10 Aug 2008, 22:35:40 UTC

Hello everyone!

I attached to Milkyway just one hour ago and commanded my quad core first too do some work, but the 8 units it received resulted in compute errors immediately.
Those broke after 0 seconds :-(
http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=25394

Then I also attached a small C2D and it runs fine, at least up to now there was no error ;-)

The Quad (Q6600) is running Kubuntu 8.04 with some additions from intrepid, say kernel 2.6.26 and some other stuff...
The dualcore (E7200) is a Debian testing plus some packages from unstable.

Any idea what might cause these immediate errors on the quad? Maybe some library issue or like that?

This host is crunching QMC, SETI v8, Einstein SSE2 and a lot more projects without any problems, so I can't imagine that this is a hardware issue...

Every idea will be appreciated :-)

edit:

after a little reading here in the forums I found crunchers staticly linced version for Suse and maybe others, but the site is not reachable anymore, can some one please send me the file?!? That could be very helpfull maybe =)
mail adress would be m-a_r_t-i-n_.k-o-s-_c-_h_m-i-d-er@_w-_e-b.de (please remove the - and _ :) Thanks a lot in advance!
ID: 4666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rudy Toody

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 100,721,007
RAC: 0
Message 4671 - Posted: 11 Aug 2008, 14:18:24 UTC

I have been getting similar errors since Friday 8:32AM PDT. Before that, I had crunched for a week and earned 163K credits on my 3 quads---Debian/GNU Linux-AMD64 and BOINC client and manager 6.2.12. Others seem to have problems with the same WUs.

I have found no pattern to suggest that it is AMD/Intel, Windows/Linux, 64Bit/32/Bit, or BOINC versions.

How do I get some good stuff to crunch?
ID: 4671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,514,284
RAC: 37,074
Message 4674 - Posted: 11 Aug 2008, 16:00:33 UTC - in response to Message 4671.  

I have been getting similar errors since Friday 8:32AM PDT. Before that, I had crunched for a week and earned 163K credits on my 3 quads---Debian/GNU Linux-AMD64 and BOINC client and manager 6.2.12. Others seem to have problems with the same WUs.

I have found no pattern to suggest that it is AMD/Intel, Windows/Linux, 64Bit/32/Bit, or BOINC versions.

How do I get some good stuff to crunch?


Strange, I'm running Linux Ubuntu on Multiple Box's with no Errors at all, I see the Wu's have the Signal 11 Error which generally is a Write error when trying to write to the Disk or a No Heartbeat from the BOINC Client.

Try Upgrading to the new 6.2.15 if you can, I've run the 6.2.14 on 1 Box with no Problems & then Upgraded that Box to 6.2.15 with no problems so far. The rest of my Box are running 5.10.45 with no Problems ...
ID: 4674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rudy Toody

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 100,721,007
RAC: 0
Message 4675 - Posted: 11 Aug 2008, 16:31:55 UTC - in response to Message 4674.  

I have been getting similar errors since Friday 8:32AM PDT. Before that, I had crunched for a week and earned 163K credits on my 3 quads---Debian/GNU Linux-AMD64 and BOINC client and manager 6.2.12. Others seem to have problems with the same WUs.

I have found no pattern to suggest that it is AMD/Intel, Windows/Linux, 64Bit/32/Bit, or BOINC versions.

How do I get some good stuff to crunch?


Strange, I'm running Linux Ubuntu on Multiple Box's with no Errors at all, I see the Wu's have the Signal 11 Error which generally is a Write error when trying to write to the Disk or a No Heartbeat from the BOINC Client.

Try Upgrading to the new 6.2.15 if you can, I've run the 6.2.14 on 1 Box with no Problems & then Upgraded that Box to 6.2.15 with no problems so far. The rest of my Box are running 5.10.45 with no Problems ...

I have upgraded to 6.2.14 which is the latest stable version available for lenny.
I also unchecked the get test WUs from preferences.

Both changes have had no effect on 3 quads. I will look into the file permissions to see if there is something there that would prevent a write to disk.
ID: 4675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rudy Toody

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 100,721,007
RAC: 0
Message 4677 - Posted: 11 Aug 2008, 18:07:20 UTC

I checked permissions and they are fine.

I reduced my memory frequency from 1066 to 533. That didn't change anything.

If it is a hardware problem it is finding the same problem on 3 different boxes.

The only things the 3 boxes have in common are Phenom 9850BE, 4GB of OCZ Reaper 1066 and Seagate 500GB sata drives.
ID: 4677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,514,284
RAC: 37,074
Message 4679 - Posted: 11 Aug 2008, 20:11:03 UTC

Don't know if you got your box's running yet or not but if not I still think it may be something to do with the permissions since the Wu's error out right away soooooo lets check those again please ... :)

1. Are you able to run other projects okay ???

2. Find your Boinc Projects Folder and see if the "astronomy_1.23_x86_64-pc-linux-gnu has the Allow executing file as program Box Checked ???

3. Find the boinc or boinc_client, the boinccmd, the boinc_cmd & the boincmgr executables & make sure they have the proper permissions & also that they have the Allow executing file as program Box Checked on them too ???

4, If all else fails try to do a Re-Install of the Boinc-Client & Boinc-Manager, reset the Project & try, Detach & re-attach to the project & try again.

Hopefully something there will work for you & get your box's running the Project again, good luck ... :)

Too bad all the Dev's have abandoned the project or maybe they could suggest something too ...
ID: 4679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rudy Toody

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 100,721,007
RAC: 0
Message 4682 - Posted: 11 Aug 2008, 20:41:39 UTC - in response to Message 4679.  
Last modified: 11 Aug 2008, 20:42:32 UTC

Don't know if you got your box's running yet or not but if not I still think it may be something to do with the permissions since the Wu's error out right away soooooo lets check those again please ... :)

1. Are you able to run other projects okay ???

2. Find your Boinc Projects Folder and see if the "astronomy_1.23_x86_64-pc-linux-gnu has the Allow executing file as program Box Checked ???

3. Find the boinc or boinc_client, the boinccmd, the boinc_cmd & the boincmgr executables & make sure they have the proper permissions & also that they have the Allow executing file as program Box Checked on them too ???

4, If all else fails try to do a Re-Install of the Boinc-Client & Boinc-Manager, reset the Project & try, Detach & re-attach to the project & try again.

Hopefully something there will work for you & get your box's running the Project again, good luck ... :)

Too bad all the Dev's have abandoned the project or maybe they could suggest something too ...

I have done all of the above.

Currently running 3x+1 on all 3 quads and doing quite well.
I checked all the permissions. I went so far as to download 8 WUs and suspend so I could catch them in the act. All permissions were ok.

I installed and tested BOINC 5.2.10, 6.2.12, and now 6.2.14. All work with 3x+1 and none work with Milkyway.

What do you mean, the devs have abandoned the project? Is this no longer a viable project. Maybe I should stay on 3x+1.
ID: 4682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,514,284
RAC: 37,074
Message 4683 - Posted: 11 Aug 2008, 20:57:48 UTC - in response to Message 4682.  
Last modified: 11 Aug 2008, 21:00:15 UTC

Don't know if you got your box's running yet or not but if not I still think it may be something to do with the permissions since the Wu's error out right away soooooo lets check those again please ... :)

1. Are you able to run other projects okay ???

2. Find your Boinc Projects Folder and see if the "astronomy_1.23_x86_64-pc-linux-gnu has the Allow executing file as program Box Checked ???

3. Find the boinc or boinc_client, the boinccmd, the boinc_cmd & the boincmgr executables & make sure they have the proper permissions & also that they have the Allow executing file as program Box Checked on them too ???

4, If all else fails try to do a Re-Install of the Boinc-Client & Boinc-Manager, reset the Project & try, Detach & re-attach to the project & try again.

Hopefully something there will work for you & get your box's running the Project again, good luck ... :)

Too bad all the Dev's have abandoned the project or maybe they could suggest something too ...

I have done all of the above.

Currently running 3x+1 on all 3 quads and doing quite well.
I checked all the permissions. I went so far as to download 8 WUs and suspend so I could catch them in the act. All permissions were ok.

I installed and tested BOINC 5.2.10, 6.2.12, and now 6.2.14. All work with 3x+1 and none work with Milkyway.

What do you mean, the devs have abandoned the project? Is this no longer a viable project. Maybe I should stay on 3x+1.


Okay, I was just trying to give you some things to try, I figured you may have tried them already ... The Dev remark was just thrown in there since none of them have responded to anything for some time now so nobody knows whats happened to them.

About the only thing we've heard from them in the last few months is when they cut the Wu limit to 8 (Which is to low really, 20 was to low as far as I was concerned) and asked that we let them know if there was a problem with that. Well a lot of people running the Quads & OctaQuads complained about it but theres been no response or action from the Dev's on it so far. Summer Dev Blahs I guess ... :)
ID: 4683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rudy Toody

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 100,721,007
RAC: 0
Message 4684 - Posted: 11 Aug 2008, 21:19:39 UTC

Thanks, Bob!

I'll stick with 3x+1 and let the rest of TeAm AnandTech pass me by! Oh, well.
ID: 4684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,514,284
RAC: 37,074
Message 4685 - Posted: 11 Aug 2008, 22:22:00 UTC

You could try posting your problem over @ the Boinc Dev Forum. theres people there with way more knowledge about Linux than I have & willing to work with you too.

Post the following error from 1 of your Wu's:

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
Unrecognized XML in parse_init_data_file: computation_deadline
Skipping: 1218907845.136000
Skipping: /computation_deadline
Unrecognized XML in GLOBAL_PREFS::parse_override: mod_time
Skipping: /mod_time
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100.000000
Skipping: /max_ncpus_pct

</stderr_txt>

All those Unrecognized lines has me wondering what their about and they may be the root cause of your problem ... bob
ID: 4685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Compute errors

©2024 Astroinformatics Group