Message boards :
Number crunching :
Compute errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Aug 07 Posts: 46 Credit: 8,529,766 RAC: 0 |
Yesterday I got some compute errors on 2 of my machines. See http://milkyway.cs.rpi.edu/milkyway/results.php?userid=9, if they haven't been purged already <grmbl>. In all cases I got "process got signal 11". This could indicate a hardware problem, if it weren't that it occurs on 2 different (physical) machines. Both clients are Ubuntu 8.04 64-bit running in VMWare Workstation 6.0.4-built 93057, hosted by 64-bit Vista. BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning Tutta55's Lair |
Send message Joined: 26 Mar 08 Posts: 15 Credit: 2,045,502 RAC: 0 |
Yesterday I got some compute errors on 2 of my machines. See http://milkyway.cs.rpi.edu/milkyway/results.php?userid=9, if they haven't been purged already <grmbl>. In all cases I got "process got signal 11". Hi Tutta, "Signal 11" can be a sign of low virtual memory. Best regards |
Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0 |
Yesterday I got some compute errors on 2 of my machines. See http://milkyway.cs.rpi.edu/milkyway/results.php?userid=9, if they haven't been purged already <grmbl>. In all cases I got "process got signal 11". Guy - Compute errors have been going on for over 6 months here on reboots,abnormal shut-offs,aborting another projects work,abnormal project switching.and a few other obscure times...all get process got signal 11, On the short tasks no one much cared as it only lost a few minutes of cpu time,,,now the loss may be hours so it its noticeable...happens on windows and linux both. |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :) |
Send message Joined: 27 Aug 07 Posts: 46 Credit: 8,529,766 RAC: 0 |
Whatever it was, it didn't occur again today. I guess yesterday both machines has a bad hair day :p BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning Tutta55's Lair |
Send message Joined: 2 Apr 08 Posts: 32 Credit: 1,017,362 RAC: 0 |
Just noticed I got a computer error on a 3 hour WU... "<core_client_version>5.10.45</core_client_version> <![CDATA[ <message> too many normally harmless exit(s) </message> <stderr_txt> No heartbeat from core client for 31 sec - exiting </stderr_txt> ]]> " http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=45159505 -Stefan- |
Send message Joined: 5 Dec 07 Posts: 6 Credit: 1,687,632 RAC: 0 |
I've seen this before, and not just on this project. It happens to me on machines that have slow or dialup network access. Apparently there are timeouts set much too tightly for that condition, so that instead of waiting patiently, things freak out and abort. It's irritating to say the least, and especially with WUs that were running successfully. Just noticed I got a computer error on a 3 hour WU... |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :) The No heartbeat from core client is related to the Process 11 Error, see the post I made to Tuttas ... :) |
Send message Joined: 2 Apr 08 Posts: 32 Credit: 1,017,362 RAC: 0 |
If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :) Thanks, just looked it up. I'll keep an eye on the WU, see if the other person gets the error or if my computer was in a quirky mood. -Stefan- |
Send message Joined: 17 May 08 Posts: 16 Credit: 528,507 RAC: 0 |
If something interrupts a disk write it will give you that error too Tutta, see the folowing > http://www.boinc-wiki.info/index.php?title=Unrecoverable_error_for_result_%27(result)%27_(process_got_signal_11) ... :) Thanks for the info Poorboy. I had 3 wu's crash yesterday upon an Ubuntu update while it was writing to disk. 10,000 seconds worth of crunch time I had lost on that little error. :( |
Send message Joined: 10 Aug 08 Posts: 5 Credit: 19,885,042 RAC: 43,606 |
Hello everyone! I attached to Milkyway just one hour ago and commanded my quad core first too do some work, but the 8 units it received resulted in compute errors immediately. Those broke after 0 seconds :-( http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=25394 Then I also attached a small C2D and it runs fine, at least up to now there was no error ;-) The Quad (Q6600) is running Kubuntu 8.04 with some additions from intrepid, say kernel 2.6.26 and some other stuff... The dualcore (E7200) is a Debian testing plus some packages from unstable. Any idea what might cause these immediate errors on the quad? Maybe some library issue or like that? This host is crunching QMC, SETI v8, Einstein SSE2 and a lot more projects without any problems, so I can't imagine that this is a hardware issue... Every idea will be appreciated :-) edit: after a little reading here in the forums I found crunchers staticly linced version for Suse and maybe others, but the site is not reachable anymore, can some one please send me the file?!? That could be very helpfull maybe =) mail adress would be m-a_r_t-i-n_.k-o-s-_c-_h_m-i-d-er@_w-_e-b.de (please remove the - and _ :) Thanks a lot in advance! |
Send message Joined: 29 Jul 08 Posts: 9 Credit: 100,721,007 RAC: 0 |
I have been getting similar errors since Friday 8:32AM PDT. Before that, I had crunched for a week and earned 163K credits on my 3 quads---Debian/GNU Linux-AMD64 and BOINC client and manager 6.2.12. Others seem to have problems with the same WUs. I have found no pattern to suggest that it is AMD/Intel, Windows/Linux, 64Bit/32/Bit, or BOINC versions. How do I get some good stuff to crunch? |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
I have been getting similar errors since Friday 8:32AM PDT. Before that, I had crunched for a week and earned 163K credits on my 3 quads---Debian/GNU Linux-AMD64 and BOINC client and manager 6.2.12. Others seem to have problems with the same WUs. Strange, I'm running Linux Ubuntu on Multiple Box's with no Errors at all, I see the Wu's have the Signal 11 Error which generally is a Write error when trying to write to the Disk or a No Heartbeat from the BOINC Client. Try Upgrading to the new 6.2.15 if you can, I've run the 6.2.14 on 1 Box with no Problems & then Upgraded that Box to 6.2.15 with no problems so far. The rest of my Box are running 5.10.45 with no Problems ... |
Send message Joined: 29 Jul 08 Posts: 9 Credit: 100,721,007 RAC: 0 |
I have been getting similar errors since Friday 8:32AM PDT. Before that, I had crunched for a week and earned 163K credits on my 3 quads---Debian/GNU Linux-AMD64 and BOINC client and manager 6.2.12. Others seem to have problems with the same WUs. I have upgraded to 6.2.14 which is the latest stable version available for lenny. I also unchecked the get test WUs from preferences. Both changes have had no effect on 3 quads. I will look into the file permissions to see if there is something there that would prevent a write to disk. |
Send message Joined: 29 Jul 08 Posts: 9 Credit: 100,721,007 RAC: 0 |
I checked permissions and they are fine. I reduced my memory frequency from 1066 to 533. That didn't change anything. If it is a hardware problem it is finding the same problem on 3 different boxes. The only things the 3 boxes have in common are Phenom 9850BE, 4GB of OCZ Reaper 1066 and Seagate 500GB sata drives. |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
Don't know if you got your box's running yet or not but if not I still think it may be something to do with the permissions since the Wu's error out right away soooooo lets check those again please ... :) 1. Are you able to run other projects okay ??? 2. Find your Boinc Projects Folder and see if the "astronomy_1.23_x86_64-pc-linux-gnu has the Allow executing file as program Box Checked ??? 3. Find the boinc or boinc_client, the boinccmd, the boinc_cmd & the boincmgr executables & make sure they have the proper permissions & also that they have the Allow executing file as program Box Checked on them too ??? 4, If all else fails try to do a Re-Install of the Boinc-Client & Boinc-Manager, reset the Project & try, Detach & re-attach to the project & try again. Hopefully something there will work for you & get your box's running the Project again, good luck ... :) Too bad all the Dev's have abandoned the project or maybe they could suggest something too ... |
Send message Joined: 29 Jul 08 Posts: 9 Credit: 100,721,007 RAC: 0 |
Don't know if you got your box's running yet or not but if not I still think it may be something to do with the permissions since the Wu's error out right away soooooo lets check those again please ... :) I have done all of the above. Currently running 3x+1 on all 3 quads and doing quite well. I checked all the permissions. I went so far as to download 8 WUs and suspend so I could catch them in the act. All permissions were ok. I installed and tested BOINC 5.2.10, 6.2.12, and now 6.2.14. All work with 3x+1 and none work with Milkyway. What do you mean, the devs have abandoned the project? Is this no longer a viable project. Maybe I should stay on 3x+1. |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
Don't know if you got your box's running yet or not but if not I still think it may be something to do with the permissions since the Wu's error out right away soooooo lets check those again please ... :) Okay, I was just trying to give you some things to try, I figured you may have tried them already ... The Dev remark was just thrown in there since none of them have responded to anything for some time now so nobody knows whats happened to them. About the only thing we've heard from them in the last few months is when they cut the Wu limit to 8 (Which is to low really, 20 was to low as far as I was concerned) and asked that we let them know if there was a problem with that. Well a lot of people running the Quads & OctaQuads complained about it but theres been no response or action from the Dev's on it so far. Summer Dev Blahs I guess ... :) |
Send message Joined: 29 Jul 08 Posts: 9 Credit: 100,721,007 RAC: 0 |
Thanks, Bob! I'll stick with 3x+1 and let the rest of TeAm AnandTech pass me by! Oh, well. |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
You could try posting your problem over @ the Boinc Dev Forum. theres people there with way more knowledge about Linux than I have & willing to work with you too. Post the following error from 1 of your Wu's: <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> Unrecognized XML in parse_init_data_file: computation_deadline Skipping: 1218907845.136000 Skipping: /computation_deadline Unrecognized XML in GLOBAL_PREFS::parse_override: mod_time Skipping: /mod_time Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100.000000 Skipping: /max_ncpus_pct </stderr_txt> All those Unrecognized lines has me wondering what their about and they may be the root cause of your problem ... bob |
©2024 Astroinformatics Group