Never Ending WU's + Invalid "Separation"

Author	Message
[AF>Amis des Lapins] Phil1966 Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0	Message 63680 - Posted: 6 Jun 2015, 20:30:27 UTC Hello, Is there a way to avoid running these never ending WU's ? ie Why don't they automaticaly stop after X minutes ? 1143729764 842451651 551092 6 Jun 2015, 13:20:28 UTC 6 Jun 2015, 17:30:54 UTC AnnulÃ© par l'utilisateur 13,937.59 2.15 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) 1143108316 841998131 551092 6 Jun 2015, 0:00:59 UTC 6 Jun 2015, 7:32:21 UTC AnnulÃ© par l'utilisateur 24,707.89 0.80 --- MilkyWay@Home v1.02 (opencl_amd_ati) 1142937681 841873815 551092 5 Jun 2015, 20:18:12 UTC 6 Jun 2015, 0:24:16 UTC Erreur en cours de calculs 13,704.75 13,607.81 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) + Have a lot of INVALID "Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)" ... Is this "normal" ? Thank You Best Regards, Phil1966 ID: 63680 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Phil1966 Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0	Message 63681 - Posted: 7 Jun 2015, 16:58:36 UTC Last modified: 7 Jun 2015, 17:12:05 UTC Hello, Another "never ending" WU today : 1144844168 843256198 7 Jun 2015, 13:52:23 UTC 7 Jun 2015, 16:45:48 UTC AnnulÃ© par l'utilisateur 9,113.11 0.87 --- MilkyWay@Home v1.02 (opencl_amd_ati) :/ NB I stopped momentarily running my 4 * GTX970 and bought 3 * HD7950 only to be able to crunch for MW. Running also Milkyway (Campaign #3) credit 4,720,110,000 on BU. Really hope to get an answer to my 2 questions this time. The opposite would be really disappointing / frustrating. Thank you. cf http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3772&postid=63648#63648 http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3772&postid=63674#63674 ID: 63681 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Phil1966 Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0	Message 63684 - Posted: 8 Jun 2015, 4:44:08 UTC Had to cancel another one this morning. runtime > 7 hours :/ 1145173555 843500743 7 Jun 2015, 20:42:16 UTC 8 Jun 2015, 4:05:44 UTC AnnulÃ© par l'utilisateur 26,320.96 1.03 --- MilkyWay@Home v1.02 (opencl_amd_ati) ID: 63684 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 63688 - Posted: 9 Jun 2015, 13:13:38 UTC Hi Phil1966, This is the first I'm hearing about never ending work units on the separation application. Would you mind checking your BOINC version number and posting that here for me? Jake W. ID: 63688 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Phil1966 Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0	Message 63691 - Posted: 10 Jun 2015, 4:23:41 UTC - in response to Message 63688. Last modified: 10 Jun 2015, 4:27:00 UTC Dear Jake, Thank you for your message. I am running 7.4.42 / W7 Ultimate / Machine ID 551092. GenuineIntel Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz [Family 6 Model 60 Stepping 3] (8 processeurs) [2] AMD AMD Radeon HD 7870/7950/7970/R9 280X series (Tahiti) (3072MB) driver: 1.4.1848 OpenCL: 1.2 Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) NB Yesterday it just ran fine. Herebelow the ref of the WU's I am referring to : 1145173555 843500743 551092 7 Jun 2015, 20:42:16 UTC 8 Jun 2015, 4:05:44 UTC AnnulÃ© par l'utilisateur 26,320.96 1.03 --- MilkyWay@Home v1.02 (opencl_amd_ati) 1143729764 842451651 551092 6 Jun 2015, 13:20:28 UTC 6 Jun 2015, 17:30:54 UTC AnnulÃ© par l'utilisateur 13,937.59 2.15 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) 1143108316 841998131 551092 6 Jun 2015, 0:00:59 UTC 6 Jun 2015, 7:32:21 UTC AnnulÃ© par l'utilisateur 24,707.89 0.80 --- MilkyWay@Home v1.02 (opencl_amd_ati) 1142937681 841873815 551092 5 Jun 2015, 20:18:12 UTC 6 Jun 2015, 0:24:16 UTC Erreur en cours de calculs 13,704.75 13,607.81 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) Thank You Kind Regards Philippe ID: 63691 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 63694 - Posted: 10 Jun 2015, 12:36:58 UTC - in response to Message 63691. I've noticed a few of your work units, the invalid ones, are returning blank stderr.txt files. This is very strange and the fact it only happens on a small percentage is interesting. Equally interesting is that you are the only one who seems to be getting stuck on the work units in question. I will do my best to try to reproduce the issue here. If you can give me any more info about the state of your system when this happens please let me know. This is only happening on your dual GPU system correct? Jake W. ID: 63694 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0	Message 63711 - Posted: 14 Jun 2015, 14:47:19 UTC - in response to Message 63694. i do also get never ending wu. suspending/resuming them works, but I dot get a bunch of invalid/errors: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid= All (3581) Â· In progress (120) Â· Validation pending (0) Â· Validation inconclusive (182) Â· Valid (2632) Â· Invalid (527) Â· Error (120) 3x HD7950 GPU, running 2 WU/GPU No invalid on my Nvidia GTX980 (also running 2 WU/GPU) State: All (329) Â· In progress (39) Â· Validation pending (0) Â· Validation inconclusive (30) Â· Valid (155) Â· Invalid (0) Â· Error (105) ID: 63711 · Rating: 0 · rate: / Reply Quote

Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0	Message 63712 - Posted: 14 Jun 2015, 23:17:36 UTC Last modified: 14 Jun 2015, 23:31:14 UTC re Invalids due to stderr blankness. Alot of us get 5% to 7% invalids using HD7950's on my ATI FX-8350 the whole stderr is always blank, on my Haswell-E with HD7950 it always cuts off the same portion of the STDERR, just at a different spot than [AF>EDLS]zOU FWIW my amd fx-8350 always truncates the whole stderr. But My Haswell (also running an HD7950) truncates the stderr after the Initial wait always at the same place. Although the FX-8350 had many more errors per day than the Haswell. Iteration area: 560000 Chunk estimate: 1 Num chunks: 2 Chunk size: 559104 Added area: 558208 Effective area: 1118208 Initial wait: 16 ms </stderr_txt> ]]> This is where my Haswell-E always truncates the STDERR Never had a hang though ID: 63712 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3319 Credit: 520,349,186 RAC: 22,117	Message 63714 - Posted: 15 Jun 2015, 10:44:54 UTC - in response to Message 63711. i do also get never ending wu. suspending/resuming them works, but I dot get a bunch of invalid/errors: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid= All (3581) Â· In progress (120) Â· Validation pending (0) Â· Validation inconclusive (182) Â· Valid (2632) Â· Invalid (527) Â· Error (120) 3x HD7950 GPU, running 2 WU/GPU Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use? ID: 63714 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0	Message 63715 - Posted: 15 Jun 2015, 11:12:43 UTC - in response to Message 63714. Last modified: 15 Jun 2015, 11:39:30 UTC i do also get never ending wu. suspending/resuming them works, but I dot get a bunch of invalid/errors: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid= All (3581) Â· In progress (120) Â· Validation pending (0) Â· Validation inconclusive (182) Â· Valid (2632) Â· Invalid (527) Â· Error (120) 3x HD7950 GPU, running 2 WU/GPU Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use? hello Mickey (are you the same having help me with my BU issue ? ) This computer is dedicated to BOINC. There's 2 crossfire cables. There's a free CPU core. I've noticed that in the past 2h, my GPU temp has dropped dramatically (used to reach 85C, now stable at 40C. Even though the ambiant temperature hasn't dropped and each GPU is still doing 2WU at a time. ================================================== I've stopped the computer and removed the XFire conenctors and restarted it. We'll see. Thank you for the suggestion. (FYI I initially used the XFire conenctors as I couldn't connect each GPU to a monitor to get them to process WU) Note: When I remove the crossfire connectors, My GPU show as connected to a PCI-E v1.1 slot. With the Xfire connector, they show a 2.0 PCI-E bus... well except now... they still show PCI-E 1.1 and WU are taking a lot longer, and it explains the temps. Anyway, I'm going to troubleshoot all that and post back... ID: 63715 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0	Message 63716 - Posted: 15 Jun 2015, 13:10:31 UTC - in response to Message 63715. I finally managed to get them to work at the proper PCI-E speed and with Xfire connectors removed (don't ask me how...) I will monitor for "never ending WU" now. (they're still crunching 2/GPU) ID: 63716 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0	Message 63717 - Posted: 15 Jun 2015, 14:53:58 UTC all my invalids are Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) I've removed that application from my preferences for now. ID: 63717 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 63718 - Posted: 15 Jun 2015, 16:24:12 UTC Hey everyone, Thanks for the feedback. I am still looking into what could be causing this. Can you all confirm that you do not have these issues with the regular separation application and only the "Modified Fit" application? Jake W. ID: 63718 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0	Message 63719 - Posted: 15 Jun 2015, 16:28:14 UTC - in response to Message 63718. Yes, check here for details :) http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091 This is my main Milkyway machine. But you can check the other one too :) ID: 63719 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,330,824 RAC: 139,501	Message 63720 - Posted: 15 Jun 2015, 18:18:52 UTC - in response to Message 63718. Jake, please look at this thread: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662 The problem is with the underlying BOINC code. I have just come to accept the 3% error rate here at MW. The MW applications are mainly to blame for exposing the BOINC coding problem. Most of my errors are with the 1.36 Modifed Fit application but I have seen the truncated stderr.txt problem with the 1.02 app also. The problem is worst with the FX-8350 though it occurs with other chip types as mentioned here in the thread. Doesn't matter whether you are running SLI, Crossfire, single or multiple tasks per card, the problem occurs because of an issue with the BOINC code. I have attracted the BOINC developers attention to the problem and they have acknowledged it, just no timeline for when the issue will get resolved. Cheers, Keith ID: 63720 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,330,824 RAC: 139,501	Message 63723 - Posted: 15 Jun 2015, 20:51:02 UTC Haven't seem any response from Jake yet. Just wanted to comment that the MW app developers could help with the problem by increasing the work unit completion times by doing more work with larger computations as has been discussed. The problem simply can be described as that the Modified Fit task completes too quickly for the underlying BOINC code to clean up the task completion in the slot. If anyone wants to see what is happening, just turn on slot_debug in the BOINC Manager and look at the logfiles. Match up the invalid task entries and see that the slot cleanup doesn't work correctly for the very short completion times of the 1.36 Modified Fit tasks. That is why it looks like only the 1.36 app has problems but in fact in can occur with the 1.02 app if the system is busy crunching work for other projects and system resources can't be applied in a timely manner to get ahead of the underlying problems with the BOINC code. Cheers, Keith P.S. So get busy on releasing a new Modified Fit app with larger work tasks. ID: 63723 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3319 Credit: 520,349,186 RAC: 22,117	Message 63724 - Posted: 16 Jun 2015, 11:47:58 UTC - in response to Message 63715. i do also get never ending wu. suspending/resuming them works, but I dot get a bunch of invalid/errors: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid= All (3581) Â· In progress (120) Â· Validation pending (0) Â· Validation inconclusive (182) Â· Valid (2632) Â· Invalid (527) Â· Error (120) 3x HD7950 GPU, running 2 WU/GPU Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use? hello Mickey (are you the same having help me with my BU issue ? ) This computer is dedicated to BOINC. There's 2 crossfire cables. There's a free CPU core. I've noticed that in the past 2h, my GPU temp has dropped dramatically (used to reach 85C, now stable at 40C. Even though the ambiant temperature hasn't dropped and each GPU is still doing 2WU at a time. ================================================== I've stopped the computer and removed the XFire conenctors and restarted it. We'll see. Thank you for the suggestion. (FYI I initially used the XFire conenctors as I couldn't connect each GPU to a monitor to get them to process WU) Note: When I remove the crossfire connectors, My GPU show as connected to a PCI-E v1.1 slot. With the Xfire connector, they show a 2.0 PCI-E bus... well except now... they still show PCI-E 1.1 and WU are taking a lot longer, and it explains the temps. Anyway, I'm going to troubleshoot all that and post back... Yes it's me...the drop in temps, and extra long workunit run times, could be because your gpu's crashed and it was running at the default low level. Rebooting the pc should bring them back up to normal speed again. As for the pci-e bus speed if you have cards that HAVE the Xfire connector on it they will all think they are on the faster bus, once you remove it they will show what they are really on. Do you use a cc_config.xml file like this: <cc_config> <options> <use_all_gpus>1</use_all_gpus> </options> </cc_config> to tell Boinc to use all your gpu's? You MUST use Notepad to copy and paste the info and then save the file as a txt type file, it will complain but say yes, and then save it in the c:\program data\boinc directory. That way all your Boinc projects can use it. You should exit and restart Boinc after putting the file in there. Each gpu will then crunch it's own unit and finish in its own time. ID: 63724 · Rating: 0 · rate: / Reply Quote

[AF>EDLS]zOU Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0	Message 63725 - Posted: 16 Jun 2015, 11:58:45 UTC - in response to Message 63724. Last modified: 16 Jun 2015, 12:05:51 UTC Thank you My cc_config was correct. I also had to connect each GPU to a monitor and extend the windows desktop. I'm glad I have monitors with multiple inputs as I didn't want to put resistors on the dvi to VGA adapter. Now I may have to do that if I want to use the APU engine... but i'm sure it's worth it Also I had no errors or invalid WU for 24 on this computer, although I had to suspend/resume some WUs. A few inconclusive but that's usual. ID: 63725 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 63726 - Posted: 16 Jun 2015, 12:47:52 UTC Hey everyone, Thank you for letting me know this is a BOINC issue! It seems strange that it would only be on the Modified Fit application though if it were a BOINC issue so I will still be looking to see if we are causing the issue somehow. Jake W. ID: 63726 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,330,824 RAC: 139,501	Message 63738 - Posted: 17 Jun 2015, 20:43:05 UTC - in response to Message 63726. Hi Jake, as I stated in the previous post, the main issue is the very short runtimes of the Modified Fit app. It just completes too fast in fast GPU crunchers for the BOINC slot cleanup code to finish in time before another task occupies the same slot as the previous task and ends up overwriting the stderr.txt file. This is why you get the completely truncated output of the stderr.txt file and thus reports as an invalid. The solution is either wait for the BOINC developers to fix this and the >4GB file deletion problem in the slot cleanup code or to make the modified fit tasks run a lot longer. If they ran in the 2-2:30 minute completion time of the 1.02 app, then you wouldn't make invalids in most cases. If you want to see the problem happen in real-time, just enable slot_debug in the BOINC Manager and look at the entries in the logfile for the invalidated tasks. You will see that a new task occupies the slot of a previous task ahead of the final slot cleanup for the previous completed task. Cheers, Keith ID: 63738 · Rating: 0 · rate: / Reply Quote