Message boards :
Number crunching :
Never Ending WU's + Invalid "Separation"
Message board moderation
Author | Message |
---|---|
Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0 |
Hello, Is there a way to avoid running these never ending WU's ? ie Why don't they automaticaly stop after X minutes ?
+ Have a lot of INVALID "Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)" ... Is this "normal" ? Thank You Best Regards, Phil1966 |
Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0 |
Hello, Another "never ending" WU today : 1144844168 843256198 7 Jun 2015, 13:52:23 UTC 7 Jun 2015, 16:45:48 UTC Annulé par l'utilisateur 9,113.11 0.87 --- MilkyWay@Home v1.02 (opencl_amd_ati) :/ NB I stopped momentarily running my 4 * GTX970 and bought 3 * HD7950 only to be able to crunch for MW. Running also Milkyway (Campaign #3) credit 4,720,110,000on BU. Really hope to get an answer to my 2 questions this time. The opposite would be really disappointing / frustrating. Thank you. cf http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3772&postid=63648#63648 http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3772&postid=63674#63674 |
Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0 |
Had to cancel another one this morning. runtime > 7 hours :/ 1145173555 843500743 7 Jun 2015, 20:42:16 UTC 8 Jun 2015, 4:05:44 UTC Annulé par l'utilisateur 26,320.96 1.03 --- MilkyWay@Home v1.02 (opencl_amd_ati) |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hi Phil1966, This is the first I'm hearing about never ending work units on the separation application. Would you mind checking your BOINC version number and posting that here for me? Jake W. |
Send message Joined: 25 Dec 11 Posts: 20 Credit: 119,498,787 RAC: 0 |
Dear Jake, Thank you for your message. I am running 7.4.42 / W7 Ultimate / Machine ID 551092.
NB Yesterday it just ran fine. Herebelow the ref of the WU's I am referring to : 1145173555 843500743 551092 7 Jun 2015, 20:42:16 UTC 8 Jun 2015, 4:05:44 UTC Annulé par l'utilisateur 26,320.96 1.03 --- MilkyWay@Home v1.02 (opencl_amd_ati) 1143729764 842451651 551092 6 Jun 2015, 13:20:28 UTC 6 Jun 2015, 17:30:54 UTC Annulé par l'utilisateur 13,937.59 2.15 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) 1143108316 841998131 551092 6 Jun 2015, 0:00:59 UTC 6 Jun 2015, 7:32:21 UTC Annulé par l'utilisateur 24,707.89 0.80 --- MilkyWay@Home v1.02 (opencl_amd_ati) 1142937681 841873815 551092 5 Jun 2015, 20:18:12 UTC 6 Jun 2015, 0:24:16 UTC Erreur en cours de calculs 13,704.75 13,607.81 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) Thank You Kind Regards Philippe |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
I've noticed a few of your work units, the invalid ones, are returning blank stderr.txt files. This is very strange and the fact it only happens on a small percentage is interesting. Equally interesting is that you are the only one who seems to be getting stuck on the work units in question. I will do my best to try to reproduce the issue here. If you can give me any more info about the state of your system when this happens please let me know. This is only happening on your dual GPU system correct? Jake W. |
Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0 |
i do also get never ending wu. suspending/resuming them works, but I dot get a bunch of invalid/errors: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid= All (3581) · In progress (120) · Validation pending (0) · Validation inconclusive (182) · Valid (2632) · Invalid (527) · Error (120) 3x HD7950 GPU, running 2 WU/GPU No invalid on my Nvidia GTX980 (also running 2 WU/GPU) State: All (329) · In progress (39) · Validation pending (0) · Validation inconclusive (30) · Valid (155) · Invalid (0) · Error (105) |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 |
re Invalids due to stderr blankness. Alot of us get 5% to 7% invalids using HD7950's on my ATI FX-8350 the whole stderr is always blank, on my Haswell-E with HD7950 it always cuts off the same portion of the STDERR, just at a different spot than [AF>EDLS]zOU FWIW my amd fx-8350 always truncates the whole stderr. But My Haswell (also running an HD7950) truncates the stderr after the Initial wait always at the same place. Although the FX-8350 had many more errors per day than the Haswell. Iteration area: 560000 Chunk estimate: 1 Num chunks: 2 Chunk size: 559104 Added area: 558208 Effective area: 1118208 Initial wait: 16 ms </stderr_txt> ]]> This is where my Haswell-E always truncates the STDERR Never had a hang though |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,349,186 RAC: 22,117 |
i do also get never ending wu. Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use? |
Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0 |
i do also get never ending wu. hello Mickey (are you the same having help me with my BU issue ? ) This computer is dedicated to BOINC. There's 2 crossfire cables. There's a free CPU core. I've noticed that in the past 2h, my GPU temp has dropped dramatically (used to reach 85C, now stable at 40C. Even though the ambiant temperature hasn't dropped and each GPU is still doing 2WU at a time. ================================================== I've stopped the computer and removed the XFire conenctors and restarted it. We'll see. Thank you for the suggestion. (FYI I initially used the XFire conenctors as I couldn't connect each GPU to a monitor to get them to process WU) Note: When I remove the crossfire connectors, My GPU show as connected to a PCI-E v1.1 slot. With the Xfire connector, they show a 2.0 PCI-E bus... well except now... they still show PCI-E 1.1 and WU are taking a lot longer, and it explains the temps. Anyway, I'm going to troubleshoot all that and post back... |
Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0 |
I finally managed to get them to work at the proper PCI-E speed and with Xfire connectors removed (don't ask me how...) I will monitor for "never ending WU" now. (they're still crunching 2/GPU) |
Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0 |
all my invalids are Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101) I've removed that application from my preferences for now. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey everyone, Thanks for the feedback. I am still looking into what could be causing this. Can you all confirm that you do not have these issues with the regular separation application and only the "Modified Fit" application? Jake W. |
Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0 |
Yes, check here for details :) http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091 This is my main Milkyway machine. But you can check the other one too :) |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,330,824 RAC: 139,501 |
Jake, please look at this thread: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662 The problem is with the underlying BOINC code. I have just come to accept the 3% error rate here at MW. The MW applications are mainly to blame for exposing the BOINC coding problem. Most of my errors are with the 1.36 Modifed Fit application but I have seen the truncated stderr.txt problem with the 1.02 app also. The problem is worst with the FX-8350 though it occurs with other chip types as mentioned here in the thread. Doesn't matter whether you are running SLI, Crossfire, single or multiple tasks per card, the problem occurs because of an issue with the BOINC code. I have attracted the BOINC developers attention to the problem and they have acknowledged it, just no timeline for when the issue will get resolved. Cheers, Keith |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,330,824 RAC: 139,501 |
Haven't seem any response from Jake yet. Just wanted to comment that the MW app developers could help with the problem by increasing the work unit completion times by doing more work with larger computations as has been discussed. The problem simply can be described as that the Modified Fit task completes too quickly for the underlying BOINC code to clean up the task completion in the slot. If anyone wants to see what is happening, just turn on slot_debug in the BOINC Manager and look at the logfiles. Match up the invalid task entries and see that the slot cleanup doesn't work correctly for the very short completion times of the 1.36 Modified Fit tasks. That is why it looks like only the 1.36 app has problems but in fact in can occur with the 1.02 app if the system is busy crunching work for other projects and system resources can't be applied in a timely manner to get ahead of the underlying problems with the BOINC code. Cheers, Keith P.S. So get busy on releasing a new Modified Fit app with larger work tasks. |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,349,186 RAC: 22,117 |
i do also get never ending wu. Yes it's me...the drop in temps, and extra long workunit run times, could be because your gpu's crashed and it was running at the default low level. Rebooting the pc should bring them back up to normal speed again. As for the pci-e bus speed if you have cards that HAVE the Xfire connector on it they will all think they are on the faster bus, once you remove it they will show what they are really on. Do you use a cc_config.xml file like this: <cc_config> <options> <use_all_gpus>1</use_all_gpus> </options> </cc_config> to tell Boinc to use all your gpu's? You MUST use Notepad to copy and paste the info and then save the file as a txt type file, it will complain but say yes, and then save it in the c:\program data\boinc directory. That way all your Boinc projects can use it. You should exit and restart Boinc after putting the file in there. Each gpu will then crunch it's own unit and finish in its own time. |
Send message Joined: 31 Mar 08 Posts: 22 Credit: 84,159,673 RAC: 0 |
Thank you My cc_config was correct. I also had to connect each GPU to a monitor and extend the windows desktop. I'm glad I have monitors with multiple inputs as I didn't want to put resistors on the dvi to VGA adapter. Now I may have to do that if I want to use the APU engine... but i'm sure it's worth it Also I had no errors or invalid WU for 24 on this computer, although I had to suspend/resume some WUs. A few inconclusive but that's usual. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey everyone, Thank you for letting me know this is a BOINC issue! It seems strange that it would only be on the Modified Fit application though if it were a BOINC issue so I will still be looking to see if we are causing the issue somehow. Jake W. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,330,824 RAC: 139,501 |
Hi Jake, as I stated in the previous post, the main issue is the very short runtimes of the Modified Fit app. It just completes too fast in fast GPU crunchers for the BOINC slot cleanup code to finish in time before another task occupies the same slot as the previous task and ends up overwriting the stderr.txt file. This is why you get the completely truncated output of the stderr.txt file and thus reports as an invalid. The solution is either wait for the BOINC developers to fix this and the >4GB file deletion problem in the slot cleanup code or to make the modified fit tasks run a lot longer. If they ran in the 2-2:30 minute completion time of the 1.02 app, then you wouldn't make invalids in most cases. If you want to see the problem happen in real-time, just enable slot_debug in the BOINC Manager and look at the entries in the logfile for the invalidated tasks. You will see that a new task occupies the slot of a previous task ahead of the final slot cleanup for the previous completed task. Cheers, Keith |
©2024 Astroinformatics Group