Welcome to MilkyWay@home

new trend: open_cl not always terminating when done

Message boards : Number crunching : new trend: open_cl not always terminating when done
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 48839 - Posted: 17 May 2011, 20:03:18 UTC
Last modified: 17 May 2011, 20:26:42 UTC

I noticed a disturbing trend: Some open_cl tasks are going way past %100 completion. This is more obvious using boinctasks rather than boincmanager. I was told (at setathome project) there is a 10x timeout. ie: if the expected completion was 30 minutes and the task was already over 300 minutes then it was supposed to be automatically terminated. I do not know if that is a boinc rule or just a setathome rule.

Anyway, I discovered that if I terminate boinc and restart boinc then the milkyway open_cl task shows up at done and reports successfully. Previously I had been aborting tasks that never complete thinking they were hung.

This trend seems to have started recently. I did update nvidia drivers last week to 270.61 and started switching my windoz systems to 6.12.26 about 3 days ago.

I also observed this same problem at setathome on cuda_fermi. I am unsure how cuda_fermi compares to open_cl. I have not see this problem on cuda23

I started switching to 6.12.26 a few days ago, about the time I started noticing this problem. However, I could have been going on before as I never paid much attention. It is difficult to document the problem because milkyway deletes results shortly after they have been reported.

Does milkway abort tasks that take too long to finish or is boinc supposed to do that?

{EDIT} I just read some of the other posts where people terminate tasks that have run for a long time ie: the de_seperation problem. Possibly this is the same problem: It has actually finished, but for some reason continues to run.
ID: 48839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 1 Sep 08
Posts: 204
Credit: 219,354,537
RAC: 0
Message 48841 - Posted: 17 May 2011, 21:14:04 UTC - in response to Message 48839.  

What amount of time are we talking about here? The current ATI app (and probably the nVidia too) do some calculations on the CPU after they reach 100% (=GPU finished). On my C2Q 2.88 GHz this takes 7 - 9 seconds per WU. But I don't think you're talking about this, since people hardly restart BOINC or abort tasks because they're a few seconds "overdue".

MrS
Scanning for our furry friends since Jan 2002
ID: 48841 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 48843 - Posted: 17 May 2011, 22:13:24 UTC - in response to Message 48839.  

.... I did update nvidia drivers last week to 270.61 .....


270.61 is starting to turn out to be one they would rather forget about. Its causing many issues all over the place (Google it if curious ....). You'll also find a large number of references to it in BOINCLand

Roll Back out of 270.61, its likely on past form elsewhere that the issues experienced lately will disappear.

Regards
Zy
ID: 48843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile arkayn
Avatar

Send message
Joined: 14 Feb 09
Posts: 999
Credit: 74,932,619
RAC: 0
Message 48844 - Posted: 17 May 2011, 23:23:47 UTC

They just release the 275.27 today, I installed it on my GTX460 machine and will see if they got the downclock bug fixed.
ID: 48844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 48847 - Posted: 18 May 2011, 11:47:59 UTC
Last modified: 18 May 2011, 12:16:40 UTC

OK - I got another one that keeps running as shown here:



However, it did NOT finish when I restarted boinc. Instead it simply picked up at 0.0 percent complete and is currently running.

Note that the percent complete shows up a %1611.906 which, as understands BoincTasks, indicates that the project ran for 16x longer than the original estimate. 3:30:09(3:25:18) indicates that the run time was 3:30 (3 hours, 30 minutes) and the cpu time was 3:25. This system had a gtx280 and EVGA-Precision is showing GPU utilization of %95 so the GPU is currently being used. I would have expected something like the following (just under 16 minutes) for this other system that also has a gtx280



As noted, this example did NOT finish when I restarted BOINC. I did have a project finish and I observed the result at the web site, but it is long gone because results are deleted per milkyway policy.

Back to my original question: Does milkyway have a policy to terminate tasks that take 10x as long as they should and show no progress? I was told that setiathome had a policy like that. It seems to me that the boinc program is the one that should terminate runaway tasks.

BTW, BoincMgr shows 0.0 % completion for these tasks. Only BoincTasks shows the 1600% or so as they calculate %completed differently.

[EDIT] I just checked again and this task, de_separation_13_3s_fix10_2_2611259_1305697346, is not present on my system. Neither is it present on my results page at this web site. It is as if it disappeared off the face of the earth. Anyway, I assume it completed successfully and was deleted within seconds of being uploaded. I was away from my computer for about 20 minutes and I assume all this transpired during that time.
ID: 48847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 48856 - Posted: 19 May 2011, 1:39:39 UTC
Last modified: 19 May 2011, 1:43:16 UTC

I think my problems are caused by an nvidia driver reset while cuda tasks are crunching. I am still having random resets with even the new beta driver 275.27. They are not as often as the 270 driver I just upgraded. Project software (setiathome cuda or milkyway cuda) do not seem to recover from a driver reset and I suspect the tasks hangs until boinc restarts it. ie: restarting boinc causes setiathome and milkway tasks to restart thus clearing up the problem.

I have one system with combo nvidia 570 and ati 5850 and have not seen a driver reset and that system does not have these problems. My other windows systems have a mix of 460, 280 and 9800 and those seem to have random driver resets,

IANE, but I suspect that this could be solvaable if the windows part of the project software recognizes that nvidia driver reset and restarts itself.
ID: 48856 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 1 Sep 08
Posts: 204
Credit: 219,354,537
RAC: 0
Message 48864 - Posted: 19 May 2011, 18:49:22 UTC

I've had driver resets with ATI in the past and consider it normal that you have to restart BOINC afterwards. Otherwise BOINC still thinks the tasks are crunching, whereas in reality nothing ever happens. If I remember correctly GPU utilization even drops to 0 on the ATIs.

MrS
Scanning for our furry friends since Jan 2002
ID: 48864 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 48869 - Posted: 19 May 2011, 22:09:56 UTC - in response to Message 48856.  
Last modified: 19 May 2011, 22:11:24 UTC

..... I have one system with combo nvidia 570 and ati 5850 and have not seen a driver reset and that system does not have these problems. My other windows systems have a mix of 460, 280 and 9800 and those seem to have random driver resets, ......


There are two things which often get overlooked in terms of driver resets - they are not common as such, but worth a quick recheck if likely to be marginal.

First is card voltage, unlikely in your case as it would seem you dont overclock as such. However an o/c can get to the stage where additional voltage is needed to move on further, and at that tipping point, additional volatage - usually only a small addition - often nails that one.

The other is slightly connected. PSUs are often "forgotten" as such as time moves on, and if in fact they are marginal (or slightly unstable) in terms of output being on the edge compared to demand from the machine, that'll cause a reset at the tipping point as well. If there are chunky PSUs in there, then it clearly will not apply to you. But if on a sober cold recalculation with no assumptions(including the significant power draw of o/c CPU and GPU needs), the PSU is in fact marginal - would bare looking at more closely.

Regards
Zy
ID: 48869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 48943 - Posted: 24 May 2011, 13:26:57 UTC

There is another possible cause for this problem. When I (occassionally) run DVDFab8 to copy a movie (BD-R), I use "boinccmd.exe --quit" to terminate boinc and stop all tasks as DVDFab utilizes CUDA for encoding when compression is required. Occassionally I see a flash on my monitors when this happens (stopping boinc). I do not recall seeing this flash before driver 270. Maybe something goes wrong in the driver when this happens and the flash is a symptom.

I posted a question over at the DVDFab forum about sharing the GPU with boinc but never got an answer. I assume I cannot run DVDFab8 concurrently with BOINC GPU apps when encoding is required so I routinely stop BOINC before bring up DVDFab8.

Note: Power supplies are adaquate and I do not overclock.
ID: 48943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 48946 - Posted: 24 May 2011, 14:46:00 UTC - in response to Message 48943.  

...... I do not recall seeing this flash before driver 270. Maybe something goes wrong in the driver when this happens and the flash is a symptom.....


Rut Roh .....

If you are running the 270 Driver, get out of it asap, its a nightmare waiting to happen / is happening. That version was a disaster (acknowledged by NVidia). Load the latest 275 drivers, they appear to have resolved the 270 dramas in the new 275's

When you've done that make sure you are running MW "normally" - no fiddling or temp experiments et al - lets see what happens then

Regards
Zy
ID: 48946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 1 Sep 08
Posts: 204
Credit: 219,354,537
RAC: 0
Message 48949 - Posted: 24 May 2011, 20:30:03 UTC - in response to Message 48943.  

I assume I cannot run DVDFab8 concurrently with BOINC GPU apps when encoding is required so I routinely stop BOINC before bring up DVDFab8.


Why not give it a try? It might run without any problems, just slower (well, and thereby defeat the purpose of using CUDA in the first place..).

MrS
Scanning for our furry friends since Jan 2002
ID: 48949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : new trend: open_cl not always terminating when done

©2024 Astroinformatics Group