Welcome to MilkyWay@home

Longer WUs for GPUs?

Message boards : Number crunching : Longer WUs for GPUs?
Message board moderation

To post messages, you must log in.

AuthorMessage
therealjcool

Send message
Joined: 5 Oct 09
Posts: 22
Credit: 22,661,352
RAC: 0
Message 32161 - Posted: 8 Oct 2009, 22:16:33 UTC
Last modified: 8 Oct 2009, 22:17:53 UTC

I already mentioned it here, but I do feel the need to separately address this point so we can have a proper discussion about this:

WU runtimes on high end GPUs are way too short, causing everyone who runs high end GPUs to lose quite some performance. This is due to the time it takes for a new WU to be started, after the previous WU is finished. On an HD4870, this is alredy quite significant, however, it gets raised to a whole new level when running an HD5800 series card (which will probably become pretty popular these coming months).

To give you an idea of what I'm talking about, have a look at this graph:



Every 21 seconds, the GPU drops down while a new WU is loaded into memory or whatever.

Anybody else seeing this issue?
ID: 32161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 32168 - Posted: 8 Oct 2009, 23:37:45 UTC - in response to Message 32161.  
Last modified: 8 Oct 2009, 23:42:01 UTC

.....Anybody else seeing this issue?

Yes it's one of the main reasons I run 2 tasks concurrently on my HD 4890. That way the load only drops to 80% when a task finishes instead of to 0%. Run one task alone until it gets to 50%, then quickly resume all the others from being suspended. That way the 2 tasks should finish at different times.

I don't do this for the slight amount of extra credit it yields, I just prefer a more constant load on the GPU because I think it may be better for the card rather than going from full load to zero load every 40 seconds.
ID: 32168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,499
RAC: 0
Message 32172 - Posted: 9 Oct 2009, 0:00:32 UTC - in response to Message 32161.  

Every 21 seconds, the GPU drops down while a new WU is loaded into memory or whatever.

Anybody else seeing this issue?

Everybody has the same behaviour. Most people probably don't notice and those that do probably don't see it as an 'issue' to be addressed.

I've been doing a little experimenting with a budget system - E3200 celeron dual core on Asus P5KPL-AM/PS, 2GB, HD4850 512MB WinXP 32 bit, cat8.12 drivers, BOINC 6.10.13. Here are some results:-

App V . . CPU sec . . GPU secs . . Wall secs . . AP parameters
v0.20 . . . . 7.5 . . . . . 51.4 . . . . . . . 52.5 . . <count>1</count>, <cmdline>f6 w0.9</cmdline>
0.20b . . . . 4.3 . . . . . 51.5 . . . . . . . 52.6 . . <count>1</count>, <cmdline>f6</cmdline>
0.20b . . . . 4.6 . . . . . 51.4 . . . . . . 101.1 . . <count>0.5</count>, <cmdline>f6</cmdline>

NOTES:
1. The numbers are averages of a few tasks (not enough to be really accurate) where both CPUs are 100% on Einstein.
2. The machine does nothing else but crunch.
3. The machine was allowed to 'settle' before looking at results on the website using a different machine.
4. Once things 'settle' the numbers seem to be rather constant - not observed for long enough yet to be sure.
4. During the period of observation only Einstein and Milkyway were active.
5. f6 was used so that the allowed runtime per iteration (166.67ms) would be larger that the predicted runtime per iteration (150ms).

CONCLUSIONS:
1. The new 0.20b gets ~100% GPU usage without needing the w option.
2. The new 0.20b has the same MW performance as 0.20.
3. The new 0.20b uses less CPU and so should allow a bit better efficiency for the CPU project.
4. You can gain a little MW efficiency by running 2 tasks concurrently - 101.1 secs instead of 105.2 secs wall clock time.
Cheers,
Gary.
ID: 32172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 32174 - Posted: 9 Oct 2009, 0:06:35 UTC - in response to Message 32172.  

You can gain a little MW efficiency by running 2 tasks concurrently - 101.1 secs instead of 105.2 secs wall clock time.

That will be quite significant for a HD5870. Something in the range of 10%.
ID: 32174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
therealjcool

Send message
Joined: 5 Oct 09
Posts: 22
Credit: 22,661,352
RAC: 0
Message 32175 - Posted: 9 Oct 2009, 0:08:52 UTC

Nice, thanks. I will update to 0.20b tomorrow/later today and also activate 2 WUs per GPU and see where that lands me.

I am already seeing close to 0 CPU time on the GPU WUs though, so that's definitely a plus.
ID: 32175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
Message 32184 - Posted: 9 Oct 2009, 4:52:17 UTC

Well, I agree it is a problem and this is one of those things that we have asked for the project to consider for some time now. My personal take is that I would love tasks that are about an hour to three as a good balance ...

Sadly, it does not look like it is going to happen. Collatz as an alternative has tasks that take about 10 minutes for me and we might be better able to prevail over there when we get a little more stable. Though there may be other issues I am not aware of that would limit the ability to extend the search (run out of memory for example) if the tasks are extended in scope.

Still we can hope Travis might look in and ... well ... I can hope ... :)
ID: 32184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Glenn Rogers
Avatar

Send message
Joined: 4 Jul 08
Posts: 165
Credit: 364,966
RAC: 0
Message 32186 - Posted: 9 Oct 2009, 8:54:11 UTC

Can some one post a link to the new apps GPU/CPU Please v.20b and higher if any
ID: 32186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
therealjcool

Send message
Joined: 5 Oct 09
Posts: 22
Credit: 22,661,352
RAC: 0
Message 32188 - Posted: 9 Oct 2009, 9:59:19 UTC

0.20b versions are here: http://www.file-upload.net/download-1931227/Milkyway_0.20b_ATI.zip.html

Ok, so I am getting mixed results with running 2 WUs concurrently. It seems to work for some time, but it looks like the program is trying to "auto-sync" the WUs after a while. I paused one deliberately as to get the 2 WUs to not finish together, and for a few minutes this works, after that, it goes back to syncing the WUs and I get worse performance than with only 1 WU running :(

ID: 32188 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 32192 - Posted: 9 Oct 2009, 12:01:13 UTC - in response to Message 32188.  
Last modified: 9 Oct 2009, 12:03:48 UTC

Ah the finishing times drift back together too quickly because your 5870 is so fast. I understand your concern but it's a problem I wouldn't mind having. :)

You could experiment with running 3 tasks concurrently. I've never used that myself because it was originally unstable on my HD 3850, but some others use it and the 5870 should be able to handle it. If that doesn't work then just go back back to 1 at a time and cop it sweet.

I don't know what is the recommended maximum number of tasks you should run with this application, but possibly the more you can run with stability without any slowdown in processing, the smoother the GPU load should be. According to the readme file there is no increase in efficiency by running more than 2 tasks concurrently.

On a positive note, you're fortunate the server is unable to cope with a run of "shorties", because you would be completing those every 4-5 seconds.

I don't think you can take advantage of the HD 58xx series ability to synchronise requests from different threads by running Collatz and MilkyWay tasks concurrently without slowdown because Collatz would probably require all the memory bandwidth available and BOINC may not support that configuration yet anyway. Although I'm interested in it, that kind of stuff is beyond my limited understanding of how BOINC ATI applications work.
ID: 32192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile uBronan
Avatar

Send message
Joined: 9 Feb 09
Posts: 166
Credit: 27,520,813
RAC: 0
Message 32194 - Posted: 9 Oct 2009, 12:17:46 UTC
Last modified: 9 Oct 2009, 12:49:49 UTC

Euhm does it not help to load 3 on such fast cards
I use 3 on my 4850 cards and i hardly notice any drops in gpu usage.
Ofcourse completely without seems to be impossible but when i run 2 they indeed sync because of the upload/download cycle.
When i use 3 it slowly changed to different start times on my machines
I did not check if it gives a lost in performance but with those superfast cards i think it will only gain.
I even set 6 on a machine with 2 x 4890 and seems to give him gain in performance now

ps. again the limit on cpu units does screw the use of multiple units the 4890 and up do them so fast it simply ends with waiting for new units and then ofcourse the cycle starts with all together again.
so you gain a bit by running more then 2 but at the end the shortage on units before it settles will put it back
A 8 core machine or 4 core with HT will have less of a problem but we don't have those, so we can't test.
Its new, its relative fast... my new bicycle
ID: 32194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
therealjcool

Send message
Joined: 5 Oct 09
Posts: 22
Credit: 22,661,352
RAC: 0
Message 32196 - Posted: 9 Oct 2009, 14:36:38 UTC
Last modified: 9 Oct 2009, 14:48:31 UTC

Ok, did what you guys suggested and running 3 WUs simultaneously now with 0.20b.

I'm getting a steady 90% load without any drops, so that is o-k. However I don't see more than 90% GPU usage, which is a general problem of 0.20b for me. No matter how many WUs I crunch concurrently or what p/w parameters I use, it won't go over 90%. Only way is to pause the CPU applications, which is not an option obviously.

I don't see that problem with 0.20 so I'm just gonna revert back for now.

Edit:

A 8 core machine or 4 core with HT will have less of a problem but we don't have those, so we can't test.


Well the machine I'm running the 5870 on is a Gulftown Hexacore with HT, so it has 12 Threads - still I only get 48 WUs for that machine, or roughly 16 minutes of GPU work buffer. So much for the quota system working properly, lol.

I guess I could put it in the Dual Gainestown rig (16 threads) instead if that helps with the crappy quota system...

Does it _ever_ assign more than 48 WUs per system? Doesn't seem to.
ID: 32196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
therealjcool

Send message
Joined: 5 Oct 09
Posts: 22
Credit: 22,661,352
RAC: 0
Message 32197 - Posted: 9 Oct 2009, 14:52:55 UTC

Sorry for double posting, but I think I got it as good as it gets on the 5870 now... using 0.20 ATI Win64 (NOT 0.20b) with the app_info set to 0.33 GPUs (no other parameters changed)



Now on to optimizing the other rigs, lol ^^
ID: 32197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 32199 - Posted: 9 Oct 2009, 16:54:44 UTC
Last modified: 9 Oct 2009, 17:00:32 UTC

That looks good, not much drop in GPU load there. Your experience and that of Team_Elteor_Borislavj~Webbie has inspired me to experiment with running 3 tasks concurrently myself tomorrow.

I am happy with how 0.20 runs so I haven't tried 0.20b and probably won't do so until Gipsel gives the all clear.

I think it is normal that the server uses a maximum of 8 CPUs for quotas and other limits. This is to avoid someone claiming to have a 200 core computer so that they could download 1200 tasks.

I am keen to get a HD 5850 or HD 5870 but currently they are in short supply in my country and priced at a premium of 30-40% higher than the the suggested retail price that they are selling for in the US. I don't know if this will change any time soon as I suspect they wish to sell HD 57XX models instead.
ID: 32199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 1 Sep 08
Posts: 204
Credit: 219,354,537
RAC: 0
Message 32207 - Posted: 9 Oct 2009, 20:46:06 UTC

On XP 32 t I had MW running along with 4 cpus tasks on my quad. Then I switched to Win 7 64 and choose ABC as the cpu project (advantage due to 64 Bit, low power consumption and formerly nice credits). My MW times increased by ~30% until I set BOINC to use 3 or 4 cores.
ABC runs at notoriously high priority, so I suppose the reduced GPU usage you're seeing under 0.20b is a matter of cpu priority. I'm running 0.20b still with 3 of 4 cores loaded and I'm still seeing 98-99% GPU utilization. 0.20b has new options to change the priority and the waiting / polling behaviour, so you may need to set these more agressive. Or just stick to 0.20(a) for the time being.

BTW: MW acknowledges a maximum of 8 cores, so it hands out not more than 48 WUs at a time. The issues arising from this and possible solutions are currently being discussed in thread "Problem with tiny cache in MW" .. though I guess it's a tough read.

MrS
Scanning for our furry friends since Jan 2002
ID: 32207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
therealjcool

Send message
Joined: 5 Oct 09
Posts: 22
Credit: 22,661,352
RAC: 0
Message 32209 - Posted: 9 Oct 2009, 21:50:51 UTC - in response to Message 32207.  



BTW: MW acknowledges a maximum of 8 cores, so it hands out not more than 48 WUs at a time. The issues arising from this and possible solutions are currently being discussed in thread "Problem with tiny cache in MW" .. though I guess it's a tough read.

MrS


Wow, then they are really far behind. My Quad 8347HE Opteron is from 2007 (> 2 year old HW) and it has 16 cores (4 x 4)...

ID: 32209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 1 Sep 08
Posts: 204
Credit: 219,354,537
RAC: 0
Message 32211 - Posted: 9 Oct 2009, 23:36:23 UTC - in response to Message 32209.  

I don't think this limit is meant to represent the maximum number of cores expected in a client machine. It's just some artificially set limit.

MrS
Scanning for our furry friends since Jan 2002
ID: 32211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile TomaszPawel
Avatar

Send message
Joined: 9 Nov 08
Posts: 41
Credit: 92,786,635
RAC: 0
Message 32421 - Posted: 16 Oct 2009, 8:38:59 UTC - in response to Message 32211.  
Last modified: 16 Oct 2009, 8:40:23 UTC

Why MW sends so small tasks?

It would be better for server and network traffic to send e.g. 4 times biger tasks, witch would ganerate 4 times more credit... (E.g on HD5870 now 21s. and 53,45c and might be 84s and 213,8 credit...)
A proud member of the Polish National Team

COME VISIT US at Polish National Team FORUM

ID: 32421 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile banditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
Message 32439 - Posted: 16 Oct 2009, 19:13:13 UTC - in response to Message 32421.  

Why MW sends so small tasks?

It would be better for server and network traffic to send e.g. 4 times biger tasks, witch would ganerate 4 times more credit... (E.g on HD5870 now 21s. and 53,45c and might be 84s and 213,8 credit...)


This has been brought up many times. The Gpus were supposed to have their own site and the wu's increased 100+ times so they would keep the gpus in work for an hour per task and do more complex calculations. Yet to happen, probably not it would seem. They have been increased hundreds of times since the beginning to coincide with the opti apps speeding the process up.
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 32439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Longer WUs for GPUs?

©2024 Astroinformatics Group