Welcome to MilkyWay@home

GPU computation errors on one host but non on the other


Advanced search

Message boards : Number crunching : GPU computation errors on one host but non on the other
Message board moderation

To post messages, you must log in.

AuthorMessage
capeITLabs

Send message
Joined: 19 Nov 12
Posts: 3
Credit: 330,132,224
RAC: 1
300 million credit badge7 year member badge
Message 59497 - Posted: 1 Aug 2013, 17:36:49 UTC

Hi @all,

since a couple of days, one of my hosts has some OpenCL computation errors. It's running linux (Debian 7 64bit) and uses a HD5970 GPU (crunching 3 WU on each GPU). Another linux host (same OS version) with two HD6950 (crunching 4 WU on each GPU) doesn't show any computation errors.

good host: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=484075
bad host: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=494095

The "bad" host shows CL_OUT_OF_RESOURCES and CL_MAP_FAILURE errors.

What's happening here ?

best regards,
Rene
ID: 59497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
100 million credit badge10 year member badgeextraordinary contributions badge
Message 59649 - Posted: 23 Aug 2013, 15:21:37 UTC - in response to Message 59497.  

The "bad" host shows CL_OUT_OF_RESOURCES and CL_MAP_FAILURE errors.


That's the clue ..... basically BOINC ran out of resources on the hardware to use and gave up. 5XXX cards are way way different beasts in their technical architecture and abilities, compared to 6XXX cards. No matter which Capability Variant is used, there is always a finite capacity unique to that variant. Crudely speaking 6XXX are way faster and better than 5XXX cards - as a generalisation - and will have greater capacity and flexibility. Hence the reason the 5XXX bombed out first.

As a generalisation don't run any more than two of a WU type on a GPU. Sometimes, rarely, three will run successfully. All that happens when more than 2 or three are run concurrently, is BOINC starts to run out of resources to cope, and in any case it time shares between the WUs as the full capacity has been reached - usually after two concurrently running WUs - so there is no or miniscule time saved. Especially when it bombs out crashing with too much being thrown at it. Any miniscule time saved by running more than 2 (sometimews three) is way way overrun by time lost whilst the machine is down for you to get it going again.

A good Rule of thumb for BOINC is a max of two WUs per GPU when the WUs are short run is seconds or a min or so. More than that don't bother - in fact you will lose time eventually running too many at once.

ID: 59649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : GPU computation errors on one host but non on the other

©2019 Astroinformatics Group