Welcome to MilkyWay@home

Some GPU tasks failing

Message boards : Number crunching : Some GPU tasks failing
Message board moderation

To post messages, you must log in.

AuthorMessage
jjv
Avatar

Send message
Joined: 17 May 10
Posts: 9
Credit: 721,598,063
RAC: 9,700
Message 71718 - Posted: 10 Feb 2022, 10:49:16 UTC

Noticed a fair amount of errors among my GPU tasks. Some work, others don't.
The logs seem to indicate a problem with device detection:
Success:
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=106001665
Fail:
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=106001938

This is a dual GPU machine but can't figure out from the log which GPU the WU was using or trying to use.

JJ
ID: 71718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,522,075
RAC: 27,222
Message 71719 - Posted: 10 Feb 2022, 12:11:21 UTC - in response to Message 71718.  

Noticed a fair amount of errors among my GPU tasks. Some work, others don't.
The logs seem to indicate a problem with device detection:
Success:
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=106001665
Fail:
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=106001938

This is a dual GPU machine but can't figure out from the log which GPU the WU was using or trying to use.

JJ


I looked at one of your invalid tasks and found this:
Found 2 CL devices
Device 'NVIDIA GeForce RTX 3090' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU)

I looked at another of your invalid tasks and found this:
Found 2 CL devices
Device 'GeForce GTX 1080 Ti' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU)

I also noticed that your 1080Ti is using driver version: 460.97
while the 3090 is using driver version: 511.23

Are you perhaps trying to run more than one task at a time on the different gpu's?
ID: 71719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jjv
Avatar

Send message
Joined: 17 May 10
Posts: 9
Credit: 721,598,063
RAC: 9,700
Message 71721 - Posted: 10 Feb 2022, 13:33:11 UTC - in response to Message 71719.  

Well, yes. It's kind of hard to fully utilize a 3090 not to mention two of them without running multiple WUs.
It actually might be the case that the error pertains to the WU not fitting in the GPU memory. Which is silly since these things have 24GB each but there seem to be limitations on GPU memory utilization with BOINC.
Where on earth did that 1080Ti reference come from??? Those haven't been in use for over half a year.

JJ
ID: 71721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Wailing Angus Beef

Send message
Joined: 24 Dec 07
Posts: 33
Credit: 1,919,002,265
RAC: 14,431
Message 71722 - Posted: 10 Feb 2022, 14:03:30 UTC
Last modified: 10 Feb 2022, 14:10:45 UTC

I've got a couple clients with failing WUs. They are WUs with names that begin with de_modfit_83_bundle5_3s_south_pt2_1643910122f

EDIT:
Admin posted this is the news...
https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4834#71697
ID: 71722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,522,075
RAC: 27,222
Message 71740 - Posted: 11 Feb 2022, 12:12:37 UTC - in response to Message 71721.  

Well, yes. It's kind of hard to fully utilize a 3090 not to mention two of them without running multiple WUs.
It actually might be the case that the error pertains to the WU not fitting in the GPU memory. Which is silly since these things have 24GB each but there seem to be limitations on GPU memory utilization with BOINC.
Where on earth did that 1080Ti reference come from??? Those haven't been in use for over half a year.

JJ


You may need to flush all the drivers from your system then by using the DDU utility you can download from the net and then reload them so you only have one driver, also unless you game I would stay away from the latest and greatest drivers especially at projects that are pretty slow to upgrade their own software like MilkyWay. Let someone else try it and then you can do it after you know that it works, new software can mean new ways of doing things and the Boinc Projects need to test the drivers to see if it works with their current tasks.
ID: 71740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jjv
Avatar

Send message
Joined: 17 May 10
Posts: 9
Credit: 721,598,063
RAC: 9,700
Message 71741 - Posted: 11 Feb 2022, 14:33:34 UTC - in response to Message 71740.  

Don't usually bother with DDU or it's ilk. Several decades of experience of limited usefulness.
I did however recently do a thorough driver cleanup due to an unrelated problem.
I also went through BOINC logs and can find no mention of older GPUs or drivers in recent logs.
I think you accidentally looked at a year old invalid task. i don't do a lot of invalids ;-)

Fully aware of the risks involved in using bleeding edge software (or hardware for that matter). Recent driver updates have however included certain fixes that pertain to issues I'm personally experiencing. Also, since I'm a very technical person by profession and hobby I consider it my role to be a tester in these things.

I dialed down the amount of WUs allowed per GPU simultaneously. Will keep monitoring the situation but no similar failures have since happened.

JJ
ID: 71741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Some GPU tasks failing

©2024 Astroinformatics Group