Welcome to MilkyWay@home

GPU Issues Mega Thread

Message boards : News : GPU Issues Mega Thread
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8

AuthorMessage
doktorOblivion

Send message
Joined: 15 Jul 12
Posts: 5
Credit: 118,703,779
RAC: 4,965
Message 66092 - Posted: 7 Jan 2017, 13:49:52 UTC

Seeing a BSOD which I think is being caused by MilkyWay GPU s/w. I have several Win10 minidumps of the issue (I hope), or whatever windows takes when it goes blue. The reason why I think its GPU is I have that set only to run when I am away from the computer and when screensaver is running, that's when it always happens.

CPU = AMD FX-8370 Eight-Core Processor, 4013 MHz, 4 Kern, 8 logische Prozessors
Mem = 16GB
Graphics = GeForce GTX 1080
Driver = Nvidia 376.33
OS = Microsoft Windows 10 Pro, 10.0.14393 Build 14393

Boinc = 7.5.33 (x64), wxWidgets = 3.0.1

Not sure if you get informed by Microsoft on issues related to your product or not. Please advise what you need next.
ID: 66092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 22 Jun 13
Posts: 44
Credit: 64,258,609
RAC: 0
Message 66093 - Posted: 7 Jan 2017, 14:21:38 UTC

wb8ili,

Glad you got it fixed.

Just in case you are still wondering, most of the client files are in /var/lib/boinc-client (projects folder and slots folder included). The procedure that starts/restarts the boinc client is in /etc/init.d and is called boinc-client.

Happy crunching.
ID: 66093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
doktorOblivion

Send message
Joined: 15 Jul 12
Posts: 5
Credit: 118,703,779
RAC: 4,965
Message 66094 - Posted: 7 Jan 2017, 17:20:42 UTC - in response to Message 66092.  

I think this is actually a BOINC issue, I stopped MilkyWay and loaded a different project that uses GPU, and hit same issue even faster. Will post on their forum.
ID: 66094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris Rampson
Avatar

Send message
Joined: 14 May 11
Posts: 7
Credit: 87,497,784
RAC: 0
Message 66139 - Posted: 25 Jan 2017, 0:53:59 UTC

Speaking about the "#define Q_INV_SQR inf" Milkyway error:

I have been running BOINC/SETI for over 20 years. I have projects in Einstein, SETI, GPUGrid, Rosetta and Milkyway. Of the Milkyway project, there are 3 different types of jobs that I noticed. ONLY ONE OF THOSE IS FAILING.

I do not subscribe to the idea that there is a configuration problem when all my other projects are screaming along (including alternate Milkyway jobs).

The most reasonable explanation is a TYPO in the project files. Someone fatfingered a "t" into a "f" - which is not too hard to do since they are next to each other on the keyboard.

What needs to happen is for a Milkyway developer to explain why this happens - and hopefully fix his own code.
ID: 66139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 22 Jun 13
Posts: 44
Credit: 64,258,609
RAC: 0
Message 66141 - Posted: 25 Jan 2017, 15:12:04 UTC
Last modified: 25 Jan 2017, 15:48:26 UTC

Chris Rampson,

If you believe that there is a problem with the code, you might ask yourself this question: Why are other Milkyway users able to successfully complete those tasks using Linux and NVIDIA GPU's?

Edit:

Re: this thread http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4087&postid=66127
ID: 66141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sebastian*

Send message
Joined: 8 Apr 09
Posts: 70
Credit: 11,027,167,827
RAC: 0
Message 66172 - Posted: 10 Feb 2017, 18:26:52 UTC

Hello again, and a happy new year to everybody.

I still got issues when running several WUs in parallel on the Hawaii bases GPUs. One WU at a time still runs fine.

Could someone look at my invalid WUs with the Validate errors. I can't make anything out of the text.

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=705276&offset=0&show_names=0&state=5&appid=

Maybe some can help me figure out what is broken.

AMD drivers have improved, the WUs don't hang any longer, but there are still Validate errors.
ID: 66172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile blitzzkreeg

Send message
Joined: 15 Aug 10
Posts: 1
Credit: 59,828,702
RAC: 0
Message 66231 - Posted: 19 Mar 2017, 17:13:41 UTC

Good day,

I've been having issues lately with specifically "MilkyWay@home 1.43 (opencl_ati_101)" units which have a failure rate of 100% 2secs after starting up.

The "MilkyWay@home 1.43 (opencl_nvidia_101)" actually works great on a separate machine so I leads me to believe that this problem is specific to the AMD Radeon GPU (6800 series - HD6870 in my case). The adapter has never been overclocked and is actually working on a machine crunching for 16 distinct BOINC projects without any issues on any other project.

I'm running the latest AMD Catalyst Software Suite available on Windows 10 (non-beta drivers).

Driver Packaging ver. 15.201.1151.1008-151104a-296217E
Provider Advanced Micro Devices, Inc.
2D Driver Version 8.01.01.1500
Direct3D Version 9.14.10.01128
OpenGL Version 6.14.10.13399
Mantle Driver Version 9.1.10.0083
Mantle API Version Not Available
AMD Catalyst CCV 2015.1104.1643.30033

Breakdown is as follows for the end result of completed units...

- MilkyWay@home 1.43 (opencl_ati_101) => Computational error after 2 secs.

- MilkyWay@home 1.43 (opencl_nvidia_101) => 100% successful completion.

- MilkyWay@Home N-Body Simulation 1.62(mt) => 100% successful completion.

- MilkyWay@Home 1.42 => 100% successful completion.


regards,

Peter
ID: 66231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Oct 16
Posts: 112
Credit: 1,174,293,644
RAC: 0
Message 66232 - Posted: 19 Mar 2017, 22:06:34 UTC - in response to Message 66231.  

Did Win 10 automatically update your drivers? I think it screws up things if it did. Need to uninstall, clean drivers and then reinstall from AMD site.
ID: 66232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,759,412
RAC: 27,748
Message 66234 - Posted: 20 Mar 2017, 10:29:01 UTC - in response to Message 66231.  

Good day,

I've been having issues lately with specifically "MilkyWay@home 1.43 (opencl_ati_101)" units which have a failure rate of 100% 2secs after starting up.

The "MilkyWay@home 1.43 (opencl_nvidia_101)" actually works great on a separate machine so I leads me to believe that this problem is specific to the AMD Radeon GPU (6800 series - HD6870 in my case). The adapter has never been overclocked and is actually working on a machine crunching for 16 distinct BOINC projects without any issues on any other project.

I'm running the latest AMD Catalyst Software Suite available on Windows 10 (non-beta drivers).

Driver Packaging ver. 15.201.1151.1008-151104a-296217E
Provider Advanced Micro Devices, Inc.
2D Driver Version 8.01.01.1500
Direct3D Version 9.14.10.01128
OpenGL Version 6.14.10.13399
Mantle Driver Version 9.1.10.0083
Mantle API Version Not Available
AMD Catalyst CCV 2015.1104.1643.30033

Breakdown is as follows for the end result of completed units...

- MilkyWay@home 1.43 (opencl_ati_101) => Computational error after 2 secs.

- MilkyWay@home 1.43 (opencl_nvidia_101) => 100% successful completion.

- MilkyWay@Home N-Body Simulation 1.62(mt) => 100% successful completion.

- MilkyWay@Home 1.42 => 100% successful completion.


regards,

Peter


The 2 or 3 second errors can also be caused by not installing this:
For Windows, the most recent Visual Studio 2012 C++ runtime
ID: 66234 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 66236 - Posted: 20 Mar 2017, 21:49:32 UTC

According to this url
https://en.wikipedia.org/wiki/Radeon_HD_6000_Series

The HD6870 does not support double precision.
ID: 66236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Šarūnas Burdulis
Avatar

Send message
Joined: 27 Apr 15
Posts: 4
Credit: 427,409,763
RAC: 0
Message 66703 - Posted: 19 Oct 2017, 13:30:47 UTC

I have been successfully running GPU tasks with both AMD (amdgpu-pro) and Nvidia devices, using their provided OpenCL libraries.

Yesterday I upgraded one of the AMD workstations to use ROCm and its OpenCL (amdgpu-pro doesn't work on Ubuntu 17.10). GPU device is RX 480 (Ellesmere/Polaris, or 'gfx803' in ROCm). Since then MilkyWay@home GPU tasks are failing. Below is what I see in the task log and part of the clinfo. Let me know if there already is any solution to this or more info is needed.

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 20 </number_params_per_WU>
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
09:15:21 (13641): called boinc_finish(1)

</stderr_txt>
]]>

clinfo|head -20
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.0 AMD-APP (2508.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Extensions function suffix AMD

Platform Name AMD Accelerated Parallel Processing
Number of devices 1
Device Name gfx803
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 1.2
Driver Version 1.1 (HSA,LC)
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Profile FULL_PROFILE
Max compute units 36
Max clock frequency 1288MHz
...
ID: 66703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,759,412
RAC: 27,748
Message 66710 - Posted: 20 Oct 2017, 1:31:45 UTC - in response to Message 65046.  
Last modified: 20 Oct 2017, 1:33:00 UTC

Sorry wrong thread!
ID: 66710 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8

Message boards : News : GPU Issues Mega Thread

©2024 Astroinformatics Group