Welcome to MilkyWay@home

New Linux system trashes all tasks


Advanced search

Message boards : Number crunching : New Linux system trashes all tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67481 - Posted: 18 May 2018, 0:34:49 UTC

Just put a new system online but MilkyWay tasks are all erroring out. Event log says each task has unrecoverable error. Task runs for 1 second and then moves to the next.

I have all the required CUDA and OpenCL drivers installed. Event log shows no problems with the drivers.

This is the stderr.txt output.

<core_client_version>7.4.44</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 4 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>
stream sigma 0.0 is invalid
Failed to get stream constants
16:17:09 (13088): called boinc_finish(1)

</stderr_txt>
]]>

Both SETI and Einstein are running fine on this new system.
ID: 67481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBeemerBiker
Avatar

Send message
Joined: 18 Nov 08
Posts: 116
Credit: 843,538,433
RAC: 882,812
500 million credit badge10 year member badge
Message 67484 - Posted: 18 May 2018, 3:33:36 UTC - in response to Message 67481.  
Last modified: 18 May 2018, 3:34:46 UTC

Your line "stream sigma 0.0 is invalid" seems to be the first real error. Comparing your system with mine the stderr is the same but I show "Using SSE4.1 path" where you have that invalid warning.

Googlein that error message did not get a single hit, sorry, not sure the problem. I assume it is related to opencl as the next line after "Using SSE4.1 path" is the platorm that opencl finds.


Your kernel 4.15.0-20 is newer than mine. I assume you have 18.x while my build was 17.04

However, you are running a really old build of boinc. My apt-get got me 7.8.3 and you show 7.4.44 really old. I assume that is not a problem since Einstein and seti are working.

The official boinc download site shows an even older 7.4.22 but they make it clear that one should use the package manager to get the newest version.

May not help, but you might consider getting 7.8.3.
ID: 67484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67490 - Posted: 18 May 2018, 20:14:17 UTC - in response to Message 67484.  

Thanks for the reply. I guess I will have to delete the project from that computer or just put it to indefinite suspend till it gets sorted out by the project.

I am running a specially compiled version of BOINC made for SETI users. It does not have the current 1000 task restriction on the number of tasks allowed like any BOINC version > 7.02.

I will not update it to anything later as that would defeat my SETI usage which is my primary project.
ID: 67490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67491 - Posted: 18 May 2018, 20:17:12 UTC

Googlein that error message did not get a single hit, sorry, not sure the problem. I assume it is related to opencl as the next line after "Using SSE4.1 path" is the platorm that opencl finds.


Strange, that is does not find the SSE4.1 path. In fact the SETI SSE4.1 cpu app is the preferred app on Ryzen and Linux. I have been running it for a year on all my Linux machines since it is much faster than AVX.
ID: 67491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2228
Credit: 255,790,796
RAC: 150,093
200 million credit badge10 year member badgeextraordinary contributions badge
Message 67494 - Posted: 19 May 2018, 11:04:49 UTC - in response to Message 67490.  

Thanks for the reply. I guess I will have to delete the project from that computer or just put it to indefinite suspend till it gets sorted out by the project.

I am running a specially compiled version of BOINC made for SETI users. It does not have the current 1000 task restriction on the number of tasks allowed like any BOINC version > 7.02.

I will not update it to anything later as that would defeat my SETI usage which is my primary project.


Load a separate instance of Boinc just for MW then.
ID: 67494 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBeemerBiker
Avatar

Send message
Joined: 18 Nov 08
Posts: 116
Credit: 843,538,433
RAC: 882,812
500 million credit badge10 year member badge
Message 67497 - Posted: 19 May 2018, 15:58:05 UTC - in response to Message 67481.  

this may not apply to you, but I just discovered that one of my Linux systems "lost" opencl. This was a minimum server install and I mistakenly allowed it to apply updates and upgrades on its own when first set up several months ago.

It seems to have rebooted just recently, possibly power spike, but I noticed yesterday that all asteroids, collatz and Einstein were erroring out. Collatz had almost 300 failed units.

Unaccountably, I could not reinstall my original cuda 384 and had to download and install the 390.59 which also failed but at least the error messages showed up on google and I was able to install and get opencl working again.
ID: 67497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67498 - Posted: 19 May 2018, 16:54:59 UTC - in response to Message 67497.  

Einstein and Seti OpenCL tasks continue to run with no issues so the OpenCL drivers are still working for those applications. BOINC still shows the OpenCL driver in the Event Log startup.

@Mikey I've not ever tried to run another BOINC instance. I think you have to make a completely different data structure. All the posts I have read have been Windows centric.
ID: 67498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67505 - Posted: 19 May 2018, 20:06:55 UTC

Found another errored task and this one has a lot more information in the stderr.txt output. It looks like the application had a problem compiling the OpenCL wisdom file. I have not had any issues with either Seti or Einstein compiling their OpenCL applications wisdom files.

Has anyone else had issues with Linux compiling the OpenCL wisdom files before?

Stderr output
<core_client_version>7.4.44</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 20 </number_params_per_WU>
Using AVX path
Found 1 platform
Platform 0 information:
Name: NVIDIA CUDA
Version: OpenCL 1.2 CUDA 9.2.101
Vendor: NVIDIA Corporation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Profile: FULL_PROFILE
Using device 1 on platform 0
Found 3 CL devices
Device 'GeForce GTX 1070' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU)
Board:
Driver version: 396.24
Version: OpenCL 1.2 CUDA
Compute capability: 6.1
Max compute units: 15
Clock frequency: 1683 Mhz
Global mem size: 8513978368
Local mem size: 49152
Max const buf size: 65536
Double extension: cl_khr_fp64
Build log:
--------------------------------------------------------------------------------
<kernel>:183:72: warning: unknown attribute 'max_constant_size' ignored
__constant real* _ap_consts __attribute__((max_constant_size(18 * sizeof(real)))),
^
<kernel>:185:62: warning: unknown attribute 'max_constant_size' ignored
__constant SC* sc __attribute__((max_constant_size(NSTREAM * sizeof(SC)))),
^
<kernel>:186:67: warning: unknown attribute 'max_constant_size' ignored
__constant real* sg_dx __attribute__((max_constant_size(256 * sizeof(real)))),
^
<kernel>:235:26: error: use of undeclared identifier 'inf'
tmp = mad((real) Q_INV_SQR, z * z, tmp); /* (q_invsqr * z^2) + (x^2 + y^2) */
^
<built-in>:35:19: note: expanded from here
#define Q_INV_SQR inf
^

--------------------------------------------------------------------------------
clBuildProgram: Build failure (-11): CL_BUILD_PROGRAM_FAILURE
Error building program from source (-11): CL_BUILD_PROGRAM_FAILURE
Error creating integral program from source
Failed to calculate likelihood
Background Epsilon (61.817300) must be >= 0, <= 1
18:13:51 (10595): called boinc_finish(1)

</stderr_txt>
]]>
ID: 67505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67506 - Posted: 19 May 2018, 20:34:46 UTC

Well, I just tried resetting the project for S&G's to see if that made any difference. Got another batch of tasks and set NNT to limit the damage if any.

No difference, all downloaded tasks immediately errored out till all were gone.

I did see one interesting line in the Event Log after each upload and report.

Sat 19 May 2018 01:21:07 PM PDT | Milkyway@Home | [sched_op] Reason: 3 consecutive failures fetching scheduler list.

It fetched the master file successfully after the reset. It had no issues reporting each failed task and getting an ack. Don't know if that error has anything to do with the execution failures.

Still getting the CL build failure in the new tasks. The stderr.txt all have the same failure as my last post.
ID: 67506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 36
Credit: 30,882,260
RAC: 10,897
30 million credit badge9 year member badgeextraordinary contributions badge
Message 67510 - Posted: 20 May 2018, 0:29:45 UTC - in response to Message 67505.  

Found another errored task and this one has a lot more information in the stderr.txt output. It looks like the application had a problem compiling the OpenCL wisdom file. I have not had any issues with either Seti or Einstein compiling their OpenCL applications wisdom files.

Has anyone else had issues with Linux compiling the OpenCL wisdom files before?

Stderr output
<core_client_version>7.4.44</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
[...snip...]
Build log:
--------------------------------------------------------------------------------
<kernel>:183:72: warning: unknown attribute 'max_constant_size' ignored
__constant real* _ap_consts __attribute__((max_constant_size(18 * sizeof(real)))),
^
<kernel>:185:62: warning: unknown attribute 'max_constant_size' ignored
__constant SC* sc __attribute__((max_constant_size(NSTREAM * sizeof(SC)))),
^
<kernel>:186:67: warning: unknown attribute 'max_constant_size' ignored
__constant real* sg_dx __attribute__((max_constant_size(256 * sizeof(real)))),
^
<kernel>:235:26: error: use of undeclared identifier 'inf'
tmp = mad((real) Q_INV_SQR, z * z, tmp); /* (q_invsqr * z^2) + (x^2 + y^2) */
^
<built-in>:35:19: note: expanded from here
#define Q_INV_SQR inf
^

--------------------------------------------------------------------------------
clBuildProgram: Build failure (-11): CL_BUILD_PROGRAM_FAILURE
Error building program from source (-11): CL_BUILD_PROGRAM_FAILURE
Error creating integral program from source
Failed to calculate likelihood
Background Epsilon (61.817300) must be >= 0, <= 1
18:13:51 (10595): called boinc_finish(1)

</stderr_txt>
]]>


Keith,

I recognized that Q_INV_SQR error message! Rather than duplicating stuff posted back in early 2017, I'll refer you to a thread in the Linux forum, titled "Consistent "Validate error" status", in which I mentioned some research I'd done into why the client was apparently building bad GPU kernels:

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4091

Also, another thread (in the Science board) called "Fix it or I'm gone":

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4093

In precis, it looks as if the parameter reading can get out of sync in some versions of boinclib; folks seeing that error back then cleared it by moving to newer clients (usually 7.6 family...)

Now I know you're an enthusiastic Seti@Home person, so it might be you have a good reason for running 7.4.44 (which was reputed to have some singularities of its own!)- I'll just say that I've never had any problems with Seti, Einstein or MilkyWay using client 7.6.32 or .33 (using NVidia GPUs)

Don't know whether this will have helped any in your case, but at least it explains what causes the error message!

Cheers - Al.

P.S. I think it's actually trying to build a GPU kernel, not a wisdom file (but I'm not a MilkyWay developer s I could be wrong...)
ID: 67510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBeemerBiker
Avatar

Send message
Joined: 18 Nov 08
Posts: 116
Credit: 843,538,433
RAC: 882,812
500 million credit badge10 year member badge
Message 67513 - Posted: 20 May 2018, 0:55:31 UTC - in response to Message 67510.  
Last modified: 20 May 2018, 0:56:29 UTC

Found another errored task and this one has a lot more information in the stderr.txt output. It looks like the application had a problem compiling the OpenCL wisdom file. I have not had any issues with either Seti or Einstein compiling their OpenCL applications wisdom files.

Has anyone else had issues with Linux compiling the OpenCL wisdom files before?

Stderr output
<core_client_version>7.4.44</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
[...snip...]
Build log:
--------------------------------------------------------------------------------
<kernel>:183:72: warning: unknown attribute 'max_constant_size' ignored
__constant real* _ap_consts __attribute__((max_constant_size(18 * sizeof(real)))),
^
<kernel>:185:62: warning: unknown attribute 'max_constant_size' ignored
__constant SC* sc __attribute__((max_constant_size(NSTREAM * sizeof(SC)))),
^
<kernel>:186:67: warning: unknown attribute 'max_constant_size' ignored
__constant real* sg_dx __attribute__((max_constant_size(256 * sizeof(real)))),
^
<kernel>:235:26: error: use of undeclared identifier 'inf'
tmp = mad((real) Q_INV_SQR, z * z, tmp); /* (q_invsqr * z^2) + (x^2 + y^2) */
^
<built-in>:35:19: note: expanded from here
#define Q_INV_SQR inf
^

--------------------------------------------------------------------------------
clBuildProgram: Build failure (-11): CL_BUILD_PROGRAM_FAILURE
Error building program from source (-11): CL_BUILD_PROGRAM_FAILURE
Error creating integral program from source
Failed to calculate likelihood
Background Epsilon (61.817300) must be >= 0, <= 1
18:13:51 (10595): called boinc_finish(1)

</stderr_txt>
]]>


Keith,

I recognized that Q_INV_SQR error message! Rather than duplicating stuff posted back in early 2017, I'll refer you to a thread in the Linux forum, titled "Consistent "Validate error" status", in which I mentioned some research I'd done into why the client was apparently building bad GPU kernels:

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4091

Also, another thread (in the Science board) called "Fix it or I'm gone":

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4093

In precis, it looks as if the parameter reading can get out of sync in some versions of boinclib; folks seeing that error back then cleared it by moving to newer clients (usually 7.6 family...)

Now I know you're an enthusiastic Seti@Home person, so it might be you have a good reason for running 7.4.44 (which was reputed to have some singularities of its own!)- I'll just say that I've never had any problems with Seti, Einstein or MilkyWay using client 7.6.32 or .33 (using NVidia GPUs)

Don't know whether this will have helped any in your case, but at least it explains what causes the error message!

Cheers - Al.

P.S. I think it's actually trying to build a GPU kernel, not a wisdom file (but I'm not a MilkyWay developer s I could be wrong...)


You beat me to it. I just found that 4093 thread about INF stands for infinity and the suggestion that old boinc clients cause this problem. Poking around, I also found the source code seeming tocause that particular error here
ID: 67513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67517 - Posted: 20 May 2018, 14:57:03 UTC - in response to Message 67510.  

Thanks for the reply. Good to know where the problem lies. Now to determine how to fix it. It would be best to know just where in the BOINC code the problem occurs.

Do you know the bug# in the BOINC codebase by chance? If I could determine which module contains the bug, I could have the 7.4.44 developer pull the updated module into the build list so he could build a newer version of 7.4.44 which does not have the error.

I do not have the skills to do it myself. My attempt to build 7.8.3 BOINC ended in failure after many weeks so I just resigned to use someone else's 7.4.44 build.

The snippet of code you provided I believe is from the MW application and not from BOINC so I can't use the code tracker function at BOINC github to determine the problem module.
ID: 67517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67518 - Posted: 20 May 2018, 15:07:47 UTC

Well now I have errored out 80 or so tasks on my Windows 7 machine using BOINC 7.8.3. So it appears not related to a too old version of BOINC.
ID: 67518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67519 - Posted: 20 May 2018, 21:53:23 UTC

Just trashed another 50 or so tasks on BOINC 7.8.3 under Windows7 64 bit. Not a Linux or too old BOINC version issue.
ID: 67519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBeemerBiker
Avatar

Send message
Joined: 18 Nov 08
Posts: 116
Credit: 843,538,433
RAC: 882,812
500 million credit badge10 year member badge
Message 67522 - Posted: 20 May 2018, 22:04:25 UTC - in response to Message 67519.  
Last modified: 20 May 2018, 22:19:34 UTC

Looks like it is picking the wrong device
Using device 2 on platform 0
Found 2 CL devices
Requested device is out of range of number found devices


Your win7 system has 3 gtx1070 but it only sees 2. When it picks devices 0 and 1 it gets (last time I looked at your computer) 127 valid tasks but when it picks device 2 (the 3rd one) all tasks fail. So far, 113 of them

Valid tasks for computer 257518

Next 20
State: All (495) · In progress (240) · Validation pending (0) · Validation inconclusive (15) · Valid (127) · Invalid (0) · Error (113) 
Application: All (495) · MilkyWay@Home (495) · MilkyWay@Home N-Body Simulation (0) 


Strange it sees only 2 opencl but tries the 3rd one anyway.

What does your event message show? I have three gtx 1070ti and milkyway is using sse4.1, not avx on my i9-7900X


CUDA: NVIDIA GPU 0: GeForce GTX 1070 Ti (driver version 397.64, CUDA version 9.2, compute capability 6.1, 4096MB, 3558MB available, 8186 GFLOPS peak)	
CUDA: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 397.64, CUDA version 9.2, compute capability 6.1, 4096MB, 3558MB available, 8186 GFLOPS peak)	
CUDA: NVIDIA GPU 2: GeForce GTX 1070 Ti (driver version 397.64, CUDA version 9.2, compute capability 6.1, 4096MB, 3558MB available, 8186 GFLOPS peak)	
OpenCL: NVIDIA GPU 0: GeForce GTX 1070 Ti (driver version 397.64, device version OpenCL 1.2 CUDA, 8192MB, 3558MB available, 8186 GFLOPS peak)	
OpenCL: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 397.64, device version OpenCL 1.2 CUDA, 8192MB, 3558MB available, 8186 GFLOPS peak)	
OpenCL: NVIDIA GPU 2: GeForce GTX 1070 Ti (driver version 397.64, device version OpenCL 1.2 CUDA, 8192MB, 3558MB available, 8186 GFLOPS peak)	
Host name: JYSArea51	
Processor: 20 GenuineIntel Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz [Family 6 Model 85 Stepping 4]	
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 dca pbe fsgsbase bmi1 hle	
OS: Microsoft Windows 10: Professional x64 Edition, (10.00.17134.00)	
ID: 67522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67523 - Posted: 20 May 2018, 23:52:56 UTC - in response to Message 67522.  

5/20/2018 16:45:22 | | log flags: file_xfer, sched_ops, task, sched_op_debug
5/20/2018 16:45:22 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
5/20/2018 16:45:22 | | Data directory: C:\ProgramData\BOINC
5/20/2018 16:45:22 | | Running under account Keith
5/20/2018 16:45:23 | | CUDA: NVIDIA GPU 0: GeForce GTX 1080 (driver version 391.24, CUDA version 9.1, compute capability 6.1, 4096MB, 3038MB available, 9523 GFLOPS peak)
5/20/2018 16:45:23 | | CUDA: NVIDIA GPU 1: GeForce GTX 1070 (driver version 391.24, CUDA version 9.1, compute capability 6.1, 4096MB, 3042MB available, 6463 GFLOPS peak)
5/20/2018 16:45:23 | | CUDA: NVIDIA GPU 2: GeForce GTX 1070 (driver version 391.24, CUDA version 9.1, compute capability 6.1, 4096MB, 3042MB available, 6463 GFLOPS peak)
5/20/2018 16:45:23 | | OpenCL: NVIDIA GPU 0: GeForce GTX 1080 (driver version 391.24, device version OpenCL 1.2 CUDA, 8192MB, 3038MB available, 9523 GFLOPS peak)
5/20/2018 16:45:23 | | OpenCL: NVIDIA GPU 1: GeForce GTX 1070 (driver version 391.24, device version OpenCL 1.2 CUDA, 8192MB, 3042MB available, 6463 GFLOPS peak)
5/20/2018 16:45:23 | | OpenCL: NVIDIA GPU 2: GeForce GTX 1070 (driver version 391.24, device version OpenCL 1.2 CUDA, 8192MB, 3042MB available, 6463 GFLOPS peak)
5/20/2018 16:45:23 | SETI@home | Found app_info.xml; using anonymous platform
5/20/2018 16:45:23 | | Host name: Keith-Windows7
5/20/2018 16:45:23 | | Processor: 8 AuthenticAMD AMD FX-8370 Eight-Core Processor [Family 21 Model 2 Stepping 0]
5/20/2018 16:45:23 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c syscall nx lm avx svm sse4a osvw ibs xop skinit wdt lwp fma4 tce tbm topx page1gb rdtscp bmi1
5/20/2018 16:45:23 | | OS: Microsoft Windows 7: Home Premium x64 Edition, Service Pack 1, (06.01.7601.00)
5/20/2018 16:45:23 | | Memory: 15.90 GB physical, 23.89 GB virtual
5/20/2018 16:45:23 | | Disk: 238.47 GB total, 128.73 GB free
5/20/2018 16:45:23 | | Local time is UTC -7 hours
5/20/2018 16:45:23 | Einstein@Home | Found app_config.xml
5/20/2018 16:45:23 | Milkyway@Home | Found app_config.xml
5/20/2018 16:45:23 | SETI@home | Found app_config.xml
5/20/2018 16:45:23 | | Config: GUI RPC allowed from any host
5/20/2018 16:45:23 | | Config: GUI RPCs allowed from:
5/20/2018 16:45:23 | | 192.168.2.192
5/20/2018 16:45:23 | | Config: event log limit disabled
5/20/2018 16:45:23 | | Config: use all coprocessors
5/20/2018 16:45:24 | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12444941; resource share 50
5/20/2018 16:45:24 | Milkyway@Home | URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 257518; resource share 50
5/20/2018 16:45:24 | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 5741129; resource share 900
5/20/2018 16:45:29 | SETI@home | General prefs: from SETI@home (last modified 07-Apr-2018 11:15:21)
5/20/2018 16:45:29 | SETI@home | Host location: none


I don't know why it was only seeing two cards when the Manager always showed all three in use. SIV also showed all 3 in use. If a card drops out I usually will see it disappear from SIV which is always up on the desktop. I also would see its utilization drop to nothing.

I just rebooted for S&G and will see if MW will run on all cards again. All the cards were in use all along for both SETI and Einstein. Only MW was having issues.

MW is just being obnoxious for me lately. I think it will be only running on my Win10 box in the short future as I will convert this daily driver tomorrow to Linux.
ID: 67523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 36
Credit: 30,882,260
RAC: 10,897
30 million credit badge9 year member badgeextraordinary contributions badge
Message 67524 - Posted: 21 May 2018, 1:27:51 UTC - in response to Message 67517.  

Thanks for the reply. Good to know where the problem lies. Now to determine how to fix it. It would be best to know just where in the BOINC code the problem occurs.

Do you know the bug# in the BOINC codebase by chance? If I could determine which module contains the bug, I could have the 7.4.44 developer pull the updated module into the build list so he could build a newer version of 7.4.44 which does not have the error.

I do not have the skills to do it myself. My attempt to build 7.8.3 BOINC ended in failure after many weeks so I just resigned to use someone else's 7.4.44 build.

The snippet of code you provided I believe is from the MW application and not from BOINC so I can't use the code tracker function at BOINC github to determine the problem module.


Keith,

Yes, the code snippet was indeed from the MW code!

As for where the error was in 7.4.44, I've no idea - I will, however, observe that some of the people reporting that problem here were on 7.2.something rather than 7.4.something, so it wasn't just happening for one version. As I had never been bitten by this problem myself, I didn't research it any further than working out what the crash was, and as people seemed to find that moving up to a 7.6 BOINC seemed to fix it...

(It acted as if it was a timing or file pick-up issue, with the application trying to access a data file that hadn't actually been put in place yet [but which should have been put in place by BOINC!] - if I had been going code-diving that's where I would've started, anyway...)

I'm not sure what version of Linux you're using (though I note it's a fairly current [Ryzen-friendly?] kernel.) I presume it's not based on Debian, because if it were you'd be able to get a viable BOINC from the repositories to see if it fixed the issue (though you would probably have to put up with it installing where it wants to and getting used to using sudo if you wanted to tune the configuration for your SETI usage!)

I don't know what the situation is for RedHat-based systems and I guess it's "not good" for some of the others - one often sees "I have to run this version of the client because it's all that's available for my distro" messages :-(

Sorry I can't be of more help; the BOINC code is above my C++ competence level, I suspect! I just hope you can find a pre-built client that resolves this particular issue. (I noticed you've a post in another thread about [different?] issues with 7.83 on Windows - perhaps your machines are just too powerful :-) ...)

Good luck finding a fix - Al

P.S. FWIW, I'm happily running a GTI 1050Ti for MW, Einstein and SETI on XUbuntu 16.04 with client 7.6.31 on an I7-7700k; I get the occasional Invalid at Einstein but I don't think I've ever noticed an OpenCL build error that wasn't a product of a corrupt data file...
ID: 67524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBeemerBiker
Avatar

Send message
Joined: 18 Nov 08
Posts: 116
Credit: 843,538,433
RAC: 882,812
500 million credit badge10 year member badge
Message 67525 - Posted: 21 May 2018, 5:09:48 UTC - in response to Message 67523.  

You are using CUDA 9.1.84 which came out last year. However, your video driver is 391.24 which is fairly recent, March 18 it seems.

Something is not right. My windows 7 system shows 9.1.104 for CUDA and the driver that it came with is 388.71 which is older than your driver.

Did you install the CUDA toolkit? It comes with a default driver and is only up-to-date when a new toolkit is released. Suggest you download last weeks release: WHQL 397.64 which should get you cuda 9.2.x
ID: 67525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67526 - Posted: 21 May 2018, 7:27:43 UTC - in response to Message 67524.  

Hi Al, yes I just converted my first Windows 7 machine to Ubuntu 18.04 LTS and the kernel that came with it is 4.15.0-20 and is supposed to have most of the Ryzen fixes baked in. I will convert my second Windows 7 machine this next outage Tuesday.

One of the benefits of the BOINC 7.4.44 that one of the BOINC Linux and MAC developers built is that it doesn't come from the repository and doesn't install BOINC in two different places with permissions that prevent changing any files or deciding where to put it.

You just drop it into a folder in your /Home directory and the user group assigned to it is Me. I can edit any file I want to. It also raises the task limit to 3000 onboard a host and that has come in handy for maintenance Tuesday outages where our special CUDA app can process a cache of 400 gpu tasks in an hour. Later BOINC versions have a 1000 task limit and would only last about 20 minutes till you were out of work. That is one of the reasons I don't want to move to a later version.

I could test the BOINC 7.8.3 Linux version that our developer made to see if the problem is solved. I would expect that is the case since that version is later than the 7.6 versions you say fixed the issue. But is has the 1000 tasks limit.

I am going to PM the BOINC developer of the 7.4.44 version just to give him a head's up. I doubt that he will do anything about it. I asked him for a modified version of his 7.8.3 with the raised limit and he said I could compile that myself since the one line of code that sets the limit is easy to find. That is the compilation project I mentioned that was a failure for me. The interface of 7.4.44 is fine and 7.8.3 interface doesn't really offer anything of substance so I find 7.4.44 fine and easy to work with.

So it looks like I will drop MW from the Linux machines. I already did so on the new machine that exposed the problem. I have added GPUGrid as its substitute. Will do the same for the next machine.
ID: 67526 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 207
Credit: 107,333,127
RAC: 18,641
100 million credit badge8 year member badgeextraordinary contributions badge
Message 67527 - Posted: 21 May 2018, 7:36:39 UTC - in response to Message 67525.  

You are using CUDA 9.1.84 which came out last year. However, your video driver is 391.24 which is fairly recent, March 18 it seems.

Something is not right. My windows 7 system shows 9.1.104 for CUDA and the driver that it came with is 388.71 which is older than your driver.

Did you install the CUDA toolkit? It comes with a default driver and is only up-to-date when a new toolkit is released. Suggest you download last weeks release: WHQL 397.64 which should get you cuda 9.2.x


I don't game so there is never any reason to constantly update video drivers for compute. Plenty of people running older drivers with no issues as long as the driver supports the hardware.

No, not running any CUDA toolkit. Just the normal package downloaded from Nvidia.

I assume that the card that kept getting assigned a MW task had issue with MW for some reason. It ran both SETI and Einstein with no issues.

Moot point. I have errored out all tasks and set NNT. I then just removed the project. Will do the same for the next computer conversion, only about 200 or so tasks left on that machine now, and will finish and report them after setting NNT. I will then remove the project and convert to Linux.
ID: 67527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : New Linux system trashes all tasks

©2019 Astroinformatics Group