Message boards :
Number crunching :
What is the cause of these 'validate errors'
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
![]() Send message Joined: 22 Apr 11 Posts: 66 Credit: 908,037,278 RAC: 41,369 ![]() ![]() ![]() ![]() |
I get funny waveforms in MSI Afterburner using 13.2 moded drivers OR using the latest 14.2 drivers. CPU makes no difference. The only difference between the E3-1230 setup and the 2600K setup is the PCIe bus speed... PCIe 3.0 vs. 2.0. Notice it sometimes drops into low clock mode, a slight change of the clock speeds below default corrects it... or seems to. I've run the clocks as low at 580MHz and seems the slower speeds create fewer errors... ![]() In any case, there is something weird with the modified fit WU's for sure. Both setups perform the same weather using reference 7970 or latest Saphire R9 280x. 8-) |
![]() Send message Joined: 22 Apr 11 Posts: 66 Credit: 908,037,278 RAC: 41,369 ![]() ![]() ![]() ![]() |
I might add that some playing around with clocks and memory speeds on the GPU's has allowed me to sort of tune the setups and greatly reduce errors... We will see how it goes, but so far so good.. 8-) |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
I just revisited this thread and see that only Tex provided you a link to errors we see on MilkyWay 1.36 tasks. Basically, the error.txt file output gets truncated. The exit status is always [0] but because the file doesn't contain any result information, the tasks get invalidated. http://milkyway.cs.rpi.edu/milkyway/results.php?userid=147145&offset=0&show_names=0&state=5&appid= This is the list from my two computers. Thought I should provide some input also so it isn't from just one user. Cheers, Keith ![]() |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
I just revisited this thread and see that only Tex provided you a link to errors we see on MilkyWay 1.36 tasks. Basically, the error.txt file output gets truncated. The exit status is always [0] but because the file doesn't contain any result information, the tasks get invalidated. Here is a link to some invalids you can actually see. Sorry about that. http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=257518&offset=0&show_names=0&state=5&appid= Cheers, Keith ![]() |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 ![]() ![]() ![]() |
Here are mine http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=573186&offset=0&show_names=0&state=5&appid= 18 Validate errors so far just today on one system |
![]() Send message Joined: 11 Feb 11 Posts: 57 Credit: 69,475,644 RAC: 0 ![]() ![]() |
i have some invalid modified fit workunits as well. http://milkyway.cs.rpi.edu/milkyway/results.php?userid=150155&offset=0&show_names=0&state=5&appid= r9 280x (3 wus at a time) no other projects running atm |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
I've attracted some interest in this problem over on the Seti Number Cruncher forum and have some of the BOINC and app developers looking into the issue. They seem to think they might have a handle on just what the problem might be. It is not a user equipment failure but a problem in the underlying BOINC platform code. Lets's hope something fruitful comes of their investigations. Cheers, Keith ![]() |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
Here are mine Hi Tom, what is interesting is that it looks like only your AMD FX-8350 system has the truncated std_error.txt results. Your Intel system is just producing results that don't validate against your wingmen. What is most interesting is that I too am running two AMD FX-8350 hosts and produce the truncated std_error.txt results invalids on them. I wonder if this some commonality I hadn't noticed before. Cheers, Keith ![]() |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 ![]() ![]() ![]() |
Hi Keith, I view the validate errors on my Haswell i7 a little differently Yes my amd fx-8350 always truncates the whole stderr. But My Haswell (also running an HD7950) truncates the stderr after the Initial wait always at the same place. Although the FX-8350 had many more errors per day than the Haswell. Iteration area: 560000 Chunk estimate: 1 Num chunks: 2 Chunk size: 559104 Added area: 558208 Effective area: 1118208 Initial wait: 16 ms </stderr_txt> ]]> |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
I wasn't aware of an invalid task that produced semi-truncated std_error.txt output. Everything I've seen so far is the extreme truncated output like this: <core_client_version>7.4.42</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> As I stated earlier in the thread, it seems I have finally started an investigation by the developers into these kinds of invalid results. The answer previously was always it is just an isolated incident common to your hardware. Now the developers have acknowledged that is a lot more common than previously thought and is a problem with the underlying BOINC code and not just with specific projects. There is also a newly recognized problem of BOINC failing to delete or remove files sizes above 4GB in project slots. Let us hope that the BOINC developers can release a new code level that fixes these issues ...... and doesn't introduce brand new problems. Cheers, Keith ![]() |
![]() Send message Joined: 30 Apr 09 Posts: 101 Credit: 29,874,293 RAC: 0 ![]() ![]() |
I got also 1 (until now) validate error: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=845965973 The <stderr_txt> isn't complete. ![]() |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
Just a quick followup to the truncated stderr.txt problem that I started as OP. It looks like we finally understand the nature of the problem and an analysis report has been submitted to the boinc_dev mailing list. Now we just have to wait for a fix or work around to the problem by the BOINC developers. I'd like to thank Richard Haselgrove for working with me and for submitting the boinc_dev report. You can follow the analysis and discussion of the "race condition" over at SETI@Home Panic Mode thread here. Cheers, Keith ![]() |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 ![]() ![]() |
After intensive work with Keith Myers and others (mainly in the SETI message board thread Stderr Truncations), I think I've finally traced and recorded the full life-cycle of these little beasties. The easiest starting point is the debris left behind. ![]() The task completed, and for 'some reason' (we'll come back to that later) BOINC couldn't delete one of the files. So it left it for later, and moved to another slot for the next task. In the message log, that looks like 14-Jul-2015 15:49:11 [---] [slot] cleaning out slots/2: handle_exited_app() 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/astronomy_parameters.txt 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/boinc_finish_called 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/boinc_task_state.xml 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/init_data.xml 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/separation_checkpoint 14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/stars.txt 14-Jul-2015 15:49:11 [---] [slot] failed to remove file slots/2/stderr.txt: Error 32 14-Jul-2015 15:49:11 [Milkyway@Home] Computation for task ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9901989_0 finished 14-Jul-2015 15:49:11 [---] [slot] cleaning out slots/2: get_free_slot() 14-Jul-2015 15:49:11 [---] [slot] failed to remove file slots/2/stderr.txt: Error 32 14-Jul-2015 15:49:11 [Milkyway@Home] [slot] failed to clean out dir: unlink() failed 14-Jul-2015 15:49:11 [---] [slot] cleaning out slots/10: get_free_slot() 14-Jul-2015 15:49:11 [Milkyway@Home] [slot] assigning slot 10 to de_80_DR8_Rev_8_5_00004_1434551187_13360920_0 Note that the timestamps match. According to MSDN, error 32 is ERROR_SHARING_VIOLATION - BOINC couldn't delete the file, because Milkyway was still writing to it. On the website, we see task 1187921853: Name ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9901989_0, Received 14 Jul 2015, 14:50:08 UTC - again it matches (my timezone is UTC+1). The stderr on the website ends ... - no final result or call to boinc_finish But I just had time to copy stderr.txt to another part of my hard disk: ![]() That copy ends ... Again, note that the Integration time, Average time per iteration, and Integral 0 time all match (they vary from task to task), and that the call to boinc_finish timestamp matches the message log. If BOINC had waited until the last few lines had been appended to stderr.txt, as they later were, before preparing the report for the server, I have every reason to believe this would have been a valid report. It took at least 3,200 tasks to reach that point (and I think a few of the early ones have already been purged). I'll take a pause from this project for a while, and let the GPU chew on a nice restful GPUGrid task (17 hours with none of this frantic uploading and downloading). But I'll come back and test any fix that David can come up with. |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 ![]() ![]() |
David has applied a possible fix for this: client (Win): when read stderr.txt, wait for write lock to be release first. and Rom has built a installer to test it. I've built a new version of 7.6 with David's latest change to address this issue. Those of you who have some experience already with v7.6.2 might like to try this and see how it compares - bearing in mind that at this point it is totally untested. (That's our job!) I'm clocking off the the night, but I'll switch back tomorrow morning and add to the testing effort. Edit - additional comment from David: I checked in a workaround in which the client waits until Windows programmers are invited to look at http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commitdiff;h=f2d690029c6dab9d586a9ba1a2e0af03dc7f3c70 |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
Great news Richard in capturing the wild beast. I know it is tough because of how MW cycles the slots every minute for 1.36 tasks. I'm off to give the new 7.6.6 drop the acid test on MW. Thanks again. Cheers, Keith ![]() |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
This is the first validated task with the new 7.7.6 client. Everything looks the same except for the PID callout which I don't remember seeing before. Haven't seen a invalid, blank result yet just some inconclusives. Haven't had the new client running long enough and didn't think to turn off the 1.02 tasks and suspend Einstein for 15 minutes after installing the new client. Just SETI and MW running now. I'll be shutting the systems down soon for the night and start fresh tomorrow morning. Here is the log with the extra flags and the task result. 446 Milkyway@Home 7/14/2015 5:51:56 PM [slot] assigning slot 0 to ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 455 Milkyway@Home 7/14/2015 5:51:56 PM [task] task_state=EXECUTING for ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 from start 456 Milkyway@Home 7/14/2015 5:51:56 PM Starting task ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 457 Milkyway@Home 7/14/2015 5:51:56 PM [cpu_sched] Starting task ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 using milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 0 588 Milkyway@Home 7/14/2015 5:52:58 PM [task] result ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 checkpointed 632 Milkyway@Home 7/14/2015 5:53:41 PM [task] Process for ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 exited, exit code 0, task state 1 633 Milkyway@Home 7/14/2015 5:53:41 PM [task] task_state=EXITED for ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 from handle_exited_app 634 7/14/2015 5:53:41 PM [slot] cleaning out slots/0: handle_exited_app() 635 7/14/2015 5:53:41 PM [slot] removed file slots/0/astronomy_parameters.txt 636 7/14/2015 5:53:41 PM [slot] removed file slots/0/boinc_finish_called 637 7/14/2015 5:53:41 PM [slot] removed file slots/0/boinc_task_state.xml 638 7/14/2015 5:53:41 PM [slot] removed file slots/0/init_data.xml 639 7/14/2015 5:53:41 PM [slot] removed file slots/0/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 640 7/14/2015 5:53:41 PM [slot] removed file slots/0/separation_checkpoint 641 7/14/2015 5:53:41 PM [slot] removed file slots/0/stars.txt 642 7/14/2015 5:53:41 PM [slot] removed file slots/0/stderr.txt 643 Milkyway@Home 7/14/2015 5:53:41 PM Computation for task ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 finished 644 Milkyway@Home 7/14/2015 5:53:41 PM [task] result state=FILES_UPLOADING for ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 from CS::app_finished 645 Milkyway@Home 7/14/2015 5:53:41 PM [task] result state=FILES_UPLOADED for ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9990546_0 from CS::update_results 646 7/14/2015 5:53:41 PM [slot] cleaning out slots/0: get_free_slot() 650 7/14/2015 5:53:41 PM request_exit(): PID 5008 has 0 descendants 651 7/14/2015 5:53:41 PM [slot] removed file slots/0/init_data.xml 664 7/14/2015 5:53:41 PM [slot] removed file slots/0/boinc_temporary_exit <core_client_version>7.6.6</core_client_version> <![CDATA[ <stderr_txt> <search_application> milkyway_separation 1.36 Windows x86_64 double OpenCL </search_application> Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation' Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File Using AVX path Found 1 platform Platform 0 information: Name: NVIDIA CUDA Version: OpenCL 1.2 CUDA 7.5.9 Vendor: NVIDIA Corporation Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts Profile: FULL_PROFILE Using device 1 on platform 0 Found 2 CL devices Device 'GeForce GTX 970' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU) Board: Driver version: 353.30 Version: OpenCL 1.2 CUDA Compute capability: 5.2 Max compute units: 13 Clock frequency: 1279 Mhz Global mem size: 4294967296 Local mem size: 49152 Max const buf size: 65536 Double extension: cl_khr_fp64 Build log: -------------------------------------------------------------------------------- ptxas info : 0 bytes gmem ptxas info : Compiling entry function 'probabilities' for 'sm_52' ptxas info : Function properties for probabilities ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 96 registers, 420 bytes cmem[0], 152 bytes cmem[2] -------------------------------------------------------------------------------- Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Estimated Nvidia GPU GFLOP/s: 1064 SP GFLOP/s, 133 DP FLOP/s Using a target frequency of 60.0 Using a block size of 8320 with 8 blocks/chunk Using clWaitForEvents() for polling with initial wait of 13 ms (mode 0) Range: { nu_steps = 320, mu_steps = 800, r_steps = 700 } Iteration area: 560000 Chunk estimate: 8 Num chunks: 9 Chunk size: 66560 Added area: 39040 Effective area: 599040 Initial wait: 13 ms Integration time: 95.871923 s. Average time per iteration = 299.599760 ms Integral 0 time = 96.521371 s Running likelihood with 108458 stars Likelihood time = 4.002593 s <background_integral> 0.000342179663701 </background_integral> <stream_integral> 3.453552708168287 228.484507039272870 24.801193300493413 </stream_integral> <background_likelihood> -4.268041850474944 </background_likelihood> <stream_only_likelihood> -142.051140945806140 -4.812913902331247 -4.792916224352506 </stream_only_likelihood> <search_likelihood> -3.447714477738023 </search_likelihood> 17:53:38 (4848): called boinc_finish </stderr_txt> ]]> ![]() |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 ![]() ![]() |
Not a single validate error, from over 500 tasks processed under BOINC v7.6.6 since this morning. |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 ![]() ![]() ![]() |
Results so far for 7.6.6 State: All (2458) · In progress (40) · Validation pending (0) · Validation inconclusive (92) · Valid (2326) · Invalid (0) · Error (0) Application: All (2458) · MilkyWay@Home (1331) · MilkyWay@Home N-Body Simulation (0) · Milkyway@Home Separation (0) · Milkyway@Home Separation (Modified Fit) (1127) Thanks Keith and Richard for pushing the workaround |
![]() ![]() Send message Joined: 24 Jan 11 Posts: 716 Credit: 559,144,001 RAC: 53,096 ![]() ![]() ![]() ![]() |
Over 200 valid 1.36 tasks so far since systems came back online this morning with the new 7.6.6 client. Looking good so far and thinking of turning off the extra logging data since we seem to have finally overcome the errors. Thanks for the help with the bug detection Richard, and all the other beta testers like Jeff and Jason over at SETI to help define the bug and the BOINC developers to come up with a solution and quickly implemented fix. Cheers, Keith ![]() |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 ![]() ![]() |
Heading close to 2,000 without error now. One additional problem at this project: the administrators have set quite a low 'maximum errors' threshhold. ![]() Two validate errors together, plus one other glitch, and the whole workunit is killed. Once BOINC v7.6.6 (or its successor) is fully tested and released as 'recommended', I'd suggest you start a push to get as many people as possible to upgrade. |
©2025 Astroinformatics Group