Message boards :
Number crunching :
Separation tasks with 7 sub-tasks - runtime and credit
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,362,278 RAC: 4,516 |
I am now receiving a lot of tasks that turn out to have 7 WUs rather than the usual 4. As one might expect, they take about 75% longer to run (on the evidence of the few that have gone through so far.) However, they don't appear to have an increased Estimated computation size so [at first] downloading fetches too much work at once; it may sort itself out over time, but this could be a bit of a nuisance for users for whom MilkyWay isn't the only project being run... Of less concern is that the tasks get the same credit as 4-WU tasks... And an observation about an error in the command line: it is fortunate that the application counts parameters and works out the WU count from that, as the -np parameter for these 7-WU tasks is 416 instead of 182! I guess it's added the usual -np value of 104 for each extra WU :-) An appropriate warning is produced. Are these larger jobs a sign that we're nearing the end of this set of stripes? Cheers - Al. [Edited to "tidy up" the text.] |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,362,278 RAC: 4,516 |
Correction to the above (missed the edit deadline) - not "a lot" but "several"; the counter command I was using had an error in it, which didn't become apparent until I noticed that it was returning 4-WU tasks I didn't think I had! The points raised remain valid, though they are now of less worry (unless the larger jobs become the norm at some point) Sorry about missing the edit deadline on the original :-( Cheers - Al. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
That's very strange, since the project isn't set up to bundle different numbers of WUs on the same runs. Right now, only 4-WU bundles should be going out on Separation. Can you link some of the 7-Wu jobs that you had? In fact, I would have expected that a 7-WU bundle was impossible, since the command line overflows with 5 WUs, which meant that we had to reduce it to 4 per bundle. 7 WUs should not run if that's the case. Maybe they're very old WUs or something? |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,362,278 RAC: 4,516 |
Tom. I have just spotted one that was recently returned - task details as follows: Task 323137295 Name de_modfit_85_bundle4_4s_south4s_gapfix_1627399316_34598455_0 Workunit 178235307 Created 9 Sep 2021, 4:38:13 UTC Sent 9 Sep 2021, 4:49:33 UTC Report deadline 21 Sep 2021, 4:49:33 UTC Received 10 Sep 2021, 22:44:24 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 817681 Run time 6 min 24 sec CPU time 4 min Validate state Valid Credit 230.36 Device peak FLOPS 1,111.81 GFLOPS Application version Milkyway@home Separation v1.46 (opencl_nvidia_101) x86_64-pc-linux-gnu Peak working set size 745.19 MB Peak swap size 6,195.81 MB Peak disk usage 0.02 MB This particular task had an interesting extra wrinkle - some of the parameters had values that needed to be represented in scientific notation, and it appeared to take a dislike to one of them, so there was an error as well as a warning; however, it went on to process all 7 WUs... Here's the top of the stderr output: <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> Warning: Number of parameters remaining can't match expected: -np = 416 Error parsing command line fit parameters at '4.66372e-310' (34): Numerical result out of range <search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application> Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation' Setting process priority to 0 (13): Permission denied Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 7 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> I've not reproduced the rest of the output, though I have kept a copy... There are others outstanding, some of which also appear to have lots of "scientific notation" - one such example is task 325293500 (workunit 179771024, name de_modfit_84_bundle4_4s_south4s_gapfix_1627399316_36097698_0, for which the command line recorded in client_state.xml reads as follows (broken up into shorter lines, and into groups of 26 parameters for convenience!): -f -np 416 -p 0.998217 1.36542 0.0183013 377.701 38.5872 2.00141 2.53033 0.451065 -0.631398 391.931 25.2571 0.882878 1.66531 2.73415 -1.55333 374.449 27.8719 0.168161 2.18661 0.441706 -15.8157 357.942 51.793 2.2979 2.13944 14.5041 0 2.42092e-322 5.35632e+199 1.23678e+224 6.28641e+151 1.98861 2.71258 1.90215e-321 4.66372e-310 4.66372e-310 1.10711e-47 2.09198e-76 8.24227e-72 1.12966e-42 8.52069e-96 1.17687e-47 7.11636e-38 1.51765e-47 7.71278e-43 1.03277e-47 2.57895e-57 2.48103e-91 3.5037e-33 3.06367 4.91512e-62 23.889 1.30457e-76 2.1102e-52 1.56536e-76 7.11732e-67 1.05221e-153 6.74642e-67 6.2324e-38 2.0062e-52 1.9014e-52 1.5761e-52 1.83019e-76 8.44253e-53 2.77026 8.52981e-96 -1.47476 1.31418e-71 3.66919e-62 2.00528e-76 3.67342e-62 8.23298e-67 2.48103e-91 3.93518e-62 3.37548e-57 1.91891e-76 2.21583e-52 1.479e-76 7.71276e-43 6.49721e-307 6.50068e-307 1.90215e-321 38.5872 4.66372e-310 9.15785e-72 7.11521e-38 8.24227e-72 4.65925e-33 1.08597e-95 2.11142e-52 1.08668e-71 2.89631e-57 3.69908e-57 2.90737e-33 27.8719 5.53288e-48 2.18661 0.441706 -17.3729 4.01272e-57 7.48534e-67 4.42134e-62 2.17561e-76 5.3601 0.991456 1.42038 -11.7379 377.701 38.5872 2.52953 2.16913 0.451065 -0.532608 390.172 24.9045 0.6091 1.48048 2.57283 -1.73009 374.449 27.8719 0.168161 2.18661 0.441706 -10.2167 373.79 4.50215 2.67771 1.14709 8.78304 0.98231 1.21693 -8.79135 377.701 38.5872 2.56202 2.79823 0.451065 -0.63577 390.525 25.3944 0.351936 2.03994 2.5284 -1.7801 374.449 27.8719 0.168161 2.18661 0.441706 -4.63546 379.789 24.6512 2.11044 1.9094 22.0583 0.981905 1.4735 -1.46697 377.701 38.5872 1.73276 2.53817 0.451065 -0.538353 391.529 24.5004 0.172322 2.08915 2.31049 -1.25135 374.449 27.8719 0.168161 2.18661 0.441706 -18.1465 396.641 39.2878 1.99856 0.532434 13.6343 Note that its 35th and 36th parameters have the same value that caused the error message in the earlier example, so I expect that when I eventually let this one run it'll show the same error message... I also noted that whilst the first parameter in each set of 26 is usually somewhere around 0.9999, in this case the second, third and fourth parameter sets don't adhere to that. I had a look at the parameter sets for one or two more of these that are/were still pending and they showed the same pattern... One of the -np 416 tasks on my other machine that has recently returned a result had an even more interesting parameter set; here are the second, third and fourth parameter sets - note all the zeroes (which may or may not be legitimate, of course!) 0 8.74496e-322 -0.54365 4.66372e-310 0 0 4.66372e-310 0 0 0 0 1.18576e-322 4.03158e-320 1.97626e-323 0 0 0 2.53136 0 2.5049 4.94066e-324 0 8.69556e-322 8.69556e-322 0 4.66372e-310 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.46473 4.79244e-322 4.66372e-310 355.245 2.71736e-322 3.21143e-322 3.13955 6.93705e-310 6.60072e-320 1.63042e-322 4.66372e-310 6.93705e-310 4.74303e-322 1.50196e-321 4.66372e-310 4.66372e-310 3.53355e-57 386.912 2.48104e-91 3.37922e-57 4.27332e-33 2.57771e-57 4.28009e-33 8.54519e-72 1.5874e-47 2.54183 3.67057e-62 2.00702e-52 8.44252e-53 9.93609e-96 4.90748e-62 1.19988e-71 7.106e-38 6.38275e-67 Other sets looked more "usual"... The task basic details are Task 324745670 Name de_modfit_85_bundle4_4s_south4s_gapfix_1627399316_35715179_0 Workunit 179380183 Created 10 Sep 2021, 12:04:00 UTC Sent 10 Sep 2021, 12:17:22 UTC Report deadline 22 Sep 2021, 12:17:22 UTC Received 11 Sep 2021, 8:30:34 UTC It raised an error about that 8.74496e-322 value but, as per usual, none of the sub-tasks failed to find a solution. By the way, the over-long command line is less likely to be an issue with newer BOINC clients, which explains why the task(s) actually ran. I don't know whether it's related, but I have also seen three jobs which went to Error status with an argument parsing error: unfortunately, as they failed instantly, I can't guess whether they had the standard 4 WUs or not... One of them is still hanging around on your server at the time of writing, because the retry went as a CPU task to a machine that seems to be sitting on CPU tasks instead of running them so you may be able to see more information about it (such as the parameters?) Here's the complete details for that particular task... Task 315437424 Name de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1627399316_29321575_0 Workunit 172772688 Created 2 Sep 2021, 13:48:16 UTC Sent 2 Sep 2021, 14:00:37 UTC Report deadline 14 Sep 2021, 14:00:37 UTC Received 3 Sep 2021, 17:47:26 UTC Server state Over Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 571356 Run time CPU time Validate state Invalid Credit 0.00 Device peak FLOPS 432.95 GFLOPS Application version Milkyway@home Separation v1.46 (opencl_nvidia_101) x86_64-pc-linux-gnu Stderr output <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> Argument parsing error: -11.2886: unknown option </stderr_txt> ]]> The other two tasks that had a similar error also failed out any wingmen that tried, and after three failures the tasks naturally got written off! Because of the way I recorded notes on these, I may not have got the workunit numbers in the right order... Workunits 175382603 and 175930791 de_modfit_85_bundle4_4s_south4s_gapfix_bgset3_1627399316_31840719 Argument parsing error: -1.06793: unknown option de_modfit_85_bundle4_4s_south4s_gapfix_bgset3_1627399316_32367349 Argument parsing error: -17.7233: unknown option I was wondering if that happens if the first parameter after the -p option is negative (though I suspect it shouldn't be!) - given the other oddities I've seen recently, who knows?!? Hope the above isn't too much information, and that it helps somewhat. Cheers - Al. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,362,278 RAC: 4,516 |
Tom, Further to the above - that stalled task (315437424) has finally been run by the delaying wingman - it came back with an Error (no surprise!), as did the next retry (a GPU task,returned almost immediately!), so that job has also gone the "three strikes and out" route now :-) Cheers - Al. |
©2024 Astroinformatics Group