Welcome to MilkyWay@home

Separation tasks with 7 sub-tasks - runtime and credit

Message boards : Number crunching : Separation tasks with 7 sub-tasks - runtime and credit
Message board moderation

To post messages, you must log in.

AuthorMessage
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,362,278
RAC: 4,516
Message 71107 - Posted: 10 Sep 2021, 14:59:37 UTC
Last modified: 10 Sep 2021, 15:21:33 UTC

I am now receiving a lot of tasks that turn out to have 7 WUs rather than the usual 4. As one might expect, they take about 75% longer to run (on the evidence of the few that have gone through so far.)

However, they don't appear to have an increased Estimated computation size so [at first] downloading fetches too much work at once; it may sort itself out over time, but this could be a bit of a nuisance for users for whom MilkyWay isn't the only project being run...

Of less concern is that the tasks get the same credit as 4-WU tasks...

And an observation about an error in the command line: it is fortunate that the application counts parameters and works out the WU count from that, as the -np parameter for these 7-WU tasks is 416 instead of 182! I guess it's added the usual -np value of 104 for each extra WU :-) An appropriate warning is produced.

Are these larger jobs a sign that we're nearing the end of this set of stripes?

Cheers - Al.

[Edited to "tidy up" the text.]
ID: 71107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,362,278
RAC: 4,516
Message 71108 - Posted: 10 Sep 2021, 16:27:08 UTC - in response to Message 71107.  

Correction to the above (missed the edit deadline) - not "a lot" but "several"; the counter command I was using had an error in it, which didn't become apparent until I noticed that it was returning 4-WU tasks I didn't think I had!

The points raised remain valid, though they are now of less worry (unless the larger jobs become the norm at some point)

Sorry about missing the edit deadline on the original :-(

Cheers - Al.
ID: 71108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71109 - Posted: 10 Sep 2021, 21:21:33 UTC

That's very strange, since the project isn't set up to bundle different numbers of WUs on the same runs. Right now, only 4-WU bundles should be going out on Separation. Can you link some of the 7-Wu jobs that you had?

In fact, I would have expected that a 7-WU bundle was impossible, since the command line overflows with 5 WUs, which meant that we had to reduce it to 4 per bundle. 7 WUs should not run if that's the case. Maybe they're very old WUs or something?
ID: 71109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,362,278
RAC: 4,516
Message 71110 - Posted: 11 Sep 2021, 9:56:53 UTC - in response to Message 71109.  

Tom.

I have just spotted one that was recently returned - task details as follows:

Task 323137295
Name 	de_modfit_85_bundle4_4s_south4s_gapfix_1627399316_34598455_0
Workunit 	178235307
Created 	9 Sep 2021, 4:38:13 UTC
Sent 	9 Sep 2021, 4:49:33 UTC
Report deadline 	21 Sep 2021, 4:49:33 UTC
Received 	10 Sep 2021, 22:44:24 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	817681
Run time 	6 min 24 sec
CPU time 	4 min
Validate state 	Valid
Credit 	230.36
Device peak FLOPS 	1,111.81 GFLOPS
Application version 	Milkyway@home Separation v1.46 (opencl_nvidia_101)
x86_64-pc-linux-gnu
Peak working set size 	745.19 MB
Peak swap size 	6,195.81 MB
Peak disk usage 	0.02 MB

This particular task had an interesting extra wrinkle - some of the parameters had values that needed to be represented in scientific notation, and it appeared to take a dislike to one of them, so there was an error as well as a warning; however, it went on to process all 7 WUs...

Here's the top of the stderr output:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
Warning: Number of parameters remaining can't match expected: -np = 416
Error parsing command line fit parameters at '4.66372e-310' (34): Numerical result out of range
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' 
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 7 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>

I've not reproduced the rest of the output, though I have kept a copy...

There are others outstanding, some of which also appear to have lots of "scientific notation" - one such example is task 325293500 (workunit 179771024, name de_modfit_84_bundle4_4s_south4s_gapfix_1627399316_36097698_0, for which the command line recorded in client_state.xml reads as follows (broken up into shorter lines, and into groups of 26 parameters for convenience!):

-f  -np 416 -p 0.998217 1.36542 0.0183013 377.701 38.5872 2.00141 2.53033 
 0.451065 -0.631398 391.931 25.2571 0.882878 1.66531 2.73415 -1.55333
 374.449 27.8719 0.168161 2.18661 0.441706 -15.8157 357.942 51.793 2.2979
 2.13944 14.5041

 0 2.42092e-322 5.35632e+199 1.23678e+224 6.28641e+151
 1.98861 2.71258 1.90215e-321 4.66372e-310 4.66372e-310 1.10711e-47
 2.09198e-76 8.24227e-72 1.12966e-42 8.52069e-96 1.17687e-47 7.11636e-38
 1.51765e-47 7.71278e-43 1.03277e-47 2.57895e-57 2.48103e-91 3.5037e-33
 3.06367 4.91512e-62 23.889

 1.30457e-76 2.1102e-52 1.56536e-76 7.11732e-67  1.05221e-153 6.74642e-67
 6.2324e-38 2.0062e-52 1.9014e-52 1.5761e-52  1.83019e-76 8.44253e-53 2.77026
 8.52981e-96 -1.47476 1.31418e-71  3.66919e-62 2.00528e-76 3.67342e-62
 8.23298e-67 2.48103e-91 3.93518e-62  3.37548e-57 1.91891e-76 2.21583e-52
 1.479e-76

 7.71276e-43 6.49721e-307 6.50068e-307 1.90215e-321 38.5872 4.66372e-310
 9.15785e-72 7.11521e-38 8.24227e-72 4.65925e-33 1.08597e-95 2.11142e-52
 1.08668e-71 2.89631e-57 3.69908e-57 2.90737e-33 27.8719 5.53288e-48  2.18661
 0.441706 -17.3729 4.01272e-57 7.48534e-67 4.42134e-62 2.17561e-76  5.3601

 0.991456 1.42038  -11.7379 377.701 38.5872 2.52953 2.16913 0.451065 -0.532608
 390.172  24.9045 0.6091 1.48048 2.57283 -1.73009 374.449 27.8719 0.168161
 2.18661 0.441706 -10.2167 373.79 4.50215 2.67771 1.14709 8.78304

 0.98231 1.21693 -8.79135 377.701 38.5872 2.56202 2.79823 0.451065 -0.63577
 390.525  25.3944 0.351936 2.03994 2.5284 -1.7801 374.449 27.8719 0.168161
 2.18661  0.441706 -4.63546 379.789 24.6512 2.11044 1.9094 22.0583

 0.981905 1.4735 -1.46697 377.701 38.5872 1.73276 2.53817 0.451065 -0.538353
 391.529 24.5004 0.172322 2.08915 2.31049 -1.25135 374.449 27.8719 0.168161
 2.18661 0.441706 -18.1465 396.641 39.2878 1.99856 0.532434 13.6343

Note that its 35th and 36th parameters have the same value that caused the error message in the earlier example, so I expect that when I eventually let this one run it'll show the same error message...

I also noted that whilst the first parameter in each set of 26 is usually somewhere around 0.9999, in this case the second, third and fourth parameter sets don't adhere to that. I had a look at the parameter sets for one or two more of these that are/were still pending and they showed the same pattern...

One of the -np 416 tasks on my other machine that has recently returned a result had an even more interesting parameter set; here are the second, third and fourth parameter sets - note all the zeroes (which may or may not be legitimate, of course!)
 0 8.74496e-322 -0.54365 4.66372e-310
 0 0 4.66372e-310 0 0 0 0 1.18576e-322 4.03158e-320 1.97626e-323 0
 0 0 2.53136 0 2.5049 4.94066e-324 0 8.69556e-322 8.69556e-322 0
 4.66372e-310

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.46473 4.79244e-322
 4.66372e-310 355.245 2.71736e-322 3.21143e-322 3.13955 6.93705e-310

 6.60072e-320 1.63042e-322 4.66372e-310 6.93705e-310 4.74303e-322
 1.50196e-321 4.66372e-310 4.66372e-310 3.53355e-57 386.912 2.48104e-91
 3.37922e-57 4.27332e-33 2.57771e-57 4.28009e-33 8.54519e-72 1.5874e-47
 2.54183 3.67057e-62 2.00702e-52 8.44252e-53 9.93609e-96 4.90748e-62
 1.19988e-71 7.106e-38 6.38275e-67

Other sets looked more "usual"...
The task basic details are
Task 324745670
Name 	de_modfit_85_bundle4_4s_south4s_gapfix_1627399316_35715179_0
Workunit 	179380183
Created 	10 Sep 2021, 12:04:00 UTC
Sent 	10 Sep 2021, 12:17:22 UTC
Report deadline 	22 Sep 2021, 12:17:22 UTC
Received 	11 Sep 2021, 8:30:34 UTC

It raised an error about that 8.74496e-322 value but, as per usual, none of the sub-tasks failed to find a solution.

By the way, the over-long command line is less likely to be an issue with newer BOINC clients, which explains why the task(s) actually ran.

I don't know whether it's related, but I have also seen three jobs which went to Error status with an argument parsing error: unfortunately, as they failed instantly, I can't guess whether they had the standard 4 WUs or not...

One of them is still hanging around on your server at the time of writing, because the retry went as a CPU task to a machine that seems to be sitting on CPU tasks instead of running them so you may be able to see more information about it (such as the parameters?)

Here's the complete details for that particular task...

Task 315437424
Name 	de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1627399316_29321575_0
Workunit 	172772688
Created 	2 Sep 2021, 13:48:16 UTC
Sent 	2 Sep 2021, 14:00:37 UTC
Report deadline 	14 Sep 2021, 14:00:37 UTC
Received 	3 Sep 2021, 17:47:26 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	1 (0x00000001) Unknown error code
Computer ID 	571356
Run time 	
CPU time 	
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	432.95 GFLOPS
Application version 	Milkyway@home Separation v1.46 (opencl_nvidia_101)
x86_64-pc-linux-gnu
Stderr output

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
Argument parsing error: -11.2886: unknown option
</stderr_txt>
]]>


The other two tasks that had a similar error also failed out any wingmen that tried, and after three failures the tasks naturally got written off! Because of the way I recorded notes on these, I may not have got the workunit numbers in the right order...

Workunits 175382603 and 175930791

  de_modfit_85_bundle4_4s_south4s_gapfix_bgset3_1627399316_31840719
      Argument parsing error: -1.06793: unknown option

  de_modfit_85_bundle4_4s_south4s_gapfix_bgset3_1627399316_32367349
      Argument parsing error: -17.7233: unknown option


I was wondering if that happens if the first parameter after the -p option is negative (though I suspect it shouldn't be!) - given the other oddities I've seen recently, who knows?!?

Hope the above isn't too much information, and that it helps somewhat.

Cheers - Al.
ID: 71110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,362,278
RAC: 4,516
Message 71116 - Posted: 14 Sep 2021, 3:17:30 UTC - in response to Message 71110.  

Tom,

Further to the above - that stalled task (315437424) has finally been run by the delaying wingman - it came back with an Error (no surprise!), as did the next retry (a GPU task,returned almost immediately!), so that job has also gone the "three strikes and out" route now :-)

Cheers - Al.
ID: 71116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Separation tasks with 7 sub-tasks - runtime and credit

©2024 Astroinformatics Group