Message boards :
News :
Admin Updates Discussion
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,459 RAC: 1,485 |
We have found what appears to be the source of the different results between Windows and Linux. The problematic function is not essential, so we will be doing runs without it until the code is replaced in our next version. We expect the amount of invalids to be significantly reduced going forward, but we will be keeping an eye out for additional issues. As mentioned, one might be an issue with runs that restart after a shut down. If you happen to shut down during some tasks, let us know if you notice anything! |
|
Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,459 RAC: 1,485 |
The invalid results given when resuming a run were caused by some checkpoint files not storing all of the information that was needed. A fix will be included in the next update. |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
Thank you for keeping us updated. :-)
|
|
Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,324,988 RAC: 21,589 |
Good to know. Thanks for the update. |
Bill FSend message Joined: 4 Jul 09 Posts: 108 Credit: 18,317,753 RAC: 2,586 |
The tasks generated after the changes on Nov 11 appear to have increased overall Task Flow if you look at BOINCStats at the Project level https://www.boincstats.com/stats/61/project/detail/credit Bill F |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
The rsc_fpops_est update has been implemented on the server. So here are the results for my first 9 WUs generated after this point: WU #: 1011108637 Name: de_nbody_orbit_fitting_10_23_2025_v193_OCS_north_MW2014__data__3_1762888230_1618 created: 11 Nov 2025, 20:14:36 UTC est GFLOPs: 743,470 est run time: 19:01:28 run time: 00:00:02 WU #: 1011109141 Name: de_nbody_10_27_2025_v193_OCS_north_MW2014__data__01_1762888230_2122 created: 11 Nov 2025, 20:46:39 UTC est GFLOPs: 93,901 est run time: 02:24:10 run time: 00:27:22 avg GFLOPs: 57.19 WU #: 1011118536 Name: de_nbody_orbit_fitting_10_23_2025_v193_OCS_north_MW2014__data__3_1762888230_11517 created: 12 Nov 2025, 7:13:57 UTC est GFLOPs: 992,244 est run time: 1d 01:23:25 run time: 0d 00:00:02 WU #: 1011125009 Name: de_nbody_10_27_2025_v193_OCS_north_MW2014__data__01_1762888230_17990 created: 12 Nov 2025, 14:52:50 UTC est GFLOPs: 14,058 est run time: 00:20:27 run time: 00:01:24 avg GFLOPs: 167.36 WU #: 1011127295 Name: de_nbody_orbit_fitting_10_23_2025_v193_OCS_north_MW2014__data__3_1762888230_20276 created: 12 Nov 2025, 17:54:14 UTC est GFLOPs: 318,937 est run time: 08:00:53 run time: 01:36:33 avg GFLOPs: 55.06 WU #: 1011127296 Name: de_nbody_orbit_fitting_10_23_2025_v193_OCS_north_MW2014__data__3_1762888230_20277 created: 12 Nov 2025, 17:54:14 UTC est GFLOPs: 364,805 est run time: 09:10:02 run time: 01:52:22 avg GFLOPs: 54.11 WU #: 1011127322 Name: de_nbody_10_27_2025_v193_OCS_north_MW2014__data__01_1762888230_20303 created: 12 Nov 2025, 17:54:54 UTC est GFLOPs: 13,250 est run time: 00:19:58 run time: 00:01:15 avg GFLOPs: 176.67 WU #: 1011125579 Name: de_nbody_10_27_2025_v193_OCS_north_MW2014__data__01_1762888230_18560 created: 12 Nov 2025, 15:52:54 UTC est GFLOPs: 64,766 est run time: 01:36:08 run time: 00:17:22 avg GFLOPs: 62.16 WU #: 1011127324 Name: de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_20305 created: 12 Nov 2025, 17:55:02 UTC est GFLOPs: 23,655 est run time: 00:35:39 run time: 00:07:04 avg GFLOPs: 55.79 I assume the estimated runtimes will be lower in general once the APR adjusts itself to the new estimates, but the tasks seem to be split into two groups (if we ignore those tasks, which nearly completely filled up my 1.2 days cache only for to end after 2 seconds): one group of longer running tasks and one group of "shorties" and as you see from the calculated processing rates, one of that groups have either too high or too low GFLOPs estimation by a factor of about 3. If this can be corrected (and once the APR adjusts itself), I think the new estimates are going to be OK-ish.
|
|
Send message Joined: 30 Dec 14 Posts: 35 Credit: 911,354,348 RAC: 29,368 |
Deleted. |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
one group of longer running tasks and one group of "shorties" and as you see from the calculated processing rates, one of that groups have either too high or too low GFLOPs estimation by a factor of about 3.After checking some random short tasks it ssems like it might be some kind of exponential type of overestimation for short tasks, which starts to be visible on tasks running below 15-20 minutes on my system. So everything above those 15-20 minutes has an APR of 55-60 GFLOPs, the task in my previous post, which run 17m22s had an calculated APR of 62 GFLOPs, so slightly above, tasks running arond 10 minutes have already around 80-120 GFLOPs and the real shorties (1-2 minutes run time) are somewhere in the range of 150-180 GFLOPs. But in general it gets better and better as the APR of the app adapts to the new estimates, so I think this should be good enough to keep the desired cache size, at least as long as there are not too many 2-second-WUs in there. Btw, any reason why you still didn't set initial replication to 2 considering that AFAICT every WU needs two results to validate? This would speed up validation a lot.
|
|
Send message Joined: 1 Nov 10 Posts: 18 Credit: 2,335,992 RAC: 6,149 |
What you say would be true if it were the vast majority of tasks that required a third (or more) result to be returned and validate against each other. BUT The majority of tasks validate with only two results being returned (In the case of my computers it looks to be about 2%). There's also an issue with the way MilkyWay categorises tasks - when the first result is returned it shows as "validation inconclusive", which is not right, and there is no result returned to validate against, such tasks should be categorised as "Validation pending". Only once two results have been returned, and they don't validate, can "validation inconclusive" be applied and thus a third task be sent out. Sending out three tasks initially would result in about a third of returned work being "wasted" in that it would not be required in the validation process. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
Sending out three tasks initially would result in about a third of returned work being "wasted" in that it would not be required in the validation process.I wrote they should send 2 tasks, not 3. They are sending now just one and that's not enough for the validation. It was IIRC enough for most Separation WUs so there initial replication of 1 was perfect, but that part of the project finished few years ago and for N-body this setting seems wrong IMHO, at least I can't remember any times, when N-body WUs were able to validate with just one result.
|
|
Send message Joined: 1 Nov 10 Posts: 18 Credit: 2,335,992 RAC: 6,149 |
Sorry, I was confused by having a number of tasks on my screen showing "initial replication" as 2 and 3. Looking at other tasks I see that the real initial task shows "initial replication" as 1. This being the case I agree with you, the project should move to the nomenclature and practice that most other projects that require "proper" validation do - send out 2 tasks in the first batch, with "initial replication" set to 2, use the "validation pending" status correctly and not abuse the "validation inconclusive" when only one o the initial pair of tasks has been returned. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,324,988 RAC: 21,589 |
The issues with an initial replication of 1 and the incorrect use of "validation inconclusive" to lump together actual inconclusives with pending have existed all along though... What is new now is the awfully long incorrect estimates since this last change. I have estimates of up to 14 days while few WUs take more than one (running single thread) and most way less. Looking right now at three running tasks with initial estimates around a week which should be done in under one day. So the buffer that so far tended to need to be kept low because estimates could be shorter and a big buffer risked having tasks time out, now has to be high so you don't risk running out. |
|
Send message Joined: 1 Nov 10 Posts: 18 Credit: 2,335,992 RAC: 6,149 |
Hmmm...... I am seeing estimated runtimes in the same ballpark as the real runtimes - this disparity in our estimated vs real runtime is a bit of a concern. Perhaps it is time for Milkyway to sit down and look at their server configuration/BOINC software and cure these issues rather than having us having to develop workarounds..... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
I guess it's going to take some time for the estimates to adjust, my are difinitely getting better and better every day. @Cavalary: Contrary to the original assumption running Milkyway in single thread mode isn't most efficient way to run it, you might want to check out my "Milkyway Nbody ST vs. MT: real benchmarking" thread. On my 5700G it's best to run 2x 7-thread tasks (I leave 2 threads free for iGPU feeding), we are talking about around 2-4x work done per day depending on the size of the WU. Since your 8700G has same cache size, it should be similar.
|
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
The update to N-body version 1.94 will be happening tomorrow, November 18th at around 18:00 UTC. (...)In case the new application is not compatible with WUs created for v1.93 (and those release notes suggest that it's not or at least will generate different results), please make sure that resends of old WUs won't be assigned to the new version like it happened during the switch from v1.87 to v1.92.
|
|
Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,459 RAC: 1,485 |
We should not have any issues with WUs this time around, since we are putting the new version on the nbody application which has had no tasks for a while now. This update should hopefully be much less trouble than the last. We are going to keep the initial replication for runs at 1 for the time being. We are doing this because we plan to make changes to improve our optimizations sometime soon, and as part of this we will not require all WUs to validate with two results (it will work like separation used to). |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
Thanks for the info. There seems to be an issue with "Max # of threads for each MilkyWay@home task" for the new application. I have set it to 7 and for v1.93 I was getting only MT tasks using 7 threads, for v1.94 I got now a single thread task. I fixed this for myself since BOINC can be a major PITA when it should run a mix of ST and MT tasks, but maybe you can find the reason, why the server sends ST tasks for Milkyway@home N-Body Simulation but not for Milkyway@home N-Body Simulation with Orbit Fitting when the user has set it to specific value. This is how I fixed it via app_config.xml (not tested yet, but it should as it's same binary, BOINC Manager shows it already as a 7-thread WU): <app_config> <app> <name>milkyway_nbody</name> <fraction_done_exact/> </app> <app> <name>milkyway_nbody_orbit_fitting</name> <fraction_done_exact/> </app> <app_version> <app_name>milkyway_nbody</app_name> <version_num>194</version_num> <platform>windows_x86_64</platform> <avg_ncpus>7.000000</avg_ncpus> <cmdline>--nthreads 7</cmdline> </app_version> <project_max_concurrent>16</project_max_concurrent> </app_config>
|
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
OK, it's different binary this time, so my method didn't work. Hopefully you can fix it on the server side.
|
|
Send message Joined: 18 Feb 10 Posts: 62 Credit: 224,641,383 RAC: 4,104 |
I had the same problem but some time ago. I, so far, fixed it by choosing No limit in Max threads and use app_config.xml to the number I want. |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,502,551 RAC: 9,758 |
Well, yes, there are ways to fix it for yourself, but we are supposed to report bugs here so they can fix it for everyone.
|
©2025 Astroinformatics Group