Admin Updates Discussion

Author	Message
Keith Myers Send message Joined: 24 Jan 11 Posts: 738 Credit: 565,324,271 RAC: 15,895	Message 77527 - Posted: 28 Jun 2025, 0:20:34 UTC - in response to Message 77522. More of those: de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_475 de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_6989 de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_6990 de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_7030 And also it was perhaps not a good idea to let v1.92 cruch tasks originally made for v1.87: de_nbody_orbit_fitting_03_25_2025_v186_OCS__data__32_1747416295_73267 de_nbody_orbit_fitting_03_25_2025_v186_OCS__data__32_1747416295_75085 The new application doesn't like them: <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> The data is invalid. (0xd) - exit code 13 (0xd)</message> <stderr_txt> <search_application> milkyway_nbody 1.92 Windows x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 8 max threads on a system with 8 processors Running MilkyWay@home Nbody v1.86 Optimal Softening Length = 0.000011993015169 kpc Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:106: bad argument #1 to 'create' (Missing required named argument 'PMSigma') Failed to read input parameters file strftime() failed called boinc_finish(13) </stderr_txt> ]]> Getting errors for the 1.90 applications for missing parameters. <core_client_version>8.3.0</core_client_version> <![CDATA[ <message> process exited with code 13 (0xd, -243)</message> <stderr_txt> <search_application> milkyway_nbody 1.90 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 1 max threads on a system with 32 processors Running MilkyWay@home Nbody v1.90 Optimal Softening Length = 0.010487539811713 kpc Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:108: Unknown named argument 'PMCorrect' Unknown named argument 'PMSigma' Unknown named argument 'usePropMot' 3 bad named arguments found Failed to read input parameters file 2025-06-27 05:19:47 (346627): called boinc_finish(13) </stderr_txt> ]]> ID: 77527 · Rating: 0 · rate: / Reply Quote

GWGeorge007 Send message Joined: 6 Jan 18 Posts: 18 Credit: 91,076,455 RAC: 13,393	Message 77528 - Posted: 28 Jun 2025, 15:55:42 UTC - in response to Message 77527. Last modified: 28 Jun 2025, 16:01:02 UTC Getting errors for the 1.90 applications for missing parameters. <core_client_version>8.3.0</core_client_version> <![CDATA[ <message> process exited with code 13 (0xd, -243)</message> <stderr_txt> <search_application> milkyway_nbody 1.90 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 1 max threads on a system with 32 processors Running MilkyWay@home Nbody v1.90 Optimal Softening Length = 0.010487539811713 kpc Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:108: Unknown named argument 'PMCorrect' Unknown named argument 'PMSigma' Unknown named argument 'usePropMot' 3 bad named arguments found Failed to read input parameters file 2025-06-27 05:19:47 (346627): called boinc_finish(13) </stderr_txt> ]]> I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid" https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1017908109 Stderr output <core_client_version>8.3.0</core_client_version> <![CDATA[ <stderr_txt> <search_application> milkyway_nbody 1.92 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 4 max threads on a system with 32 processors Running MilkyWay@home Nbody v1.92 Optimal Softening Length = 0.037574956190444 kpc Dwarf Initial Position: [-40.554933916194180,-45.504290620411119,-8.264150885489647] Dwarf Initial Velocity: [95.530208657967421,113.610439063046329,-59.959442932363203] Initial LMC position: [115.779021398409526,773.941580420324840,-160.834750267893781] Initial LMC velocity: [-10.181006980321788,-86.536892927247862,-2.421415618438683] <search_likelihood>-661.279708036776128</search_likelihood> <search_likelihood_EMD>-82.120914388430975</search_likelihood_EMD> <search_likelihood_Mass>-55.044884148421936</search_likelihood_Mass> <search_likelihood_Beta>-134.711993513170285</search_likelihood_Beta> <search_likelihood_BetaAvg>-162.502110001586573</search_likelihood_BetaAvg> <search_likelihood_VelAvg>-108.997646384525780</search_likelihood_VelAvg> <search_likelihood_Dist>-117.902159600640516</search_likelihood_Dist> 2025-06-28 08:31:39 (2432949): called boinc_finish(0) </stderr_txt> ]]> George ID: 77528 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77529 - Posted: 28 Jun 2025, 17:59:22 UTC - in response to Message 77528. I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid" You have some invalids also from version 1.90 and 1.87, but I guess nearly everyone gets here some invalids every now and than, I've got some too. ID: 77529 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77530 - Posted: 29 Jun 2025, 9:32:03 UTC - in response to Message 77529. I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid" You have some invalids also from version 1.90 and 1.87, but I guess nearly everyone gets here some invalids every now and than, I've got some too. I mean, just look at this WU, so far 3 different results (and v1.92 won't be able to finish it, but that's a different story). I've seen many WUs like that, too many to blame unstable computers, it must be something with the application (perhaps getting sometimes different result from different CPUs?). ID: 77530 · Rating: 0 · rate: / Reply Quote

GWGeorge007 Send message Joined: 6 Jan 18 Posts: 18 Credit: 91,076,455 RAC: 13,393	Message 77533 - Posted: 2 Jul 2025, 16:59:11 UTC Last modified: 2 Jul 2025, 17:02:19 UTC I'm continuing to get new invalid tasks today of the v1.92 N-Body Simulation with Orbit Fitting https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1018353718 ( Stderr output is just to long to post) George ID: 77533 · Rating: 0 · rate: / Reply Quote

GWGeorge007 Send message Joined: 6 Jan 18 Posts: 18 Credit: 91,076,455 RAC: 13,393	Message 77536 - Posted: 2 Jul 2025, 21:17:09 UTC Now that we have v1.93 available, and hopefully this one will work without any invalids or errors, should we delete any 'leftover' v1.92 tasks in our Linux task cache? George ID: 77536 · Rating: 0 · rate: / Reply Quote

Cavalary Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,326,123 RAC: 21,278	Message 77549 - Posted: 8 Jul 2025, 0:58:47 UTC Can definitely say that 1.93 did NOT fix the Windows/Linux differences. If anything, it worsened them, since I had a few invalids with Linux computers validating each other before, now I get a couple a day. And it'd sure help tracking things if inconclusive would just be used for actual inconclusives, with the pending validation WUs properly listed as pending. But since I went through them one by one, let's see. These should also end up being considered invalid for me: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004524410 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004403249 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004361235 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004488161 These should validate for me and not for the Linux host: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004594868 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004530420 The odd one out at the moment is https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004488154 which has differences between my computer and another Windows one that has no oddities reported otherwise, just a few invalids when checked against 2 Linux hosts. And then there's https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1003735350 that's on 1.90 where the other computer that reported it so far fell really foul of the resume bug, resuming 7 times and results way out of whack. ID: 77549 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,581 RAC: 1,457	Message 77552 - Posted: 9 Jul 2025, 21:51:44 UTC We're looking into the source of these invalids. It looks like the cause can be narrowed down to one of the new features added in 1.92 which may be returning different results on windows and linux. We should be able to find the exact problem in the next few days. ID: 77552 · Rating: 0 · rate: / Reply Quote

Cavalary Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,326,123 RAC: 21,278	Message 77569 - Posted: 1 Aug 2025, 1:05:30 UTC - in response to Message 77552. Any updates on this? Quite a waste of computing power to have a couple of invalids per day because of the issue. And, in terms of the science, if the results differ depending on OS, how do you know what the actual correct one is? ID: 77569 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77570 - Posted: 1 Aug 2025, 20:54:42 UTC Eventually there's some general "instability" in the application, de_nbody_07_02_2025_v193_OCS_north__data__01_1752499202_600831 has now 3 quite different results, all from Windows: #1 <search_likelihood>-8.782164729492026</search_likelihood> <search_likelihood_EMD>-7.013343962106111</search_likelihood_EMD> <search_likelihood_Mass>-0.200299889044651</search_likelihood_Mass> <search_likelihood_Beta>-1.568520878341263</search_likelihood_Beta> #2 <search_likelihood>-6.230592315846064</search_likelihood> <search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD> <search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass> <search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta> #3 <search_likelihood>-14.002150320864439</search_likelihood> <search_likelihood_EMD>-11.407574843044049</search_likelihood_EMD> <search_likelihood_Mass>-0.006884998195510</search_likelihood_Mass> <search_likelihood_Beta>-2.587690479624881</search_likelihood_Beta> ID: 77570 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848	Message 77571 - Posted: 1 Aug 2025, 23:38:49 UTC - in response to Message 77570. Last modified: 1 Aug 2025, 23:45:02 UTC Useful example, link! It's probably the same sort of instability that different GPUs (and CPUs) used to see on certain Separation tasks, typically (it seemed) when there were lots of near-zero values being processed -- cumulative differences in rounding behaviour, possibly aggravated by processed data blocks having different boundaries on different hardware? One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-( As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that? Cheers - Al. [Edited - "tasks" -> "work units" in the last paragraph...] ID: 77571 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77572 - Posted: 2 Aug 2025, 11:53:05 UTC - in response to Message 77571. One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-( Well, if running different amount of threads results in different results, I'd call it a bug. And there's actually quite simple solution to it: find out which is right and stick to that. As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that? IIRC it does. Regarding the Windows vs. Linux incompatibility: this might be the issue. ID: 77572 · Rating: 0 · rate: / Reply Quote

Bill F Send message Joined: 4 Jul 09 Posts: 108 Credit: 18,317,753 RAC: 2,586	Message 77574 - Posted: 2 Aug 2025, 23:09:49 UTC And it all has to be CPU's as Milyways is no longer doing any GPU work. Bill F ID: 77574 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848	Message 77575 - Posted: 3 Aug 2025, 3:05:16 UTC - in response to Message 77572. Last modified: 3 Aug 2025, 3:30:14 UTC As my intent was to use Link's earlier post to try to answer Cavalary's post about "which result is right?", I think I need to flesh this out a bit more... One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-( Well, if running different amount of threads results in different results, I'd call it a bug. And there's actually quite simple solution to it: find out which is right and stick to that. Bug versus "known to happen" issue in certain (less common?) situations? By the way, picking a single multi-thread count in the case of BOINC OpenMP applications won't work because a user can tweak the settings to get a different thread count if so inclined -- been there, done that when I needed to free up cores for something else... I suspect that the folks who do massive "finite element" problems (such as weather applications) spend a lot of time either coding for (overlapping?) boundaries or ensuring that the problem is always cut up into sub-problems of a constant size; unfortunately, most BOINC projects that might use multiple threads probably can't put the resources into doing that :-( And that was why I asked the next question... As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that? IIRC it does. And that means that unless every work-unit for a specific data point fails to validate, there will be a decision made as to the validity of the results; if that's the case, the main problem is user dissatisfaction, not scientific validity, I think... Regarding the Windows vs. Linux incompatibility: this might be the issue. Yup, and that's why places like WCG use homogeneous redundancy; it may not stop problems completely (witness some recent issues with ARP1 and Darwin systems), but it tends to get rid of most problems that arise from compiler variations such as different code ordering and different instruction choices! It's an interesting topic, and it usually comes down to development resources in the end... Cheers - Al. [Edited to re-organize slightly] ID: 77575 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77581 - Posted: 5 Aug 2025, 16:08:09 UTC - in response to Message 77570. Eventually there's some general "instability" in the application, de_nbody_07_02_2025_v193_OCS_north__data__01_1752499202_600831 has now 3 quite different results, all from Windows: #1 <search_likelihood>-8.782164729492026</search_likelihood> <search_likelihood_EMD>-7.013343962106111</search_likelihood_EMD> <search_likelihood_Mass>-0.200299889044651</search_likelihood_Mass> <search_likelihood_Beta>-1.568520878341263</search_likelihood_Beta> #2 <search_likelihood>-6.230592315846064</search_likelihood> <search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD> <search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass> <search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta> #3 <search_likelihood>-14.002150320864439</search_likelihood> <search_likelihood_EMD>-11.407574843044049</search_likelihood_EMD> <search_likelihood_Mass>-0.006884998195510</search_likelihood_Mass> <search_likelihood_Beta>-2.587690479624881</search_likelihood_Beta> And the winner is... <search_likelihood>-6.230592315846064</search_likelihood> <search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD> <search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass> <search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta> So an AMD Ryzen 9 5950X running 16 threads per task got the same result as an Intel Core Ultra 5 135U running 12 threads per task. So different CPUs and different amount of threads were not an issue in this case. ID: 77581 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,581 RAC: 1,457	Message 77585 - Posted: 10 Aug 2025, 0:30:51 UTC Unfortunately we have still not solved the issue with the invalids. It does seem that the issue arises from how the two different versions are compiled and I have been looking for a solution, but no luck yet. The case with three windows machines disagreeing is rather interesting. All linux and all windows results generally agree, so this may be pointing towards something else. As for the validity of the results, the good news is that the differences between OS are relatively small and a good result on one is still a good result on the other, so our optimization will be able to work as usual. If we do get a final result for a run that disagrees between OS, it will still give us enough useful information to be useful at this time. Thankfully the issue is not affecting all results either, so we are not guaranteed to get a problematic result. This could definitely cause an issue for our final results, but we have some time to fix it before its a major issue. Currently we believe the linux results are the correct ones. Hopefully we can get these annoying invalids out of the way soon! ID: 77585 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77588 - Posted: 10 Aug 2025, 13:42:43 UTC - in response to Message 77585. Thanks for the update. ID: 77588 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820	Message 77640 - Posted: 8 Sep 2025, 13:36:45 UTC - in response to Message 77585. Hopefully we can get these annoying invalids out of the way soon! Another thing, that you might need to get out of the way are the 11616 Milkyway@home N-Body Simulation results, that the server staus page shows as in progress since few weeks. ID: 77640 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,581 RAC: 1,457	Message 77641 - Posted: 8 Sep 2025, 21:16:38 UTC - in response to Message 77640. Thanks for pointing these out, it looks like these results have been in limbo going as far back as 2020. Having them up didn't do any harm, but we have set them to completed so they will stop showing up. To give a brief update on the invalids issue: some of our improvements to the code seem to have had an effect on the windows results. The results still do not match with those found on linux (which did not change) but these fixes should be at least a part of the solution. ID: 77641 · Rating: 0 · rate: / Reply Quote

Cavalary Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,326,123 RAC: 21,278	Message 77643 - Posted: 20 Sep 2025, 1:49:47 UTC Seems like the potential invalid result for a task running when the system is shut down issue is back... Hadn't seen that in quite a while, but now I (belatedly) noticed https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1007593313 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1007596693 Seems like they were almost finished when I had to shut down, since they were sent about an hour after I booted back up. But that was a week ago. Huh, didn't think I hadn't checked since. ID: 77643 · Rating: 0 · rate: / Reply Quote