Message boards :
News :
Admin Updates Discussion
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next
| Author | Message |
|---|---|
Keith MyersSend message Joined: 24 Jan 11 Posts: 738 Credit: 565,324,271 RAC: 15,895 |
More of those: Getting errors for the 1.90 applications for missing parameters. <core_client_version>8.3.0</core_client_version> <![CDATA[ <message> process exited with code 13 (0xd, -243)</message> <stderr_txt> <search_application> milkyway_nbody 1.90 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 1 max threads on a system with 32 processors Running MilkyWay@home Nbody v1.90 Optimal Softening Length = 0.010487539811713 kpc Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:108: Unknown named argument 'PMCorrect' Unknown named argument 'PMSigma' Unknown named argument 'usePropMot' 3 bad named arguments found Failed to read input parameters file 2025-06-27 05:19:47 (346627): called boinc_finish(13) </stderr_txt> ]]>
|
GWGeorge007Send message Joined: 6 Jan 18 Posts: 18 Credit: 91,076,455 RAC: 13,393 |
Getting errors for the 1.90 applications for missing parameters. I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid" https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1017908109 Stderr output <core_client_version>8.3.0</core_client_version> <![CDATA[ <stderr_txt> <search_application> milkyway_nbody 1.92 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 4 max threads on a system with 32 processors Running MilkyWay@home Nbody v1.92 Optimal Softening Length = 0.037574956190444 kpc Dwarf Initial Position: [-40.554933916194180,-45.504290620411119,-8.264150885489647] Dwarf Initial Velocity: [95.530208657967421,113.610439063046329,-59.959442932363203] Initial LMC position: [115.779021398409526,773.941580420324840,-160.834750267893781] Initial LMC velocity: [-10.181006980321788,-86.536892927247862,-2.421415618438683] <search_likelihood>-661.279708036776128</search_likelihood> <search_likelihood_EMD>-82.120914388430975</search_likelihood_EMD> <search_likelihood_Mass>-55.044884148421936</search_likelihood_Mass> <search_likelihood_Beta>-134.711993513170285</search_likelihood_Beta> <search_likelihood_BetaAvg>-162.502110001586573</search_likelihood_BetaAvg> <search_likelihood_VelAvg>-108.997646384525780</search_likelihood_VelAvg> <search_likelihood_Dist>-117.902159600640516</search_likelihood_Dist> 2025-06-28 08:31:39 (2432949): called boinc_finish(0) </stderr_txt> ]]> George
|
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid"You have some invalids also from version 1.90 and 1.87, but I guess nearly everyone gets here some invalids every now and than, I've got some too.
|
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
I mean, just look at this WU, so far 3 different results (and v1.92 won't be able to finish it, but that's a different story). I've seen many WUs like that, too many to blame unstable computers, it must be something with the application (perhaps getting sometimes different result from different CPUs?).I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid"You have some invalids also from version 1.90 and 1.87, but I guess nearly everyone gets here some invalids every now and than, I've got some too.
|
GWGeorge007Send message Joined: 6 Jan 18 Posts: 18 Credit: 91,076,455 RAC: 13,393 |
I'm continuing to get new invalid tasks today of the v1.92 N-Body Simulation with Orbit Fitting https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1018353718 ( Stderr output is just to long to post) George
|
GWGeorge007Send message Joined: 6 Jan 18 Posts: 18 Credit: 91,076,455 RAC: 13,393 |
Now that we have v1.93 available, and hopefully this one will work without any invalids or errors, should we delete any 'leftover' v1.92 tasks in our Linux task cache? George
|
|
Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,326,123 RAC: 21,278 |
Can definitely say that 1.93 did NOT fix the Windows/Linux differences. If anything, it worsened them, since I had a few invalids with Linux computers validating each other before, now I get a couple a day. And it'd sure help tracking things if inconclusive would just be used for actual inconclusives, with the pending validation WUs properly listed as pending. But since I went through them one by one, let's see. These should also end up being considered invalid for me: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004524410 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004403249 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004361235 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004488161 These should validate for me and not for the Linux host: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004594868 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004530420 The odd one out at the moment is https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004488154 which has differences between my computer and another Windows one that has no oddities reported otherwise, just a few invalids when checked against 2 Linux hosts. And then there's https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1003735350 that's on 1.90 where the other computer that reported it so far fell really foul of the resume bug, resuming 7 times and results way out of whack. |
|
Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,581 RAC: 1,457 |
We're looking into the source of these invalids. It looks like the cause can be narrowed down to one of the new features added in 1.92 which may be returning different results on windows and linux. We should be able to find the exact problem in the next few days. |
|
Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,326,123 RAC: 21,278 |
Any updates on this? Quite a waste of computing power to have a couple of invalids per day because of the issue. And, in terms of the science, if the results differ depending on OS, how do you know what the actual correct one is? |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
Eventually there's some general "instability" in the application, de_nbody_07_02_2025_v193_OCS_north__data__01_1752499202_600831 has now 3 quite different results, all from Windows: #1 <search_likelihood>-8.782164729492026</search_likelihood> <search_likelihood_EMD>-7.013343962106111</search_likelihood_EMD> <search_likelihood_Mass>-0.200299889044651</search_likelihood_Mass> <search_likelihood_Beta>-1.568520878341263</search_likelihood_Beta> #2 <search_likelihood>-6.230592315846064</search_likelihood> <search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD> <search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass> <search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta> #3 <search_likelihood>-14.002150320864439</search_likelihood> <search_likelihood_EMD>-11.407574843044049</search_likelihood_EMD> <search_likelihood_Mass>-0.006884998195510</search_likelihood_Mass> <search_likelihood_Beta>-2.587690479624881</search_likelihood_Beta>
|
|
Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848 |
Useful example, link! It's probably the same sort of instability that different GPUs (and CPUs) used to see on certain Separation tasks, typically (it seemed) when there were lots of near-zero values being processed -- cumulative differences in rounding behaviour, possibly aggravated by processed data blocks having different boundaries on different hardware? One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-( As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that? Cheers - Al. [Edited - "tasks" -> "work units" in the last paragraph...] |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-(Well, if running different amount of threads results in different results, I'd call it a bug. And there's actually quite simple solution to it: find out which is right and stick to that. As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that?IIRC it does. Regarding the Windows vs. Linux incompatibility: this might be the issue.
|
Bill FSend message Joined: 4 Jul 09 Posts: 108 Credit: 18,317,753 RAC: 2,586 |
And it all has to be CPU's as Milyways is no longer doing any GPU work. Bill F |
|
Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848 |
As my intent was to use Link's earlier post to try to answer Cavalary's post about "which result is right?", I think I need to flesh this out a bit more... Bug versus "known to happen" issue in certain (less common?) situations?One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-(Well, if running different amount of threads results in different results, I'd call it a bug. And there's actually quite simple solution to it: find out which is right and stick to that. By the way, picking a single multi-thread count in the case of BOINC OpenMP applications won't work because a user can tweak the settings to get a different thread count if so inclined -- been there, done that when I needed to free up cores for something else... I suspect that the folks who do massive "finite element" problems (such as weather applications) spend a lot of time either coding for (overlapping?) boundaries or ensuring that the problem is always cut up into sub-problems of a constant size; unfortunately, most BOINC projects that might use multiple threads probably can't put the resources into doing that :-( And that was why I asked the next question... As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that?IIRC it does. And that means that unless every work-unit for a specific data point fails to validate, there will be a decision made as to the validity of the results; if that's the case, the main problem is user dissatisfaction, not scientific validity, I think... Regarding the Windows vs. Linux incompatibility: this might be the issue.Yup, and that's why places like WCG use homogeneous redundancy; it may not stop problems completely (witness some recent issues with ARP1 and Darwin systems), but it tends to get rid of most problems that arise from compiler variations such as different code ordering and different instruction choices! It's an interesting topic, and it usually comes down to development resources in the end... Cheers - Al. [Edited to re-organize slightly] |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
Eventually there's some general "instability" in the application, de_nbody_07_02_2025_v193_OCS_north__data__01_1752499202_600831 has now 3 quite different results, all from Windows:And the winner is... <search_likelihood>-6.230592315846064</search_likelihood> <search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD> <search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass> <search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta> So an AMD Ryzen 9 5950X running 16 threads per task got the same result as an Intel Core Ultra 5 135U running 12 threads per task. So different CPUs and different amount of threads were not an issue in this case.
|
|
Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,581 RAC: 1,457 |
Unfortunately we have still not solved the issue with the invalids. It does seem that the issue arises from how the two different versions are compiled and I have been looking for a solution, but no luck yet. The case with three windows machines disagreeing is rather interesting. All linux and all windows results generally agree, so this may be pointing towards something else. As for the validity of the results, the good news is that the differences between OS are relatively small and a good result on one is still a good result on the other, so our optimization will be able to work as usual. If we do get a final result for a run that disagrees between OS, it will still give us enough useful information to be useful at this time. Thankfully the issue is not affecting all results either, so we are not guaranteed to get a problematic result. This could definitely cause an issue for our final results, but we have some time to fix it before its a major issue. Currently we believe the linux results are the correct ones. Hopefully we can get these annoying invalids out of the way soon! |
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
Thanks for the update.
|
|
Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,504,381 RAC: 9,820 |
Hopefully we can get these annoying invalids out of the way soon!Another thing, that you might need to get out of the way are the 11616 Milkyway@home N-Body Simulation results, that the server staus page shows as in progress since few weeks.
|
|
Send message Joined: 11 Sep 24 Posts: 13 Credit: 32,581 RAC: 1,457 |
Thanks for pointing these out, it looks like these results have been in limbo going as far back as 2020. Having them up didn't do any harm, but we have set them to completed so they will stop showing up. To give a brief update on the invalids issue: some of our improvements to the code seem to have had an effect on the windows results. The results still do not match with those found on linux (which did not change) but these fixes should be at least a part of the solution. |
|
Send message Joined: 23 Aug 11 Posts: 58 Credit: 18,326,123 RAC: 21,278 |
Seems like the potential invalid result for a task running when the system is shut down issue is back... Hadn't seen that in quite a while, but now I (belatedly) noticed https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1007593313 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1007596693 Seems like they were almost finished when I had to shut down, since they were sent about an hour after I booted back up. But that was a week ago. Huh, didn't think I hadn't checked since. |
©2025 Astroinformatics Group