Welcome to MilkyWay@home

Admin Updates Discussion

Message boards : News : Admin Updates Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 738
Credit: 565,321,565
RAC: 15,744
Message 77527 - Posted: 28 Jun 2025, 0:20:34 UTC - in response to Message 77522.  

More of those:
de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_475
de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_6989
de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_6990
de_nbody_orbit_fitting_06_25_2025_v192_OCS_lmc__data__01_1750887887_7030


And also it was perhaps not a good idea to let v1.92 cruch tasks originally made for v1.87:
de_nbody_orbit_fitting_03_25_2025_v186_OCS__data__32_1747416295_73267
de_nbody_orbit_fitting_03_25_2025_v186_OCS__data__32_1747416295_75085

The new application doesn't like them:
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
The data is invalid.
 (0xd) - exit code 13 (0xd)</message>
<stderr_txt>
<search_application> milkyway_nbody 1.92 Windows x86_64 double  OpenMP, Crlibm </search_application>
Using OpenMP 8 max threads on a system with 8 processors
Running MilkyWay@home Nbody v1.86
Optimal Softening Length = 0.000011993015169 kpc
Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:106: bad argument #1 to 'create' (Missing required named argument 'PMSigma') 
Failed to read input parameters file
strftime() failed called boinc_finish(13)

</stderr_txt>
]]>


Getting errors for the 1.90 applications for missing parameters.

<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 13 (0xd, -243)</message>
<stderr_txt>
<search_application> milkyway_nbody 1.90 Linux x86_64 double  OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 32 processors
Running MilkyWay@home Nbody v1.90
Optimal Softening Length = 0.010487539811713 kpc
Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:108: 
  Unknown named argument 'PMCorrect'
  Unknown named argument 'PMSigma'
  Unknown named argument 'usePropMot'
  3 bad named arguments found 
Failed to read input parameters file
2025-06-27 05:19:47 (346627): called boinc_finish(13)

</stderr_txt>
]]>

ID: 77527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GWGeorge007
Avatar

Send message
Joined: 6 Jan 18
Posts: 18
Credit: 91,075,329
RAC: 13,406
Message 77528 - Posted: 28 Jun 2025, 15:55:42 UTC - in response to Message 77527.  
Last modified: 28 Jun 2025, 16:01:02 UTC

Getting errors for the 1.90 applications for missing parameters.

<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 13 (0xd, -243)</message>
<stderr_txt>
<search_application> milkyway_nbody 1.90 Linux x86_64 double  OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 32 processors
Running MilkyWay@home Nbody v1.90
Optimal Softening Length = 0.010487539811713 kpc
Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 - 2018 Siddhartha ..."]:108: 
  Unknown named argument 'PMCorrect'
  Unknown named argument 'PMSigma'
  Unknown named argument 'usePropMot'
  3 bad named arguments found 
Failed to read input parameters file
2025-06-27 05:19:47 (346627): called boinc_finish(13)

</stderr_txt>
]]>


I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid"
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1017908109

Stderr output

<core_client_version>8.3.0</core_client_version>
<![CDATA[
<stderr_txt>
<search_application> milkyway_nbody 1.92 Linux x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 4 max threads on a system with 32 processors
Running MilkyWay@home Nbody v1.92
Optimal Softening Length = 0.037574956190444 kpc
Dwarf Initial Position: [-40.554933916194180,-45.504290620411119,-8.264150885489647]
Dwarf Initial Velocity: [95.530208657967421,113.610439063046329,-59.959442932363203]
Initial LMC position: [115.779021398409526,773.941580420324840,-160.834750267893781]
Initial LMC velocity: [-10.181006980321788,-86.536892927247862,-2.421415618438683]
<search_likelihood>-661.279708036776128</search_likelihood>
<search_likelihood_EMD>-82.120914388430975</search_likelihood_EMD>
<search_likelihood_Mass>-55.044884148421936</search_likelihood_Mass>
<search_likelihood_Beta>-134.711993513170285</search_likelihood_Beta>
<search_likelihood_BetaAvg>-162.502110001586573</search_likelihood_BetaAvg>
<search_likelihood_VelAvg>-108.997646384525780</search_likelihood_VelAvg>
<search_likelihood_Dist>-117.902159600640516</search_likelihood_Dist>
2025-06-28 08:31:39 (2432949): called boinc_finish(0)

</stderr_txt>
]]>
George
ID: 77528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77529 - Posted: 28 Jun 2025, 17:59:22 UTC - in response to Message 77528.  

I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid"
You have some invalids also from version 1.90 and 1.87, but I guess nearly everyone gets here some invalids every now and than, I've got some too.
ID: 77529 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77530 - Posted: 29 Jun 2025, 9:32:03 UTC - in response to Message 77529.  

I'm also still getting errors for Nbody v1.92, or more correctly "Validate state : Invalid"
You have some invalids also from version 1.90 and 1.87, but I guess nearly everyone gets here some invalids every now and than, I've got some too.
I mean, just look at this WU, so far 3 different results (and v1.92 won't be able to finish it, but that's a different story). I've seen many WUs like that, too many to blame unstable computers, it must be something with the application (perhaps getting sometimes different result from different CPUs?).
ID: 77530 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GWGeorge007
Avatar

Send message
Joined: 6 Jan 18
Posts: 18
Credit: 91,075,329
RAC: 13,406
Message 77533 - Posted: 2 Jul 2025, 16:59:11 UTC
Last modified: 2 Jul 2025, 17:02:19 UTC

I'm continuing to get new invalid tasks today of the v1.92 N-Body Simulation with Orbit Fitting

https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1018353718

( Stderr output is just to long to post)
George
ID: 77533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GWGeorge007
Avatar

Send message
Joined: 6 Jan 18
Posts: 18
Credit: 91,075,329
RAC: 13,406
Message 77536 - Posted: 2 Jul 2025, 21:17:09 UTC

Now that we have v1.93 available, and hopefully this one will work without any invalids or errors, should we delete any 'leftover' v1.92 tasks in our Linux task cache?
George
ID: 77536 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 58
Credit: 18,325,956
RAC: 21,475
Message 77549 - Posted: 8 Jul 2025, 0:58:47 UTC

Can definitely say that 1.93 did NOT fix the Windows/Linux differences. If anything, it worsened them, since I had a few invalids with Linux computers validating each other before, now I get a couple a day.

And it'd sure help tracking things if inconclusive would just be used for actual inconclusives, with the pending validation WUs properly listed as pending.
But since I went through them one by one, let's see.
These should also end up being considered invalid for me:
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004524410
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004403249
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004361235
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004488161
These should validate for me and not for the Linux host:
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004594868
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004530420
The odd one out at the moment is https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1004488154 which has differences between my computer and another Windows one that has no oddities reported otherwise, just a few invalids when checked against 2 Linux hosts.
And then there's https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1003735350 that's on 1.90 where the other computer that reported it so far fell really foul of the resume bug, resuming 7 times and results way out of whack.
ID: 77549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gimmyk
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 11 Sep 24
Posts: 13
Credit: 32,581
RAC: 1,457
Message 77552 - Posted: 9 Jul 2025, 21:51:44 UTC

We're looking into the source of these invalids. It looks like the cause can be narrowed down to one of the new features added in 1.92 which may be returning different results on windows and linux. We should be able to find the exact problem in the next few days.
ID: 77552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 58
Credit: 18,325,956
RAC: 21,475
Message 77569 - Posted: 1 Aug 2025, 1:05:30 UTC - in response to Message 77552.  

Any updates on this? Quite a waste of computing power to have a couple of invalids per day because of the issue. And, in terms of the science, if the results differ depending on OS, how do you know what the actual correct one is?
ID: 77569 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77570 - Posted: 1 Aug 2025, 20:54:42 UTC

Eventually there's some general "instability" in the application, de_nbody_07_02_2025_v193_OCS_north__data__01_1752499202_600831 has now 3 quite different results, all from Windows:

#1
<search_likelihood>-8.782164729492026</search_likelihood>
<search_likelihood_EMD>-7.013343962106111</search_likelihood_EMD>
<search_likelihood_Mass>-0.200299889044651</search_likelihood_Mass>
<search_likelihood_Beta>-1.568520878341263</search_likelihood_Beta>

#2
<search_likelihood>-6.230592315846064</search_likelihood>
<search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD>
<search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass>
<search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta>

#3
<search_likelihood>-14.002150320864439</search_likelihood>
<search_likelihood_EMD>-11.407574843044049</search_likelihood_EMD>
<search_likelihood_Mass>-0.006884998195510</search_likelihood_Mass>
<search_likelihood_Beta>-2.587690479624881</search_likelihood_Beta>

ID: 77570 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 218
Credit: 110,420,422
RAC: 3,848
Message 77571 - Posted: 1 Aug 2025, 23:38:49 UTC - in response to Message 77570.  
Last modified: 1 Aug 2025, 23:45:02 UTC

Useful example, link!

It's probably the same sort of instability that different GPUs (and CPUs) used to see on certain Separation tasks, typically (it seemed) when there were lots of near-zero values being processed -- cumulative differences in rounding behaviour, possibly aggravated by processed data blocks having different boundaries on different hardware?

One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-(

As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that?

Cheers - Al.

[Edited - "tasks" -> "work units" in the last paragraph...]
ID: 77571 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77572 - Posted: 2 Aug 2025, 11:53:05 UTC - in response to Message 77571.  

One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-(
Well, if running different amount of threads results in different results, I'd call it a bug. And there's actually quite simple solution to it: find out which is right and stick to that.


As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that?
IIRC it does.


Regarding the Windows vs. Linux incompatibility: this might be the issue.
ID: 77572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Bill F
Avatar

Send message
Joined: 4 Jul 09
Posts: 108
Credit: 18,317,753
RAC: 2,586
Message 77574 - Posted: 2 Aug 2025, 23:09:49 UTC

And it all has to be CPU's as Milyways is no longer doing any GPU work.

Bill F
ID: 77574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 218
Credit: 110,420,422
RAC: 3,848
Message 77575 - Posted: 3 Aug 2025, 3:05:16 UTC - in response to Message 77572.  
Last modified: 3 Aug 2025, 3:30:14 UTC

As my intent was to use Link's earlier post to try to answer Cavalary's post about "which result is right?", I think I need to flesh this out a bit more...

One of those tasks used a single thread, one used 4 threads and one used 16 threads. That could well cause similar issues, and if that's the case I'm not sure there's much they can do about it :-(
Well, if running different amount of threads results in different results, I'd call it a bug. And there's actually quite simple solution to it: find out which is right and stick to that.
Bug versus "known to happen" issue in certain (less common?) situations?

By the way, picking a single multi-thread count in the case of BOINC OpenMP applications won't work because a user can tweak the settings to get a different thread count if so inclined -- been there, done that when I needed to free up cores for something else...

I suspect that the folks who do massive "finite element" problems (such as weather applications) spend a lot of time either coding for (overlapping?) boundaries or ensuring that the problem is always cut up into sub-problems of a constant size; unfortunately, most BOINC projects that might use multiple threads probably can't put the resources into doing that :-(

And that was why I asked the next question...
As for which result is right -- if there are enough different work units all looking at the same data points with slightly different parameters a consensus might be possible; doesn't nBody work like that?
IIRC it does.

And that means that unless every work-unit for a specific data point fails to validate, there will be a decision made as to the validity of the results; if that's the case, the main problem is user dissatisfaction, not scientific validity, I think...

Regarding the Windows vs. Linux incompatibility: this might be the issue.
Yup, and that's why places like WCG use homogeneous redundancy; it may not stop problems completely (witness some recent issues with ARP1 and Darwin systems), but it tends to get rid of most problems that arise from compiler variations such as different code ordering and different instruction choices!

It's an interesting topic, and it usually comes down to development resources in the end...

Cheers - Al.

[Edited to re-organize slightly]
ID: 77575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77581 - Posted: 5 Aug 2025, 16:08:09 UTC - in response to Message 77570.  

Eventually there's some general "instability" in the application, de_nbody_07_02_2025_v193_OCS_north__data__01_1752499202_600831 has now 3 quite different results, all from Windows:

#1
<search_likelihood>-8.782164729492026</search_likelihood>
<search_likelihood_EMD>-7.013343962106111</search_likelihood_EMD>
<search_likelihood_Mass>-0.200299889044651</search_likelihood_Mass>
<search_likelihood_Beta>-1.568520878341263</search_likelihood_Beta>

#2
<search_likelihood>-6.230592315846064</search_likelihood>
<search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD>
<search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass>
<search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta>

#3
<search_likelihood>-14.002150320864439</search_likelihood>
<search_likelihood_EMD>-11.407574843044049</search_likelihood_EMD>
<search_likelihood_Mass>-0.006884998195510</search_likelihood_Mass>
<search_likelihood_Beta>-2.587690479624881</search_likelihood_Beta>
And the winner is...
<search_likelihood>-6.230592315846064</search_likelihood>
<search_likelihood_EMD>-4.578218253136948</search_likelihood_EMD>
<search_likelihood_Mass>-0.017953695121116</search_likelihood_Mass>
<search_likelihood_Beta>-1.634420367588000</search_likelihood_Beta>

So an AMD Ryzen 9 5950X running 16 threads per task got the same result as an Intel Core Ultra 5 135U running 12 threads per task. So different CPUs and different amount of threads were not an issue in this case.
ID: 77581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gimmyk
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 11 Sep 24
Posts: 13
Credit: 32,581
RAC: 1,457
Message 77585 - Posted: 10 Aug 2025, 0:30:51 UTC

Unfortunately we have still not solved the issue with the invalids. It does seem that the issue arises from how the two different versions are compiled and I have been looking for a solution, but no luck yet. The case with three windows machines disagreeing is rather interesting. All linux and all windows results generally agree, so this may be pointing towards something else.

As for the validity of the results, the good news is that the differences between OS are relatively small and a good result on one is still a good result on the other, so our optimization will be able to work as usual. If we do get a final result for a run that disagrees between OS, it will still give us enough useful information to be useful at this time. Thankfully the issue is not affecting all results either, so we are not guaranteed to get a problematic result. This could definitely cause an issue for our final results, but we have some time to fix it before its a major issue. Currently we believe the linux results are the correct ones.

Hopefully we can get these annoying invalids out of the way soon!
ID: 77585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77588 - Posted: 10 Aug 2025, 13:42:43 UTC - in response to Message 77585.  

Thanks for the update.
ID: 77588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77640 - Posted: 8 Sep 2025, 13:36:45 UTC - in response to Message 77585.  

Hopefully we can get these annoying invalids out of the way soon!
Another thing, that you might need to get out of the way are the 11616 Milkyway@home N-Body Simulation results, that the server staus page shows as in progress since few weeks.
ID: 77640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gimmyk
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 11 Sep 24
Posts: 13
Credit: 32,581
RAC: 1,457
Message 77641 - Posted: 8 Sep 2025, 21:16:38 UTC - in response to Message 77640.  

Thanks for pointing these out, it looks like these results have been in limbo going as far back as 2020. Having them up didn't do any harm, but we have set them to completed so they will stop showing up.

To give a brief update on the invalids issue: some of our improvements to the code seem to have had an effect on the windows results. The results still do not match with those found on linux (which did not change) but these fixes should be at least a part of the solution.
ID: 77641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 58
Credit: 18,325,956
RAC: 21,475
Message 77643 - Posted: 20 Sep 2025, 1:49:47 UTC

Seems like the potential invalid result for a task running when the system is shut down issue is back... Hadn't seen that in quite a while, but now I (belatedly) noticed
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1007593313
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1007596693
Seems like they were almost finished when I had to shut down, since they were sent about an hour after I booted back up. But that was a week ago. Huh, didn't think I hadn't checked since.
ID: 77643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next

Message boards : News : Admin Updates Discussion

©2025 Astroinformatics Group