Request for Windows ARM64 support - Snapdragon X2 Elite Extreme (Oryon)

Author	Message
kasdashdfjsah Send message Joined: 3 Feb 24 Posts: 16 Credit: 232,541 RAC: 0	Message 77926 - Posted: 22 Apr 2026, 13:09:19 UTC Just got a new X2 Elite Extreme Snapdragon laptop. The CPU single and multi core speed is insane. Performance could be much better with native ARM64 support. x86 emulation hits these chips harder than Apple Silicon on MacOS. Please consider adding native Windows ARM64 support. I would be happy to test this out. I can also do the work myself if I get access to relevant project files. My 18-core setup is ready for testing. Data shows native apps run much faster than emulated ones on Oryon. ID: 77926 · Rating: 0 · rate: / Reply Quote

kasdashdfjsah Send message Joined: 3 Feb 24 Posts: 16 Credit: 232,541 RAC: 0	Message 77927 - Posted: 24 Apr 2026, 8:51:26 UTC Update: Subject: Windows ARM64 support - Benchmarking the Snapdragon X2 Elite (Oryon) I see that Nbody v1.95 was recently released. I am currently running native Windows ARM64 builds on Asteroids@home and Einstein@home with great results on the new Snapdragon X2 Elite Extreme. My 18-core setup is delivering high-end desktop throughput at only 25-30W CPU power. I would love to bring this efficiency to MilkyWay@home. Since the Adreno GPU lacks FP64 support, I am looking for a native Windows ARM64 CPU build (MSVC/CMake). Given the current source supports OpenMP and Double Precision, a native Oryon build should be extremely efficient for N-body simulations. Is there a windows_arm64 plan class in the works, or could a test binary be provided? I am ready to provide benchmarks and stability data immediately. ID: 77927 · Rating: 0 · rate: / Reply Quote

ahorek's team Send message Joined: 8 Sep 07 Posts: 10 Credit: 2,566,112 RAC: 658	Message 77936 - Posted: 26 Apr 2026, 15:55:29 UTC The current CPU application can be built on ARM, but it produces incorrect results. I tried to fix it, but I wasn’t successful: https://github.com/Milkyway-at-home/milkywayathome_client/pull/224/changes without a proper fix to make it match the x64 results, the ARM version won’t be useful. and the admins don’t seem to have much interest in those platforms either, so it is unlikely there will be an ARM version for Apple, Android, or Windows on ARM. Windows on ARM x64 emulation may still work, though. > should be extremely efficient for N-body simulations Don’t make those assumptions if you haven’t tested it. Just because your brand-new laptop’s CPU only consumes 30W doesn’t necessarily mean it’s efficient. If you compare the performance-per-watt ratio with recent x64 CPUs, you’ll see it’s not that impressive. ID: 77936 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 33 Credit: 710,354 RAC: 477	Message 77937 - Posted: 27 Apr 2026, 17:25:42 UTC I would love to be able to release a version on ARM. I've looked into it a little bit, but I don't think its something that's feasible. The simulations we run are very chaotic, so even a small difference in a single calculation will propagate and produce a very different result. Chances are that the only way we could get results within the needed precision to validate against each other on the server is by forcing identical results between the two architectures for every calculation. I'm not even sure if doing that is possible, and if it is it would take more work than we are able to dedicate to that at this time. As mentioned below the application can be build on ARM, but it can't be used for much but a low accuracy approximation. You're free to look at it and try to fix it if you want, but I wouldn't recommend it. ID: 77937 · Rating: 0 · rate: / Reply Quote

ahorek's team Send message Joined: 8 Sep 07 Posts: 10 Credit: 2,566,112 RAC: 658	Message 77938 - Posted: 30 Apr 2026, 16:45:36 UTC > it can't be used for much but a low accuracy approximation ARM chips do support double-precision arithmetic, so they can absolutely be used to compute results with sufficient accuracy. However, your current code depends on undefined behaviour as explained here https://github.com/Milkyway-at-home/milkywayathome_client/pull/224#issuecomment-4354305606 this is a bug, not a precision issue or a hardware limitation. Unfortunately, my fix alone isn’t enough. Either there are other places with the same bug, or another problem. Comparing results at each step between 2 architectures is pretty time-consuming... ID: 77938 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 33 Credit: 710,354 RAC: 477	Message 77940 - Posted: 1 May 2026, 16:45:25 UTC Just to clarify, when I talk about the accuracy of the result I am referring to our final simulation result. While individual calculations may be extremely close, any errors we have will grow exponentially in this kind of simulation. I worry that the only way to ensure results close enough to validate would be to have exact bitwise identity, which would be difficult to do and likely cost a lot of performance. I'm certainly no expert on the topic, but I think we would need to change our math in many places to ensure this kind of strict reproducibility. From my understanding nbody applications typically do not ensure this level of consistency between architectures. It may be possible to keep many things the same and fix bugs like what you have pointed out to get statistically similar results that we can still consider "good", but I don't know if these will consistently be close enough that we could have them validate against each other. ID: 77940 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77941 - Posted: 1 May 2026, 20:19:17 UTC - in response to Message 77940. Last modified: 1 May 2026, 20:21:34 UTC Just to clarify, when I talk about the accuracy of the result I am referring to our final simulation result. While individual calculations may be extremely close, any errors we have will grow exponentially in this kind of simulation. I worry that the only way to ensure results close enough to validate would be to have exact bitwise identity, which would be difficult to do and likely cost a lot of performance. I'm certainly no expert on the topic, but I think we would need to change our math in many places to ensure this kind of strict reproducibility. From my understanding nbody applications typically do not ensure this level of consistency between architectures. It may be possible to keep many things the same and fix bugs like what you have pointed out to get statistically similar results that we can still consider "good", but I don't know if these will consistently be close enough that we could have them validate against each other. i have (with much AI help) fixed the issue with accuracy on aarch64 in Linux. a single small change to the nbody_histogram.c file will correct the issue for aarch64 Linux builds (excluding other small changes needed to target aarch64) my short test WU I ran with this is accurate to 4 decimals on all 8 output parameters. not sure how strict the validator is, but I am testing this now via anonymous platform. edit, I tried to post the code fix here but the forum freaks out, maybe you have something on the forums the prevents posting code bits? ID: 77941 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77942 - Posted: 1 May 2026, 20:24:36 UTC - in response to Message 77941. diff --git a/nbody/src/nbody_histogram.c b/nbody/src/nbody_histogram.c index 47fb6715..1731984d 100644 --- a/nbody/src/nbody_histogram.c +++ b/nbody/src/nbody_histogram.c @@ -915,12 +915,21 @@ MainStruct* nbCreateHistogram(const NBodyCtx* ctx, /* Simulation context mu_ras[ub_counter] = DEFAULT_NOT_USE; mu_decs[ub_counter] = DEFAULT_NOT_USE; - /* Find the indices / - lambdaIndex = (unsigned int) mw_floor((lambda - lambdaStart) / lambdaSize); - betaIndex = (unsigned int) mw_floor((beta - betaStart) / betaSize); + / Find the indices. Casting a negative double to unsigned int is + * undefined behavior, and x86_64 vs aarch64 implement it + * differently: x86_64 wraps to a huge unsigned (correctly fails the + * < lambdaBins check), aarch64 saturates to 0 (incorrectly bins + * out-of-range particles into bin 0). Bound-check on the float + * first to keep behavior identical across architectures. */ + real lambdaIdxF = mw_floor((lambda - lambdaStart) / lambdaSize); + real betaIdxF = mw_floor((beta - betaStart) / betaSize); + mwbool inRange = (lambdaIdxF >= 0.0 && lambdaIdxF < (real) lambdaBins + && betaIdxF >= 0.0 && betaIdxF < (real) betaBins); + lambdaIndex = inRange ? (unsigned int) lambdaIdxF : lambdaBins; + betaIndex = inRange ? (unsigned int) betaIdxF : betaBins; ID: 77942 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77943 - Posted: 1 May 2026, 20:36:34 UTC Last modified: 1 May 2026, 20:48:36 UTC this is the first workunit with the newest code (ignore earlier runs with more error) https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1020117489 (mine is the anonymous platform result) very very close, but it looks like the validator still doesn't accept it since it wasn't judged valid. is this really an invalid result? or can the validator strictness be loosened to allow this? Metric aarch64 x86_64 Delta (x86_64 - aarch64) --------------------------- ----------------------- ----------------------- ------------------------ search_likelihood -751.206790741635700 -751.232077413926845 -0.025286672291145 search_likelihood_EMD -16.071059557630502 -16.099033208117227 -0.027973650486725 search_likelihood_Mass -48.396293126839204 -48.396293126839204 0.000000000000000 search_likelihood_Beta -112.139917748761746 -111.501316752592331 +0.638600996169415 search_likelihood_BetaAvg -142.601938056101801 -143.279383333977933 -0.677445277876132 search_likelihood_VelAvg -101.435569456358564 -101.418970908206191 +0.016598548152373 search_likelihood_Dist -112.500000000000000 -112.500000000000000 0.000000000000000 search_likelihood_Momentum -218.062012795943843 -218.037080084194088 +0.024932711749755 ID: 77943 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 33 Credit: 710,354 RAC: 477	Message 77944 - Posted: 1 May 2026, 21:05:22 UTC This looks promising. I guess none of us never caught that behavior with the casting! I'll have to look into testing this with some other cases and see how well it does; particularly around the important regions of the likelihood surface. If it is consistently this good we may be able to increase the range on the validator, but thats something I'd need to bring up and get permission for. ID: 77944 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77945 - Posted: 1 May 2026, 21:11:42 UTC - in response to Message 77944. thanks. i just realized that my build for that sample WU didnt include some fp math strictness flags that might be necessary on aarch64 also. added those in and will keep testing to see if they come out closer. ID: 77945 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77947 - Posted: 2 May 2026, 1:56:49 UTC - in response to Message 77945. https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1019638235 added three build arguments -ffp-contract=off -fno-associative-math -fno-finite-math-only and now my results seems to be bitwise identical to x86_64 results. and they are validating. just -ffp-contract=off alone might be enough to get by the validator limits, i'll try that tomorrow. ID: 77947 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77948 - Posted: 2 May 2026, 18:14:15 UTC - in response to Message 77947. looks like just -ffp-contract=off is enough. ID: 77948 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77952 - Posted: 3 May 2026, 14:21:48 UTC - in response to Message 77948. Last modified: 3 May 2026, 14:27:32 UTC I have 6 different aarch64 SBCs all running my app now. all producing valid results against the stock v1.95 x86_64 app. you should be good to push out an official Linux aarch64 I think. if you have an appropriate device to build it on. my hosts running this app: Nvidia Jetson Orin Nano: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=1066634 Nvidia Jetson Orin NX: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=1066909 Nvidia Jetson Orin NX: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=1066906 Radxa Rock 5C: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=1066913 Radxa Rock 5C: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=1066911 Raspberry Pi 5: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=1066934 and maybe ahorek can build a Win_arm version as well for kasdashdfjsah to test. (on topic for this thread) ID: 77952 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 33 Credit: 710,354 RAC: 477	Message 77953 - Posted: 6 May 2026, 21:28:16 UTC This seems to be working for me as well. I'll begin testing out the linux version, but I don't know if we'll be able to build for windows arm. ID: 77953 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,280 RAC: 14,807	Message 77956 - Posted: 7 May 2026, 13:39:59 UTC - in response to Message 77953. thanks! ID: 77956 · Rating: 0 · rate: / Reply Quote