Message boards :
Number crunching :
problem with de_nbody tasks never finishing
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Jul 12 Posts: 1 Credit: 8,074,768 RAC: 0 |
I have recently been having problems with the de_nbody files never finishing. They will start, elapsed time continues to increment but the time remaining also increments instead of declining. These tasks will also use every cpu available (ryzen 1600 6 cores 12 threads) which means no other projects I am connected to get any processor time. I finally abort these files and things proceed normally until another one is scheduled. Some of them do run and finish normally but it is getting annoying. Anyone know why this is happening and what to do about it? Running win10 home, boinc 7.6.33 Dennis |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I have recently been having problems with the de_nbody files never finishing. They will start, elapsed time continues to increment but the time remaining also increments instead of declining. These tasks will also use every cpu available (ryzen 1600 6 cores 12 threads) which means no other projects I am connected to get any processor time. I finally abort these files and things proceed normally until another one is scheduled. Some of them do run and finish normally but it is getting annoying. The n-body workunits don't work for everyone, they work for alot of them but not everyone and it's a work in progress to keep up with all the new features and cpu's that come out all the time. I suggest just running the standard units, you can run 11 of them at a time if you also use your gpu for crunching. |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
I started this thread a few days ago and it looks like we're having the same problem. In my case, it is only occasional and restarting BOINC gets the tasks working again. Also, I first noticed the problem after updating BOINC to v7.10.2. (So, you might want to try v7.8.3.) If there are any moderators out there, it's OK with me if you would like to combine our 2 threads. And, retitling would probably be a good idea, as well - maybe something like "3-CPU Nbody Task hang-ups" |
Send message Joined: 25 Oct 18 Posts: 1 Credit: 8,548,574 RAC: 681 |
It happens to me on nbody and on single CPU tasks. Started killing the long running stalled tasks (what a waste) before realising rebooting fixed the problem for a while. |
Send message Joined: 3 Mar 20 Posts: 1 Credit: 676,963 RAC: 0 |
I am having the same long run problems. From my experience simply suspending and immediately restarting Boinc takes care of the problem for a time. I suspect that the problem occurs because of two issues. One inadequate error handling in the code. A simple solution to the problem would be a simple timer that starts counting out with each iteration. Should the count get beyond some number a suspend and restart occurs in that specific work unit. My experience with engineering and scientific code has shown that sometimes the coders do not appreciate how numeric noise corrupts the calculations. A simple example is in spice code for electronic simulation. Use of double precision is necessary but even there sometimes not sufficient. I re-coded the software in long doubles and almost never had a problem. Iterative solutions using various approaches with sparse matrices are fundamentally susceptible to these types of errors. Suspending the process dumps most of the computation at that time , accumulated errors included. I am confident these folks will get this straightened out. BW |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I am having the same long run problems. From my experience simply suspending and immediately restarting Boinc takes care of the problem for a time. I suspect that the problem occurs because of two issues. One inadequate error handling in the code. A simple solution to the problem would be a simple timer that starts counting out with each iteration. Should the count get beyond some number a suspend and restart occurs in that specific work unit. My experience with engineering and scientific code has shown that sometimes the coders do not appreciate how numeric noise corrupts the calculations. A simple example is in spice code for electronic simulation. Use of double precision is necessary but even there sometimes not sufficient. I re-coded the software in long doubles and almost never had a problem. Iterative solutions using various approaches with sparse matrices are fundamentally susceptible to these types of errors. Suspending the process dumps most of the computation at that time , accumulated errors included. I am confident these folks will get this straightened out. You can always Pm an admin with your thoughts |
Send message Joined: 7 May 14 Posts: 57 Credit: 206,540,646 RAC: 6 |
hi all made vid on youtube for multiple instances instruction's and at full load on a Radeon VII RADEON VII GIGABYTE// 3 Instances_ Milkyway@home WUs BOINC_ 3_instances https://www.youtube.com/watch?v=4xKy9wGKmz4 all the best and welcome to earth |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
I almost started a new thread for my problem but then I realized that I am encountering essentially the same issue as others who have posted here. That is, tasks running the Milkyway@home N-Body Simulation v1.76 (mt) windows_x86_64 app are often hanging up for hours at a time until discovered by me. And, restarting BOINC always get them going again. I used to think that the problem might be related to incompatibility with other programs I might be running.but I have since convinced myself that other programs are irrelevant. The problem seems to occur just as often when other programs are not running as when they are running. Conversely, I sometimes run other programs and then check BOINC to find that N-Body tasks are still running OK. However, I do think that the problem is somehow related to the characteristics of specific tasks. That is, some tasks require relatively few (0-2) restarts while others may need restarting 8+ times. Although I have been a MilkyWay contributor for 10+ years, my participation rate has recently increased substantially (due to the SETI hibernation). And as a result, this issue has become very annoying. I am also getting older and my memory isn't what it used to be. Case in point, I had completely forgotten that I had posted this message a little over 2 years ago - essentially reporting this exact same issue. The only difference being updated versions of the N-Body app, BOINC, and Windows, as well as the addition of a new computer. |
Send message Joined: 20 May 18 Posts: 5 Credit: 2,082,833 RAC: 2 |
I wish I read this thread a few days ago. On May 16, I got 52 de_nbody WUs and they all were due on May 28th. The WUs that were estimated to take 3 hours all seemed to function normally, but the WUs that were estimated at 8 hours did reversal countdowns and would stall unmoving for hours at a time. I ended up having to abort 28 WUs with a few of them half-completed as the deadline today was a few hours away and I could see they were not going to make it in time and they were causing other projects to miss their deadlines as well. https://boincstats.com/signature/-1/user/4394448/sig.png |
Send message Joined: 8 Mar 15 Posts: 30 Credit: 78,352,636 RAC: 172 |
How can I set M@H to use the CPU only for N-Body Simulation and GPU for Separation? Maybe with an app_config but I don't know what I have to write. ASUS X570 E-Gaming AMD Ryzen 9 3950X, 16 core / 32 thread 4.4 GHz AMD Radeon Sapphire RX 480 4GB Nitro+ Nvidia GTX 1080 Ti Gaming X Trio 4x16 GB Corsair Vengeance RGB 3466 MHz |
Send message Joined: 30 Oct 13 Posts: 2 Credit: 248,374 RAC: 450 |
I have the same issue. It is quite annoying to have to intervine. Here I thought the entire run was corrupt and so I have just been aborting it. Didn't even occur to me to suspend and restart BOINC. They really need a timer to compare estimated run time and usage to see if it is actually completing the task or if it failed then restart at its last known good point and continue again. If it fails to make progress, automatically abort. |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
This problem has existed for roughly 3 years and, as far as I can tell, no project administrators or moderators have ever responded to this thread. I first reported the problem on 2 Jun 2018 in this post.. Then, on 13 May 2020, i reported it again in this post. To be clear, there is a problem with N-Body Simulation (mt) (3 CPUs) tasks hanging up. The problem existed with V1.68 and continues with V1.76.. And over the last 3 years it has continued to crop up under all versions of BOINC and under all versions of Windows that I have used - on 3 different computers |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
Deleted accidental double post. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hey there, Just checking in to say that I do monitor things and I'm aware of this thread. I'm not at all sure what the problem may be, but please know that we aren't ignoring you. Best, Tom |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
Tom, Thanks for the reply! And it's good to know somebody is watching. If you read my earlier posts on the subject you know that the problem is easily gotten around by restarting BOINC. And, right now I am restarting BOINC 2 or 3 times a day. If there is anything you would like me to do before restarting, please let me know. Stick |
Send message Joined: 9 Jul 17 Posts: 100 Credit: 16,967,906 RAC: 0 |
I have never seen the problem, and though I have not run a very large number of N-body, I have probably done enough to expect to see it. I always suspect AV software when something is hanging up. It may be the "real time protection", which often still monitors processes even if you exclude the BOINC Data folder. I use Microsoft Defender for Win10 and don't have problems with it. |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
Jim. Thank uou for the suggestion. I will change to Microsoft Defender (from Avast) on one of my computers to see if it makes a difference. Stick |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
After switching to MS Defender it didn't take long for a Milkyway@home N-Body Simulation 1.76 (3 CPUs) hangup to occur. But I had forgotten to exclude the BOINC folders. Restarting BOINC now with folders excluded. |
Send message Joined: 9 Jul 17 Posts: 100 Credit: 16,967,906 RAC: 0 |
After switching to MS Defender it didn't take long for a Milkyway@home N-Body Simulation 1.76 (3 CPUs) hangup to occur. But I had forgotten to exclude the BOINC folders. I haven't excluded them either. But you seem to be running mobile CPUs. Are the work units being suspended? I run my machines 24/7, since they are dedicated. It could be one of the power-down tricks that Intel or Microsoft uses that causes the problem. I set my power options to "high performance" mode. Good luck. |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,859,587 RAC: 3,794 |
But you seem to be running mobile CPUs I run my machines 24/7, since they are dedicated.Thanks again for the reply.. You are right. Both my multi-core computers are laptops. They are older and the batteries are shot. I run them plugged into the charger, pretty much 24/7 for BOINC. I run several different BOINC projects and it's only the 3 CPUs Nbody tasks that have any problems. Are the work units being suspended?BOINC does not show the hung-up tasks as suspended. They are shown as Running with Elapsed time counting up but Progress is frozen. Most Nbody tasks take 20 tp 25 minutes to finish up - so when I see one with a longer elapsed time, I restart BOINC. When BOINC restarts the hung-up task starts running again, but its Elapsed time has been reset to a much earlier time (less than 20 minutes) Guessing that only around 10% of tasks hang-up. Some tasks hang up multiple times. |
©2024 Astroinformatics Group