Welcome to MilkyWay@home

problem with de_nbody tasks never finishing


Advanced search

Message boards : Number crunching : problem with de_nbody tasks never finishing
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Dennis

Send message
Joined: 2 Jul 12
Posts: 1
Credit: 8,042,153
RAC: 0
5 million credit badge9 year member badge
Message 67559 - Posted: 3 Jun 2018, 22:18:13 UTC

I have recently been having problems with the de_nbody files never finishing. They will start, elapsed time continues to increment but the time remaining also increments instead of declining. These tasks will also use every cpu available (ryzen 1600 6 cores 12 threads) which means no other projects I am connected to get any processor time. I finally abort these files and things proceed normally until another one is scheduled. Some of them do run and finish normally but it is getting annoying.

Anyone know why this is happening and what to do about it?
Running win10 home, boinc 7.6.33

Dennis
ID: 67559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2541
Credit: 462,666,679
RAC: 142
300 million credit badge12 year member badgeextraordinary contributions badge
Message 67560 - Posted: 4 Jun 2018, 11:28:47 UTC - in response to Message 67559.  

I have recently been having problems with the de_nbody files never finishing. They will start, elapsed time continues to increment but the time remaining also increments instead of declining. These tasks will also use every cpu available (ryzen 1600 6 cores 12 threads) which means no other projects I am connected to get any processor time. I finally abort these files and things proceed normally until another one is scheduled. Some of them do run and finish normally but it is getting annoying.

Anyone know why this is happening and what to do about it?
Running win10 home, boinc 7.6.33

Dennis


The n-body workunits don't work for everyone, they work for alot of them but not everyone and it's a work in progress to keep up with all the new features and cpu's that come out all the time. I suggest just running the standard units, you can run 11 of them at a time if you also use your gpu for crunching.
ID: 67560 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 67561 - Posted: 4 Jun 2018, 18:58:36 UTC - in response to Message 67559.  

I started this thread a few days ago and it looks like we're having the same problem. In my case, it is only occasional and restarting BOINC gets the tasks working again. Also, I first noticed the problem after updating BOINC to v7.10.2. (So, you might want to try v7.8.3.)

If there are any moderators out there, it's OK with me if you would like to combine our 2 threads. And, retitling would probably be a good idea, as well - maybe something like "3-CPU Nbody Task hang-ups"
ID: 67561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dad

Send message
Joined: 25 Oct 18
Posts: 1
Credit: 4,828,862
RAC: 8,445
3 million credit badge2 year member badge
Message 69176 - Posted: 17 Oct 2019, 3:18:22 UTC

It happens to me on nbody and on single CPU tasks. Started killing the long running stalled tasks (what a waste) before realising rebooting fixed the problem for a while.
ID: 69176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert D. Washburn

Send message
Joined: 3 Mar 20
Posts: 1
Credit: 676,963
RAC: 0
500 thousand credit badge1 year member badge
Message 69741 - Posted: 19 Apr 2020, 14:50:47 UTC

I am having the same long run problems. From my experience simply suspending and immediately restarting Boinc takes care of the problem for a time. I suspect that the problem occurs because of two issues. One inadequate error handling in the code. A simple solution to the problem would be a simple timer that starts counting out with each iteration. Should the count get beyond some number a suspend and restart occurs in that specific work unit. My experience with engineering and scientific code has shown that sometimes the coders do not appreciate how numeric noise corrupts the calculations. A simple example is in spice code for electronic simulation. Use of double precision is necessary but even there sometimes not sufficient. I re-coded the software in long doubles and almost never had a problem. Iterative solutions using various approaches with sparse matrices are fundamentally susceptible to these types of errors. Suspending the process dumps most of the computation at that time , accumulated errors included. I am confident these folks will get this straightened out.
BW
ID: 69741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2541
Credit: 462,666,679
RAC: 142
300 million credit badge12 year member badgeextraordinary contributions badge
Message 69743 - Posted: 19 Apr 2020, 20:43:19 UTC - in response to Message 69741.  

I am having the same long run problems. From my experience simply suspending and immediately restarting Boinc takes care of the problem for a time. I suspect that the problem occurs because of two issues. One inadequate error handling in the code. A simple solution to the problem would be a simple timer that starts counting out with each iteration. Should the count get beyond some number a suspend and restart occurs in that specific work unit. My experience with engineering and scientific code has shown that sometimes the coders do not appreciate how numeric noise corrupts the calculations. A simple example is in spice code for electronic simulation. Use of double precision is necessary but even there sometimes not sufficient. I re-coded the software in long doubles and almost never had a problem. Iterative solutions using various approaches with sparse matrices are fundamentally susceptible to these types of errors. Suspending the process dumps most of the computation at that time , accumulated errors included. I am confident these folks will get this straightened out.
BW


You can always Pm an admin with your thoughts
ID: 69743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hurr1cane78

Send message
Joined: 7 May 14
Posts: 30
Credit: 51,502,561
RAC: 3
50 million credit badge7 year member badge
Message 69794 - Posted: 10 May 2020, 8:44:45 UTC

hi all made vid on youtube for multiple instances instruction's and at full load on a Radeon VII
RADEON VII GIGABYTE// 3 Instances_ Milkyway@home WUs BOINC_ 3_instances
https://www.youtube.com/watch?v=4xKy9wGKmz4
all the best and welcome to earth
ID: 69794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 69808 - Posted: 13 May 2020, 1:26:55 UTC

I almost started a new thread for my problem but then I realized that I am encountering essentially the same issue as others who have posted here. That is, tasks running the Milkyway@home N-Body Simulation v1.76 (mt) windows_x86_64 app are often hanging up for hours at a time until discovered by me. And, restarting BOINC always get them going again. I used to think that the problem might be related to incompatibility with other programs I might be running.but I have since convinced myself that other programs are irrelevant. The problem seems to occur just as often when other programs are not running as when they are running. Conversely, I sometimes run other programs and then check BOINC to find that N-Body tasks are still running OK. However, I do think that the problem is somehow related to the characteristics of specific tasks. That is, some tasks require relatively few (0-2) restarts while others may need restarting 8+ times.

Although I have been a MilkyWay contributor for 10+ years, my participation rate has recently increased substantially (due to the SETI hibernation). And as a result, this issue has become very annoying. I am also getting older and my memory isn't what it used to be. Case in point, I had completely forgotten that I had posted this message a little over 2 years ago - essentially reporting this exact same issue. The only difference being updated versions of the N-Body app, BOINC, and Windows, as well as the addition of a new computer.
ID: 69808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJonathan Melusky
Avatar

Send message
Joined: 20 May 18
Posts: 5
Credit: 1,164,182
RAC: 1,386
1 million credit badge3 year member badge
Message 69868 - Posted: 29 May 2020, 5:13:00 UTC

I wish I read this thread a few days ago. On May 16, I got 52 de_nbody WUs and they all were due on May 28th. The WUs that were estimated to take 3 hours all seemed to function normally, but the WUs that were estimated at 8 hours did reversal countdowns and would stall unmoving for hours at a time. I ended up having to abort 28 WUs with a few of them half-completed as the deadline today was a few hours away and I could see they were not going to make it in time and they were causing other projects to miss their deadlines as well.
https://boincstats.com/signature/-1/user/4394448/sig.png
ID: 69868 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alessio Susi
Avatar

Send message
Joined: 8 Mar 15
Posts: 30
Credit: 77,377,061
RAC: 4,926
50 million credit badge6 year member badge
Message 69877 - Posted: 30 May 2020, 8:02:51 UTC

How can I set M@H to use the CPU only for N-Body Simulation and GPU for Separation? Maybe with an app_config but I don't know what I have to write.
ASUS X570 E-Gaming
AMD Ryzen 9 3950X, 16 core / 32 thread 4.4 GHz
AMD Radeon Sapphire RX 480 4GB Nitro+
Nvidia GTX 1080 Ti Gaming X Trio
4x16 GB Corsair Vengeance RGB 3466 MHz

ID: 69877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Curtis Owens

Send message
Joined: 30 Oct 13
Posts: 2
Credit: 71,812
RAC: 0
10 thousand credit badge7 year member badge
Message 70717 - Posted: 8 Apr 2021, 12:48:33 UTC - in response to Message 69741.  

I have the same issue. It is quite annoying to have to intervine. Here I thought the entire run was corrupt and so I have just been aborting it. Didn't even occur to me to suspend and restart BOINC. They really need a timer to compare estimated run time and usage to see if it is actually completing the task or if it failed then restart at its last known good point and continue again. If it fails to make progress, automatically abort.
ID: 70717 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 70721 - Posted: 9 Apr 2021, 20:55:04 UTC
Last modified: 9 Apr 2021, 21:02:10 UTC

This problem has existed for roughly 3 years and, as far as I can tell, no project administrators or moderators have ever responded to this thread. I first reported the problem on 2 Jun 2018 in this post.. Then, on 13 May 2020, i reported it again in this post. To be clear, there is a problem with N-Body Simulation (mt) (3 CPUs) tasks hanging up. The problem existed with V1.68 and continues with V1.76.. And over the last 3 years it has continued to crop up under all versions of BOINC and under all versions of Windows that I have used - on 3 different computers
ID: 70721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 70722 - Posted: 9 Apr 2021, 20:55:18 UTC
Last modified: 9 Apr 2021, 21:05:30 UTC

Deleted accidental double post.
ID: 70722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 131
Credit: 56,987,253
RAC: 69,734
50 million credit badge2 year member badge
Message 70723 - Posted: 9 Apr 2021, 21:19:11 UTC

Hey there,

Just checking in to say that I do monitor things and I'm aware of this thread. I'm not at all sure what the problem may be, but please know that we aren't ignoring you.

Best,
Tom
ID: 70723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 70724 - Posted: 10 Apr 2021, 0:12:32 UTC - in response to Message 70723.  

Tom,
Thanks for the reply! And it's good to know somebody is watching. If you read my earlier posts on the subject you know that the problem is easily gotten around by restarting BOINC. And, right now I am restarting BOINC 2 or 3 times a day. If there is anything you would like me to do before restarting, please let me know.
Stick
ID: 70724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 9 Jul 17
Posts: 88
Credit: 14,147,438
RAC: 503
10 million credit badge4 year member badge
Message 70725 - Posted: 10 Apr 2021, 2:16:59 UTC
Last modified: 10 Apr 2021, 2:18:03 UTC

I have never seen the problem, and though I have not run a very large number of N-body, I have probably done enough to expect to see it.

I always suspect AV software when something is hanging up. It may be the "real time protection", which often still monitors processes even if you exclude the BOINC Data folder.
I use Microsoft Defender for Win10 and don't have problems with it.
ID: 70725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 70726 - Posted: 10 Apr 2021, 14:44:25 UTC - in response to Message 70725.  

Jim.
Thank uou for the suggestion. I will change to Microsoft Defender (from Avast) on one of my computers to see if it makes a difference.
Stick
ID: 70726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 70727 - Posted: 10 Apr 2021, 19:16:17 UTC

After switching to MS Defender it didn't take long for a Milkyway@home N-Body Simulation 1.76 (3 CPUs) hangup to occur. But I had forgotten to exclude the BOINC folders. Restarting BOINC now with folders excluded.
ID: 70727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 9 Jul 17
Posts: 88
Credit: 14,147,438
RAC: 503
10 million credit badge4 year member badge
Message 70728 - Posted: 10 Apr 2021, 20:21:48 UTC - in response to Message 70727.  

After switching to MS Defender it didn't take long for a Milkyway@home N-Body Simulation 1.76 (3 CPUs) hangup to occur. But I had forgotten to exclude the BOINC folders.

I haven't excluded them either. But you seem to be running mobile CPUs. Are the work units being suspended? I run my machines 24/7, since they are dedicated.
It could be one of the power-down tricks that Intel or Microsoft uses that causes the problem. I set my power options to "high performance" mode.
Good luck.
ID: 70728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 50
Credit: 1,802,105
RAC: 3,931
1 million credit badge13 year member badge
Message 70730 - Posted: 13 Apr 2021, 1:14:20 UTC - in response to Message 70728.  

But you seem to be running mobile CPUs I run my machines 24/7, since they are dedicated.
It could be one of the power-down tricks that Intel or Microsoft uses that causes the problem. I set my power options to "high performance" mode..
Thanks again for the reply.. You are right. Both my multi-core computers are laptops. They are older and the batteries are shot. I run them plugged into the charger, pretty much 24/7 for BOINC. I run several different BOINC projects and it's only the 3 CPUs Nbody tasks that have any problems.
Are the work units being suspended?
BOINC does not show the hung-up tasks as suspended. They are shown as Running with Elapsed time counting up but Progress is frozen. Most Nbody tasks take 20 tp 25 minutes to finish up - so when I see one with a longer elapsed time, I restart BOINC. When BOINC restarts the hung-up task starts running again, but its Elapsed time has been reset to a much earlier time (less than 20 minutes) Guessing that only around 10% of tasks hang-up. Some tasks hang up multiple times.
ID: 70730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : problem with de_nbody tasks never finishing

©2021 Astroinformatics Group