Welcome to MilkyWay@home

Stalled computation

Message boards : Number crunching : Stalled computation
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
lmeeny

Send message
Joined: 18 Nov 21
Posts: 3
Credit: 34,918,994
RAC: 0
Message 71901 - Posted: 8 Mar 2022, 18:07:47 UTC

Hello,

Of late on my laptop active work units, after chugging along fine with decreasing time remaining, go apparently idle with the eight CPU usages down to a few percent for each core with increasing elapsed time and time to complete. I have to exit and restart BOINC to get usage back to normal.

More recently I get the following message in the event log.
"3/8/2022 12:52:02 PM | Milkyway@Home | Task de_nbody_08_31_2021_v176_40k__data__11_1645561443_646216_1 postponed for 600 seconds: Waiting to acquire slot directory lock. Another instance may be running."

According to Task Manager I see only on instance.

I'd appreciate any thoughts on what I'm doing wrong.

Thank you,

Ed Machak
ID: 71901 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile d_a_dempsey

Send message
Joined: 7 Jan 21
Posts: 14
Credit: 84,539,898
RAC: 4,680
Message 71916 - Posted: 10 Mar 2022, 14:52:37 UTC - in response to Message 71901.  

You're not alone. I'm having this happen on 4 separate computers, but each time it is an N-Body simulation. Number of cores/age of chip doesn't seem to matter.
Very frustrating to find that your crunching has been held hostage by these units for hours. I don't babysit my computers and shouldn't have to.

David
David

ID: 71916 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,947,628
RAC: 22,118
Message 71925 - Posted: 11 Mar 2022, 11:30:29 UTC - in response to Message 71901.  
Last modified: 11 Mar 2022, 11:32:33 UTC

Hello,

Of late on my laptop active work units, after chugging along fine with decreasing time remaining, go apparently idle with the eight CPU usages down to a few percent for each core with increasing elapsed time and time to complete. I have to exit and restart BOINC to get usage back to normal.

More recently I get the following message in the event log.
"3/8/2022 12:52:02 PM | Milkyway@Home | Task de_nbody_08_31_2021_v176_40k__data__11_1645561443_646216_1 postponed for 600 seconds: Waiting to acquire slot directory lock. Another instance may be running."

According to Task Manager I see only on instance.

I'd appreciate any thoughts on what I'm doing wrong.

Thank you,

Ed Machak


Since your pc's are hidden only an Admin can look at your tasks and see what's going on, to see what's shared if you do that click on my name and then View and then you can click on Computers and see mine. Nothing personal about ME is shared just the pc's.
ID: 71925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kyle

Send message
Joined: 13 Feb 22
Posts: 1
Credit: 3,871,749
RAC: 0
Message 71929 - Posted: 11 Mar 2022, 21:21:58 UTC - in response to Message 71901.  

I'm having the exact same problem and the task I find get stuck after about an hour. If this doesn't get fix I'm going to have to drop the project as it's not only affecting this BOINC project but delays in processing for other projects. I can't babysit my computer just to donate computation power.
ID: 71929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile d_a_dempsey

Send message
Joined: 7 Jan 21
Posts: 14
Credit: 84,539,898
RAC: 4,680
Message 71936 - Posted: 12 Mar 2022, 16:07:23 UTC - in response to Message 71901.  
Last modified: 12 Mar 2022, 16:07:47 UTC

Hello,

Of late on my laptop active work units, after chugging along fine with decreasing time remaining, go apparently idle with the eight CPU usages down to a few percent for each core with increasing elapsed time and time to complete. I have to exit and restart BOINC to get usage back to normal.

More recently I get the following message in the event log.
"3/8/2022 12:52:02 PM | Milkyway@Home | Task de_nbody_08_31_2021_v176_40k__data__11_1645561443_646216_1 postponed for 600 seconds: Waiting to acquire slot directory lock. Another instance may be running."

According to Task Manager I see only on instance.

I'd appreciate any thoughts on what I'm doing wrong.

Thank you,

Ed Machak


Ed,

What event log options do you have selected? I'm not seeing any messages when mine stall.
David

ID: 71936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,031,557
RAC: 35,973
Message 71937 - Posted: 12 Mar 2022, 16:52:28 UTC - in response to Message 71936.  

Are you guys seeing stalls on both n body AND separation?
ID: 71937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile d_a_dempsey

Send message
Joined: 7 Jan 21
Posts: 14
Credit: 84,539,898
RAC: 4,680
Message 71949 - Posted: 14 Mar 2022, 14:17:27 UTC - in response to Message 71937.  

Are you guys seeing stalls on both n body AND separation?


I'm only getting them on N-Body, and a lot of them. No issues with Separation.

David
David

ID: 71949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
corysmath

Send message
Joined: 18 Feb 22
Posts: 5
Credit: 1,138,901
RAC: 0
Message 72766 - Posted: 13 Apr 2022, 20:24:50 UTC

I'm having problems with some tasks that go on indefinitely. The time remaining goes up rather than down so task never finishes. I have to abort the task so others can run. Happens for several Windows 10 computers. Don't know if it is just n-body or not. Will have to shift to other projects if this problem persists.
ID: 72766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
corysmath

Send message
Joined: 18 Feb 22
Posts: 5
Credit: 1,138,901
RAC: 0
Message 72771 - Posted: 14 Apr 2022, 1:40:03 UTC - in response to Message 72766.  

Here is an example of an n-body task that stalled::
task de_nbody_08_31_2021_v176_40k__data__12_1647295263_10374507_0

After it stalled the CPU usage was just a few percent suggesting computation was truly stalled.

I tried a suspend and resume but remained stalled at 20.843% done.

I also tried upping the CPU time usage from 75% to 100% but still remained stalled.

After looking through earlier comments in this thread I tried the following:
I exited from the BOINC manager and restarted it.

This worked; the task progressed normally and ran to completion.

Hope this example may help if someone wants to dig into what's happening more than I can.
ID: 72771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
XaurreauX

Send message
Joined: 14 Feb 20
Posts: 1
Credit: 7,430
RAC: 0
Message 72844 - Posted: 15 Apr 2022, 14:59:34 UTC

I would like to know why a file that is supposed to download in, say, 28 minutes is taking several hours--or more--to complete.

Thanks,

XaurreauX
ID: 72844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,947,628
RAC: 22,118
Message 72880 - Posted: 16 Apr 2022, 9:53:54 UTC - in response to Message 72844.  

I would like to know why a file that is supposed to download in, say, 28 minutes is taking several hours--or more--to complete.

Thanks,

XaurreauX


Server problems on MilkyWays end
ID: 72880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
corysmath

Send message
Joined: 18 Feb 22
Posts: 5
Credit: 1,138,901
RAC: 0
Message 73068 - Posted: 21 Apr 2022, 20:59:45 UTC - in response to Message 72766.  

I have had several more stalled n-body tasks today. The latest stalled at 14.004% done after running for over an hour, blocking all other BOINC tasks from running, so I will have to abort it and maybe switch to another project.

Here are the stalled task details:

Application
Milkyway@home N-Body Simulation 1.82 (mt)
Name
de_nbody_08_31_2021_v176_40k__data__11_1647295263_13261164
State
Running
Received
4/18/2022 9:20:16 AM
Report deadline
4/30/2022 9:20:15 AM
Resources
8 CPUs
Estimated computation size
12,303 GFLOPs
CPU time
00:08:57
CPU time since checkpoint
00:02:01
Elapsed time
01:27:29
Estimated time remaining
08:57:12
Fraction done
14.005%
Virtual memory size
11.23 MB
Working set size
13.29 MB
Directory
slots/4
Process ID
6096
Progress rate
9.720% per hour
Executable
milkyway_nbody_1.82_windows_x86_64__mt.exe
ID: 73068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
corysmath

Send message
Joined: 18 Feb 22
Posts: 5
Credit: 1,138,901
RAC: 0
Message 73069 - Posted: 22 Apr 2022, 0:52:16 UTC - in response to Message 73068.  

This n-body task stalled at just 0.038% done with 232 days remaining. I give up but will check back later.

Application
Milkyway@home N-Body Simulation 1.82 (mt)
Name
de_nbody_08_31_2021_v176_40k__data__13_1647295263_13271884
State
Project suspended by user
Received
4/18/2022 9:20:16 AM
Report deadline
4/30/2022 9:20:15 AM
Resources
8 CPUs
Estimated computation size
12,267 GFLOPs
CPU time
00:05:00
CPU time since checkpoint
00:03:33
Elapsed time
02:06:50
Estimated time remaining
233d 03:26:21
Fraction done
0.038%
Virtual memory size
11.22 MB
Working set size
5.24 MB
Directory
slots/4
Process ID
13236
Executable
milkyway_nbody_1.82_windows_x86_64__mt.exe
ID: 73069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,947,628
RAC: 22,118
Message 73071 - Posted: 22 Apr 2022, 9:59:56 UTC - in response to Message 73069.  

This n-body task stalled at just 0.038% done with 232 days remaining. I give up but will check back later.

Application
Milkyway@home N-Body Simulation 1.82 (mt)
Name
de_nbody_08_31_2021_v176_40k__data__13_1647295263_13271884
State
Project suspended by user
Received
4/18/2022 9:20:16 AM
Report deadline
4/30/2022 9:20:15 AM
Resources
8 CPUs
Estimated computation size
12,267 GFLOPs
...
milkyway_nbody_1.82_windows_x86_64__mt.exe


What else does this pc do besides crunch MilkyWay n-body tasks? And is it doing that when this happens? I just ran thru a few n-body tasks on my laptop and had zero slowdown but the laptop was not doing anything else at the time, and it wasn't using 100% of the cpu's either. I ran some run thru using 12cores and then a few more with 2cores, running multiple tasks at once, but I also have Boinc set to only use a max of 75% of the cpu cores and then only 90% of the cpu time.
ID: 73071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
corysmath

Send message
Joined: 18 Feb 22
Posts: 5
Credit: 1,138,901
RAC: 0
Message 73323 - Posted: 5 May 2022, 5:05:52 UTC - in response to Message 72771.  

Turns out I almost fixed the stalling problem.

A preventative workaround fix is to set the CPU time usage to 100% in the computing preferences, if set lower than that. After doing this none of the n-body tasks that I have run have stalled on the PC's that previously did stall, which all have Intel i7 CPU's.

As noted in my earlier post If an n-body task has already stalled the only fix I know is to exit BOINC and restart.

Note setting CPU time usage to the maximum 100% will likely increase the CPU temperature. I was able to bring it down by eliminating overclocking and setting performance options to maximize stability and save energy.

Some older PC's with duo and quad CPU's (both Intel and AMD) have never stalled on n-body tasks, so I have left the CPU time usage at the default 75%.

May the 4th be with you.
ID: 73323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,947,628
RAC: 22,118
Message 73330 - Posted: 5 May 2022, 12:43:21 UTC - in response to Message 73323.  

Turns out I almost fixed the stalling problem.

A preventative workaround fix is to set the CPU time usage to 100% in the computing preferences, if set lower than that. After doing this none of the n-body tasks that I have run have stalled on the PC's that previously did stall, which all have Intel i7 CPU's.

As noted in my earlier post If an n-body task has already stalled the only fix I know is to exit BOINC and restart.

Note setting CPU time usage to the maximum 100% will likely increase the CPU temperature. I was able to bring it down by eliminating overclocking and setting performance options to maximize stability and save energy.

Some older PC's with duo and quad CPU's (both Intel and AMD) have never stalled on n-body tasks, so I have left the CPU time usage at the default 75%.

May the 4th be with you.


I'm glad you found the solution!!
ID: 73330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Juhis

Send message
Joined: 12 Mar 22
Posts: 2
Credit: 979,779
RAC: 0
Message 73811 - Posted: 10 Jun 2022, 13:51:49 UTC - in response to Message 73330.  

N-body task runs forever. If I stop and restart it then it will finish just few minutes.
I'll try changing the computing preferences.It that does not help I have to quit running Milkyway
ID: 73811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,881
RAC: 267
Message 73812 - Posted: 10 Jun 2022, 14:43:48 UTC - in response to Message 73811.  

Have you allocated your entire number of CPUs to it ? Try a reduced number like 4 if you have 8, ie 50% of your resources.
ID: 73812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Juhis

Send message
Joined: 12 Mar 22
Posts: 2
Credit: 979,779
RAC: 0
Message 73814 - Posted: 11 Jun 2022, 15:05:20 UTC - in response to Message 73812.  

Yes I have. I dropped CPU usage to 80% and I still have same problem
ID: 73814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,881
RAC: 267
Message 73815 - Posted: 11 Jun 2022, 15:14:24 UTC - in response to Message 73814.  

How many Cores/processors have you got ?
ID: 73815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Stalled computation

©2024 Astroinformatics Group