Welcome to MilkyWay@home

WUs not downloaded in time - rig is idling - doing no work ...

Message boards : Number crunching : WUs not downloaded in time - rig is idling - doing no work ...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69051 - Posted: 18 Sep 2019, 3:02:39 UTC

Anybody running MW exclusively on the BOINC 7.16.1 client and see very different pull requests? There are significant changes to work_fetch.cpp that eliminate a lot of the previous issues.
ID: 69051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69052 - Posted: 18 Sep 2019, 13:50:24 UTC - in response to Message 69051.  
Last modified: 18 Sep 2019, 13:51:28 UTC

Anybody running MW exclusively on the BOINC 7.16.1 client and see very different pull requests? There are significant changes to work_fetch.cpp that eliminate a lot of the previous issues.


I cannot find a window 7.16.1 client. Poked around over at GitHub but didn't find an executable. I can no longer run 18.04 with my AMD s9000 boards (a long and sad story) and do not have the expertise to do the cross compile of the 16.1 source to windows. You know where to find the windows 16.1?
ID: 69052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69053 - Posted: 18 Sep 2019, 18:42:00 UTC - in response to Message 69052.  
Last modified: 18 Sep 2019, 18:47:18 UTC

Anybody running MW exclusively on the BOINC 7.16.1 client and see very different pull requests? There are significant changes to work_fetch.cpp that eliminate a lot of the previous issues.


I cannot find a window 7.16.1 client. Poked around over at GitHub but didn't find an executable. I can no longer run 18.04 with my AMD s9000 boards (a long and sad story) and do not have the expertise to do the cross compile of the 16.1 source to windows. You know where to find the windows 16.1?


Github contains the source code. It does not house any executables. You download the source code and compile it yourself.

The client 7.16.1 branch can be found with the client tag at github. Just click on the branch arrow and scroll down to the client_release/7.16.1
https://github.com/BOINC/boinc/tree/client_release/7/7.16

The Windows build artifacts are over at AppVeyor.
https://ci.appveyor.com/api/buildjobs/4bvvgoug1ej0x5mh/artifacts/deploy%2Fwin-client%2Fwin-client_master_2019-09-18_15ffc98a.7z
ID: 69053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gambatesa
Avatar

Send message
Joined: 23 Feb 18
Posts: 26
Credit: 4,744,416,145
RAC: 0
Message 69069 - Posted: 19 Sep 2019, 16:05:05 UTC - in response to Message 69050.  

Yes, this was reported in May. From the last result there needs to be a 10min period of no requests until the clients can get more work.
https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4424&postid=68441#68441

I'm about to setup a script to turn off networking for like 11min, resume and do a project update, allow for 30min or so then repeat.


That's not true.. if you report immediatly the last results and wait the first deferred communication interval of 1minute and 30seconds the new workunits will arrive..

Automatic:
102726 Milkyway@Home 19/09/2019 15:37:18 Sending scheduler request: To fetch work.
102727 Milkyway@Home 19/09/2019 15:37:18 Reporting 5 completed tasks
102728 Milkyway@Home 19/09/2019 15:37:18 Requesting new tasks for AMD/ATI GPU
102729 Milkyway@Home 19/09/2019 15:37:25 Scheduler request completed: got 0 new tasks
102730 Milkyway@Home 19/09/2019 15:48:00 Sending scheduler request: To fetch work.
102731 Milkyway@Home 19/09/2019 15:48:00 Requesting new tasks for AMD/ATI GPU
102732 Milkyway@Home 19/09/2019 15:48:22 Scheduler request completed: got 300 new tasks


with manual intervention:
104225 Milkyway@Home 19/09/2019 17:53:10 update requested by user
104226 Milkyway@Home 19/09/2019 17:53:13 Sending scheduler request: Requested by user.
104227 Milkyway@Home 19/09/2019 17:53:13 Reporting 1 completed tasks
104228 Milkyway@Home 19/09/2019 17:53:13 Requesting new tasks for AMD/ATI GPU
104229 Milkyway@Home 19/09/2019 17:53:15 Scheduler request completed: got 0 new tasks
104230 Milkyway@Home 19/09/2019 17:53:15 Not sending work - last request too recent: 17 sec
104231 Milkyway@Home 19/09/2019 17:54:50 Sending scheduler request: To fetch work.
104232 Milkyway@Home 19/09/2019 17:54:50 Requesting new tasks for AMD/ATI GPU
104233 Milkyway@Home 19/09/2019 17:54:56 Scheduler request completed: got 300 new tasks

This idle time can be manually minimized.. but this isn't the solution.. reported results must be replaced with new workunits.. not wait until the queue is empty and then wait for timeouts and then download a full stack..
Want your Kids stay off from Drugs? Get them building Crunching PC's and they'll never have enough money for drugs
ID: 69069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69070 - Posted: 19 Sep 2019, 17:02:22 UTC - in response to Message 69053.  


The Windows build artifacts are over at AppVeyor.
https://ci.appveyor.com/api/buildjobs/4bvvgoug1ej0x5mh/artifacts/deploy%2Fwin-client%2Fwin-client_master_2019-09-18_15ffc98a.7z


Just got around to downloading. It is 7.15.0

does that have the new feature that 16.1 has?

Going to let it run for a while and see what happens.

thanks!
ID: 69070 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69071 - Posted: 19 Sep 2019, 17:54:25 UTC - in response to Message 69070.  
Last modified: 19 Sep 2019, 17:59:05 UTC


The Windows build artifacts are over at AppVeyor.
https://ci.appveyor.com/api/buildjobs/4bvvgoug1ej0x5mh/artifacts/deploy%2Fwin-client%2Fwin-client_master_2019-09-18_15ffc98a.7z


Just got around to downloading. It is 7.15.0

does that have the new feature that 16.1 has?

Going to let it run for a while and see what happens.

thanks!

It is made from the current master which is 7.15.0 and has the same work_fetch.cpp module that is in the client_release/7.16.1. So in that regard they are same. Mainly the client has the commit for the bugfix I requested #3076
https://github.com/BOINC/boinc/commit/0b5bae4cc98660538b76842dea8b5cf4a16d06f6

There are other changes in 7.16.1 compared to the master 7.15.0 but in areas I don't think has an impact on the inability of the client to request work for work turned in at MW. That is why I suggested running one of the later artifacts that has the latest code in work_fetch.cpp. All I ask is someone run it and see if anything changes for work scheduling.

[Edit]Just wanted to point out that the changes to work_fetch.cpp didn't just cover the issue with using a max_concurrent statement. DA also implemented many lines of new code specifically calling rr_simulation routines. It's those routines that help determine the correct shortfall in work requested and the part that I think will have the greatest effect on work requested at MW.
ID: 69071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69076 - Posted: 19 Sep 2019, 20:44:50 UTC - in response to Message 69071.  
Last modified: 19 Sep 2019, 20:45:58 UTC


There are other changes in 7.16.1 compared to the master 7.15.0 but in areas I don't think has an impact on the inability of the client to request work for work turned in at MW. That is why I suggested running one of the later artifacts that has the latest code in work_fetch.cpp. All I ask is someone run it and see if anything changes for work scheduling


Did not affect the problem where I run out of data. I was hoping that after each upload there would be a few downloads, but no, my BT task had to issue an update as shown: between lines 2186 and 2190

2183	Milkyway@Home	9/19/2019 2:27:45 PM	Computation for task de_modfit_86_bundle4_4s_south4s_bgset_2_1564052102_18863666_0 finished	
2184	Milkyway@Home	9/19/2019 2:28:16 PM	Sending scheduler request: To fetch work.	
2185	Milkyway@Home	9/19/2019 2:28:16 PM	Reporting 16 completed tasks	
2186	Milkyway@Home	9/19/2019 2:28:16 PM	Requesting new tasks for AMD/ATI GPU	
2187	Milkyway@Home	9/19/2019 2:28:19 PM	Scheduler request completed: got 0 new tasks	
2188	Milkyway@Home	9/19/2019 2:30:59 PM	update requested by user	
2189	Milkyway@Home	9/19/2019 2:31:00 PM	Sending scheduler request: Requested by user.	
2190	Milkyway@Home	9/19/2019 2:31:00 PM	Requesting new tasks for AMD/ATI GPU	
2191	Milkyway@Home	9/19/2019 2:31:04 PM	Scheduler request completed: got 900 new tasks	
2192	Milkyway@Home	9/19/2019 2:31:07 PM	Starting task de_modfit_80_bundle4_4s_south4s_bgset_2_1564052102_18598672_1	


However, I could disable my script and see how long it takes before the client asks again on its own. I will try that.
ID: 69076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69081 - Posted: 20 Sep 2019, 0:00:32 UTC - in response to Message 69076.  

Sorry to hear that didn't accomplish anything. I thought give it a shot, what the heck. I assume that if you ran the sched_ops_debug flag in logging options and it shows the client is asking for 0 seconds of gpu work? Have you used the work_fetch_debug flag to see what the shortfalls are for each component and project?

I am beginning to think that the consensus opinion here that the server is misconfigured and doesn't allow requests for work at the same time work is reported is correct.
ID: 69081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69085 - Posted: 20 Sep 2019, 1:07:07 UTC - in response to Message 69081.  

Sorry to hear that didn't accomplish anything


You spoke too soon (maybe)

Just looked at results and at the very least it seems to be only in the order of 12 to 15 minutes, not a huge amount though some may consider that a lifetime ;<)

4360	Milkyway@Home	9/19/2019 5:00:25 PM	Computation for task de_modfit_14_bundle5_testing_4s3f_2_1564052102_18633077_1 finished	
4361	Milkyway@Home	9/19/2019 5:01:07 PM	Sending scheduler request: To fetch work.	
4362	Milkyway@Home	9/19/2019 5:01:07 PM	Reporting 15 completed tasks	
4363	Milkyway@Home	9/19/2019 5:01:07 PM	Requesting new tasks for AMD/ATI GPU	
4364	Milkyway@Home	9/19/2019 5:01:10 PM	Scheduler request completed: got 0 new tasks	
4365	Milkyway@Home	9/19/2019 5:12:50 PM	Sending scheduler request: To fetch work.	
4366	Milkyway@Home	9/19/2019 5:12:50 PM	Requesting new tasks for AMD/ATI GPU	
4367	Milkyway@Home	9/19/2019 5:12:53 PM	Scheduler request completed: got 900 new tasks	


Note there is no user request for update so the scheduler took about 12 minutes and then issued a new request for data. At least it as not an hour or more.

Looks like this on my data analysis graph. The 6 minute delay was from my BT update script which shaved off about 9 minutes of idle.
ID: 69085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69086 - Posted: 20 Sep 2019, 3:18:43 UTC

OK, that is encouraging. Still wondering where that 12 minute timeout is coming from. I doubt it is a hard coded delay in the client. Could it be that the reason was the one espoused by the project scientist that stated you could just be hitting the RTS buffer when it is empty?
ID: 69086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gambatesa
Avatar

Send message
Joined: 23 Feb 18
Posts: 26
Credit: 4,744,416,145
RAC: 0
Message 69093 - Posted: 20 Sep 2019, 17:27:11 UTC

when you ron out of WU the first deferred communication is always more or less 1.40minutes.. in this first stage it uploads the latest results (and don't downloads anything)

then begin the second deferred communication of more or less 12minutes (can't remember).. the gpu is idling and then since no results to report, the downloads of 300wu begin

this loop is always the same on all hosts.. next time i'll run out of WU i will post the exact times..
Want your Kids stay off from Drugs? Get them building Crunching PC's and they'll never have enough money for drugs
ID: 69093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69094 - Posted: 20 Sep 2019, 17:59:03 UTC - in response to Message 69093.  

Yes, it is hard for me to diagnose these issues as I have never seen the behavior described. But I don't solely run MW on my hosts. They always have around 900 tasks in their cache and just keep topping off to reach 900. The only time I ever ran out of tasks was when the project was offline and I crunched through the entire cache. But as soon as the project came back, the first scheduler connection started refilling back to 900. But I run a spoofed client compiled from the latest source so I don't know if that is what insulates me. I also don't remember this issue when I ran the bone stock 7.14.2 client either however.
ID: 69094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69096 - Posted: 20 Sep 2019, 21:04:11 UTC - in response to Message 69094.  
Last modified: 20 Sep 2019, 21:05:22 UTC

Yes, it is hard for me to diagnose these issues as I have never seen the behavior described


Perhaps your Linux version is different enough from windows that the timing and scheduling functions are causing problem only in windows. I tried using the 7.16.1 Linux app with my AMD s9x00 boards but gave up after about 7 attempts using different AMD drivers and kernels. I had it running for a short time (an hour) on an old 18.04.1 with a 4.x kernel but an upgrade ruined things and I never got kernel 5.00x to work and was unable to go to go back to the original 18.04.1 as my usb install was only "live". I had failed to install the full package when downloading to the usb flash earlier in the year. Also, your FP64 2080 specs are still only 1/3 of my S9100 specs so possibly you are not in the timing area where the problem occurs. Just a guess.
ID: 69096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 167
Credit: 1,008,062,758
RAC: 1,668
Message 69098 - Posted: 21 Sep 2019, 1:53:13 UTC - in response to Message 69093.  

when you ron out of WU the first deferred communication is always more or less 1.40minutes.. in this first stage it uploads the latest results (and don't downloads anything)

then begin the second deferred communication of more or less 12minutes (can't remember).. the gpu is idling and then since no results to report, the downloads of 300wu begin

this loop is always the same on all hosts.. next time i'll run out of WU i will post the exact times..


There is nothing to actually upload. Set the client to no networking and the tasks go straight to Waiting to Report and not Uploading. There are no data files to download or upload for this project per task.
ID: 69098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Chooka
Avatar

Send message
Joined: 13 Dec 12
Posts: 101
Credit: 1,782,758,310
RAC: 0
Message 69099 - Posted: 21 Sep 2019, 4:44:27 UTC - in response to Message 69020.  

The problem after 5 months is still present. Any update?

There has been some discussion of this in the News thread 30 Workunit Limit Per Request - Fix Implemented. However, that seems to have gone quiet - perhaps the "to do" list is rather long at present?!?

To summarize, what appears to happen is that if your BOINC client sends in an update to report completed tasks and ask for new work MilkyWay spits the work request out! If another update is requested after the wait time (90 seconds?) but before there's another completed task to report you'll almost certainly get some work then. Unfortunately, if you can process a work-unit in under 90 seconds, there'll be another report (and no work) and you'll get the empty queue problem anyway.

Of the projects I do work for, this seems to be peculiar to MilkyWay! In particular, SETI@Home (with relatively short work-unitsand a longer wait time (5 minutes)) doesn't have this issue... Indeed, someone suggested that perhaps the SETI@Home people might be able to advise on possible set-up issues.

For more details look at that News thread, especially the second page.

Cheers - Al.


PrimeGrid has some very short workunits as well and doesn't have this problem either!!
One would think the Admins would talk to each other rather than just try and muddle thru on their own, this isn't the stone age!!


Agree with Mikey. Other projects with quick work units don't have this issue. What makes M@H so unique?
It is frustrating a little to have great crunching gear but it sits idle. One reason I like Einstein@Home.

It just works. Never any hiccups.

ID: 69099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Chooka
Avatar

Send message
Joined: 13 Dec 12
Posts: 101
Credit: 1,782,758,310
RAC: 0
Message 69100 - Posted: 21 Sep 2019, 4:47:32 UTC - in response to Message 69093.  

Wow gambatesa!
I just noticed you have 1.8 Billion credits and only a 1 year badge!
Then I read your signature about keeping kids off drugs and it all made sense! LOL

Dedication.

ID: 69100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 0
Message 69102 - Posted: 21 Sep 2019, 12:28:47 UTC - in response to Message 69096.  

Yes, it is hard for me to diagnose these issues as I have never seen the behavior described


Perhaps your Linux version is different enough from windows that the timing and scheduling functions are causing problem only in windows. I tried using the 7.16.1 Linux app with my AMD s9x00 boards but gave up after about 7 attempts using different AMD drivers and kernels. I had it running for a short time (an hour) on an old 18.04.1 with a 4.x kernel but an upgrade ruined things and I never got kernel 5.00x to work and was unable to go to go back to the original 18.04.1 as my usb install was only "live". I had failed to install the full package when downloading to the usb flash earlier in the year. Also, your FP64 2080 specs are still only 1/3 of my S9100 specs so possibly you are not in the timing area where the problem occurs. Just a guess.


No I have the same problem in both my Linux and Windows pc's!!
ID: 69102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,638,306
RAC: 42,264
Message 69103 - Posted: 21 Sep 2019, 19:55:59 UTC - in response to Message 69070.  


The Windows build artifacts are over at AppVeyor.
https://ci.appveyor.com/api/buildjobs/4bvvgoug1ej0x5mh/artifacts/deploy%2Fwin-client%2Fwin-client_master_2019-09-18_15ffc98a.7z


Just got around to downloading. It is 7.15.0

does that have the new feature that 16.1 has?

Going to let it run for a while and see what happens.

thanks!

The client is up to 7.16.2 now and has about 60 more commits from the master added to it. A lot of polishing. The close with the red X issue seems to be fixed now. From what I hear the only thing hanging up this as the new master release is the translations are still waiting to come in.
ID: 69103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : WUs not downloaded in time - rig is idling - doing no work ...

©2024 Astroinformatics Group