Welcome to MilkyWay@home

Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)


Advanced search

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

AuthorMessage
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 64,881
100 million credit badge10 year member badge
Message 72428 - Posted: 1 Apr 2022, 20:34:21 UTC - in response to Message 72427.  
Last modified: 1 Apr 2022, 20:35:25 UTC

Probably a while more, I am getting tasks generated on the 16th of March :D
Doesn't that mean the server has caught up, but we haven't?


Server has caught up with work that was sent back. But tasks generated on the 16th of march haven't been sent out yet.

Server has done the heavy lifting of validating work that we've completed, now we can compute knowing stuff is running normally
ID: 72428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,470,899
RAC: 92,093
300 million credit badge11 year member badge
Message 72429 - Posted: 1 Apr 2022, 20:47:50 UTC - in response to Message 72428.  

Server has caught up with work that was sent back. But tasks generated on the 16th of march haven't been sent out yet.

Server has done the heavy lifting of validating work that we've completed, now we can compute knowing stuff is running normally
Not sure that means anything. Just that work was generated a while ago. The telescope data it's from is probably 6 months ago.
ID: 72429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JohnDK
Avatar

Send message
Joined: 18 Feb 10
Posts: 40
Credit: 216,817,691
RAC: 2,298
200 million credit badge12 year member badge
Message 72431 - Posted: 1 Apr 2022, 20:51:38 UTC - in response to Message 72422.  

project: http://milkyway.cs.rpi.edu/milkyway/
gpu_limit: 30
report_delay: 750

The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection.
The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur.
The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache.
I always have my 30 task count maintained.
I can't find "report_delay" in the Boinc configuration files. Where do I put it? This is why I currently let it get the maximum 300 per GPU because there will be a 10 minute gap at the end.

That option is only for those that uses a custom made BOINC, also it's only for Linux.
ID: 72431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,470,899
RAC: 92,093
300 million credit badge11 year member badge
Message 72432 - Posted: 1 Apr 2022, 20:55:32 UTC - in response to Message 72431.  

project: http://milkyway.cs.rpi.edu/milkyway/
gpu_limit: 30
report_delay: 750

The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection.
The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur.
The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache.
I always have my 30 task count maintained.
I can't find "report_delay" in the Boinc configuration files. Where do I put it? This is why I currently let it get the maximum 300 per GPU because there will be a 10 minute gap at the end.

That option is only for those that uses a custom made BOINC, also it's only for Linux.
Can I have a big description of this? I'm going to ask in Github for them to put it into normal Boinc, as it would be very useful.
ID: 72432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JohnDK
Avatar

Send message
Joined: 18 Feb 10
Posts: 40
Credit: 216,817,691
RAC: 2,298
200 million credit badge12 year member badge
Message 72433 - Posted: 1 Apr 2022, 21:01:37 UTC - in response to Message 72432.  
Last modified: 1 Apr 2022, 21:02:00 UTC

That option is only for those that uses a custom made BOINC, also it's only for Linux.
Can I have a big description of this? I'm going to ask in Github for them to put it into normal Boinc, as it would be very useful.

Well I don't know that much, but maybe Keith Myers can explain things, try send him a PM.
ID: 72433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 640
Credit: 496,346,501
RAC: 158,640
300 million credit badge11 year member badgeextraordinary contributions badge
Message 72434 - Posted: 1 Apr 2022, 21:02:45 UTC - in response to Message 72422.  
Last modified: 1 Apr 2022, 21:11:52 UTC

project: http://milkyway.cs.rpi.edu/milkyway/
gpu_limit: 30
report_delay: 750

The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection.
The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur.
The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache.
I always have my 30 task count maintained.
I can't find "report_delay" in the Boinc configuration files. Where do I put it? This is why I currently let it get the maximum 300 per GPU because there will be a 10 minute gap at the end.

It's not in the standard BOINC client. I use our optimized GPUUG team client which has the setting especially for Milkyway.
I asked our dev to put it in just for me since I am the only team member that does MW. [Edit] Or initially was. A few newer team members also do MW now.
Our client does a lot more to overcome the failures of the standard BOINC client. Setting specific task count sizes is the one everyone uses on all projects.
The next most used is the request_min_cooldown setting to choose our own scheduler interval connection intervals.
It's pretty ridiculous to ping a cpu only project server every 11 seconds for a scheduler connection as the project default when the tasks take 1-12 hours to complete. With most of my projects I add 2-3 minutes to the stock project interval.
And back in the old Seti project days when it was difficult to get enough work to keep our GPUUG team "special sauce" application busy we can also spoof however many cpus or gpus we wanted to build up large cache sizes.

There is another MW optimized client by a different developer over at Github that achieves the same thing.
https://github.com/JStateson/MilkywayNewWork
ID: 72434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 645,190
RAC: 6,582
500 thousand credit badge
Message 72435 - Posted: 1 Apr 2022, 21:27:22 UTC

I get a date for the WU and another date for the task. I should have been more specific. I'm looking at task date.

For Example... I get this info when looking at the workunit : created 15 Mar 2022, 13:16:34 UTC
but when I look at the specific task that I have I get : Created 24 Mar 2022, 20:57:13 UTC

I'm guessing that a workunit is created, but isn't used to generate a task for the queue until the queue size falls below some threshold . So I'm looking at the task creation (date,time) as I think it is more interesting to see where in the queue we are.

A couple of days ago we were as much as 10+ days behind in the queue, and now we are 7-8 days behind. I'm not sure what is "normal" , but it gives me another data point to watch as the system heals.
ID: 72435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 64,881
100 million credit badge10 year member badge
Message 72436 - Posted: 1 Apr 2022, 21:43:38 UTC - in response to Message 72435.  
Last modified: 1 Apr 2022, 21:45:30 UTC

I get a date for the WU and another date for the task. I should have been more specific. I'm looking at task date.

For Example... I get this info when looking at the workunit : created 15 Mar 2022, 13:16:34 UTC
but when I look at the specific task that I have I get : Created 24 Mar 2022, 20:57:13 UTC

I'm guessing that a workunit is created, but isn't used to generate a task for the queue until the queue size falls below some threshold . So I'm looking at the task creation (date,time) as I think it is more interesting to see where in the queue we are.

A couple of days ago we were as much as 10+ days behind in the queue, and now we are 7-8 days behind. I'm not sure what is "normal" , but it gives me another data point to watch as the system heals.


I got this task about... 4 hours ago https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=174582283
Created 16 Mar 2022, 3:10:33 UTC
And the associated workunit
created 15 Mar 2022, 3:29:37 UTC

The difference between workunit creation and task creation is due to the transitioner being backed up.

Workunits are generated by the work generator typically and is independent of queue size. Tasks are generated by the transitioner and is independent of queue size, it'll keep generating as long as there are workunits without tasks associated.

You can see a flaw...
ID: 72436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 645,190
RAC: 6,582
500 thousand credit badge
Message 72438 - Posted: 1 Apr 2022, 23:22:43 UTC

Thanks Kiska for the info.

I'm not doing any nbody tasks so it looks like the nbody tasks have a short window between generation and being sent out even though the queue is huge (17 million +).

I just got some new separation tasks and it looks like I'm working on tasks created 24 Mar 2022, 21:04:17 UTC . I'm betting there are a ton of resends for March 30, 31 later in the queue as the system finally worked through the backlog of validations.

Is there really no limit on queue size? I guess not if the nbody queue was over 18 million.
ID: 72438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 186
Credit: 2,375,109
RAC: 5,136
2 million credit badge11 year member badge
Message 72439 - Posted: 2 Apr 2022, 9:24:41 UTC

Another day without credits……
ID: 72439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileentigy

Send message
Joined: 10 Jun 09
Posts: 6
Credit: 9,615,751
RAC: 4,410
5 million credit badge13 year member badge
Message 72440 - Posted: 2 Apr 2022, 10:23:17 UTC

Over 17 million WUs "Ready to Send" yet I still can't get any new tasks.
What gives ?

02/04/2022 11:17:54 | Milkyway@Home | work fetch resumed by user
02/04/2022 11:17:57 | Milkyway@Home | Sending scheduler request: To fetch work.
02/04/2022 11:17:57 | Milkyway@Home | Requesting new tasks for NVIDIA GPU
02/04/2022 11:18:00 | Milkyway@Home | Scheduler request completed: got 0 new tasks
02/04/2022 11:18:00 | Milkyway@Home | Project requested delay of 91 seconds
ID: 72440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 64,881
100 million credit badge10 year member badge
Message 72441 - Posted: 2 Apr 2022, 10:42:09 UTC - in response to Message 72440.  

Over 17 million WUs "Ready to Send" yet I still can't get any new tasks.
What gives ?

02/04/2022 11:17:54 | Milkyway@Home | work fetch resumed by user
02/04/2022 11:17:57 | Milkyway@Home | Sending scheduler request: To fetch work.
02/04/2022 11:17:57 | Milkyway@Home | Requesting new tasks for NVIDIA GPU
02/04/2022 11:18:00 | Milkyway@Home | Scheduler request completed: got 0 new tasks
02/04/2022 11:18:00 | Milkyway@Home | Project requested delay of 91 seconds


Let Tom have the weekend off to not focus on the project. Its running fairly well at this time and I just got 100 tasks for my CPU so it may be the case that the buffer hasn't filled completely before everyone has drained it
ID: 72441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 186
Credit: 2,375,109
RAC: 5,136
2 million credit badge11 year member badge
Message 72442 - Posted: 2 Apr 2022, 11:45:45 UTC - in response to Message 72441.  

I have had a machine ready for an hour , even tried manually, nothing. Doing other work now.
ID: 72442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stfn

Send message
Joined: 17 Jun 21
Posts: 4
Credit: 7,158,192
RAC: 12,834
5 million credit badge1 year member badge
Message 72444 - Posted: 2 Apr 2022, 12:37:24 UTC - in response to Message 72441.  



Let Tom have the weekend off to not focus on the project. Its running fairly well at this time and I just got 100 tasks for my CPU so it may be the case that the buffer hasn't filled completely before everyone has drained it


Exactly. Thank you Tom for all your hard work in making the situation stable, have a great weekend and get some rest :)
ID: 72444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 645,190
RAC: 6,582
500 thousand credit badge
Message 72446 - Posted: 2 Apr 2022, 13:53:35 UTC

Been 2 to 3 hours since I've gotten a WU, just to confirm what others are seeing. I've got the machine doing other projects, so no stress, hopefully the server will start giving WUs again on its own at some point.

I was working on some resends from March 26 (separation queue) last I had some WUs to work on.
ID: 72446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,470,899
RAC: 92,093
300 million credit badge11 year member badge
Message 72448 - Posted: 2 Apr 2022, 14:50:15 UTC - in response to Message 72439.  

Another day without credits……
But think of the big christmas present you'll soon get!
ID: 72448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,238,853
RAC: 17,328
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72450 - Posted: 2 Apr 2022, 15:43:56 UTC - in response to Message 72448.  

Another day without credits……
But think of the big christmas present you'll soon get!

SOON?
I'd hate to have to wait till December ....
ID: 72450 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,470,899
RAC: 92,093
300 million credit badge11 year member badge
Message 72451 - Posted: 2 Apr 2022, 16:15:38 UTC - in response to Message 72450.  

Another day without credits……
But think of the big christmas present you'll soon get!

SOON?
I'd hate to have to wait till December ....
Does it really matter?
ID: 72451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,238,853
RAC: 17,328
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72453 - Posted: 2 Apr 2022, 17:37:22 UTC - in response to Message 72451.  

Another day without credits……
But think of the big christmas present you'll soon get!

SOON?
I'd hate to have to wait till December ....
Does it really matter?

+1
ID: 72453 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 372
Credit: 96,713,798
RAC: 136,708
50 million credit badge3 year member badge
Message 72458 - Posted: 2 Apr 2022, 20:07:09 UTC

I get the feeling that the server thinks there are 17M jobs ready to send out, so it doesn't make more jobs. However, I cancelled all of those jobs in order to try to clear the validation backlog. I'm not sure where the jobs are stuck, but I will turn things off and reset their transition times, and see if that clears them.

The jobs should get removed from the DB once they are cancelled, but they just might not have been transitioned yet because they haven't gone out to volunteers (because they were cancelled)
ID: 72458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

©2022 Astroinformatics Group