Welcome to MilkyWay@home

Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)


Advanced search

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next

AuthorMessage
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 372
Credit: 96,767,310
RAC: 134,561
50 million credit badge3 year member badge
Message 72331 - Posted: 29 Mar 2022, 21:57:34 UTC - in response to Message 72326.  

No no no, please don't set that, something more sensible please? Unless you actually have that amount of RAM its not going to help.

You also need to consider that the database also needs to use memory for its own caching ability.

Maybe something like 512MiB or 1GiB should be sufficient for shared memory.

If you use /etc/sysctl.conf for system config, just edit then use sysctl -p to reload the changes

As for the turning thing off ONLY these 2 should be turned off to test if the database tasks table is just really full:
stream_fit_work_generator (milkyway )
nbody_work_generator (milkyway_nbody)


Yeah, I was surprised when I saw that it was already set to that value. Maybe in the future I'll set it to 1 GB in order to moderate the amount of memory allocated to the feeder and scheduler pool?

I just cycled a bunch of unsent WUs that were waiting to get sent, which should reduce the load on the server after they get deleted from the DB. I will try turning off the WU generators if the number of WUs waiting to get sent doesn't decrease after the server status page updates.
ID: 72331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 372
Credit: 96,767,310
RAC: 134,561
50 million credit badge3 year member badge
Message 72332 - Posted: 29 Mar 2022, 21:58:40 UTC - in response to Message 72327.  

Tom - thanks for the clarification.

I'd just edited my earlier message to acknowledge seeing your "killed tasks" comment, but I'll leave the edit as is...

The task numbers appear unchanged (as the page on display at 20:00 UTC is apparently from before your most recent activity) -- I presume the server status page task data won't get updated again today - do you know how often it should update those numbers? (It didn't seem to be updated more than once a day whilst the server was struggling...)

Thanks for your efforts, especially given the other calls on your time! I just hope it sorts out soon, and you can get a bit of relative peace and quiet...

Cheers - Al.


I think it's supposed to update every 30 min, or at least every hour. It used to update pretty frequently. I think it just takes forever for the DB queries that generate it to go through because of how slow things are at the moment.
ID: 72332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,248,237
RAC: 17,314
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72333 - Posted: 29 Mar 2022, 22:55:54 UTC - in response to Message 72304.  

@ Tom:

@ Tom:

My stats say 600 tasks in progress:
State: All (82650) · In progress (600) · Validation pending (28154) · Validation inconclusive (8569) · Valid (45223) · Invalid (0) · Error (104)
Application: All (82650) · Milkyway@home N-Body Simulation (0) · Milkyway@home Separation (82650)
I wish I could find out where they are.
None of my rigs show any tasks.
...

It looks like those 600 tasks in progress, but nowhere to be seen in BOINC Manager, are tasks solely with a minimum quorum of one.

Nothing has changed regarding the unfindable 600 tasks in progress.
Still not getting any GPU work ...
It is late - "see" you tomorrow!
ID: 72333 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,472,918
RAC: 89,227
300 million credit badge11 year member badge
Message 72334 - Posted: 29 Mar 2022, 23:29:33 UTC - in response to Message 72304.  

It looks like those 600 tasks in progress, but nowhere to be seen in BOINC Manager, are tasks solely with a minimum quorum of one.
What should they be? I can't access most of mine, it says "can't find workunit", but those I can say "minimum quorum" 1, but "initial replication" 2. If they only need 1, why make 2?
ID: 72334 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 151
Credit: 85,382,825
RAC: 55,523
50 million credit badge12 year member badgeextraordinary contributions badge
Message 72336 - Posted: 30 Mar 2022, 1:51:16 UTC

To those of you commenting on tasks that the MilkyWay site says are "In Progress" but aren't showing up in BOINC Manager (or equivalent)...

They are tasks that have got "orphaned" because of network issues -- if you check your BOINC log for around the time the MW site claims the tasks were sent you'll probably find there's a request for work that failed on a timeout.

For what it's worth, one of my machines currently has 195 In Progress according to the MW site but BOINC Manager shows 131; the other is alleged to have 117 but actually has 100! In both cases, I can associate all the "lost" tasks with attempted connections that timed out.

I don't know whether these can be picked up again if one resets the project; in theory, the server should be able to determine what tasks are "lost" as the result of the reset and send them again (up to 12 at a time, I think), However, when I tried that with one of my machines a few days ago I only got new tasks when the reset completed; that might've been because of the server issues, though. I'll try again next time I run out of work (unless someone else beats me to a conclusive test now the server is behaving somewhat better...)

If they can't be re-claimed by a reset, I guess we're stuck with them in the same way some of us have odd tasks left over (in various states) from the database record renumbering exercise of early 2021 :-)

Cheers - Al.
ID: 72336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,248,237
RAC: 17,314
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72337 - Posted: 30 Mar 2022, 5:12:09 UTC - in response to Message 72334.  

@ Peter:

It looks like those 600 tasks in progress, but nowhere to be seen in BOINC Manager, are tasks solely with a minimum quorum of one.
What should they be? I can't access most of mine, it says "can't find workunit", but those I can say "minimum quorum" 1, but "initial replication" 2. If they only need 1, why make 2?

Haven't tried to access all of the 600.
They are distributed over several rigs.
I checked about 50-60 of them and they all have
a "minimum quorum" and "initial replication" of 1 (one).
Strange - hope Tom can fix this.
ID: 72337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,248,237
RAC: 17,314
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72338 - Posted: 30 Mar 2022, 7:03:30 UTC - in response to Message 72336.  

@ alanb1951:

To those of you commenting on tasks that the MilkyWay site says are "In Progress" but aren't showing up in BOINC Manager (or equivalent)
...
They are tasks that have got "orphaned" because of network issues -- if you check your BOINC log for around the time the MW site claims the tasks were sent you'll probably find there's a request for work that failed on a timeout.
...
I don't know whether these can be picked up again if one resets the project;
...
If they can't be re-claimed by a reset, I guess we're stuck with them in the same way some of us have odd tasks left over (in various states) from the database record renumbering exercise of early 2021
...

Thanks for the interesting info!

But what do you mean by "BOINC log"? Do you mean the "Event log"? It gets cleared when you exit BOINC.
I can't find any other log anywhere.

I did a "Reset", but nothing has changed.

My other (main) problem seems to be, that I am still unable to get tasks.
The "update" request gives the "Event log" message
"3/30/2022 8:52:03 AM | Milkyway@Home | Scheduler request completed: got 0 new tasks"
repeatedly.

Hope Tom can do something about this.

Have a nice day!
ID: 72338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,248,237
RAC: 17,314
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72339 - Posted: 30 Mar 2022, 9:20:03 UTC

I'm completely at loss.
Restarted BOINC for "I don't know how many times" and suddenly I am getting tasks on all rigs without having changed anything.
GREAT! Sounds like magic ...
ID: 72339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,472,918
RAC: 89,227
300 million credit badge11 year member badge
Message 72340 - Posted: 30 Mar 2022, 12:58:56 UTC - in response to Message 72337.  

@ Peter:

It looks like those 600 tasks in progress, but nowhere to be seen in BOINC Manager, are tasks solely with a minimum quorum of one.
What should they be? I can't access most of mine, it says "can't find workunit", but those I can say "minimum quorum" 1, but "initial replication" 2. If they only need 1, why make 2?

Haven't tried to access all of the 600.
They are distributed over several rigs.
I checked about 50-60 of them and they all have
a "minimum quorum" and "initial replication" of 1 (one).
Strange - hope Tom can fix this.
I've looked at more and the replication is only 2 on some. I guess that was a mistake or one got lost as per earlier messages in here.
ID: 72340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Monty

Send message
Joined: 11 Jul 20
Posts: 2
Credit: 25,191,071
RAC: 24,389
20 million credit badge2 year member badge
Message 72341 - Posted: 30 Mar 2022, 14:06:23 UTC

Good day.
I want to ask if anyone has the same problem with their own account as me.
As of 28 Mar no account has been credited to my account credit for submitted finished work. All finished work goes to the folder waiting for validation and then moves to the folder partial validation. And since the same date, my average credit has not changed and the average credit should decrease when it does does not apply any credit.
ID: 72341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,472,918
RAC: 89,227
300 million credit badge11 year member badge
Message 72342 - Posted: 30 Mar 2022, 14:10:20 UTC - in response to Message 72341.  

Good day.
I want to ask if anyone has the same problem with their own account as me.
As of 28 Mar no account has been credited to my account credit for submitted finished work. All finished work goes to the folder waiting for validation and then moves to the folder partial validation. And since the same date, my average credit has not changed and the average credit should decrease when it does does not apply any credit.
It's not you, the server is overloaded after a disk problem. Credit will appear when it's caught up. Keep an eye on the server status page at the 3 million tasks not yet validated. When those get done, you should get credit.
ID: 72342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 649,969
RAC: 6,691
500 thousand credit badge
Message 72343 - Posted: 30 Mar 2022, 14:19:10 UTC - in response to Message 72341.  

Monty- it isn't you, it is the system. The server lost a hard drive and was limping along for a while, and now that there is a new hard drive in the system, it will take some time for it to recover and catch up on all the validating.

The system did manage to go from over 4 million waiting for validation to 3088742 now, so I'll take that as a good sign of things moving in the right direction.

I seem to be getting WUs in a random way. I do a manual request here and there so the time between requests doesn't get to long, but otherwise I haven't restarted boinc or my computer. I've got some now, so I'll hope the system is on the mend after all the stuff Tom did the other day. I do think the one poster was right and that the db is overloaded and slow which is why there are WUs but we don't always get them when we ask.
ID: 72343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Monty

Send message
Joined: 11 Jul 20
Posts: 2
Credit: 25,191,071
RAC: 24,389
20 million credit badge2 year member badge
Message 72344 - Posted: 30 Mar 2022, 14:36:00 UTC

Thank you for the explanation
ID: 72344 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 186
Credit: 2,375,109
RAC: 4,482
2 million credit badge11 year member badge
Message 72345 - Posted: 30 Mar 2022, 14:50:29 UTC
Last modified: 30 Mar 2022, 14:53:20 UTC

As far as I can see validation is still at least 9-10 days behind. Although compared to others in numbers mine are low ,my Valid Inconclusive number is steadily increasing covering almost the whole of March.

I am still bemused that the volume of WU’s ready to send is up around 18 Million, I thought this was going to be stopped until things stabilised. Surely it would better to stop generating new WU’s until things have caught up ?
ID: 72345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 58,765
100 million credit badge10 year member badge
Message 72346 - Posted: 30 Mar 2022, 14:54:02 UTC - in response to Message 72340.  
Last modified: 30 Mar 2022, 15:05:24 UTC

@ Peter:

It looks like those 600 tasks in progress, but nowhere to be seen in BOINC Manager, are tasks solely with a minimum quorum of one.
What should they be? I can't access most of mine, it says "can't find workunit", but those I can say "minimum quorum" 1, but "initial replication" 2. If they only need 1, why make 2?

Haven't tried to access all of the 600.
They are distributed over several rigs.
I checked about 50-60 of them and they all have
a "minimum quorum" and "initial replication" of 1 (one).
Strange - hope Tom can fix this.
I've looked at more and the replication is only 2 on some. I guess that was a mistake or one got lost as per earlier messages in here.


Its not a mistake, its a feature of BOINC. Some "Workunits" get chosen to replicate 2 "tasks" others may only have 1 "task"

Source: https://boinc.berkeley.edu/trac/wiki/ValidationSummary#Adaptivereplication and https://boinc.berkeley.edu/trac/wiki/AdaptiveReplication
ID: 72346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 58,765
100 million credit badge10 year member badge
Message 72347 - Posted: 30 Mar 2022, 14:55:35 UTC - in response to Message 72345.  

As far as I can see validation is still at least 9-10 days behind. Although compared to others in numbers mine are low ,my Valid Inconclusive number is steadily increasing covering almost the whole of March.

I am still bemused that the volume of WU’s ready to send is up around 18 Million, I thought this was going to be stopped until things stabilised. Surely it would better to stop generating new WU’s until things have caught up ?


It has according to my stats:


You can view this dashboard at: https://grafana.kiska.pw/goto/hA52pAy7k?orgId=1
ID: 72347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,472,918
RAC: 89,227
300 million credit badge11 year member badge
Message 72348 - Posted: 30 Mar 2022, 15:01:35 UTC - in response to Message 72345.  

As far as I can see validation is still at least 9-10 days behind. Although compared to others in numbers mine are low ,my Valid Inconclusive number is steadily increasing covering almost the whole of March.

I am still bemused that the volume of WU’s ready to send is up around 18 Million, I thought this was going to be stopped until things stabilised. Surely it would better to stop generating new WU’s until things have caught up ?
Boinc server software, like the client software, is a cobbled together piece of crap. It's not easy to control what it does.
ID: 72348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,472,918
RAC: 89,227
300 million credit badge11 year member badge
Message 72349 - Posted: 30 Mar 2022, 15:02:43 UTC - in response to Message 72346.  

@ Peter:

It looks like those 600 tasks in progress, but nowhere to be seen in BOINC Manager, are tasks solely with a minimum quorum of one.
What should they be? I can't access most of mine, it says "can't find workunit", but those I can say "minimum quorum" 1, but "initial replication" 2. If they only need 1, why make 2?

Haven't tried to access all of the 600.
They are distributed over several rigs.
I checked about 50-60 of them and they all have
a "minimum quorum" and "initial replication" of 1 (one).
Strange - hope Tom can fix this.
I've looked at more and the replication is only 2 on some. I guess that was a mistake or one got lost as per earlier messages in here.


Its not a mistake, its a feature of BOINC. Some "Workunits" get chosen to replicate 2 "tasks" others may only have 1 "task"

Source: https://boinc.berkeley.edu/trac/wiki/ValidationSummary#Adaptivereplication
WTF? Boinc being intelligent?
ID: 72349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 186
Credit: 2,375,109
RAC: 4,482
2 million credit badge11 year member badge
Message 72350 - Posted: 30 Mar 2022, 15:09:34 UTC - in response to Message 72348.  

As far as I can see validation is still at least 9-10 days behind. Although compared to others in numbers mine are low ,my Valid Inconclusive number is steadily increasing covering almost the whole of March.

I am still bemused that the volume of WU’s ready to send is up around 18 Million, I thought this was going to be stopped until things stabilised. Surely it would better to stop generating new WU’s until things have caught up ?
Boinc server software, like the client software, is a cobbled together piece of crap. It's not easy to control what it does.


Love the explanation!
ID: 72350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,472,918
RAC: 89,227
300 million credit badge11 year member badge
Message 72351 - Posted: 30 Mar 2022, 15:13:38 UTC - in response to Message 72350.  

As far as I can see validation is still at least 9-10 days behind. Although compared to others in numbers mine are low ,my Valid Inconclusive number is steadily increasing covering almost the whole of March.

I am still bemused that the volume of WU’s ready to send is up around 18 Million, I thought this was going to be stopped until things stabilised. Surely it would better to stop generating new WU’s until things have caught up ?
Boinc server software, like the client software, is a cobbled together piece of crap. It's not easy to control what it does.


Love the explanation!
:-) If you say anything like that on the main boinc forums you get banned. I've been banned 38 times.
ID: 72351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

©2022 Astroinformatics Group