Welcome to MilkyWay@home

Nbody WU Flush

Message boards : News : Nbody WU Flush
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

AuthorMessage
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73018 - Posted: 19 Apr 2022, 18:56:47 UTC

The server has been brought back up. I'm going to wait for the server status page to update and then make an assessment of the situation. Hopefully things will be clearing out and I can turn the Nbody WU generator back on.
ID: 73018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,881
RAC: 267
Message 73019 - Posted: 19 Apr 2022, 19:56:48 UTC - in response to Message 73018.  

That looks a lot better…. Thanks
ID: 73019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73020 - Posted: 19 Apr 2022, 20:04:07 UTC

I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so.
ID: 73020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,032,715
RAC: 35,573
Message 73021 - Posted: 19 Apr 2022, 21:03:05 UTC - in response to Message 73012.  

OK. The n body backlog is out of control. I think I will abort all separation _0 CPU work for a week or so. This will free up 36 CPU threads for the N Body WU Flush. Any wingman separation tasks that are _1 and above, will be processed normally. All GPU separations will also be processed normally. Separation retests will also go through normally.

I'm starting to get worked up into campaign mode here.

Thoughts?
I just got told by the server there were no Nbodys left, so perhaps Tom cleared it? I'm also getting loads of GPU separation every time I ask so I think the server is ok now and you can run what you like.

I try not to do seperation on the CPUs anyway, as the GPUs are many many times faster and it seems a waste. It's a pity the server options don't let me choose that completely.

716 aborted by the time I saw your post here. could go another 30, but will hold off for a bit.
ID: 73021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,142,956
RAC: 2
Message 73022 - Posted: 19 Apr 2022, 21:05:52 UTC - in response to Message 73020.  
Last modified: 19 Apr 2022, 21:06:26 UTC

I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so.
Don't forget to turn up the GPU seperation limits like you promised me :)
Pretty please?
The 10 minute delay otherwise is fustrating.
ID: 73022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73023 - Posted: 19 Apr 2022, 23:26:24 UTC - in response to Message 73022.  

Don't forget to turn up the GPU seperation limits like you promised me :)


All done! I'm not sure if the changes will take place instantaneously or if they will need a reset. If it needs to be reset, that will happen soon when I turn on the Nbody WU generator (probably tomorrow or Thursday, based on how the numbers are trending)
ID: 73023 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,142,956
RAC: 2
Message 73025 - Posted: 19 Apr 2022, 23:35:39 UTC - in response to Message 73023.  
Last modified: 19 Apr 2022, 23:37:23 UTC

Don't forget to turn up the GPU seperation limits like you promised me :)
All done! I'm not sure if the changes will take place instantaneously or if they will need a reset. If it needs to be reset, that will happen soon when I turn on the Nbody WU generator (probably tomorrow or Thursday, based on how the numbers are trending)
Thankyou! I just tested it and got a limit of 900 tasks for a 3 GPU machine as before, so I guess the reset needs to happen first. My new GPUs won't be here for a few days anyway, but by then I want 6 280X to be running non stop on Milkyway!

When I ask for CPU work I seem to get mainly seperation.
ID: 73025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,453,596
RAC: 36,231
Message 73026 - Posted: 20 Apr 2022, 0:13:19 UTC - in response to Message 73023.  
Last modified: 20 Apr 2022, 0:16:07 UTC

Tom,

A couple of questions now normality may have been restored :-)

Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software...

I realize this next one isn't about N-body, but rather than posting in two places...

Is there a particular reason that it seems that all Separation work units that don't validate the first result at once go on to need two retries, and all three results end up validated? It seems to have been doing this ever since the disk problem, and several folks have commented on it in various threads. I've looked at the stderr.txt of quite a few of the tasks where I'm waiting for the second retry to return, and in almost every case the values from both results seem to be well within the sort of tolerances that I thought would get passed without question (differences are out around the 12th decimal place!)

Cheers - Al.

P.S. here's hoping there'll be enough Separation work to allow those who don't want/need a giant data feed to get work after the big hitters have grabbed the increased numbers of tasks :-)
ID: 73026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73027 - Posted: 20 Apr 2022, 0:15:08 UTC - in response to Message 73026.  
Last modified: 20 Apr 2022, 0:17:49 UTC

Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software...


There is one in place, actually. It normally keeps the number of jobs ready to be sent at ~10k (like how separation is self-monitoring right now). This just got screwed up somehow with the disk problem and spiraled out of control.

Also, is there a particular reason that it seems that all Separation work units that don't validate the first result at once go on to need two retries, and all three results end up validated? It seems to have been doing this ever since the disk problem, and several folks have commented on it in various threads. I've looked at the stderr.txt of quite a few of the tasks where I'm waiting for the second retry to return, and in almost every case the values from both results seem to be well within the sort of tolerances that I thought would get passed without question (differences are out around the 12th decimal place!)


That is by design. We can choose how many times Separation WUs need to go out for validation, and whoever came before me thought that 2 retries was the correct number. But yes, the tolerance changes should be small because they should just depend on how different operating systems execute the code.
ID: 73027 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,453,596
RAC: 36,231
Message 73028 - Posted: 20 Apr 2022, 0:20:48 UTC - in response to Message 73027.  
Last modified: 20 Apr 2022, 0:28:56 UTC

Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software...


There is one in place, actually. It normally keeps the number of jobs ready to be sent at ~10k (like how separation is self-monitoring right now). This just got screwed up somehow with the disk problem and spiraled out of control.

Thanks for the response, Tom; the only question that raises (specifically for Separation!) is whether that limit might need to be raised if the number of tasks sent per request goes up for big hitters!.

Just saw your edit in response to my second question too... I wondered whether it was a deliberate choice or not; it just had an unfortunate side-effect whilst retries were severely delayed, but it's less of an issue now normal service seems to have been resumed!

Cheers - Al.

[Edited to acknowledge added response to second question...]
ID: 73028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,142,956
RAC: 2
Message 73029 - Posted: 20 Apr 2022, 0:24:24 UTC - in response to Message 73028.  
Last modified: 20 Apr 2022, 0:25:00 UTC

Thanks for the response, Tom; the only question that raises (specifically for Separation!) is whether that limit might need to be raised if the number of tasks sent per request goes up for big hitters!.
Don't worry, as a big hitter if I don't get my huge number I will make it known. I will be watching the number of tasks received at once....
ID: 73029 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AnandBhat

Send message
Joined: 14 Feb 22
Posts: 12
Credit: 2,876,230
RAC: 2,278
Message 73031 - Posted: 20 Apr 2022, 2:33:54 UTC - in response to Message 73020.  

Tom Donlon wrote:
I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so.
I noticed some of my WUs in a "validation inconclusive" status that had a second task created to be assigned have had their second task cancelled ("Didn't need"), while the WU continues to remain in the "Validation inconclusive" state. For e.g., WU 414046094.

Will such work units get a new task created and assigned once the NBody WU generator is turned on so that these can be validated?
ID: 73031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,950,216
RAC: 22,077
Message 73034 - Posted: 20 Apr 2022, 10:20:56 UTC - in response to Message 73008.  

I am also confused by this, my seperation GPU WUs have:

minimum quorum 2
initial replication 3

Why send out 3 if only 2 are needed? This seems like a waste of processing time. If two of us have agreed on the result, why get a third GPU to run it?


That setting has been around FOREVER and is because ALOT of people had problems with their machines and didn't complete the tasks for some reason so to speed up the process they did that, I would have thought it wasn't needed today but I guess it is.
ID: 73034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,142,956
RAC: 2
Message 73036 - Posted: 20 Apr 2022, 10:37:43 UTC - in response to Message 73034.  

I am also confused by this, my seperation GPU WUs have:

minimum quorum 2
initial replication 3

Why send out 3 if only 2 are needed? This seems like a waste of processing time. If two of us have agreed on the result, why get a third GPU to run it?


That setting has been around FOREVER and is because ALOT of people had problems with their machines and didn't complete the tasks for some reason so to speed up the process they did that, I would have thought it wasn't needed today but I guess it is.
Perhaps it would be more efficient if the deadline was decreased but reduced to 1 or 2 replication? So if you don't do yours, it wouldn't be long before they were resent to me.
ID: 73036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,142,956
RAC: 2
Message 73037 - Posted: 20 Apr 2022, 10:39:05 UTC

TOM: What's the status of things now? I see Nbody on server status is sitting at the correct exact 1000. Is the generator back on? I'm still getting limited to 300 seperation per GPU :-(

Also see my reply to Mikey above about efficiency.
ID: 73037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pawg

Send message
Joined: 29 Oct 11
Posts: 1
Credit: 241,174
RAC: 225
Message 73060 - Posted: 21 Apr 2022, 9:53:45 UTC - in response to Message 73031.  

I have about 200 WU`s with that status
ID: 73060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73106 - Posted: 25 Apr 2022, 23:37:36 UTC

Hmm, I did change the values within the config.xml file on the server and then rebooted everything. Maybe I changed the wrong settings... I'll check what's up and see if anything obvious jumps out at me.
ID: 73106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73108 - Posted: 25 Apr 2022, 23:51:06 UTC - in response to Message 73106.  

I found where to change the individual limits, but I don't know why my changes to the feeder/scheduler pool never took effect. I'll have to dig around some more because that seems funky. I can't boost the number of WUs we can send out to a single volunteer until I raise the pool numbers, because otherwise one person could drain the pool in one go and there wouldn't be any WUs left until the pool refilled.
ID: 73108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,142,956
RAC: 2
Message 73109 - Posted: 26 Apr 2022, 0:59:03 UTC

Thanks Tom. Maybe you could ask on a Boinc forum?
ID: 73109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Coplin

Send message
Joined: 23 Sep 13
Posts: 19
Credit: 36,217,133
RAC: 0
Message 73110 - Posted: 26 Apr 2022, 1:11:54 UTC

Why are we getting new N-Body work units when we have a lot of Validation Inconclusive work units that still need to be validated
ID: 73110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

Message boards : News : Nbody WU Flush

©2024 Astroinformatics Group