Welcome to MilkyWay@home

Validation inconclusive

Message boards : Number crunching : Validation inconclusive
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 18 · Next

AuthorMessage
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,027,893
RAC: 38,188
Message 72673 - Posted: 11 Apr 2022, 2:30:28 UTC - in response to Message 72669.  

What may help is recruiting some of those BOINC teams that do various marathons and sprints focusing on one project at a time. Getting a few hundred or maybe thousand volunteers to focus on N-Body for a period of time would clear the large queue quickly. One uncertainty is whether the server will be able to handle the high jump in traffic.

What would also help is just turning off the n body generator for a few days. Tom as going to do that at onetime a few days ago. Catching up on n body will be even slower once the WCG folks leave.
ID: 72673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,437,067
RAC: 37,187
Message 72675 - Posted: 11 Apr 2022, 3:30:27 UTC - in response to Message 72673.  

What would also help is just turning off the n body generator for a few days. Tom as going to do that at onetime a few days ago. Catching up on n body will be even slower once the WCG folks leave.

I think you'll find the NBody work unit generator is already turned off! Tom did indeed say he was going to do so several days ago, and it no longer appears on the server status list.

However, retries to resolve inconclusive validations still get added to the queue of work to do, as they don't involve the generator :-)

In the case of Separation work units, one side-effect of the performance issues seemed to be requests for a second retry which would turn out to be unnecessary as eventually all three results would validate. If the same is happening with NBody, that could be a very large number of retries, and as has been pointed out earlier in this thread the retries [probably] end up on the end of the queue -- spot the problem...

As NBody doesn't seem to attract as many users as Separation (non-availability of a GPU version being a factor) this could take some time to sort itself out, even if some WCG users stick around to help out :-(

Cheers - Al.

P.S. I don't run MW CPU tasks, as I don't do GPU and CPU work from a project on the same systems. I therefore have no first-hand experience of NBody throughput...
ID: 72675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GolfSierra

Send message
Joined: 11 Mar 22
Posts: 42
Credit: 21,902,543
RAC: 0
Message 72685 - Posted: 11 Apr 2022, 9:35:36 UTC - in response to Message 72665.  

+1


+1
ID: 72685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GolfSierra

Send message
Joined: 11 Mar 22
Posts: 42
Credit: 21,902,543
RAC: 0
Message 72686 - Posted: 11 Apr 2022, 9:39:40 UTC - in response to Message 72675.  


As NBody doesn't seem to attract as many users as Separation (non-availability of a GPU version being a factor) this could take some time to sort itself out, even if some WCG users stick around to help out :-(


I changed my profile to only receive NBody WUs and will keep this setting until WCG is back online.
ID: 72686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,940,047
RAC: 22,627
Message 72688 - Posted: 11 Apr 2022, 10:20:34 UTC - in response to Message 72669.  

Second tasks of N-Body Work Units are being sent out but at a very slow pace. I've recently gotten some tasks for WUs that had the first task completed almost a month ago. I think that, like has been mentioned before, the 13+ million queue needs to be processed first before things can go back to normal. It'll take time though since N-Body is CPU only and so tasks take longer and not as many people process them. There are 5 times as many users in the last 24 hours for Separation compared to N-Body.

What may help is recruiting some of those BOINC teams that do various marathons and sprints focusing on one project at a time. Getting a few hundred or maybe thousand volunteers to focus on N-Body for a period of time would clear the large queue quickly. One uncertainty is whether the server will be able to handle the high jump in traffic.


One problem is that new people see the task taking every cpu core they have in their pc being assigned to a single tasks and say 'whoa not for me' and drop the whole batch of tasks and leave. That only works IF the task has been sent out though. I personally think something like what PrimeGrid did with the user choosing how many cpu cores to put on a single task is much better for the user as most newbies have no clue how to make app_config.xml files to make any changes on their own.
ID: 72688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,834
RAC: 271
Message 72689 - Posted: 11 Apr 2022, 10:46:28 UTC - in response to Message 72688.  

Second tasks of N-Body Work Units are being sent out but at a very slow pace. I've recently gotten some tasks for WUs that had the first task completed almost a month ago. I think that, like has been mentioned before, the 13+ million queue needs to be processed first before things can go back to normal. It'll take time though since N-Body is CPU only and so tasks take longer and not as many people process them. There are 5 times as many users in the last 24 hours for Separation compared to N-Body.

What may help is recruiting some of those BOINC teams that do various marathons and sprints focusing on one project at a time. Getting a few hundred or maybe thousand volunteers to focus on N-Body for a period of time would clear the large queue quickly. One uncertainty is whether the server will be able to handle the high jump in traffic.


One problem is that new people see the task taking every cpu core they have in their pc being assigned to a single tasks and say 'whoa not for me' and drop the whole batch of tasks and leave. That only works IF the task has been sent out though. I personally think something like what PrimeGrid did with the user choosing how many cpu cores to put on a single task is much better for the user as most newbies have no clue how to make app_config.xml files to make any changes on their own.


I have a 16 processors and find the simplest way to handle Nbody is set Boinc at 25% Cpu (4 CPU's), get a load of Nbody WU's, they all say 4 CPU's. Once they start processing change Boinc to 50% or 75% CPu I can run 2 or 3 tasks at once, each task will only use 4 CPU's . I think another reason is that the credits are quite poor and seem to be based on elapsed time not CPU time. CPU time is obviously 2 or 3 times the elapsed time, depending how many CPU's you let it grab. Once the load of WU's is finished I do the whole process again, a bit of a nuisance otherwise if I let WU's just flow in any new ones will grab all 12 CPU's. Works for me.
ID: 72689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,834
RAC: 271
Message 72692 - Posted: 11 Apr 2022, 12:13:21 UTC

Also based on my experience I dont think NBody will run on less than 4 CPU's. It certainly wont run on my old Intel Yorkfield Quad with usage set at 75% whereas Separation tasks will.
ID: 72692 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,027,893
RAC: 38,188
Message 72693 - Posted: 11 Apr 2022, 13:57:08 UTC - in response to Message 72675.  

I think you'll find the NBody work unit generator is already turned off!

Correct! I missed that. These glasses aren't worth crap
ID: 72693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,027,893
RAC: 38,188
Message 72695 - Posted: 11 Apr 2022, 13:59:43 UTC - in response to Message 72689.  
Last modified: 11 Apr 2022, 14:05:03 UTC

I think another reason is that the credits are quite poor and seem to be based on elapsed time not CPU time. CPU time is obviously 2 or 3 times the elapsed time, depending how many CPU's you let it grab.


That is a big issue for me. I did a study a several weeks ago which showed that n body only yields about 20% of the credit for the same amount of CPU time.

Another issue is I don't get 100% CPU on n body like i get with separation. So right now I reluctantly run n body.
ID: 72695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,027,893
RAC: 38,188
Message 72697 - Posted: 11 Apr 2022, 14:08:04 UTC - in response to Message 72689.  

I have a 16 processors and find the simplest way to handle Nbody is set Boinc at 25% Cpu (4 CPU's), get a load of Nbody WU's, they all say 4 CPU's. Once they start processing change Boinc to 50% or 75% CPu I can run 2 or 3 tasks at once, each task will only use 4 CPU's .

Gonna try this. I have 12 cores/24 threads. Will see how it works.
ID: 72697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,027,893
RAC: 38,188
Message 72705 - Posted: 11 Apr 2022, 20:07:08 UTC - in response to Message 72697.  
Last modified: 11 Apr 2022, 20:08:20 UTC

I have a 16 processors and find the simplest way to handle Nbody is set Boinc at 25% Cpu (4 CPU's), get a load of Nbody WU's, they all say 4 CPU's. Once they start processing change Boinc to 50% or 75% CPu I can run 2 or 3 tasks at once, each task will only use 4 CPU's .

Gonna try this. I have 12 cores/24 threads. Will see how it works.

Works well! 8 instances of 3 CPUs each! Utilization is now at 94%, instead of 45! Cool!
ID: 72705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 13 Oct 21
Posts: 43
Credit: 225,017,873
RAC: 9,954
Message 72706 - Posted: 11 Apr 2022, 20:13:50 UTC

I'm pretty sure N-Body will run on any amount of cores you set it to, up to 16. I know for sure it'll run on as little as 2 as I run them on an old laptop with Intel Celeron N3050 2-core CPU running Lubuntu off of a thumb drive. app_config.xml is good way to limit CPU usage or do what Septimus does. I agree that a web preferences option would be nice. Besides PrimeGrid (as per mikey), LHC@Home also has this option in web preferences. It allows you to limit the amount of tasks to download as well as the amount of CPUs to dedicate to a task.

This needs to be verified, but I think that for max points running them single core would be best, for max throughput - 16 cores per task.
ID: 72706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,834
RAC: 271
Message 72708 - Posted: 11 Apr 2022, 20:48:25 UTC - in response to Message 72705.  

I have a 16 processors and find the simplest way to handle Nbody is set Boinc at 25% Cpu (4 CPU's), get a load of Nbody WU's, they all say 4 CPU's. Once they start processing change Boinc to 50% or 75% CPu I can run 2 or 3 tasks at once, each task will only use 4 CPU's .

Gonna try this. I have 12 cores/24 threads. Will see how it works.

Works well! 8 instances of 3 CPUs each! Utilization is now at 94%, instead of 45! Cool!


Glad it works for you…great…
ID: 72708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GolfSierra

Send message
Joined: 11 Mar 22
Posts: 42
Credit: 21,902,543
RAC: 0
Message 72724 - Posted: 12 Apr 2022, 14:51:16 UTC - in response to Message 72708.  

I didn't restrict BOINC this way so I get NBody (16 CPUs) tasks. Each task needs 1:48 min to complete.

However, I noticed that for some reasons some tasks freeze at different stages. The only way to restart them ist to exit BOINC and restart BOINC. This happens also during the night. I only can think of my antivir software that probably will suppress BOINC occasionally.
ID: 72724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,834
RAC: 271
Message 72725 - Posted: 12 Apr 2022, 15:29:21 UTC - in response to Message 72724.  
Last modified: 12 Apr 2022, 15:30:45 UTC

I would think you are locking the system out from its own functions with that many high CPU tasks. I run with 2 x 6 CPU and tasks take 3.5 mins on average.
ID: 72725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,940,047
RAC: 22,627
Message 72755 - Posted: 13 Apr 2022, 10:52:55 UTC - in response to Message 72724.  

I didn't restrict BOINC this way so I get NBody (16 CPUs) tasks. Each task needs 1:48 min to complete.

However, I noticed that for some reasons some tasks freeze at different stages. The only way to restart them ist to exit BOINC and restart BOINC. This happens also during the night. I only can think of my antivir software that probably will suppress BOINC occasionally.


Set your a/v system to ignore the Boinc folders as yes the constant back and forth sending of data can be indicative of a real virus so sometimes Boinc gets flagged, any real virus will try to escape the Boinc folders and get caught by your a/v program but you will no longer get the false positive problems. Every Project uses an a/v program on their end to ensure we users don't get a virus from them so you can be pretty sure you won't get a virus from a Boinc Project.
ID: 72755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,834
RAC: 271
Message 72760 - Posted: 13 Apr 2022, 17:30:08 UTC

I think it’s time we had an explanation about the hold up on the Nbody Simulation WU’s.

Are they getting validated ? If so how many days behind after processing ?

Why does the total of 13.8 Million outstanding WU’s barely change from day to day.

Would a fixed credit of say 76 encourage more processing bearing in mind that 20 times more users seem to do Separation runs rather than Simulations.

I would gladly do more Simulation WU’s but at present see little point. Compared to others my contribution is minimal but I would gladly contribute more.
ID: 72760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Wrend
Avatar

Send message
Joined: 4 Nov 12
Posts: 96
Credit: 251,528,484
RAC: 0
Message 72761 - Posted: 13 Apr 2022, 18:16:29 UTC
Last modified: 13 Apr 2022, 18:30:46 UTC

For me Separation have been very hit or miss, mostly miss, having to repeatedly request updates to get WUs and then only occasionally.
I'm doing this for the science, not the credits. Credits are a nice milestone to help keep track of things though, so naturally the less artificially manipulated or inflated they are is preferable to me. I do Separation specifically because my two Titan Black cards are more competent at crunching them due to their DP/FP64 capabilities. I'm doing Einstein@Home on the CPU.

You can see the batches of GPU tasks stuttering along here in my RAC listing in the BOINC manager UI. https://i.imgur.com/5gSp9WY.png

The GPUs have been actively crunching maybe 1/3 to 1/5 of the time on average within that RAC upward trajectory.
ID: 72761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,437,067
RAC: 37,187
Message 72762 - Posted: 13 Apr 2022, 18:27:38 UTC - in response to Message 72760.  

I am only a user, not an MW scientist or technician, but I'll have a go at this...
I think it’s time we had an explanation about the hold up on the Nbody Simulation WU’s.
The explanation is almost certainly the 13+ Million tasks waiting. Even if there wasn't a huge backlog on validation, that's still too many tasks for a sensible database load (and filestore load if each task has any unique data files, though I don't know if that applies to NBody.) See also below...
Are they getting validated ? If so how many days behind after processing ?
I very much doubt any of them are getting validated (unless "trusted status" operates on NBody and a user gets lucky.)
I don't run NBody myself (GPU tasks only) but I looked at the NBody results for some other users who posted about inconclusive results and those users had results dating back several weeks that were tagged as inconclusive...
Why does the total of 13.8 Million outstanding WU’s barely change from day to day.
When a retry is requested, the "result record" for it [presumably] goes on the end of the queue of tasks to be sent out. So each result returned that needs a retry to confirm it will create another task to run - one in, one to go out! Not good...
Would a fixed credit of say 76 encourage more processing bearing in mind that 20 times more users seem to do Separation runs rather than Simulations.

I would gladly do more Simulation WU’s but at present see little point. Compared to others my contribution is minimal but I would gladly contribute more.
I don't think the answer is getting more people to run NBody (especially if each initial task generates a retry!); it will probably take some fairly drastic action to resolve this, as I don't think it is simple to get the feeder to give precedence to retries (which would certainly thin out the inconclusives were it possible) and it may be "bad science" to just kill off millions of tasks that are waiting.

As far as I know, NBody is still Eric Mendelsohn's project at present - any drastic solution is likely to require his say-so...

Cheers- Al.
ID: 72762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,834
RAC: 271
Message 72765 - Posted: 13 Apr 2022, 20:11:44 UTC - in response to Message 72762.  
Last modified: 13 Apr 2022, 20:50:39 UTC

Thanks very much for your explanation Alanb. Really helpful, the thing that is confusing me is that all my second tasks although they have a number are shown as unsent, in reality the data base must be just getting full of tasks going nowhere.
ID: 72765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 18 · Next

Message boards : Number crunching : Validation inconclusive

©2024 Astroinformatics Group