Welcome to MilkyWay@home

Nbody WU Flush

Message boards : News : Nbody WU Flush
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 22 · Next

AuthorMessage
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72791 - Posted: 14 Apr 2022, 18:04:27 UTC

Hey Everyone,

I'm going to turn the project off for a little while to cancel a bunch of WUs. There's no easy way to cancel only Nbody WUs (unless I want to sit and cancel them 999 at a time, because of how BOINC servers are set up), so I'm just going to cancel ~75% of the currently available jobs. This will reduce the enormous Nbody backlog and hopefully get things flowing more smoothly.

Best,
Tom
ID: 72791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72792 - Posted: 14 Apr 2022, 18:09:53 UTC
Last modified: 14 Apr 2022, 18:10:13 UTC

Okay, all done. When the server status page updates there should be a lot fewer jobs in the backlog.

Also, I've noticed that the DB isn't struggling with queries that took hours/days to complete anymore, so it does in fact look like the project is healing.
ID: 72792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,881
RAC: 267
Message 72793 - Posted: 14 Apr 2022, 18:11:31 UTC - in response to Message 72792.  

Thanks….
ID: 72793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 13 Oct 21
Posts: 43
Credit: 225,022,930
RAC: 9,511
Message 72794 - Posted: 14 Apr 2022, 19:08:35 UTC

Cool, I was about to make a post with some grim numbers about the declining user base and how long it'd likely take for us to process those almost 14 million N-Body tasks. I may be wrong but I still think that it'll take a bit of time before inconclusive validations get cleared out.
ID: 72794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72795 - Posted: 14 Apr 2022, 19:23:44 UTC

It looks like it updated and the numbers didn't change. I think I have to reset the transition times of all WUs before they'll get cleared out - I'm going to turn the project off for a little while and do that.
ID: 72795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72798 - Posted: 14 Apr 2022, 20:45:53 UTC

The status page has updated again, but it looks like all the Separation WUs were cancelled, but none of the Nbody WUs were cancelled. I don't know how that's possible.
ID: 72798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72799 - Posted: 14 Apr 2022, 20:46:49 UTC

There is still a 1.5 hour transitioner backlog. Maybe once that is worked through the Nbody WU count will drop.
ID: 72799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72801 - Posted: 14 Apr 2022, 23:25:15 UTC

Well, there are 1.2M Separation WUs in the queue now. Hopefully the volunteers are noticing that there are more tasks being given out by the server?
ID: 72801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,080,398
RAC: 86,658
Message 72803 - Posted: 14 Apr 2022, 23:39:25 UTC

Tom, at this point since the project is not producing the expected results from configuration changes that it might be wise and most efficient to just clear the database entirely and start fresh.

That would also get rid of the orphaned tasks from a year ago that you said would take a complete database clear to be rid of.
ID: 72803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,462,460
RAC: 36,126
Message 72805 - Posted: 15 Apr 2022, 1:34:36 UTC

Tom,

As a "GPU jobs only" user I have to say things seem to be behaving a lot better, but then I don't get NBody jobs :-) That said, that huge number still there against NBody makes me curious...

You refer to cancelling currently available jobs... Can you offer a little clarity here, as I/we have no idea what tools are available to a BOINC admin...?

If that meant flushing tasks (aka results within the database), I presume the transitioner wlll simply notice that a given work-unit needs a new task to replace each removed task, so it may effectively just juggle tasks to the back of the queue! If that is what's happening, it explains why the number of NBody tasks isn't dropping significantly (though it might help bringing some of the retries forwards).

If, however, it meant work units then the database may have a problem... There appears to be no "referential integrity" in the d/b schema and constraints (perhaps because old versions of MySQL didn't actually implement it!), so if the tool for removing work-units doesn't explicitly clear out the associated task/result items too, there will probably be [future] issues :-(

I'm guessing you've already done MySQL queries against the workunit and result tables to see how many WUs there are and get a flavour of the state of tasks? I've had a quick look at the schema stuff that purports to be for the server version quoted on the "Server status" page, and it does seem to be a bag of nails, doesn't it?

Thanks for your current efforts, and good luck with NBody!

Cheers - Al.
ID: 72805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
waffleironhead

Send message
Joined: 10 Aug 09
Posts: 9
Credit: 70,402,530
RAC: 0
Message 72806 - Posted: 15 Apr 2022, 2:24:20 UTC

I am now consistently getting separation WU's.
Thank you!
ID: 72806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Wrend
Avatar

Send message
Joined: 4 Nov 12
Posts: 96
Credit: 251,528,484
RAC: 0
Message 72808 - Posted: 15 Apr 2022, 4:41:22 UTC
Last modified: 15 Apr 2022, 4:54:09 UTC

It had been suggested in another forum thread that having more systems crunching Nbody WUs would help. I just have the one computer of course, but wouldn't mind switching over from Einstein@H on the CPU to MW@H once the queue is empty (I've set it to no new WUs), if that's the case. Thanks.

Usually I just crunch on the GPUs for MW@H, because my Titan Black cards can be DP/FP64 optimized beyond most other Nvidia cards, figuring that they could do more good here, while helping out other projects on the CPU threads.
ID: 72808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 72842 - Posted: 15 Apr 2022, 14:09:16 UTC - in response to Message 72805.  

Tom,

As a "GPU jobs only" user I have to say things seem to be behaving a lot better, but then I don't get NBody jobs :-) That said, that huge number still there against NBody makes me curious...

You refer to cancelling currently available jobs... Can you offer a little clarity here, as I/we have no idea what tools are available to a BOINC admin...?

If that meant flushing tasks (aka results within the database), I presume the transitioner wlll simply notice that a given work-unit needs a new task to replace each removed task, so it may effectively just juggle tasks to the back of the queue! If that is what's happening, it explains why the number of NBody tasks isn't dropping significantly (though it might help bringing some of the retries forwards).

If, however, it meant work units then the database may have a problem... There appears to be no "referential integrity" in the d/b schema and constraints (perhaps because old versions of MySQL didn't actually implement it!), so if the tool for removing work-units doesn't explicitly clear out the associated task/result items too, there will probably be [future] issues :-(

I'm guessing you've already done MySQL queries against the workunit and result tables to see how many WUs there are and get a flavour of the state of tasks? I've had a quick look at the schema stuff that purports to be for the server version quoted on the "Server status" page, and it does seem to be a bag of nails, doesn't it?

Thanks for your current efforts, and good luck with NBody!

Cheers - Al.


There is a tool within BOINC that provides an "OPS" page for project admins. This page has several scripts that you can run, one of which allows you to cancel jobs. When jobs are cancelled, you can either provide a SQL query (limit 999 jobs) to cancel, or you can provide a range of WU ids to cancel. These WUs aren't dropped from the DB, they are just marked as "no longer needed" and transitioned out from the list of WUs that need to be completed. By the way, when this says "jobs", that's the terminology that BOINC is using in their scripts. What it is really doing is querying and ubdating the workunit table.

I could manually run DB queries to do all of this myself, but I figure that the BOINC script should do everything correctly, and avoid me missing something when I'm going through the process. Yes, I've done queries to look at the workunit and results tables. As of right now, the workunit table contains 1.4M WUs for Nbody, versus 3M for Separation. So the numbers don't exactly correspond to the server status page for whatever reason.
ID: 72842 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,462,460
RAC: 36,126
Message 72857 - Posted: 15 Apr 2022, 18:25:59 UTC - in response to Message 72842.  

Tom,

Thanks for the response -- this bit clarified what I was worried about:
These WUs aren't dropped from the DB, they are just marked as "no longer needed" and transitioned out from the list of WUs that need to be completed.
and the next bit highlighted something I'd wondered about:
As of right now, the workunit table contains 1.4M WUs for Nbody, versus 3M for Separation. So the numbers don't exactly correspond to the server status page for whatever reason.
So the best we can do at the moment is ignore the numbers on the server status page :-)

Again, thanks for your efforts and responses.

Cheers - Al.
ID: 72857 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
poppinfresh99

Send message
Joined: 28 Feb 22
Posts: 16
Credit: 2,400,538
RAC: 0
Message 72866 - Posted: 16 Apr 2022, 5:09:24 UTC - in response to Message 72857.  
Last modified: 16 Apr 2022, 5:14:34 UTC

Tom, I think you are mistakenly saying WORKUNITS when you mean that you're deleting TASKS. When you clear jobs via the "milkyway_ops" page, I'm pretty sure it just removes the current tasks from the workunits, which perhaps are counted as an error towards the workunit, but then the workunit can send more tasks. I might be wrong since it was a while ago that I set up a BOINC server on my laptop on my local network for fun. Since N-body is multithreaded, each computer is only running 1 task at a time, so deleting tasks does minimal harm, especially since the tasks are so incredibly tiny (could easily be 50 times larger).

Whatever you did, I'm finally getting my 30-day-old N-body workunits to become valid!! My computer is also now helping validate other old N-body workunits! Thank you!

There probably are 1.4 million N-body workunits since, before today, only like 5 of my over 4000 workunits have validated in the past 30 days, and I'm just one computer, so approximately 1,400,000/4000 = 350 computers could be responsible for this.

As for why there are still 13.8 million N-body tasks to send (source: Server Status page), these might be some kind of orphaned tasks??

If you decide to clear the database to remove orphaned tasks, before you do it, please allow my 4000+ N-body workunits to finish! I'd hate to have all this work by me and others go to waste!
ID: 72866 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
poppinfresh99

Send message
Joined: 28 Feb 22
Posts: 16
Credit: 2,400,538
RAC: 0
Message 72869 - Posted: 16 Apr 2022, 6:39:56 UTC - in response to Message 72866.  

Tom, I think you are mistakenly saying WORKUNITS when you mean that you're deleting TASKS. When you clear jobs via the "milkyway_ops" page, I'm pretty sure it just removes the current tasks from the workunits, which perhaps are counted as an error towards the workunit, but then the workunit can send more tasks. I might be wrong since it was a while ago that I set up a BOINC server on my laptop on my local network for fun. Since N-body is multithreaded, each computer is only running 1 task at a time, so deleting tasks does minimal harm, especially since the tasks are so incredibly tiny (could easily be 50 times larger).

Oh never mind, I was thinking about something else when I had corrupted workunit filenames. I just tested it on my BOINC server and the WORKUNIT is what is deleted.

None of my MilkyWay workunits seem to have been deleted though!
ID: 72869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72870 - Posted: 16 Apr 2022, 7:06:37 UTC - in response to Message 72869.  

...
None of my MilkyWay workunits seem to have been deleted though!

I'm wondering how you can tell if a task, which is part of a workunit (not belonging to anyone specific), has been deleted?
Hmm, actually, what do you mean by "deleted"?
ID: 72870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
poppinfresh99

Send message
Joined: 28 Feb 22
Posts: 16
Credit: 2,400,538
RAC: 0
Message 72889 - Posted: 16 Apr 2022, 15:30:42 UTC - in response to Message 72870.  
Last modified: 16 Apr 2022, 15:36:39 UTC


I'm wondering how you can tell if a task, which is part of a workunit (not belonging to anyone specific), has been deleted?
Hmm, actually, what do you mean by "deleted"?

I meant cancel (via a page like "cancel jobs" at milkyway_ops)

A task being "deleted"/cancelled says something like "Didn't need" on the task's status.

A workunit being "deleted"/cancelled says "Cancelled by server" or "Completed, can't validate" on workunit's status. They should be sorted with "error" or "invalid" tasks.
ID: 72889 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72890 - Posted: 16 Apr 2022, 15:50:59 UTC - in response to Message 72889.  

Thanks for explaining.
Have never seen this information.
Probably because I never needed it?

Still wondering what "milkyway_ops" is.

I guess I'm just a plain noob ...
ID: 72890 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,037,238
RAC: 35,415
Message 72908 - Posted: 16 Apr 2022, 20:40:48 UTC - in response to Message 72890.  

I guess I'm just a plain noob ...


YOU TOO?!?!
ID: 72908 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 22 · Next

Message boards : News : Nbody WU Flush

©2024 Astroinformatics Group