Message boards :
News :
Nbody WU Flush
Message board moderation
Author | Message |
---|---|
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hey Everyone, I'm going to turn the project off for a little while to cancel a bunch of WUs. There's no easy way to cancel only Nbody WUs (unless I want to sit and cancel them 999 at a time, because of how BOINC servers are set up), so I'm just going to cancel ~75% of the currently available jobs. This will reduce the enormous Nbody backlog and hopefully get things flowing more smoothly. Best, Tom |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Okay, all done. When the server status page updates there should be a lot fewer jobs in the backlog. Also, I've noticed that the DB isn't struggling with queries that took hours/days to complete anymore, so it does in fact look like the project is healing. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Thanks…. |
Send message Joined: 13 Oct 21 Posts: 44 Credit: 227,168,986 RAC: 7,426 |
Cool, I was about to make a post with some grim numbers about the declining user base and how long it'd likely take for us to process those almost 14 million N-Body tasks. I may be wrong but I still think that it'll take a bit of time before inconclusive validations get cleared out. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
It looks like it updated and the numbers didn't change. I think I have to reset the transition times of all WUs before they'll get cleared out - I'm going to turn the project off for a little while and do that. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
The status page has updated again, but it looks like all the Separation WUs were cancelled, but none of the Nbody WUs were cancelled. I don't know how that's possible. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
There is still a 1.5 hour transitioner backlog. Maybe once that is worked through the Nbody WU count will drop. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Well, there are 1.2M Separation WUs in the queue now. Hopefully the volunteers are noticing that there are more tasks being given out by the server? |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,584,433 RAC: 49,363 |
Tom, at this point since the project is not producing the expected results from configuration changes that it might be wise and most efficient to just clear the database entirely and start fresh. That would also get rid of the orphaned tasks from a year ago that you said would take a complete database clear to be rid of. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,720,796 RAC: 23,366 |
Tom, As a "GPU jobs only" user I have to say things seem to be behaving a lot better, but then I don't get NBody jobs :-) That said, that huge number still there against NBody makes me curious... You refer to cancelling currently available jobs... Can you offer a little clarity here, as I/we have no idea what tools are available to a BOINC admin...? If that meant flushing tasks (aka results within the database), I presume the transitioner wlll simply notice that a given work-unit needs a new task to replace each removed task, so it may effectively just juggle tasks to the back of the queue! If that is what's happening, it explains why the number of NBody tasks isn't dropping significantly (though it might help bringing some of the retries forwards). If, however, it meant work units then the database may have a problem... There appears to be no "referential integrity" in the d/b schema and constraints (perhaps because old versions of MySQL didn't actually implement it!), so if the tool for removing work-units doesn't explicitly clear out the associated task/result items too, there will probably be [future] issues :-( I'm guessing you've already done MySQL queries against the workunit and result tables to see how many WUs there are and get a flavour of the state of tasks? I've had a quick look at the schema stuff that purports to be for the server version quoted on the "Server status" page, and it does seem to be a bag of nails, doesn't it? Thanks for your current efforts, and good luck with NBody! Cheers - Al. |
Send message Joined: 10 Aug 09 Posts: 9 Credit: 70,518,679 RAC: 0 |
I am now consistently getting separation WU's. Thank you! |
Send message Joined: 4 Nov 12 Posts: 96 Credit: 251,528,484 RAC: 0 |
It had been suggested in another forum thread that having more systems crunching Nbody WUs would help. I just have the one computer of course, but wouldn't mind switching over from Einstein@H on the CPU to MW@H once the queue is empty (I've set it to no new WUs), if that's the case. Thanks. Usually I just crunch on the GPUs for MW@H, because my Titan Black cards can be DP/FP64 optimized beyond most other Nvidia cards, figuring that they could do more good here, while helping out other projects on the CPU threads. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Tom, There is a tool within BOINC that provides an "OPS" page for project admins. This page has several scripts that you can run, one of which allows you to cancel jobs. When jobs are cancelled, you can either provide a SQL query (limit 999 jobs) to cancel, or you can provide a range of WU ids to cancel. These WUs aren't dropped from the DB, they are just marked as "no longer needed" and transitioned out from the list of WUs that need to be completed. By the way, when this says "jobs", that's the terminology that BOINC is using in their scripts. What it is really doing is querying and ubdating the workunit table. I could manually run DB queries to do all of this myself, but I figure that the BOINC script should do everything correctly, and avoid me missing something when I'm going through the process. Yes, I've done queries to look at the workunit and results tables. As of right now, the workunit table contains 1.4M WUs for Nbody, versus 3M for Separation. So the numbers don't exactly correspond to the server status page for whatever reason. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,720,796 RAC: 23,366 |
Tom, Thanks for the response -- this bit clarified what I was worried about: These WUs aren't dropped from the DB, they are just marked as "no longer needed" and transitioned out from the list of WUs that need to be completed.and the next bit highlighted something I'd wondered about: As of right now, the workunit table contains 1.4M WUs for Nbody, versus 3M for Separation. So the numbers don't exactly correspond to the server status page for whatever reason.So the best we can do at the moment is ignore the numbers on the server status page :-) Again, thanks for your efforts and responses. Cheers - Al. |
Send message Joined: 28 Feb 22 Posts: 16 Credit: 2,400,538 RAC: 0 |
Tom, I think you are mistakenly saying WORKUNITS when you mean that you're deleting TASKS. When you clear jobs via the "milkyway_ops" page, I'm pretty sure it just removes the current tasks from the workunits, which perhaps are counted as an error towards the workunit, but then the workunit can send more tasks. I might be wrong since it was a while ago that I set up a BOINC server on my laptop on my local network for fun. Since N-body is multithreaded, each computer is only running 1 task at a time, so deleting tasks does minimal harm, especially since the tasks are so incredibly tiny (could easily be 50 times larger). Whatever you did, I'm finally getting my 30-day-old N-body workunits to become valid!! My computer is also now helping validate other old N-body workunits! Thank you! There probably are 1.4 million N-body workunits since, before today, only like 5 of my over 4000 workunits have validated in the past 30 days, and I'm just one computer, so approximately 1,400,000/4000 = 350 computers could be responsible for this. As for why there are still 13.8 million N-body tasks to send (source: Server Status page), these might be some kind of orphaned tasks?? If you decide to clear the database to remove orphaned tasks, before you do it, please allow my 4000+ N-body workunits to finish! I'd hate to have all this work by me and others go to waste! |
Send message Joined: 28 Feb 22 Posts: 16 Credit: 2,400,538 RAC: 0 |
Tom, I think you are mistakenly saying WORKUNITS when you mean that you're deleting TASKS. When you clear jobs via the "milkyway_ops" page, I'm pretty sure it just removes the current tasks from the workunits, which perhaps are counted as an error towards the workunit, but then the workunit can send more tasks. I might be wrong since it was a while ago that I set up a BOINC server on my laptop on my local network for fun. Since N-body is multithreaded, each computer is only running 1 task at a time, so deleting tasks does minimal harm, especially since the tasks are so incredibly tiny (could easily be 50 times larger). Oh never mind, I was thinking about something else when I had corrupted workunit filenames. I just tested it on my BOINC server and the WORKUNIT is what is deleted. None of my MilkyWay workunits seem to have been deleted though! |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
... I'm wondering how you can tell if a task, which is part of a workunit (not belonging to anyone specific), has been deleted? Hmm, actually, what do you mean by "deleted"? |
Send message Joined: 28 Feb 22 Posts: 16 Credit: 2,400,538 RAC: 0 |
I meant cancel (via a page like "cancel jobs" at milkyway_ops) A task being "deleted"/cancelled says something like "Didn't need" on the task's status. A workunit being "deleted"/cancelled says "Cancelled by server" or "Completed, can't validate" on workunit's status. They should be sorted with "error" or "invalid" tasks. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
Thanks for explaining. Have never seen this information. Probably because I never needed it? Still wondering what "milkyway_ops" is. I guess I'm just a plain noob ... |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
I guess I'm just a plain noob ... YOU TOO?!?! |
©2024 Astroinformatics Group