Message boards :
News :
bypassing server set cache limits
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
While we appreciate everyone wanting to crunch more MilkyWay@Home by increasing their cache limits; this is part of the reason why we've had so many server problems lately with an unresponsive validator. Mainly, our machine/database is not fast enough to keep up with the additional amount of workunits this is causing in the database. So if anyone is modifying their BOINC client to artificially increase their cache we're asking you to stop so the project will be more stable (until we can further improve our hardware). A few of the really offending clients (who have cached 1k+ workunits) are being banned as they haven't responded to us, and they're hurting everyones ability to crunch the project as a whole. So in short, we need you guys to work with us as we're working with limited hardware that can't handle more than 500k+ workunits at a time -- our cache is low partially for this reason. Second, as we've said in a bunch of previous threads in the past, due to the nature of the science we're doing at the project we need a low cache because this really improves the quality of the work you guys report. As you (hopefully) know by now, we search for structure and try to optimize parameters to fit that structure within the Milky Way galaxy. And lately we've been also doing N-Body simulation of the formation of those structures. What your workunits are doing is trying to find the optimal set of parameters for those N-Body simulations to end up best representing our sky survey data or to fit those different structures (like dwarf galaxy tidal streams) from that data. To do this, we use strategies which mimic evolution. The server keeps track of a population of good potential solutions to these problems, and then generates workunits by mutating some solutions, and using others to create offspring. You guys crunch the data and return the result -- if it's a good one we insert it into the population which improves as a whole. Over time, we get very very good solutions which aren't really possible using other deterministic approaches. If people have large caches, that means the work they're crunching can come from very old versions of those populations which have since evolved quite a bit away from where they were when the user filled up their cache. When they return the results there's a lower chance for the results to improve the population of results we're currently working with. So that's why our cache is so low, and we'd really appreciate it if you worked with us on this. There are other great BOINC projects out there which can help fill in missing crunch time when we go down, and the BOINC client can definitely handle running more than one at a time. So it might not be too bad to explore some of the other great research going on out there. :) Thanks again for your time and understanding, --Travis |
Send message Joined: 20 Mar 08 Posts: 108 Credit: 2,607,924,860 RAC: 0 |
You have two good points, Travis, but also two different ones. In order to prevent the clients from crunching old data, the report deadline ought to be short. Presently, that's 8 days. In order to minimise the number of work units the server needs to keep track of, you can limit the number of cached work units per core. Presently that limit is 6. My fastest rig (as an example) is allowed a cache of 72 WUs. That's equivalent to ~23 minutes. When that runs out, it gets work from a backup project. Hours and hours of work (much more than my set preferences suggest, for some reason). While this lot is worked on, new MW WUs usually trickle in after a few minutes, just sitting there getting older before BOINC is done with the other project's work. (This FIFO behaviour might change in a future release of the BOINC core client.)
A small per-core cache is good (not really, but there aren't many better ways) for limiting the server load. But for keeping crunched data recent, it's actually a little counter-productive. As long as the N-body WUs are CPU-only, how about shortening the deadline on the separation WUs (like 1/4 or 1/8 of the current value) and reserving them for GPUs? If you also increase the separation WUs' per-core limit by 1, GPU caches will run out noticeably less often. Admittedly this will increase the server load slightly when everyone behaves, but it will also enable BOINC's built-in mechanisms to work better towards discouraging ridiculous caches on those misbehaving clients. |
Send message Joined: 28 Nov 07 Posts: 14 Credit: 94,794,818 RAC: 0 |
I am glad to see this being posted in the news feed. Not everyone reads the forums often or at all. Providing announcements here is helpful for highlighting important issues such as this. Thank you for the update. |
Send message Joined: 4 Jan 10 Posts: 86 Credit: 51,753,924 RAC: 0 |
Travis, What do you think about this post: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=2170&nowrap=true#45787 |
Send message Joined: 26 Nov 09 Posts: 33 Credit: 62,675,234 RAC: 0 |
i have two GPU's in one system... i just would like them to have a back up of 12 hours. i work between 3 projects. if i want MW to have any work i have to not allow GPU work from other projects because the other two will down load 1/2 day work and MW will only down load 24 unit per day (this is like 2 hours of work). i'm lazy i want the systme to work with out me having to turn off and on projects. i will work with what i've got and support MW. |
Send message Joined: 21 Mar 10 Posts: 5 Credit: 19,414,466 RAC: 0 |
My HD3850 and HD5850 both crunch MW@h exclusively. I noted before that the HD3850 has a limit of 12 tasks and the HD5850 has a limit of 24 tasks. Regardless, both caches are run through very quickly; 1½-2 hours worth of work. Here's a question; how does CPU work compare to Radeon work and GeForce work? Are they doing different tasks? |
Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0 |
My HD3850 and HD5850 both crunch MW@h exclusively. I noted before that the HD3850 has a limit of 12 tasks and the HD5850 has a limit of 24 tasks. Regardless, both caches are run through very quickly; 1½-2 hours worth of work. All tasks are generic except Cpu's can run N-body tasks also. A couple years ago it was discussed about makeing gpu specific tasks, but that hasn't come about yet. Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. |
Send message Joined: 2 Mar 10 Posts: 5 Credit: 105,634,798 RAC: 0 |
I'm pretty sure that guys like Bill Gates can toss in a few million to solve your hardware problem. There's more to this project than screwing around with code. Get off your butt and find the funding to make this project work! |
Send message Joined: 22 Apr 09 Posts: 38 Credit: 27,377,932 RAC: 0 |
Haven't seen MW work in some time now, it looks like projects are basically underfunded and, like SETI and some more, crunchers need not only 'donate' their free computer cycles but also money for new servers?! Since the start of CUDA/CAL/OpenCL GPU crunching, only a few people are needed to 'crunch', when looking at the huge difference in RAC or Average Work Throughput, between CPU only and GPU crunchers. Maybe the whole concept of Distributed Computing, has to be reviewed. Changes in hardware (Moore's Law), and thus software are changing that fast, some projects are probably better of when doing the crunching on a few Mega-GPU-Computers, in order to keep their server problems under control. Just my 2 (Euro) cents.............. Knight Who says Ni |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
I've got a cache full of wu's... The more crunchers the more science can get done and when more scientists comprehend the computing resources available to them the more projects will eventuate. All up Distributed Computing is a massive success! |
Send message Joined: 15 Jul 08 Posts: 383 Credit: 729,293,740 RAC: 0 |
Haven't seen MW work in some time now See the reply I left you on the Collatz board, it should solve some of your problems. |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
When that runs out, it gets work from a backup project. Hours and hours of work (much more than my set preferences suggest, for some reason). While this lot is worked on, new MW WUs usually trickle in after a few minutes, just sitting there getting older before BOINC is done with the other project's work. (This FIFO behaviour might change in a future release of the BOINC core client.) FIFO for the GPU is finally gone in the 6.12.x series... |
Send message Joined: 18 Dec 09 Posts: 9 Credit: 236,002,715 RAC: 0 |
I agree that this project requres a new population every few hours, but there is a great gap betweet GPUs and I have 5870 that runs out of work in less than 18 minutes (if no work is sent). I dont want to participate in math projects, so my card is idle in last month for 70% of time. Some projects have ability to chose a longer WUs, but I didnt find this feature in settings What can I do to reduce this idle time of my card? I cant find any suggestion here on forums. |
Send message Joined: 14 Feb 09 Posts: 999 Credit: 74,932,619 RAC: 0 |
|
Send message Joined: 11 Nov 09 Posts: 17 Credit: 7,324,208 RAC: 0 |
I have a task on seti and I'm worried it might expire before I can download it, their servers haven't been liking me for awhile. |
Send message Joined: 19 Aug 09 Posts: 23 Credit: 631,303 RAC: 0 |
Docter, if you post this question in Number Crunching at SET@Home they will probably remind you that when servers are down (like now) that expired tasks can still get credit once the servers are fixed. If it is stuck in download, don't worry, S@H will sort things out. |
Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0 |
Travis, thanks for posting this in the news. It should give many people a better understanding of how MW works and is supposed to work. However, I've got a nice (IMO) idea how to improve things further. Let me first explain my 2 personal hosts: Nr. 1 has an i7 and a HD4870. It's allowed 42 units (HT on, 7 cores used) and needs ~175s GPU time per WU, i.e. a cache of 2h. Nr. 2 has a C2Q and a pimped HD6950. It's allowed 24 units and needs ~65s GPU time per WU, i.e. a cache of 26 mins. The faster host is allowed to cache much less work, which totally doesn't make sense. If someone ran the work on an HT-enabled notebook i7 that cache might last for an entire day. This clearly contradicts what you want to have. Also I observed that host Nr. 2 runs out of MW work rather often, way more often than Nr. 1. With such a small cache any small glitch, be it on my or your side, will cause a backup project to kick in and keep the machine busy with (almost useless) non MW work. And as someone else said, after a short time 24 new MW WUs will arrive and stay in the cache for hours, waiting for their turn. Again, this is not what you want. So may I make the following suggestion? Use the "Average turnaround time" as a basis to calculate the amount of allowed WUs per host. It's a value readily available to the server, as it's already in the database. A setting of no more than 1h should probably work. This will solve several problems for you: - it's independent of CPU count (which had been meaningless ever after the first GPUs started to appear here) - it scales automatically with the number and speed of GPUs (or whatever co-processor will be used in the future) - it allows fast hosts to contribute with less interruptions - it prevents slow hosts from caching days of CPU work - it prevents excessive caching by modified BOINC clients - no matter how much they request, if they can't return in time, they won't get more (and less next time) Best regards, MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 22 Mar 09 Posts: 99 Credit: 503,422,495 RAC: 0 |
Thanks MrS, very good idea! It´s worth to think about! |
Send message Joined: 6 Apr 09 Posts: 26 Credit: 1,021,301,443 RAC: 0 |
Thank you MrS (Extra Terrestrial Apes) for summarizing. Your idea makes far more sense than the current limit of 6 work units per (cpu) core. I have the exact problem you described. Because of the physical size of the 5970, the only computer case that the monster would fit into has a single core processor (old MB) allowing only 6 work units at a time. With a dual GPU video card - unless everything works perfectly at MW - I run out of WUs in +/- 9 minutes..... I have set DNETC as my backup project on the machine and most days it does more DNETC work than MW! I think your idea is great! I also vote for you for president!!! |
Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0 |
Thanks for the flowers, guys :) MrS Scanning for our furry friends since Jan 2002 |
©2024 Astroinformatics Group