Massive server issues and Wu validation delays

Author	Message
Michael H.W. Weber Send message Joined: 22 Jan 08 Posts: 29 Credit: 242,764,112 RAC: 1,426	Message 65537 - Posted: 27 Oct 2016, 8:06:08 UTC Last modified: 27 Oct 2016, 8:25:25 UTC Since around three days I am again trying to support your project by using 4 GPUs (2x 280X, 2x 290X). Although I had addressed these things earlier in this forum, to date you still have not solved the following issues: 1. GPU WUs are to short (280X: 9 sec/ WU, 290X: 13 Sec/WU). 2. Your server hands out only a limited number of WUs at a time. At least the automatic detection of the 290X GPUs you have implemented. Thanks for that (although it took you years to do so). Now, your server has repeated massive database issues around every 15-30 minutes resulting in: 1. failure to upload result data 2. failure to download new WUs (which results in idling machines) 3. failure to login to my account 4. inability to report this problem to you because your forum does not work either. On top of that, your validator does not seem to keep up with the incoming results. Of the 11446 tasks I was capable of uploading to your server within the past 3 days (and this is only a tiny, tiny fraction of what would have been possible if your server wouldn't crash every other minute), only 1244 were validated. The rest 'hangs in the air' waiting for validation (credit aquisition is accordingly delayed). I do not know what is the reason for all of these issues, but if you like people to support this project, you need to address these issues quickly. Not in a year, please. I think I do have an explanation for your server problem, though: Because your WUs are so small, your server can't keep up with the connections made by all the clients working on your tasks. At the RNA World distributed computing project, we solved exactly the same problem by simply bundling the tiny WUs to larger WU packets. This massively reduced the number of connection requests and also helped deliver more work to the clients. PLEASE think about that when you get free advice from people running their own DC projects and servers. :) If you like, contact Yoyo from our admin & project team and ask for code details on how we bundle the tasks. And remember: I suggested this long back, too. What you are currently doing is a self-induced DDoS-attack on your own server(s). Michael. President of Rechenkraft.net e.V. - This planet's first and largest distributed computing organization. ID: 65537 · Rating: 0 · rate: / Reply Quote

bluestang Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0	Message 65538 - Posted: 27 Oct 2016, 14:01:48 UTC - in response to Message 65537. +1 mil Hopefully the grant they recently got will help solve these issues quickly! No reason for this to be setup this way, your killing this project's capability of doing a massive amount of work. ID: 65538 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 22 Jan 08 Posts: 29 Credit: 242,764,112 RAC: 1,426	Message 65539 - Posted: 27 Oct 2016, 14:14:22 UTC - in response to Message 65538. Last modified: 27 Oct 2016, 14:15:53 UTC What I am offering is a clear description of the problem plus a solution. For the latter, no grant is required. Three first measures: 1. Increase client delay, such that connections are allowed only every 30 minutes. Yes, not nice but server stability is of priority in the current situation. 2. WU run time duration requires increase to at least 1 hour per WU, better would be 5 hrs. If a simple increase per WU is impossible for whatever reason, bundle 100 or more tasks in a single packet. 3. Keep your database small: Produce WUs only when the server has no more WUs ready for delivery. Delete WUs once they have been completed - do not save them for long. With these measures we run such projects on a laptop even during worldwide challenges. Without any grant. Just try it. ;) Michael. President of Rechenkraft.net e.V. - This planet's first and largest distributed computing organization. ID: 65539 · Rating: 0 · rate: / Reply Quote

Dataman Send message Joined: 5 Sep 08 Posts: 28 Credit: 245,585,043 RAC: 0	Message 65540 - Posted: 27 Oct 2016, 14:25:42 UTC +1 Michael! I would like to return to this project but the frustration level prevents me from doing that. There is an immediate need to improve the efficiency of MW crunching. Cheers. ID: 65540 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,014,814,418 RAC: 1,033	Message 65541 - Posted: 27 Oct 2016, 15:15:51 UTC I agree with Michael's 1st post but NOT the 2nd with as is conditions. 60 WUs lasts me only several minutes. 30 minutes between WU transfer is not enough. I won't run this project if thats so. Most of the time my GPU would be idle or on something else. I've had to implement 2 separate scripts in an attempt to keep a single 280x busy with MW and then with E@H when there is no MW work. By no work I mean no work available to me as I usually see WUs available but none in my queue. WUs need to be 1000 times as big. WUs last like 30 seconds at 4x and some of that is crunching at the end. I just started with MW but I've had only about 4/5 of WUs validated. Over 15k and climbing as of last night. The project's WUs waiting to be validated as climbed from the lowest I've seen of ~800k to over 3.4mil in less than 2 weeks. Once Universe releases its GPU app which will DP I'm sure it'll much of my support if it has WU availability. ID: 65541 · Rating: 0 · rate: / Reply Quote

Henk Haneveld Send message Joined: 15 Aug 14 Posts: 6 Credit: 1,679,526 RAC: 1	Message 65542 - Posted: 27 Oct 2016, 19:42:31 UTC - in response to Message 65541. WUs need to be 1000 times as big. WUs last like 30 seconds at 4x and some of that is crunching at the end. Hold on there are users who don't have a GPU and use older hosts. A single result takes about 30 minutes on my system. If results become bigger there need to be seperate results for GPU and non-GPU hosts. ID: 65542 · Rating: 0 · rate: / Reply Quote

MindCrime Send message Joined: 5 Mar 14 Posts: 24 Credit: 501,232,884 RAC: 0	Message 65543 - Posted: 27 Oct 2016, 21:59:36 UTC Last modified: 27 Oct 2016, 22:08:37 UTC About a year ago I ramped up milkyway production on a 7970. I set it up to run about 7 concurrent tasks to reduce the percentage of time spent transition between the short WUs. At 7 concurrent tasks on one 7970 i think the run times are about 1min. I used to get between 400-450k a day. Winter is coming and I'm firing up the heater again, after some trial and error testing I found that 8 concurrent was producing the most credit/time. The math doesn't come close to 400-450k a day, I'm assuming the WU or credit changed since I last was running the project heavily. As most of us have noticed we aren't getting close to the daily credit we're producing on our ends. One 7970 running ~60% of the time on M@H quicky racks up a 10k+ pending. I have so many tasks pending the server has trouble loading the "tasks" page. I agree with the aforementioned solution of increasing WU size. The sheer number of these tiny WUs is probably the majority of the server side problems. Since there is such a discrepancy between a high DP card and a low DP card, doubling the runtime on a tahiti with its 1/4 DP would increase a 1/32nd DP card to 16x? A 750ti looks to run 1 Wu in about 100 sec, and a 1080 in about 25 sec. If the tahiti time doubled I think that would mean 1600 seconds for a 750ti and 400 seconds for a 1080. Seems to me that the nature of desktop graphics cards and their SP/DP might hold sway here as we alienate anyone with a bad SP/DP if we cater to the Tahiti. BUT, the tahiti (and the 58xx,69xx) cards do the vast majority of the work here. This is a science project, what gets the most work done? Supporting your biggest producers or supporting more of the smaller producers? WUs need to be 1000 times as big. WUs last like 30 seconds at 4x and some of that is crunching at the end. Hold on there are users who don't have a GPU and use older hosts. A single result takes about 30 minutes on my system. If results become bigger there need to be seperate results for GPU and non-GPU hosts. I think he was speaking of GPU WUs. And to be honest, I don't think catering to decade old hardware, especially CPUs, is the best time spent on this project. ID: 65543 · Rating: 0 · rate: / Reply Quote

Jesse Viviano Send message Joined: 4 Feb 11 Posts: 86 Credit: 60,913,150 RAC: 0	Message 65544 - Posted: 28 Oct 2016, 4:44:44 UTC - in response to Message 65542. Last modified: 28 Oct 2016, 4:45:56 UTC MilkyWay@home has some tasks that cannot be done on GPUs: the N-Body Simulation tasks. I think that this project should feed those first to CPUs and only send other tasks (both of which can be processed on GPUs) to CPUs if the scheduler runs out of N-Body Simulation tasks. ID: 65544 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 22 Jan 08 Posts: 29 Credit: 242,764,112 RAC: 1,426	Message 65546 - Posted: 28 Oct 2016, 9:08:42 UTC A few clarifications and additional thoughts to address some of the postings above: (1) Was was talking about GPU tasks, ONLY. (2) The idea of running task in parallel is nice e.g. if you use systems which have a single 280X or a bundle of these. In my case however, I run combinations of 290X and 280X cards in the same machine. That's possible as they use the same driver and this approach (a) combines different capabilities in the same system, (b) saves me hardware (power supplies) and (c) increases electricity efficiency. By contrast to the 280X, however, the 290X does not allow processing multiple MW WUs in parallel. Well, to be precise, of course you can make it do this. But then most of these end up not getting validated - so you simply waste your electricity. In short: Bundling is not a generally applicable solution. Moreover, it does not at all reduce the server load: By contrast, it rather worsens it because as stated above, running tasks in parallel is more time efficient than running them individually. Hence, in the same time frame, more work is requested from the server. (3) Using scripts in an attempt to counteract the obvious issues of the MW server configuration can't be the right choice: A DC project has to expect that its volunteers are either not able to create such solutions or just don't want to spend their time on such things. In short: Participation has to be kept simple. The solution has to be implemented on the project server side end and not at the user's end. In fact, there is no need to discuss this much further. I have given clear suggestions which measures will help relieve the server. A bundling of WUs is mandatory when increasing the server connection delay to ensure that machines do not idle around during these increased intervals. Try these suggestions and then we will see. Michael. President of Rechenkraft.net e.V. - This planet's first and largest distributed computing organization. ID: 65546 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,014,814,418 RAC: 1,033	Message 65547 - Posted: 28 Oct 2016, 14:25:13 UTC - in response to Message 65542. WUs need to be 1000 times as big. WUs last like 30 seconds at 4x and some of that is crunching at the end. Hold on there are users who don't have a GPU and use older hosts. A single result takes about 30 minutes on my system. If results become bigger there need to be seperate results for GPU and non-GPU hosts. There are some other projects where users can specify the approximate runtime. 2 hours, 4 hours, etc. GPUGrid has short and long tasks. The # of current WUs bundled together could then vary to suite the processor. If the same application is still available for GPUs and CPUs after bundling this could be a solution to suite different types of processors. ID: 65547 · Rating: 0 · rate: / Reply Quote

Finrond Send message Joined: 8 Nov 10 Posts: 5 Credit: 624,965,185 RAC: 0	Message 65548 - Posted: 28 Oct 2016, 14:48:41 UTC - in response to Message 65543. I have so many tasks pending the server has trouble loading the "tasks" page. I agree with the aforementioned solution of increasing WU size. The sheer number of these tiny WUs is probably the majority of the server side problems. Since there is such a discrepancy between a high DP card and a low DP card, doubling the runtime on a tahiti with its 1/4 DP would increase a 1/32nd DP card to 16x? A 750ti looks to run 1 Wu in about 100 sec, and a 1080 in about 25 sec. If the tahiti time doubled I think that would mean 1600 seconds for a 750ti and 400 seconds for a 1080. If you double the length of the WU, the runtime doubles for all cards. ie right now all cards are "running a 100 yard dash". If you made it a 200 yard dash, all the cards times would go up by a factor of 2. ID: 65548 · Rating: 0 · rate: / Reply Quote

Captiosus Send message Joined: 9 Apr 14 Posts: 35 Credit: 9,708,616 RAC: 0	Message 65549 - Posted: 28 Oct 2016, 16:00:34 UTC - in response to Message 65548. I have so many tasks pending the server has trouble loading the "tasks" page. I agree with the aforementioned solution of increasing WU size. The sheer number of these tiny WUs is probably the majority of the server side problems. Since there is such a discrepancy between a high DP card and a low DP card, doubling the runtime on a tahiti with its 1/4 DP would increase a 1/32nd DP card to 16x? A 750ti looks to run 1 Wu in about 100 sec, and a 1080 in about 25 sec. If the tahiti time doubled I think that would mean 1600 seconds for a 750ti and 400 seconds for a 1080. If you double the length of the WU, the runtime doubles for all cards. ie right now all cards are "running a 100 yard dash". If you made it a 200 yard dash, all the cards times would go up by a factor of 2. Well, I'm wondering if it would be possible to hand out longer and shorter workunits to different video cards according to their processing capability (in gflops) and manufacturer, so that cards that fall within certain performance metrics get adequately sized workunits. I mean, I have a mismatched pair of video cards in my main rig that struggle to get above 150 Gflops double precision, and they each chew through a pair of units in about 3-4 minutes tops. Why the hell am I getting the same sized units as those getting sent out to machines equipped with R9 280x or other TFLOP-class DP cards, cards that as I understand it have to run 8 or more concurrent workunits to maximize utilization due to how fast they burn through them? ID: 65549 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 65550 - Posted: 28 Oct 2016, 17:01:17 UTC Hey Everyone, Sorry, I have been out of town for the last week although I have still been working on the server. I would like to just clarify a few thing as to why making changes on MilkyWay@home takes so long. Two years ago, we lost NSF funding for the project. This led to a shortage of development staff. As for the staff we have, Sidd is full time trying to get N-body producing something that can be published as a scientific result (there is a huge amount of catch up to be played here as much of the code was messy, buggy, and undocumented before he got here). Clayton is working on getting N-body running on GPUs while also trying to take classes for his masters degree. I am working on getting the server stable, but my time is split between getting the sever stable, getting science from separation, working on preparing the next project I will be working on, and taking a class. Travis is splitting his time between teaching and running several other projects. Since we now have NSF funding again (YAY!) and I am no longer taking a full class load, you can expect to see continued work on improving the server for the rest of my tenure here (expected graduation time of Spring 2018). After that, I am hopeful there will be new students to take over who are dedicated to keeping this running smoothly. I will take into consideration changing how we bundle work units (hopefully packing 4-10 workunits together work be nice), but at the moment that is technically challenging since the framework we have set up for workunits and their generation does not allow for it (imagine a lot of coding a bugs for at least 6 months). Since I know this is going to be a challenging process, we are in discussions about other potential methods of fixing our issues, for example hardware improvements to server to increase the memory available for the database. This would allow the full database to be stored in RAM and only written to disk occasionally thus improving the response time of the database. If these easier, and faster solutions don't fully fix the problem, the next step will be potentially rewriting the work unit generation algorithms to allow for work unit bundling. Jake ID: 65550 · Rating: 0 · rate: / Reply Quote

Rymorea Send message Joined: 6 Oct 14 Posts: 46 Credit: 20,017,425 RAC: 0	Message 65558 - Posted: 28 Oct 2016, 22:51:47 UTC Thanks for your response and happy to hear NSF funding again. Waiting more responsive project from now on :) And also waiting N-body running on GPUs to see the scientific results. ID: 65558 · Rating: 0 · rate: / Reply Quote

Rich Send message Joined: 14 Nov 14 Posts: 9 Credit: 214,644,261 RAC: 0	Message 65559 - Posted: 28 Oct 2016, 23:26:44 UTC Thank you Jake for the update. I hope people will understand the problems your team face and the challenges that will be overcome. The funding will help. I appreciate what all of you are doing with the limited resources that you have. Keep up the good work and I also have to thank you for keeping us informed of the progress and goals. Don't forget to take some time for yourself, a person only lives once. Rich ID: 65559 · Rating: 0 · rate: / Reply Quote

paris Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0	Message 65560 - Posted: 28 Oct 2016, 23:39:06 UTC Thank you for the amazing job of keeping us informed and also for all of the other work you do. It is much appreciated. Plus SETI Classic = 21,082 WUs ID: 65560 · Rating: 0 · rate: / Reply Quote

Michael H.W. Weber Send message Joined: 22 Jan 08 Posts: 29 Credit: 242,764,112 RAC: 1,426	Message 65561 - Posted: 29 Oct 2016, 22:21:51 UTC - in response to Message 65550. Last modified: 29 Oct 2016, 22:33:08 UTC I will take into consideration changing how we bundle work units (hopefully packing 4-10 workunits together work be nice), but at the moment that is technically challenging since the framework we have set up for workunits and their generation does not allow for it (imagine a lot of coding a bugs for at least 6 months). Packing 4-10 tasks into one bundle won't solve the problem when taking runtimes per WU of 9 secs into consideration. This is what you need to do: 1. Modify the WU generator to define criteria by which WUs are bundled. 2. Modify the application such that the bundled WUs are processed one by one individually. 3. Modify the validator to validate the WUs. To keep things simple, use the wrapper approach: For 1: Pack as many tasks into one .zip bundle until the estimated run time is below 5 hrs or until you have 200 tasks at maximum. For 2: Using the wrapper you only need to list the required program calls in the job.xml; initially unpack the .zip bundle on the client machine and later re-pack everything again as a .zip after computation completion. For 3: Adapt the validator to read the .zip returned by the client and compare the individual result files with a second result (or call a different validation algorithm). That's it. No further year of fiddling around required. :) Michael. P.S.: As I offered ealier, contact me or my team mates for code samples from RNA World. But don't wait too long. President of Rechenkraft.net e.V. - This planet's first and largest distributed computing organization. ID: 65561 · Rating: 0 · rate: / Reply Quote