Out Of Work?

Author	Message
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 65414 - Posted: 10 Oct 2016, 7:31:25 UTC Thanks for keeping us up to date, Jake. Yes, database errors are still occurring, sometimes followed by "no tasks available" or getting only a few tasks. But generally, it's still possible to get plenty of workunits after a (manual) update or two. However, manual update isn't an option for unattended machines and after a database error, they have to wait 60 minutes before next communication is attempted. Unfortunately, 60 minutes is quite a long time for Milkyway, even slower GPUs can empty their queue much sooner that. Perhaps it's possible to reduce database error timeout to 15 minutes? Or it's hard coded in BOINC? ID: 65414 · Rating: 0 · rate: / Reply Quote

SuperSluether Send message Joined: 2 Jul 14 Posts: 15 Credit: 20,991,384 RAC: 0	Message 65418 - Posted: 10 Oct 2016, 18:45:01 UTC - in response to Message 65414. What is the underlying cause for the recent problems? e.g. Was there a recent change to the server that caused this, or is MilkyWay being overloaded with new crunchers now that Einstein and Poem shut down their GPU projects? ID: 65418 · Rating: 0 · rate: / Reply Quote

Camille Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0	Message 65420 - Posted: 10 Oct 2016, 21:56:55 UTC - in response to Message 65414. Vortac wrote: [...]However, manual update isn't an option for unattended machines and after a database error, they have to wait 60 minutes before next communication is attempted. Unfortunately, 60 minutes is quite a long time for Milkyway, even slower GPUs can empty their queue much sooner that. Perhaps it's possible to reduce database error timeout to 15 minutes? Or it's hard coded in BOINC? Yeah that idea seems quite great indeed, in 60 minutes even my CPU can end it's task and indeed, with the recent database errors my computer ended up with no task to do and 30 minutes left to the counter. Anyway, now tasks are flowing again and it's very good because I was getting cold and that computer was bored ;D ID: 65420 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 65423 - Posted: 11 Oct 2016, 13:59:19 UTC Hey Everyone, The underlying cause of the problems is the decrease in work unit crunch times due to some optimizations (8 times speed increase). This means GPU crunchers need a significant increase in work units to keep busy, which the database was having trouble keeping up with. I will look into decreasing the timeout time. Thank you all for your patience and help. Jake ID: 65423 · Rating: 0 · rate: / Reply Quote

Camille Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0	Message 65424 - Posted: 11 Oct 2016, 20:18:11 UTC I'm not really comfident at how work units works, but maybe a better way would be to increase the work amount of each work units so that our GPUs can crunch for longer ? I mean, it only takes less than 10 seconds to complete a task on my GPU, and we're talking about a very underclocked R9 280X so nothing extremely powerful there ^^ ID: 65424 · Rating: 0 · rate: / Reply Quote

Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0	Message 65425 - Posted: 11 Oct 2016, 22:46:47 UTC R9 280X so nothing extremely powerful there ^^ As far as AMD GPU's are concerned the Tahiti based GPU's are still the most Powerful Double Precision GPU's available. Only the NVIDIA Titan is more powerful. ID: 65425 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 65426 - Posted: 12 Oct 2016, 7:52:41 UTC - in response to Message 65424. I'm not really comfident at how work units works, but maybe a better way would be to increase the work amount of each work units so that our GPUs can crunch for longer ? I mean, it only takes less than 10 seconds to complete a task on my GPU, and we're talking about a very underclocked R9 280X so nothing extremely powerful there ^^ Yes, R9 280X is in fact a rebadged HD7970 - still very powerful for double-precision computing and very reliable for BOINC in general. You should run at least 3 or 4 workunits simultaneously on it, otherwise your GPU is idling a lot when switching between workunits every 10 secs. Of course, longer workunits would be very welcome, but apparently it's not an easy thing to do (from a technical standpoint). Anyway, there is a lot of workunits available now and with 4 of them running in parallel, it's a smooth ride all along, even with short workunits. ID: 65426 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 65427 - Posted: 12 Oct 2016, 15:06:54 UTC Hey Everyone, I tried setting the delay after a database error to 15 minutes. Can anyone confirm this is working as intended? Jake ID: 65427 · Rating: 0 · rate: / Reply Quote

Camille Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0	Message 65429 - Posted: 12 Oct 2016, 19:13:17 UTC - in response to Message 65426. Last modified: 12 Oct 2016, 19:54:42 UTC Vortac wrote: [...]and with 4 of them running in parallel, it's a smooth ride all along, even with short workunits. ~~How do you set the max amount of parallel running work units ?~~ ~~Found the way to do this ^^~~ Okay it worked for a second and it's not working anymore... I created the file app_config.xml and replaced gpu_usage from .5 to .2 and when I reloaded the config file it wasn't working anymore... Even restarting the BOINC manager and closing tasks didn't work And for database error, if I see one I'll tell you how much it says ID: 65429 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 65430 - Posted: 12 Oct 2016, 19:47:44 UTC - in response to Message 65429. How do you set the max amount of parallel running work units ? To run multiple workunits per GPU, you need to create app_config.xml in Milkyway's folder (default path in Win7 is C:\ProgramData\BOINC\projects\milkyway.cs.rpi.edu_milkyway). BOINC Manager will automatically detect app_config.xml and read it upon start. For 4 workunits per GPU, I have it like this: <app_config> <app> <name>milkyway</name> <gpu_versions> <gpu_usage>0.25</gpu_usage> <cpu_usage>0.25</cpu_usage> </gpu_versions> </app> </app_config> ID: 65430 · Rating: 0 · rate: / Reply Quote

Camille Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0	Message 65431 - Posted: 12 Oct 2016, 19:57:42 UTC My bad. I put a : instead of a / in </app_config> ID: 65431 · Rating: 0 · rate: / Reply Quote

Rymorea Send message Joined: 6 Oct 14 Posts: 46 Credit: 20,017,425 RAC: 0	Message 65433 - Posted: 12 Oct 2016, 22:00:26 UTC - in response to Message 65427. Hey Everyone, I tried setting the delay after a database error to 15 minutes. Can anyone confirm this is working as intended? Jake I will try to monitor new setting I hope it will working. ID: 65433 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 65434 - Posted: 13 Oct 2016, 8:14:27 UTC Got the database error message at 08:02 UTC. Timeout is 15 minutes now. At 08:08 I did a successful manual update (without database error), but no new tasks were available. ID: 65434 · Rating: 0 · rate: / Reply Quote

Rymorea Send message Joined: 6 Oct 14 Posts: 46 Credit: 20,017,425 RAC: 0	Message 65435 - Posted: 13 Oct 2016, 10:19:58 UTC last night boinc switch to E@H cause not reach server and now cant report waiting wus Milkyway@Home \| Server can't open database ID: 65435 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3334 Credit: 524,010,781 RAC: 962	Message 65436 - Posted: 13 Oct 2016, 11:14:01 UTC - in response to Message 65435. last night boinc switch to E@H cause not reach server and now cant report waiting wus Milkyway@Home \| Server can't open database If you WANT to crunch MW but will crunch anything when you can't set the resource share of the backup projects to zero. That will only get you enough units to crunch RIGHT NOW but not fill your cache, then when those unitsare done it will check here at MW for work and if there isn't any you will get work from a backup project again. I have a primary and a backup project for both my cpu's and gpu's, my pc's are ALWAYS crunching something. And yes I do have several backup projects for my cpu's as sometimes my backup projects don't have units, so it rolls over to the next one. ID: 65436 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 65437 - Posted: 13 Oct 2016, 14:19:22 UTC Work was unavailable from 08:00 UTC onwards, with a lot database errors. Even when server responded properly, there was no work available, so reduced database timeout (15 mins) didn't help. Now, everything is normal and there's plenty of work again. I keep Milkyway at 100% resource share and Collatz at 0%. When Milkyway runs out of work, BOINC fetches Collatz workunits automatically. But Collatz uses only FP32 and there are plenty of newer cards today with much better FP32 performance than my 7970s, so Milkyway is always my first choice. ID: 65437 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 65439 - Posted: 13 Oct 2016, 15:57:07 UTC Okay, I think I know what the issue is. We do nightly backups of the database and server. This locks the database for a little while. I might look into scaling the database backups to once every other day or something. Jake ID: 65439 · Rating: 0 · rate: / Reply Quote

bluestang Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0	Message 65475 - Posted: 18 Oct 2016, 17:59:54 UTC - in response to Message 65439. I still think the timeout set to 15 min is too long. That's a lot of idle time for fast cards who want to do work. Please change the amount of tasks allow in progress to a higher amount (like 3 or 4x of current) to help with the DB timeouts and issues. May even help with reducing server load. Thanks for your hard work! ID: 65475 · Rating: 0 · rate: / Reply Quote

TimeRanger Send message Joined: 31 Oct 10 Posts: 83 Credit: 38,632,375 RAC: 0	Message 65486 - Posted: 19 Oct 2016, 3:31:10 UTC I am getting plenty of GPU work .. zero for CPU ID: 65486 · Rating: 0 · rate: / Reply Quote

wb8ili Send message Joined: 18 Jul 10 Posts: 76 Credit: 635,998,708 RAC: 0	Message 65535 - Posted: 25 Oct 2016, 14:11:34 UTC Last modified: 25 Oct 2016, 14:11:56 UTC Jake - Is it possible to make us users feel good and get an update on the server(s) performance problems? All of the "issues" listed in this thread and other threads are still present (cannot open database, SQL errors on the homepage, over 3,000,000 tasks waiting for validation, and others). Your last update that I found was 18 OCT which implied the problems are related to nightly backups of the database and that the whole system may be overloaded due to too many users having too many fast graphics cards. Is there any timetable for resolution? More DB tuning? More hardware? etc.? Thanks. ID: 65535 · Rating: 0 · rate: / Reply Quote