Message boards :
Number crunching :
Out Of Work?
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,771,888,435 RAC: 2,619,671 ![]() ![]() ![]() |
Thanks for keeping us up to date, Jake. Yes, database errors are still occurring, sometimes followed by "no tasks available" or getting only a few tasks. But generally, it's still possible to get plenty of workunits after a (manual) update or two. However, manual update isn't an option for unattended machines and after a database error, they have to wait 60 minutes before next communication is attempted. Unfortunately, 60 minutes is quite a long time for Milkyway, even slower GPUs can empty their queue much sooner that. Perhaps it's possible to reduce database error timeout to 15 minutes? Or it's hard coded in BOINC? |
![]() ![]() Send message Joined: 2 Jul 14 Posts: 15 Credit: 20,986,979 RAC: 0 ![]() ![]() |
What is the underlying cause for the recent problems? e.g. Was there a recent change to the server that caused this, or is MilkyWay being overloaded with new crunchers now that Einstein and Poem shut down their GPU projects? ![]() |
![]() Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0 ![]() ![]() ![]() |
Vortac wrote: [...]However, manual update isn't an option for unattended machines and after a database error, they have to wait 60 minutes before next communication is attempted. Unfortunately, 60 minutes is quite a long time for Milkyway, even slower GPUs can empty their queue much sooner that. Perhaps it's possible to reduce database error timeout to 15 minutes? Or it's hard coded in BOINC? Yeah that idea seems quite great indeed, in 60 minutes even my CPU can end it's task and indeed, with the recent database errors my computer ended up with no task to do and 30 minutes left to the counter. Anyway, now tasks are flowing again and it's very good because I was getting cold and that computer was bored ;D |
Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 135 ![]() ![]() ![]() |
Hey Everyone, The underlying cause of the problems is the decrease in work unit crunch times due to some optimizations (8 times speed increase). This means GPU crunchers need a significant increase in work units to keep busy, which the database was having trouble keeping up with. I will look into decreasing the timeout time. Thank you all for your patience and help. Jake |
![]() Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0 ![]() ![]() ![]() |
I'm not really comfident at how work units works, but maybe a better way would be to increase the work amount of each work units so that our GPUs can crunch for longer ? I mean, it only takes less than 10 seconds to complete a task on my GPU, and we're talking about a very underclocked R9 280X so nothing *extremely* powerful there ^^ |
Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 7 ![]() ![]() ![]() |
R9 280X so nothing *extremely* powerful there ^^ As far as AMD GPU's are concerned the Tahiti based GPU's are still the most Powerful Double Precision GPU's available. Only the NVIDIA Titan is more powerful. |
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,771,888,435 RAC: 2,619,671 ![]() ![]() ![]() |
I'm not really comfident at how work units works, but maybe a better way would be to increase the work amount of each work units so that our GPUs can crunch for longer ? I mean, it only takes less than 10 seconds to complete a task on my GPU, and we're talking about a very underclocked R9 280X so nothing *extremely* powerful there ^^ Yes, R9 280X is in fact a rebadged HD7970 - still very powerful for double-precision computing and very reliable for BOINC in general. You should run at least 3 or 4 workunits simultaneously on it, otherwise your GPU is idling a lot when switching between workunits every 10 secs. Of course, longer workunits would be very welcome, but apparently it's not an easy thing to do (from a technical standpoint). Anyway, there is a lot of workunits available now and with 4 of them running in parallel, it's a smooth ride all along, even with short workunits. |
Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 135 ![]() ![]() ![]() |
Hey Everyone, I tried setting the delay after a database error to 15 minutes. Can anyone confirm this is working as intended? Jake |
![]() Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0 ![]() ![]() ![]() |
Vortac wrote: [...]and with 4 of them running in parallel, it's a smooth ride all along, even with short workunits. Okay it worked for a second and it's not working anymore... I created the file app_config.xml and replaced gpu_usage from .5 to .2 and when I reloaded the config file it wasn't working anymore... Even restarting the BOINC manager and closing tasks didn't work And for database error, if I see one I'll tell you how much it says |
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,771,888,435 RAC: 2,619,671 ![]() ![]() ![]() |
How do you set the max amount of parallel running work units ? To run multiple workunits per GPU, you need to create app_config.xml in Milkyway's folder (default path in Win7 is C:\ProgramData\BOINC\projects\milkyway.cs.rpi.edu_milkyway). BOINC Manager will automatically detect app_config.xml and read it upon start. For 4 workunits per GPU, I have it like this: <app_config> <app> <name>milkyway</name> <gpu_versions> <gpu_usage>0.25</gpu_usage> <cpu_usage>0.25</cpu_usage> </gpu_versions> </app> </app_config> |
![]() Send message Joined: 14 Oct 12 Posts: 8 Credit: 5,841,106 RAC: 0 ![]() ![]() ![]() |
My bad. I put a : instead of a / in </app_config> |
Rymorea Send message Joined: 6 Oct 14 Posts: 46 Credit: 20,005,844 RAC: 1 ![]() ![]() |
|
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,771,888,435 RAC: 2,619,671 ![]() ![]() ![]() |
Got the database error message at 08:02 UTC. Timeout is 15 minutes now. At 08:08 I did a successful manual update (without database error), but no new tasks were available. |
Rymorea Send message Joined: 6 Oct 14 Posts: 46 Credit: 20,005,844 RAC: 1 ![]() ![]() |
|
![]() ![]() Send message Joined: 8 May 09 Posts: 3105 Credit: 518,026,956 RAC: 23,929 ![]() ![]() ![]() |
last night boinc switch to E@H cause not reach server and now cant report waiting wus If you WANT to crunch MW but will crunch anything when you can't set the resource share of the backup projects to zero. That will only get you enough units to crunch RIGHT NOW but not fill your cache, then when those unitsare done it will check here at MW for work and if there isn't any you will get work from a backup project again. I have a primary and a backup project for both my cpu's and gpu's, my pc's are ALWAYS crunching something. And yes I do have several backup projects for my cpu's as sometimes my backup projects don't have units, so it rolls over to the next one. |
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,771,888,435 RAC: 2,619,671 ![]() ![]() ![]() |
Work was unavailable from 08:00 UTC onwards, with a lot database errors. Even when server responded properly, there was no work available, so reduced database timeout (15 mins) didn't help. Now, everything is normal and there's plenty of work again. I keep Milkyway at 100% resource share and Collatz at 0%. When Milkyway runs out of work, BOINC fetches Collatz workunits automatically. But Collatz uses only FP32 and there are plenty of newer cards today with much better FP32 performance than my 7970s, so Milkyway is always my first choice. |
Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 135 ![]() ![]() ![]() |
Okay, I think I know what the issue is. We do nightly backups of the database and server. This locks the database for a little while. I might look into scaling the database backups to once every other day or something. Jake |
bluestang Send message Joined: 13 Oct 16 Posts: 108 Credit: 1,167,820,514 RAC: 283,471 ![]() ![]() |
I still think the timeout set to 15 min is too long. That's a lot of idle time for fast cards who want to do work. Please change the amount of tasks allow in progress to a higher amount (like 3 or 4x of current) to help with the DB timeouts and issues. May even help with reducing server load. Thanks for your hard work! |
![]() Send message Joined: 31 Oct 10 Posts: 83 Credit: 38,632,375 RAC: 0 ![]() ![]() |
I am getting plenty of GPU work .. zero for CPU |
wb8ili Send message Joined: 18 Jul 10 Posts: 76 Credit: 629,987,312 RAC: 296,113 ![]() ![]() ![]() |
Jake - Is it possible to make us users feel good and get an update on the server(s) performance problems? All of the "issues" listed in this thread and other threads are still present (cannot open database, SQL errors on the homepage, over 3,000,000 tasks waiting for validation, and others). Your last update that I found was 18 OCT which implied the problems are related to nightly backups of the database and that the whole system may be overloaded due to too many users having too many fast graphics cards. Is there any timetable for resolution? More DB tuning? More hardware? etc.? Thanks. |
©2023 Astroinformatics Group