Welcome to MilkyWay@home

Out Of Work?

Message boards : Number crunching : Out Of Work?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 65414 - Posted: 10 Oct 2016, 7:31:25 UTC

Thanks for keeping us up to date, Jake. Yes, database errors are still occurring, sometimes followed by "no tasks available" or getting only a few tasks. But generally, it's still possible to get plenty of workunits after a (manual) update or two. However, manual update isn't an option for unattended machines and after a database error, they have to wait 60 minutes before next communication is attempted. Unfortunately, 60 minutes is quite a long time for Milkyway, even slower GPUs can empty their queue much sooner that. Perhaps it's possible to reduce database error timeout to 15 minutes? Or it's hard coded in BOINC?
ID: 65414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile SuperSluether
Avatar

Send message
Joined: 2 Jul 14
Posts: 15
Credit: 20,991,384
RAC: 6
Message 65418 - Posted: 10 Oct 2016, 18:45:01 UTC - in response to Message 65414.  

What is the underlying cause for the recent problems?

e.g. Was there a recent change to the server that caused this, or is MilkyWay being overloaded with new crunchers now that Einstein and Poem shut down their GPU projects?
ID: 65418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Camille

Send message
Joined: 14 Oct 12
Posts: 8
Credit: 5,841,106
RAC: 0
Message 65420 - Posted: 10 Oct 2016, 21:56:55 UTC - in response to Message 65414.  

Vortac wrote:
[...]However, manual update isn't an option for unattended machines and after a database error, they have to wait 60 minutes before next communication is attempted. Unfortunately, 60 minutes is quite a long time for Milkyway, even slower GPUs can empty their queue much sooner that. Perhaps it's possible to reduce database error timeout to 15 minutes? Or it's hard coded in BOINC?


Yeah that idea seems quite great indeed, in 60 minutes even my CPU can end it's task and indeed, with the recent database errors my computer ended up with no task to do and 30 minutes left to the counter.

Anyway, now tasks are flowing again and it's very good because I was getting cold and that computer was bored ;D
ID: 65420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 65423 - Posted: 11 Oct 2016, 13:59:19 UTC

Hey Everyone,

The underlying cause of the problems is the decrease in work unit crunch times due to some optimizations (8 times speed increase). This means GPU crunchers need a significant increase in work units to keep busy, which the database was having trouble keeping up with.

I will look into decreasing the timeout time.

Thank you all for your patience and help.

Jake
ID: 65423 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Camille

Send message
Joined: 14 Oct 12
Posts: 8
Credit: 5,841,106
RAC: 0
Message 65424 - Posted: 11 Oct 2016, 20:18:11 UTC

I'm not really comfident at how work units works, but maybe a better way would be to increase the work amount of each work units so that our GPUs can crunch for longer ? I mean, it only takes less than 10 seconds to complete a task on my GPU, and we're talking about a very underclocked R9 280X so nothing *extremely* powerful there ^^
ID: 65424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 65425 - Posted: 11 Oct 2016, 22:46:47 UTC

R9 280X so nothing *extremely* powerful there ^^


As far as AMD GPU's are concerned the Tahiti based GPU's are still the most Powerful Double Precision GPU's available. Only the NVIDIA Titan is more powerful.
ID: 65425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 65426 - Posted: 12 Oct 2016, 7:52:41 UTC - in response to Message 65424.  

I'm not really comfident at how work units works, but maybe a better way would be to increase the work amount of each work units so that our GPUs can crunch for longer ? I mean, it only takes less than 10 seconds to complete a task on my GPU, and we're talking about a very underclocked R9 280X so nothing *extremely* powerful there ^^

Yes, R9 280X is in fact a rebadged HD7970 - still very powerful for double-precision computing and very reliable for BOINC in general. You should run at least 3 or 4 workunits simultaneously on it, otherwise your GPU is idling a lot when switching between workunits every 10 secs. Of course, longer workunits would be very welcome, but apparently it's not an easy thing to do (from a technical standpoint). Anyway, there is a lot of workunits available now and with 4 of them running in parallel, it's a smooth ride all along, even with short workunits.
ID: 65426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 65427 - Posted: 12 Oct 2016, 15:06:54 UTC

Hey Everyone,

I tried setting the delay after a database error to 15 minutes. Can anyone confirm this is working as intended?

Jake
ID: 65427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Camille

Send message
Joined: 14 Oct 12
Posts: 8
Credit: 5,841,106
RAC: 0
Message 65429 - Posted: 12 Oct 2016, 19:13:17 UTC - in response to Message 65426.  
Last modified: 12 Oct 2016, 19:54:42 UTC

Vortac wrote:
[...]and with 4 of them running in parallel, it's a smooth ride all along, even with short workunits.


How do you set the max amount of parallel running work units ?
Found the way to do this ^^
Okay it worked for a second and it's not working anymore...
I created the file app_config.xml and replaced gpu_usage from .5 to .2 and when I reloaded the config file it wasn't working anymore... Even restarting the BOINC manager and closing tasks didn't work

And for database error, if I see one I'll tell you how much it says
ID: 65429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 65430 - Posted: 12 Oct 2016, 19:47:44 UTC - in response to Message 65429.  

How do you set the max amount of parallel running work units ?

To run multiple workunits per GPU, you need to create app_config.xml in Milkyway's folder (default path in Win7 is C:\ProgramData\BOINC\projects\milkyway.cs.rpi.edu_milkyway). BOINC Manager will automatically detect app_config.xml and read it upon start.
For 4 workunits per GPU, I have it like this:

<app_config>
<app>
<name>milkyway</name>
<gpu_versions>
<gpu_usage>0.25</gpu_usage>
<cpu_usage>0.25</cpu_usage>
</gpu_versions>
</app>
</app_config>
ID: 65430 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Camille

Send message
Joined: 14 Oct 12
Posts: 8
Credit: 5,841,106
RAC: 0
Message 65431 - Posted: 12 Oct 2016, 19:57:42 UTC

My bad. I put a : instead of a / in </app_config>
ID: 65431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rymorea

Send message
Joined: 6 Oct 14
Posts: 46
Credit: 20,017,425
RAC: 0
Message 65433 - Posted: 12 Oct 2016, 22:00:26 UTC - in response to Message 65427.  

Hey Everyone,

I tried setting the delay after a database error to 15 minutes. Can anyone confirm this is working as intended?

Jake


I will try to monitor new setting I hope it will working.
ID: 65433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 65434 - Posted: 13 Oct 2016, 8:14:27 UTC

Got the database error message at 08:02 UTC. Timeout is 15 minutes now. At 08:08 I did a successful manual update (without database error), but no new tasks were available.
ID: 65434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rymorea

Send message
Joined: 6 Oct 14
Posts: 46
Credit: 20,017,425
RAC: 0
Message 65435 - Posted: 13 Oct 2016, 10:19:58 UTC

last night boinc switch to E@H cause not reach server and now cant report waiting wus

Milkyway@Home | Server can't open database

ID: 65435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3319
Credit: 520,337,380
RAC: 21,644
Message 65436 - Posted: 13 Oct 2016, 11:14:01 UTC - in response to Message 65435.  

last night boinc switch to E@H cause not reach server and now cant report waiting wus

Milkyway@Home | Server can't open database


If you WANT to crunch MW but will crunch anything when you can't set the resource share of the backup projects to zero. That will only get you enough units to crunch RIGHT NOW but not fill your cache, then when those unitsare done it will check here at MW for work and if there isn't any you will get work from a backup project again. I have a primary and a backup project for both my cpu's and gpu's, my pc's are ALWAYS crunching something. And yes I do have several backup projects for my cpu's as sometimes my backup projects don't have units, so it rolls over to the next one.
ID: 65436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 65437 - Posted: 13 Oct 2016, 14:19:22 UTC

Work was unavailable from 08:00 UTC onwards, with a lot database errors. Even when server responded properly, there was no work available, so reduced database timeout (15 mins) didn't help. Now, everything is normal and there's plenty of work again.

I keep Milkyway at 100% resource share and Collatz at 0%. When Milkyway runs out of work, BOINC fetches Collatz workunits automatically. But Collatz uses only FP32 and there are plenty of newer cards today with much better FP32 performance than my 7970s, so Milkyway is always my first choice.
ID: 65437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 65439 - Posted: 13 Oct 2016, 15:57:07 UTC

Okay, I think I know what the issue is. We do nightly backups of the database and server. This locks the database for a little while. I might look into scaling the database backups to once every other day or something.

Jake
ID: 65439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Oct 16
Posts: 112
Credit: 1,174,293,644
RAC: 0
Message 65475 - Posted: 18 Oct 2016, 17:59:54 UTC - in response to Message 65439.  

I still think the timeout set to 15 min is too long. That's a lot of idle time for fast cards who want to do work.

Please change the amount of tasks allow in progress to a higher amount (like 3 or 4x of current) to help with the DB timeouts and issues. May even help with reducing server load.

Thanks for your hard work!
ID: 65475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile TimeRanger

Send message
Joined: 31 Oct 10
Posts: 83
Credit: 38,632,375
RAC: 0
Message 65486 - Posted: 19 Oct 2016, 3:31:10 UTC

I am getting plenty of GPU work .. zero for CPU
ID: 65486 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wb8ili

Send message
Joined: 18 Jul 10
Posts: 76
Credit: 635,998,708
RAC: 0
Message 65535 - Posted: 25 Oct 2016, 14:11:34 UTC
Last modified: 25 Oct 2016, 14:11:56 UTC

Jake -

Is it possible to make us users feel good and get an update on the server(s) performance problems?

All of the "issues" listed in this thread and other threads are still present (cannot open database, SQL errors on the homepage, over 3,000,000 tasks waiting for validation, and others).

Your last update that I found was 18 OCT which implied the problems are related to nightly backups of the database and that the whole system may be overloaded due to too many users having too many fast graphics cards.

Is there any timetable for resolution? More DB tuning? More hardware? etc.?

Thanks.
ID: 65535 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Out Of Work?

©2024 Astroinformatics Group