Welcome to MilkyWay@home

Milkyway not running...


Advanced search

Message boards : Number crunching : Milkyway not running...
Message board moderation

To post messages, you must log in.

AuthorMessage
Profilevalterc

Send message
Joined: 28 Aug 09
Posts: 22
Credit: 673,135,926
RAC: 165,477
500 million credit badge10 year member badge
Message 32601 - Posted: 21 Oct 2009, 9:48:19 UTC
Last modified: 21 Oct 2009, 9:50:46 UTC

Hi all,

Shortly, this is my problem:

I would like to run Milkyway for most of the time and use Collatz as a backup project. I do have some network drops and it doesn't reconnect automatically so I really need a backup project.

My dual-core runs Collatz, SETI and Primegrid with resource share 100 and MW with resource share 700.

However, although the resoruce share is respected by the cpu only projects (SETI & Primegrid) it seems that is ignored by the ATI GPU only projects. Collatz is running almost all the time....

Any hints?

BTW, I use client v6.10.13, all but Primegrid with app_info. No problems at all crunching, ie no errors, just the resource share problem. Boinc connects every 0.1 and has an additional work buffer of 4 days.
ID: 32601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vid Vidmar*
Avatar

Send message
Joined: 29 Aug 07
Posts: 81
Credit: 60,360,858
RAC: 0
50 million credit badge10 year member badge
Message 32603 - Posted: 21 Oct 2009, 11:43:33 UTC - in response to Message 32601.  

Hi all,

Shortly, this is my problem:

I would like to run Milkyway for most of the time and use Collatz as a backup project. I do have some network drops and it doesn't reconnect automatically so I really need a backup project.

My dual-core runs Collatz, SETI and Primegrid with resource share 100 and MW with resource share 700.

However, although the resoruce share is respected by the cpu only projects (SETI & Primegrid) it seems that is ignored by the ATI GPU only projects. Collatz is running almost all the time....

Any hints?

BTW, I use client v6.10.13, all but Primegrid with app_info. No problems at all crunching, ie no errors, just the resource share problem. Boinc connects every 0.1 and has an additional work buffer of 4 days.


It might be because MW is low/out of work?
On the other hand, so far BOINC supports only resource share by project. I proposed to make it by resource (CPU, GPU, ...) and was calmly ignored. I also reported a bug in the feature, that enabled me for a while to achieve per resource shares (using 2 clients in parallel - I was even able to process MW on CPU and GPU at the same time), but no devs stirred. So, if you happen to make any of the devs even flinch, all kudos to you.
BR,
ID: 32603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfilePaul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
100 million credit badge10 year member badge
Message 32607 - Posted: 21 Oct 2009, 15:38:55 UTC

DA decided that GPU work should be done in strict FIFO order ... that means that if you want to run MW mostly you need to trim your queue to try to avoid large chunks of Collatz being DL. Of course, the issue here is that you have network drops and lowering the queue means you will run out of work ...

In that this would not solve your problem ... you need to complain on the alpha mailing list ... I have, but so far to silence ... What you really want is to have Collatz on hand and only run it if you cannot get MW work ... or you run dry because of a network outage ...

Theory says that Resource Share will be respected over time ... but I cannot prove that ... in that the debt calculations seem to have been bent for some time my suspicion is that this is also not true ...

I run with a one day something cache and right now my ATI system seems to run a day on MW if it can get work, then it will DL 90 or so Collatz and spend a day running them off ... then again, I also run with 100/100 share allocation so I am not quite in your situation...

Anyway, your problem is that you are running into one of the later brainstorms of UCB... oh, alternative, downlevel your installation and use app info to make it work (of course you may not be able to run CPU of MW and Collatz) ...

ID: 32607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilevalterc

Send message
Joined: 28 Aug 09
Posts: 22
Credit: 673,135,926
RAC: 165,477
500 million credit badge10 year member badge
Message 32611 - Posted: 21 Oct 2009, 17:25:40 UTC - in response to Message 32607.  

DA decided that GPU work should be done in strict FIFO order ... that means that if you want to run MW mostly you need to trim your queue to try to avoid large chunks of Collatz being DL. Of course, the issue here is that you have network drops and lowering the queue means you will run out of work ...

In that this would not solve your problem ... you need to complain on the alpha mailing list ... I have, but so far to silence ... What you really want is to have Collatz on hand and only run it if you cannot get MW work ... or you run dry because of a network outage ...

Theory says that Resource Share will be respected over time ... but I cannot prove that ... in that the debt calculations seem to have been bent for some time my suspicion is that this is also not true ...

I run with a one day something cache and right now my ATI system seems to run a day on MW if it can get work, then it will DL 90 or so Collatz and spend a day running them off ... then again, I also run with 100/100 share allocation so I am not quite in your situation...

Anyway, your problem is that you are running into one of the later brainstorms of UCB... oh, alternative, downlevel your installation and use app info to make it work (of course you may not be able to run CPU of MW and Collatz) ...



I already use app_info, in fact I have a set up that forces CPU only Seti & Primegrid and GPU only for MW & Collatz.

The only way I know for having some MW crunching is babysitting the whole thing and suspend Collatz from time to time....

Oh, btw, a couple of days ago I resetted both MW & Collatz after letting them run empty (no new tasks) but I still have the same bahavior....
ID: 32611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfilePaul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
100 million credit badge10 year member badge
Message 32615 - Posted: 21 Oct 2009, 18:38:17 UTC - in response to Message 32611.  

The only way I know for having some MW crunching is babysitting the whole thing and suspend Collatz from time to time....

Oh, btw, a couple of days ago I resetted both MW & Collatz after letting them run empty (no new tasks) but I still have the same bahavior....

Until UCB removes the rule of strict FIFO on GPU tasks that is all you can do ...

The silliest thing is that the rule was added because of the instability of scheduling the tasks on the GPU ... which was actually caused by other factors (some of which they also don't acknowledge) and a couple bugs that have since been removed (somewhere in 6.10.7 through 7.10.11)...
ID: 32615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0
1 billion credit badge10 year member badge
Message 32630 - Posted: 22 Oct 2009, 0:40:06 UTC - in response to Message 32615.  

The only way I know for having some MW crunching is babysitting the whole thing and suspend Collatz from time to time....

Until UCB removes the rule of strict FIFO on GPU tasks that is all you can do ...

Well, not quite.... It is possible to have MW crunching continuously and out of strict FIFO order, with older Collatz tasks sitting in the wings and waiting, with not much baby-sitting needed. I've been doing it for the last week or so and it is working quite well. My desired resource share is 90/10 in favour of MW and my current RAC ratio is 850K/90K so I'm quite happy with the results so far.

The aim is for Collatz to crunch only when the MW mini outages cause the restricted MW cache to be exhausted. This probably happens at least once or twice per day so part of the plan has to be to NOT have a huge Collatz cache so that MW can resume in a reasonably short time. On 4850s a CC task takes around 12mins so a cache setting of 0/0.1 days gives around 12 tasks after which MW will resume if the mini outage has been cleared up. If not, more CC tasks will download as required to maintain the buffer until MW is available.

Most of my hosts are dual cores so the max cache for MW is only 12 tasks. The trick to making sure MW crunches continuously and out of FIFO order is to set the <count> tag in app_info.xml to 0.5 so that there are always two tasks running. Running 'out of phase' is also required to prevent a newer CC task from 'sneaking in' if the two MW tasks happened to finish at exactly the same time. Even a few seconds 'out of phase' is enough to keep the CC tasks at bay. CC gets to run immediately ONLY when the MW cache runs out or when MW has been running for so long that resource share considerations intervene. Yes, I do believe I've actually (by chance) watched that happen - a MW task get suspended 'in flight' to allow a CC task to start.

There are two 'problems' with this technique. Firstly, if both projects happen to run dry simultaneously, the GPUs will grow cold. Secondly, some micromanagement is needed to maintain work for the project running on the CPUs. In my case I run Einstein and I like to have a cache of several days. So, every second day or so, on each GPU host I set NNT for Collatz and set the cache from 0/0.1 to 0/6.1 and allow a whole bunch of EAH tasks to download. When finished, I set the cache back to 0/0.1 and remove NNT on Collatz. problem solved for a couple of days.

Cheers,
Gary.
ID: 32630 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileverstapp
Avatar

Send message
Joined: 26 Jan 09
Posts: 589
Credit: 497,834,261
RAC: 0
300 million credit badge10 year member badge
Message 32632 - Posted: 22 Oct 2009, 1:55:24 UTC - in response to Message 32630.  
Last modified: 22 Oct 2009, 1:56:03 UTC

>if both projects happen to run dry simultaneously
There's always FaH. :p
Cheers,

PeterV

.
ID: 32632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Milkyway not running...

©2019 Astroinformatics Group