Welcome to MilkyWay@home

Total estimated CPU-time increases constantly during calculations - why?

Message boards : Number crunching : Total estimated CPU-time increases constantly during calculations - why?
Message board moderation

To post messages, you must log in.

AuthorMessage
Paul Vleugels

Send message
Joined: 1 Apr 09
Posts: 2
Credit: 18,201
RAC: 0
Message 43458 - Posted: 3 Nov 2010, 22:34:43 UTC

While running a WU with an initial estimated run-time of 24 hrs the CPU-time spend is now 36 hrs and the estimated time to finish is an additional 14 hrs. So, instead of 24 hrs the total CPU-time is not set to 50 hrs (but it also can be 60 or 70 or ? hrs when really finished).

What's going on? Why is the CPU-time increasing constantly? Bottleneck is that I need to spend at least 50 hrs of CPU-time and therefore being late to report back to the server.

Can the estimated CPU-time of a WU be decreased to 6 hrs or so (being constant!) in order to run within the expected reporting time + give other projects a chance to participate?
ID: 43458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 43465 - Posted: 4 Nov 2010, 4:52:50 UTC - in response to Message 43458.  

The problem is the way BOINC calculates time for completion. Its a topic all on its own as to why BOINC went the way it did, and is the subject of much controversy. It would seem a simple thing on the face of it, however, BOINC is tracking a lot of disparate Projects, and some it knows nothing about yet. That complicates things massively, and the conversation rapidly heads South into Geek Land :)

The net effect is it will "learn" as time goes on how long it takes to crunch a Project WU. That can rapidly change if you change Projects and crunch dramaticly shorter or longer WUs, that will put it back to a way out estimate.

Stay as you are, you will find the completion time will settle the more WUs you do. Meanwhile, a good way of estimating is take the time taken to date, and the percentage completed already and extrapolate that to 100%. Its usually on the money after the first 30% or so of the WU has been crunched. So (for example) if a Project WU shows as 40% completed and took 4hours, then you got another 6 hours to go.

Project Staff can, and do, try to mitigate the effect of this by inserting a parameter value at server level, but that in itself has problems (back to geek land).

Bottom line, extrapolate the percentage completed and time taken to get that far, and ignore the time to completion and alls well.

Regards
Zy
ID: 43465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
55degrees

Send message
Joined: 8 Sep 09
Posts: 62
Credit: 61,330,584
RAC: 0
Message 43470 - Posted: 4 Nov 2010, 17:42:15 UTC - in response to Message 43465.  

zydor, you are full of good stuff, and I appreciate your 68xx/69xx insight as well. hopefully the 6970s come December. thanks.

to this topic: is there a calculation for gpus for the initial "to completion?" I recently crunched a project, 5870s, and the initial to completion was nearly 24 hours yet two WUs were crunched <10min. it progressed as usual so that eventually my cache normalized around a more accurate crunching time, but that initial to completion was a bit too far out there. so I dismissed it as probably something like sifting out the pretenders and not handing out too many WUs for someone just messing. that is, until I read your post.

is the initial to competition determined by cpu specs regardless?

thanks.
ID: 43470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 43504 - Posted: 5 Nov 2010, 12:19:57 UTC - in response to Message 43470.  
Last modified: 5 Nov 2010, 12:38:27 UTC

The creeking noise you might be hearing is the lid of Pandora's box being lifted ..... there is no short cut in explaining this, and the background needs to be understood else we end up with a "why" still hanging around, so bare with me, and any BOINC Devs/experianced Alpha testers out there, check me out here - feel free to correct detail - putting this lot down without taking up the equivalent of "War and Peace" to cover every niggly twist and turn, is always full of potential grief :)

First, be clear on responsibility for processing work. Project staffs handle the Work Unit Generation, and present it to the BOINC schedular for delivery. BOINC server provides 100 slots by default for the Project staff to place WUs ready for collection by BOINC schedular. The work generator says "ok there you go come and get em". At that point all Project responsibility ceases. The BOINC schedular knows the priority you want for WUs from the percentages you set as prefences, and goes to each Project in turn and picks up from the 100 slots as many WUs as it thinks is needed to meet your set Project preferences for total cache.

The slots get refreshed by the Project every two seconds. More slots can be added by each Project via server parameter to increase capacity, however there are limitations as each increase steals some Kernal memory, and a recompile is needed. An easy process, but a fiddly fact, which has its ultimate end point on the memory available to the server, balancing off other needs for that server memory.

Ok, now the schedular picks up its target WUs, and transports them to your BOINC client "here ya go, crunch these". Easy you might say, but at this point "devil's in the detail" surfaces. How does the BOINC schedular know the total WUs it needs to present to your Client? It gets that from the buffer setting in your preferences (4days plus XX additional, etc). To do that it needs to know how long each WU will take to crunch - simplistic example, WU one hour long, 24 needed for one day's cache. How does BOINC know the time taken to crunch 'em? It gets that from an estimate placed in the WU by Project Staffs.

However that estimate is a nightmare considering the number of different machines and hardware capability now and in the future. So another fundamental value surfaces, its a value based on the theoretical BOINC "power" measure, Cobblestones, in that the measure says how many cobblestones of effort is needed to crunch this WU. That value is known as the Duration Correction Factor (DCF), not strictly true, but will do for this explanation. You can see yours by clicking a Project in the Projects Tab and selecting "Properties" from the menu on the left.

The detail of the calculation of the project time estimate and DCF value is another Pandora's box, lets not go there for the moment .... a DCF of one is the "norm", a plus value means other projects owe you crunching time, a minus value means you owe other projects crunching time. Over time BOINC uses these values to ensure in the long term that you crunch to the proportions and priorities you set in preferences.

BOINC then counts the number of WUs you have on the PC, uses the DCF x Project estimate to work out how many more you need to meet your preferences, and goes and gets them. If the number held times the duration factors exceeds the cache value you set, BOINC refuses to get any more, the dreaded "no work requested".

Now another twist, all this is on fixed PC ability, you use the machine in different ways which affect time to crunch. So BOINC factors in that variance by dynamicaly calculating how you are taking to do them, and changing on the fly the DCF value. Its this change thats at the core of you original question re completion time. The BOINC Client uses the DCF x Project estimate to paint to the PC screen the time left to complete. Hopefully now you can see why that value can be massively out initially.

Hence my original response, just take the completed percentage and time taken to get that far and extrapolate to 100%, its usually accurate after around 30% of the WU is completed.

All these twists and turns are caused by a Fundamental Design decision by BOINC, that the software must handle unattended crunchers, for all projects, for all machines known or unknown now and in the future, of all types, sizes and power ability known or unknown, and account for infinite variance on how those machines are used by the user. If thats not a night mare, what is rofl :)

I hope thats given an insight into the whys and wherefores, and its not too much a "clear as mud" explanation. There are many pandora's boxes in that lot on the details .......

Overall lifes too short - just extrapolate the percentages and time taken to date

Regards
Zy
ID: 43504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
55degrees

Send message
Joined: 8 Sep 09
Posts: 62
Credit: 61,330,584
RAC: 0
Message 43525 - Posted: 5 Nov 2010, 21:12:09 UTC - in response to Message 43504.  

wow. first, only by the POWER OF GREYSKULL did I find my way through your helpful explanation. thanks.

so, between the dcf and the project estimate then my guess, please correct, is that the dcf initially fights, for lack of a better word, the project estimate that's way off or out there. it is trying to correct every so slowly from whatever estimate. now I better understand why after say 10 WUs crunched in <10 min per the to completion was still at 22+ hrs.

your effort to help explain is greatly appreciated.

thanks again.


ID: 43525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 43527 - Posted: 5 Nov 2010, 21:32:06 UTC - in response to Message 43525.  

You got it .... its the source of much frustration, however, structuring the code this way is the only real alternative if the original design philosophy of complete software independence from any other factor is to be achieved. Many argue that its too much to chew on and the original design criteria is too ambitious. I'll leave the resolution of that one to the gurus ... greater minds than mine as they say rofl :)

Regards
Zy
ID: 43527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Crunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
Message 43530 - Posted: 5 Nov 2010, 22:06:21 UTC - in response to Message 43527.  

You got it .... its the source of much frustration, however, structuring the code this way is the only real alternative if the original design philosophy of complete software independence from any other factor is to be achieved. Many argue that its too much to chew on and the original design criteria is too ambitious. I'll leave the resolution of that one to the gurus ... greater minds than mine as they say rofl :)

Regards
Zy


The only solution to 'fix' such issues is to rewrite the boinc client code to match your needs, as the devs seem to be kind of resistant to input regarding such matters...

I pity all of you that run stock boinc clients...

LOL...


Join Support science! Joinc Team BOINC United now!
ID: 43530 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
55degrees

Send message
Joined: 8 Sep 09
Posts: 62
Credit: 61,330,584
RAC: 0
Message 43548 - Posted: 6 Nov 2010, 14:04:45 UTC - in response to Message 43530.  

You got it .... its the source of much frustration, however, structuring the code this way is the only real alternative if the original design philosophy of complete software independence from any other factor is to be achieved. Many argue that its too much to chew on and the original design criteria is too ambitious. I'll leave the resolution of that one to the gurus ... greater minds than mine as they say rofl :)

Regards
Zy


The only solution to 'fix' such issues is to rewrite the boinc client code to match your needs, as the devs seem to be kind of resistant to input regarding such matters...

I pity all of you that run stock boinc clients...

LOL...


why oh why didn't I take the [red] pill?
ID: 43548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Vleugels

Send message
Joined: 1 Apr 09
Posts: 2
Credit: 18,201
RAC: 0
Message 43575 - Posted: 7 Nov 2010, 18:33:45 UTC

Thanks everybody for your clear comments. Looking to all pre-requirements I can understand it is virtually impossible to calculate a reliable run-time in advance, I can live with that.

But, my 2nd question is not answered: why not offering WUs that have a - let's say - 6 time smaller size (24 hrs on my laptop will be initially 4 hrs, maybe 8 hrs in reality)? Advantage: not blocking other projects giving them a fair change to get some CPU-time, disadvantage: higher traffic rate on the servers.

BTW: currently I've again another 24 hrs job waiting, expiration tomorrow so I woudn't even start this one, pity for MilkyWay!
ID: 43575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Total estimated CPU-time increases constantly during calculations - why?

©2024 Astroinformatics Group