Short deadlines-High priority-suspend Milkyway

Author	Message
Thunder Send message Joined: 9 Jul 08 Posts: 85 Credit: 44,842,651 RAC: 0	Message 5040 - Posted: 26 Aug 2008, 1:22:34 UTC - in response to Message 5038. Last modified: 26 Aug 2008, 1:23:40 UTC They all wind up running MW at high priority. I even have a machine with 10 days of queue and it still finds it necessary to run at "High Priority" Well, I hate to say this because it might seem obvious to some, but if you're asking the BOINC client to keep 10 days worth of work for a project that requires results to be returned in 3 days, then you're forcing it to run all of the tasks at high priority. You're forcing it into a situation in which it believes >2/3rds of the tasks cannot be completed in time. The project can't set the deadlines any longer or they're not going to be producing anything scientifically useful. You'd be getting a lot of credit, but the work would only be "flogging a dead horse". As much as I hate to discourage anyone from any project, if the other projects you volunteer for require you to keep a 10 day cache, then you're just not going to be able to run this project as well. I used to wonder why I kept hearing about everyone having high-priority problems when I never seemed to have any on even old computers with a relatively low resource share. The answer is simply that I never try to cache a lot of work. I have my cache set to "0" (always connected) with the "additional" at the default of .25 days. I almost never have results waiting to run on any project unless it's 6 hours or less from finishing the prior task. All the projects that I work for that are in a "normal" status do just great this way on 17 different computers and have for as long as I can remember. (Yes, I do have a full-time connection for all of them and I understand that modem connection users need to do things a bit differently) ID: 5040 · Rating: 0 · rate: / Reply Quote

Martin P. Send message Joined: 21 Nov 07 Posts: 52 Credit: 1,756,052 RAC: 0	Message 5043 - Posted: 26 Aug 2008, 8:09:40 UTC - in response to Message 5040. They all wind up running MW at high priority. I even have a machine with 10 days of queue and it still finds it necessary to run at "High Priority" Well, I hate to say this because it might seem obvious to some, but if you're asking the BOINC client to keep 10 days worth of work for a project that requires results to be returned in 3 days, then you're forcing it to run all of the tasks at high priority. You're forcing it into a situation in which it believes >2/3rds of the tasks cannot be completed in time. The project can't set the deadlines any longer or they're not going to be producing anything scientifically useful. You'd be getting a lot of credit, but the work would only be "flogging a dead horse". As much as I hate to discourage anyone from any project, if the other projects you volunteer for require you to keep a 10 day cache, then you're just not going to be able to run this project as well. I used to wonder why I kept hearing about everyone having high-priority problems when I never seemed to have any on even old computers with a relatively low resource share. The answer is simply that I never try to cache a lot of work. I have my cache set to "0" (always connected) with the "additional" at the default of .25 days. I almost never have results waiting to run on any project unless it's 6 hours or less from finishing the prior task. All the projects that I work for that are in a "normal" status do just great this way on 17 different computers and have for as long as I can remember. (Yes, I do have a full-time connection for all of them and I understand that modem connection users need to do things a bit differently) Thunder, I have set my cache to 3 days for some good reasons. One of them is the situation I had this weekend: On Friday night our electricity broke down and it took until this morning until all servers were up and running again. My 3 days cache kept my computer running and all projects produced valid results with just one exception: Milkyway@Home. Due to the short deadlines I lost 15 workunits which were completed properly but reported too late to validate. Besides that I see no reason why longer dealines (e.g. 5 days instead of 3) should lead to useless results. The stars are here for several billion years now, so why should a delay of 2 days change the scientific outcome? This sounds like a very cheap excuse. ID: 5043 · Rating: 0 · rate: / Reply Quote

STE\/E Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0	Message 5044 - Posted: 26 Aug 2008, 10:17:58 UTC I think the Deadline is a little short too but it is what it is I guess for a reason. If the Deadline was 7 days and something happened to somebody that they couldn't get their Wu's in on time then they would be wanting even longer Deadlines. So I guess it's the chance you take running a short Deadline Project, how did you keep your Computer running for more than 2 & 1/2 days though. I have Backup UPS's but they are only good for 20 minutes or so, any longer than that and they shut the Box's off ... :) ID: 5044 · Rating: 0 · rate: / Reply Quote

Martin P. Send message Joined: 21 Nov 07 Posts: 52 Credit: 1,756,052 RAC: 0	Message 5045 - Posted: 26 Aug 2008, 10:56:45 UTC - in response to Message 5044. Last modified: 26 Aug 2008, 10:57:25 UTC ... how did you keep your Computer running for more than 2 & 1/2 days though. I have Backup UPS's but they are only good for 20 minutes or so, any longer than that and they shut the Box's off ... :) PoorBoy, electricity was back appr. 2 hours later and my computer is configured to automatically restart after a power failure. However, the servers and switches were down and therefore I had no network and internet access until my IT-department could fix everything this morning (small company, took 2 days to replace the broken switch). In the meantime my computers were happily crunching the projects and WUs that were in the cache before the power failure. ID: 5045 · Rating: 0 · rate: / Reply Quote

STE\/E Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0	Message 5046 - Posted: 26 Aug 2008, 12:07:21 UTC - in response to Message 5045. ... how did you keep your Computer running for more than 2 & 1/2 days though. I have Backup UPS's but they are only good for 20 minutes or so, any longer than that and they shut the Box's off ... :) PoorBoy, electricity was back appr. 2 hours later and my computer is configured to automatically restart after a power failure. However, the servers and switches were down and therefore I had no network and internet access until my IT-department could fix everything this morning (small company, took 2 days to replace the broken switch). In the meantime my computers were happily crunching the projects and WUs that were in the cache before the power failure. Ahhhhhh, okay, was just wondering, mine are configured to restart after the power comes back on too ... :) ID: 5046 · Rating: 0 · rate: / Reply Quote

Thunder Send message Joined: 9 Jul 08 Posts: 85 Credit: 44,842,651 RAC: 0	Message 5052 - Posted: 26 Aug 2008, 23:22:16 UTC - in response to Message 5043. Besides that I see no reason why longer dealines (e.g. 5 days instead of 3) should lead to useless results. The stars are here for several billion years now, so why should a delay of 2 days change the scientific outcome? This sounds like a very cheap excuse. The nature of their methodology dictates the requirement. They use what is known as a "genetic" algorithm. What this means is that the output from the results you return TODAY influence the parameters of the work generated TOMORROW. It would be IDEAL for the science of the project if no core kept more than one task ready to work at any given time, but they've extended it to 3 days so that certain computers can have a chance to contribute results, even if they're not quite as valuable as those that are returned/refreshed quickly. You're asking this entire project to change how it works just to allow you to deal with one extremely isolated event (at least I hope multi-day failures of your network are not a regular occurance). You said it yourself: In the meantime my computers were happily crunching the projects and WUs that were in the cache before the power failure. That's a GOOD thing about BOINC. You were able to continue work for other projects. I understand your frustration because nobody likes to have idle time or work that they aren't credited for but by the time you returned that work it was of little value to the project anyhow. ID: 5052 · Rating: 0 · rate: / Reply Quote

Ricky@SETI.USA Send message Joined: 4 Feb 08 Posts: 19 Credit: 179,971 RAC: 0	Message 5053 - Posted: 27 Aug 2008, 0:03:51 UTC Last modified: 27 Aug 2008, 0:06:52 UTC Well, it looks like I will be droping this project for now until some time when you have longer deadlines. My laptops can't keep up with them. In fact I have a Compaq that shows 2 WU's due tomorrow and they show 11 hours each. Milkyway is the only project on this laptop and by looking at the deadline I know only 1 will make it. I had one laptop that downloaded it's limit but the days were so near the deadline that by the time it finished 2 of them the other 4 had do chance of making the deadline. To top it off I don't have a "always on" network connection. So as my laptops finish I will be done. ID: 5053 · Rating: 0 · rate: / Reply Quote

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 5054 - Posted: 27 Aug 2008, 2:22:37 UTC - in response to Message 5053. Last modified: 27 Aug 2008, 2:26:27 UTC Well, it looks like I will be droping this project for now until some time when you have longer deadlines. I had been watching this conversation, tempted to respond, but deciding against it...until now. People have repeatedly informed you that the project's needs are such that the tasks are returned quickly. There are only two other ways they can do this: increase the number of initial replications to something like 3-5 and run the risk of having redundant work or spend time on application optimization. We know how the optimization has played out over the past few days, so that's a non-starter for the moment. Additionally, various volunteers also like to get way too bent out of shape about "wasted" work that is generated by over-replications, so much so that various individuals repeatedly bring up the topic in one of the other BOINC projects that has a longer deadline than this one... My laptops can't keep up with them. Then, unfortunately, this is not the best project for your laptops. Stuff like that happens. From what I can tell, assuming that what they say is true, that the quorum of 1 is "ok" from a scientific standpoint, then what they are doing now is the best course of action for them given the dynamics of their project. It is their project, not yours. While it is your computer, they set the rules. If you don't like the rules, then change projects. Consider it like dating... It didn't work out for the two of you, so it's time to move on... ID: 5054 · Rating: 0 · rate: / Reply Quote

Buster Gunn Send message Joined: 3 Oct 07 Posts: 5 Credit: 329,770 RAC: 0	Message 5055 - Posted: 27 Aug 2008, 3:17:23 UTC Last modified: 27 Aug 2008, 3:21:13 UTC After reading the replies, I've decided to cut back MW to 1 machine. Impatience by the project for results is not a valid reason for weird deadlines. For the life of me I can't think of a valid reason that a result would be useless on day 5 or 6, but not on day 3. And yes, there are many reasons for days of queue to be set at 10. I only have it on 1 machine out of 12 but it is necessary. Great science, no so great execution of a solution. ID: 5055 · Rating: 0 · rate: / Reply Quote

Martin P. Send message Joined: 21 Nov 07 Posts: 52 Credit: 1,756,052 RAC: 0	Message 5056 - Posted: 27 Aug 2008, 7:35:13 UTC - in response to Message 5052. Last modified: 27 Aug 2008, 7:37:39 UTC The nature of their methodology dictates the requirement. They use what is known as a "genetic" algorithm. What this means is that the output from the results you return TODAY influence the parameters of the work generated TOMORROW. It would be IDEAL for the science of the project if no core kept more than one task ready to work at any given time, but they've extended it to 3 days so that certain computers can have a chance to contribute results, even if they're not quite as valuable as those that are returned/refreshed quickly. Thanks Thunder! This is the first time I receive some usefull information about WHY things are like that. However, in this case it would be a nice feature if I could define the number of WUs that are downloaded by each host. If my hosts downloaded fewer work-units all problems were gone. Unfortunately my PowerMac Dual 2.7GHz insists in downloading 2 work-units every time it returns one and my Mac Pro 8x3 GHz always downloads the maximum number of 20 work-units (I already suspended this project on the other computers). Due to this fact both machines run Milkyway in high-priority mode. If I could limit the number of WUs downloaded to 1 and 10 respectively all problems were gone. Currently I have to use this workaround: Let it download work, set Milkyway to "No new work" and manually abort some work-units until the rest runs in normal mode. ID: 5056 · Rating: 0 · rate: / Reply Quote

Odd-Rod Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,715,481 RAC: 0	Message 5057 - Posted: 27 Aug 2008, 9:19:01 UTC A few thoughts and questions. Thunder gave a good explanation of why short deadlines are needed by the project. He (sorry if that's wrong!) also said he doesn't have high priority problems. Neither do I because I also keep a very small cache. The only exception was when the crunch time was increased without increasing the estimated crunch time, but after 1 WU the DCF sorted that out. Those with cached WUs would probably have seen those running in high priority, and maybe even not completing in time. Since the estimate is still not adjusted, any new crunchers will have the same problems till the DCF adjust and their inital cached WUs are done. And remember, after running in high priority milkyway's Long Term Debt (LTD) will prevent it downloading more WUs till other project have had there share. But my actual question is: Where does the time pressure come from? What I mean is, who actually says that 'x' amount of science has to be done within a certain amount of time? As long there are still enough hosts here, it's not a problem. If enough hosts were to stop crunching, they still wouldn't get enough results back in time, but that's for the admins to keep an eye on. Consider this line from the front page: This particular project is being developed to better understand the power of volunteer computer resources. Maybe they're exploring where they can get the most science in the least time. When the returns drop too low, they could adjust things again, but - that won't guarantee a return of crunchers. Anyhow, that's my thoughts for now. I'm still and crunching - even a PII 400MHz has no problem here. ID: 5057 · Rating: 0 · rate: / Reply Quote

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 5058 - Posted: 27 Aug 2008, 10:58:46 UTC - in response to Message 5055. After reading the replies, I've decided to cut back MW to 1 machine. Impatience by the project for results is not a valid reason for weird deadlines. For the life of me I can't think of a valid reason that a result would be useless on day 5 or 6, but not on day 3. The returned values of the results generated today would be used to generate the results "tomorrow" (within 3 days). This means that the longer the deadline, the easier it will be for the project to run out of work for everyone, since the results are additive, meaning the work that I do today was built upon the work you did yesterday, and the work you did yesterday was built upon the work Odd-Rod did two days ago, etc, etc, etc... If you have someone hanging onto tasks, it can impede progress and make it to where things cannot move forward. ID: 5058 · Rating: 0 · rate: / Reply Quote

Ricky@SETI.USA Send message Joined: 4 Feb 08 Posts: 19 Credit: 179,971 RAC: 0	Message 5059 - Posted: 27 Aug 2008, 10:59:18 UTC Last modified: 27 Aug 2008, 11:00:16 UTC There is another reason why I am dropping the project and that is due to the weather and lost of Power. I never know when a storm may pop up and I have to shut everything down for as much as 6 hours. If one come in the middle of the night and I get up and shut things down I am not going to wait up to turn them on again... I go back to bed! BTW I have run this project before and had no proplems with the deadlines. I hate missing deadlines. Yes I am the one to blame here because I didn't check before I came back to this project. I just re-attached because my Team picked Milkyway as the Project Of The Month (POTM). Like I said I will reurn some day when the deadline are not so short. ID: 5059 · Rating: 0 · rate: / Reply Quote

Thunder Send message Joined: 9 Jul 08 Posts: 85 Credit: 44,842,651 RAC: 0	Message 5060 - Posted: 27 Aug 2008, 12:30:24 UTC - in response to Message 5056. If my hosts downloaded fewer work-units all problems were gone. Unfortunately my PowerMac Dual 2.7GHz insists in downloading 2 work-units every time it returns one and my Mac Pro 8x3 GHz always downloads the maximum number of 20 work-units (I already suspended this project on the other computers). Due to this fact both machines run Milkyway in high-priority mode. If I could limit the number of WUs downloaded to 1 and 10 respectively all problems were gone. Currently I have to use this workaround: Let it download work, set Milkyway to "No new work" and manually abort some work-units until the rest runs in normal mode. Martin, I still think we can solve even these problems, but I'll have to have quite a bit of detailed information from your computers to try... I'm still not sure if we can solve it for your PowerMac G5. I don't question that the G5 is still a good processor, but it seems that the stock application on this project runs excruciatingly slowly on that processor. I have a few processors that produce at an unusually slow rate on this project only and I've chosen not to run it on them only. For now, you may have to do the same. The problem on the other computers either has to do with your cache settings or else it's an issue with the "debt" in the BOINC client for the project. If you're willing to share that info, please post your BOINC client version, your settings for "Computer is connected to the Internet about every..." and "Maintain enough work for an additional...". Then, from the client_state.xml file on each computer find the section that begins: - <project> <master_url>http://milkyway.cs.rpi.edu/milkyway/</master_url> and look for the: <short_term_debt> and <long_term_debt> lines and copy/paste them here. I'm not 100% positive that will reveal everything we'll need, but it's a good start. ID: 5060 · Rating: 0 · rate: / Reply Quote

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 5062 - Posted: 27 Aug 2008, 15:59:16 UTC - in response to Message 5059. Like I said I will reurn some day when the deadline are not so short. That day may not come. I think that the deadlines are actually appropriate for the task that they're trying to do as it has been described. Like the saying goes, "there is more than one way to skin a cat". From a relative perspective, if application performance is increased (read -> "optimization"), the relative pressure of a short deadline is decreased. The problem that you are seeing right now is because the app appears to have significant room for performance improvements, but they need the deadlines short so as to keep work flowing since the results are additive. So, IMO, instead of protesting for longer deadlines, the better thing to do is protest for application peformance improvements... IMO, YMMV, etc, etc, etc... ID: 5062 · Rating: 0 · rate: / Reply Quote

Angus Send message Joined: 8 Nov 07 Posts: 20 Credit: 257,763 RAC: 0	Message 5063 - Posted: 27 Aug 2008, 17:19:03 UTC Why are the estimated run times on these WUs so short (8:20) , when we all know they take hours to run? I can't get this project to download less than 20 WU every time, and there is NO chance that that many will finish in time. If the run times were adjusted properly, the client would (should!) only download what it can finish by deadline. ID: 5063 · Rating: 0 · rate: / Reply Quote

Thunder Send message Joined: 9 Jul 08 Posts: 85 Credit: 44,842,651 RAC: 0	Message 5066 - Posted: 27 Aug 2008, 18:02:04 UTC - in response to Message 5063. Why are the estimated run times on these WUs so short (8:20) , when we all know they take hours to run? I can't get this project to download less than 20 WU every time, and there is NO chance that that many will finish in time. If the run times were adjusted properly, the client would (should!) only download what it can finish by deadline. Angus, you're exactly correct and I know for a fact that Travis is aware of the problem. They're working on a new application to be released soon and I know they're making every effort to have the estimated runtimes (technically it's the estimated fpops) for these tasks correct. In the meantime, there's almost no harm in aborting the tasks that you know your computer cannot complete and if you complete a few of the long tasks without resetting the project then your BOINC client will be much closer to a correct estimated time. ID: 5066 · Rating: 0 · rate: / Reply Quote

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 5067 - Posted: 27 Aug 2008, 18:21:35 UTC - in response to Message 5066. Last modified: 27 Aug 2008, 18:22:36 UTC I know for a fact that Travis is aware of the problem. They're working on a new application to be released soon and I know they're making every effort to have the estimated runtimes (technically it's the estimated fpops) for these tasks correct. There's definitely room for improvement there and it will likely help on reducing the level of angst from people that feel that the project is "hogging" their system. My current TDCF is: Task duration correction factor 35.910769 I think almost everyone would agree that the estimates need to be a bit more realistic... ;-) Anyway, with the TDCF in place, the system should get restricted to fewer than 20 tasks a day. I believe my system only picked up 15 yesterday, but it was sharing with 3 Cosmology tasks. Cosmology should be cleared out by tonight, unless a miracle happens and tasks start downloading. From there I'm going to let the system run here for a week and see how things are handled... ID: 5067 · Rating: 0 · rate: / Reply Quote

Odd-Rod Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,715,481 RAC: 0	Message 5071 - Posted: 27 Aug 2008, 22:58:43 UTC - in response to Message 5058. The returned values of the results generated today would be used to generate the results "tomorrow" (within 3 days). This means that the longer the deadline, the easier it will be for the project to run out of work for everyone, ... If you have someone hanging onto tasks, it can impede progress and make it to where things cannot move forward Aha, that's thrown some more light on it for me! I had the feeling that someone was just in a hurry to get results. I hadn't realised how the project could grind to a halt! Thanks, Brian, for this. Rod ID: 5071 · Rating: 0 · rate: / Reply Quote

Odd-Rod Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,715,481 RAC: 0	Message 5072 - Posted: 27 Aug 2008, 23:27:15 UTC - in response to Message 5063. Why are the estimated run times on these WUs so short (8:20) , when we all know they take hours to run? If you're seeing estimates of 8:20, that suggests you've not completed any of the longer WU. If you had, your Duration Correction Factor (DCF) should have adjusted itself and you would see a more realistic estimate. My P4 3GHZ has an estimate of 9h 04m for a WU at the moment with a DCF of 42.9266. I can't get this project to download less than 20 WU every time, and there is NO chance that that many will finish in time. Try setting your "Connect about every xx days" and "Additional work buffer" to very low values (mine are 0 and .05) until you've finished a longer WU. This way you should only get 1 WU at a time. But you do need to let one of the longer WUs finish to get the DCF in the ballpark. Just in case you're not aware of it, the original (short) WU names start gs_371, the much longer ones gs_372 and the medium length ones gs_373. If the run times were adjusted properly, the client would (should!) only download what it can finish by deadline. Indeed! But since there is a new application coming soon, I think we need to live with it. However, if there wasn't I would certainly have expected it to be sorted by now. Rod ID: 5072 · Rating: 0 · rate: / Reply Quote