Message boards :
News :
maximum time limit elapsed bug
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
Send message Joined: 30 Dec 09 Posts: 3 Credit: 56,529,517 RAC: 31,926 |
still having the problem, although did roll back the driver to 11.3 |
Send message Joined: 11 Nov 07 Posts: 232 Credit: 178,229,009 RAC: 0 |
Strange... Maybe it is related to the BOINC manager version? I use BOINC 6.10.60. |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
I had the same problem until i changed the <duration_correction_factor> in client_state.xml file (I changed it first to 100 which was way to high then changed it to 10 and let it run). Currently it is 0.959884 so it's ok now. If found this in one of the forum's. If it was mention here then I am sorry. Currently running win7 x64 catalyst 11.6b. Found it in number crunching thread - Massive "exceeded elapsed time limit" errors |
Send message Joined: 19 Sep 08 Posts: 4 Credit: 1,955,671 RAC: 0 |
Nope, doesn't work for me... (win xp 32, radeon hd4830) |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
Here is a thought for all having the problem, at least for those of you, who like some experiments: 1. Get a full cache, disable network in BOINC and than stop BOINC. 2. For each WU increase the rsc_fpops_bound value in client_state by the factor of 1000 or even more, doesn't matter. Eventually backup your data directory before that. <workunit>The bold zeros are those you should add, can be more of them if you want. 3. Crunch all WUs with disabled network, than report all of them without requesting new tasks, than request new tasks on next scheduler contact. 4. If you are still getting the error try once more. The idea is to tell the server how much time the computer actually needs for a WU. The time estimate is calculated by the server and the value it is counting with obviously got screwed up. Or at least that's my theory... might be wrong as well, but since everything works for me, I can't test it. |
Send message Joined: 30 Sep 09 Posts: 211 Credit: 36,977,315 RAC: 0 |
Strange... Maybe it is related to the BOINC manager version? I use BOINC 6.10.60. I've seen a number of posts elsewhere indicating that it easily could be, and those with this problem (and some of those that don't) should mention which BOINC version they are using. It seems that some of the 6.12.* versions of BOINC fail to initialize a certain variable often used in calculating time limits for workunits, and therefore give resulting time limits as if a random number was substituted for this variable. Also, some people have seen similar problems under some of the 6.10.* versions, but apparantly from some other cause not identified yet. I'm using BOINC 6.10.58 under 64-bit Windows Vista SP2 with a GTS 450, and don't see this problem. If you can find out just WHICH variable isn't initialized, there MIGHT be a way to have app_info.xml supply a substitute value for that variable and therefore supply a good workaround for the affected 6.12.* versions of BOINC, even if this does not also work for similar problems under the 6.10.* versions. |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
Running 6.12.33. I had the same problem until i changed the <duration_correction_factor> in client_state.xml file (I changed it first to 100 which was way to high then changed it to 10 and let it run). Currently it is 0.959884 so it's ok now. And I do hope you stopped boinc before the change and restarted after ? |
Send message Joined: 30 Sep 09 Posts: 211 Credit: 36,977,315 RAC: 0 |
Yeah tried the 75% option which didn't work. How many CPU cores does your computer have and does that count the extra ones hyperthreading make appear in some CPUs? 75% should be fine if your CPU is enough like mine (4 CPU cores, no hyperthreading), but you're likely to need some other value otherwise. |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
I did some testing (running the same program milkyway_separation_0.82_windows_x86_64__ati14.exe): When I remove my app_info.xml file and look at the properties of the workunit: Estimated app speed 36043.67 GFLOPs/sec Estimated task size 29604 GFLOPs Resources 0.05 CPUs + 1.00 ATI GPUs (device 1) And now it crashes. With app_info.xml: Estimated app speed 100.00 GFLOPs/sec Estimated task size 14777 GFLOPs Resources 0.02 CPUs + 1.00 ATI GPUs (device 1) So it's no wonder the maximum time is elapsed occurs. Or am I wrong. I have the following in the app_info.xml (<flops>1.0e11</flops>) Perhaps a combination of the duration factor and the flops calculation is way off. It also would be nice if the files stayed a little bit longer on the server to check the results. |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
Well, you have not completed even one WU without the app_info, so the server has no idea how fast your GPU is and the initial estimate is waaay off. Have you tried to teach the server the way I descibed above? |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
True but if the application boms out on estimates how can it calculate the correct value ? I am running with the app_info so no problem for me. I also checked collatz which is running without app_info and it reports a gflops value below 200 and not 36043.67 on the same gpu. I think it is better to start with a to low gflops value then a to high number if the program is using this value to calculate how long the application should run. To high a value and the maximum time will surely be exceeded to low value and it will correct itself with the duration factor. Since I don't know how the program calculates it, it could be total rubbish what I am saying. I also think the program estimates how long it should run based on these values and if it exceeds it it gives the error. When i started a few days ago on milkyway all the workunits crashed so how can the server learn the correct values if they all crash. Only after adding the app_info and changing the duration factor the program ran without error. I have no problem changing the xml files to fool the server but not your average user. |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
True but if the application boms out on estimates how can it calculate the correct value ? The problem is (as I understand it), that part of the estimate is calculated by the server, which in case of a new host has to just guess how fast it is and that can be completely wrong as we see. Once it has some valid results, it can use those to calculate the values for next tasks, so the estimate gets better. However this needs few results before it kicks in, that's why this slightly more complicated way. On SETI it's 10 valid results, don't know here, but it was something close to that when my HD3850 started to crunch here. Sure, if you want to use the app_info anyway, no need to fix the values for the stock app. |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
As far as I understand the program and see in the properties of the workunit without app_info. The server give you a time in which you have to finish the WU in my case it was a few days or so. But the WU takes about 2 to 5 minutes to finish and than gives the maximum time limit elapsed. So it looks to me that the program estimates how long it should take and with a to high gflops value it will never finish in time so the program gives the error even though it finished in the time given by the server. Without app_info: Estimated app speed 36043.67 GFLOPs/sec Estimated task size 29604 GFLOPs Time needed would less then a second. Perhaps that it why it is crashing. With app_info.xml: Estimated app speed 100.00 GFLOPs/sec Estimated task size 14777 GFLOPs Time needed 147 seconds. |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
As far as I understand the program and see in the properties of the workunit without app_info. The server give you a time in which you have to finish the WU in my case it was a few days or so. That's the report deadline and has nothing to do with this problem. The max. runtime is calculated from the rsc_fpops_bound value, which is since the introduction of server side DCF adjusted by the server for each host and each app so that the local DCF should be always around 1. The idea behind that is, that you should not need any flop entries and the time estimates should be right for GPU and CPU apps. That was not always the case before that change and that's why we needed different flop values in the app_info for to have the right amount of tasks in cache and people running stock apps always had problems, at least at SETI, don't know here. But if the server has got it completely wrong at the beginning and all tasks error out, it can't correct it, since it doesn't get any data to do that adjustment. |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
Ok, I can understand that. But where does the Estimated app speed 36043.67 GFLOPs/sec come from ? Which is way to high. Perhaps the programmer should change the program so that when this error occurs it should be allowed to continue and report the new speed value to the server or reset the server side when the error occurs. |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
Where do you get this value from? |
Send message Joined: 15 Jul 11 Posts: 14 Credit: 5,978,191 RAC: 0 |
When the program is running check the properties of the WU in boinc manager. That is without app_info of course else it would show the value from app_info if entered in the file. But perhaps it is only reporting this high number on my machine. |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
Hmm... my 6.10.18 Manager don't have it yet, but I guess the app speed is the value from <app_version><flops> tag, which can be found between the <file_info> and <workunit> tags of each project in client_state.xml. And the "task size" is than <rsc_fpops_est>. So changing the <flops> in <app_version> might eventually help as well, but I would still go with changing <rsc_fpops_bound> and let BOINC adjust the other values by itself after it gets some "experience" with the app. |
Send message Joined: 1 May 09 Posts: 1 Credit: 18,434,272 RAC: 0 |
All my last workunits on the ATI HD5850 went wrong with the message: 19/07/2011 10:50:20 | Milkyway@home | Aborting task ps_separation_82_2s_mix3_1_130095_0: exceeded elapsed time limit 56.69 (2960429.33G/52224.92G) 19/07/2011 10:50:22 | Milkyway@home | Computation for task ps_separation_82_2s_mix3_1_130099_0 finished 19/07/2011 10:50:22 | Milkyway@home | Starting task ps_separation_82_2s_mix3_1_130091_0 using milkyway version 82 19/07/2011 10:51:20 | Milkyway@home | Aborting task ps_separation_82_2s_mix3_1_130091_0: exceeded elapsed time limit 56.69 (2960429.33G/52224.92G) 19/07/2011 10:51:20 | Milkyway@home | Computation for task ps_separation_82_2s_mix3_1_130095_0 finished Is there any solution possible? |
Send message Joined: 19 Jul 10 Posts: 578 Credit: 18,845,239 RAC: 856 |
Is there any solution possible? Yes, read this thread, there are many different solusions posted in it. EDIT: that's interesting and eventually something to look at, when the admins try to fix this problem: the application details page of your host is showing just few completed WUs while the host has over 9 million credits, i.e. it got reseted somehow recently and started estimating the runtimes from scratch and than got it wrong like for many others here. _ |
©2024 Astroinformatics Group