Welcome to MilkyWay@home

maximum time limit elapsed bug

Message boards : News : maximum time limit elapsed bug
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
TBeck_All

Send message
Joined: 30 Dec 09
Posts: 3
Credit: 56,529,517
RAC: 31,926
Message 50125 - Posted: 17 Jul 2011, 7:43:48 UTC

still having the problem, although did roll back the driver to 11.3
ID: 50125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Simplex0
Avatar

Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,229,009
RAC: 0
Message 50126 - Posted: 17 Jul 2011, 8:29:12 UTC - in response to Message 50125.  

Strange... Maybe it is related to the BOINC manager version? I use BOINC 6.10.60.
ID: 50126 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50140 - Posted: 17 Jul 2011, 18:45:33 UTC - in response to Message 50126.  
Last modified: 17 Jul 2011, 18:51:42 UTC

I had the same problem until i changed the <duration_correction_factor> in client_state.xml file (I changed it first to 100 which was way to high then changed it to 10 and let it run). Currently it is 0.959884 so it's ok now.
If found this in one of the forum's. If it was mention here then I am sorry.
Currently running win7 x64 catalyst 11.6b.

Found it in number crunching thread - Massive "exceeded elapsed time limit" errors
ID: 50140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bornerdogge

Send message
Joined: 19 Sep 08
Posts: 4
Credit: 1,955,671
RAC: 0
Message 50143 - Posted: 17 Jul 2011, 20:41:25 UTC - in response to Message 50140.  

Nope, doesn't work for me... (win xp 32, radeon hd4830)
ID: 50143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50144 - Posted: 17 Jul 2011, 20:59:12 UTC
Last modified: 17 Jul 2011, 21:03:43 UTC

Here is a thought for all having the problem, at least for those of you, who like some experiments:

1. Get a full cache, disable network in BOINC and than stop BOINC.
2. For each WU increase the rsc_fpops_bound value in client_state by the factor of 1000 or even more, doesn't matter. Eventually backup your data directory before that.

<workunit>
(...)
<rsc_fpops_bound>140346600092088000.560000</rsc_fpops_bound>
(...)
</workunit>
The bold zeros are those you should add, can be more of them if you want.
3. Crunch all WUs with disabled network, than report all of them without requesting new tasks, than request new tasks on next scheduler contact.
4. If you are still getting the error try once more.

The idea is to tell the server how much time the computer actually needs for a WU. The time estimate is calculated by the server and the value it is counting with obviously got screwed up. Or at least that's my theory... might be wrong as well, but since everything works for me, I can't test it.
ID: 50144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
robertmiles

Send message
Joined: 30 Sep 09
Posts: 211
Credit: 36,977,315
RAC: 0
Message 50145 - Posted: 17 Jul 2011, 21:14:17 UTC - in response to Message 50126.  

Strange... Maybe it is related to the BOINC manager version? I use BOINC 6.10.60.


I've seen a number of posts elsewhere indicating that it easily could be, and those with this problem (and some of those that don't) should mention which BOINC version they are using. It seems that some of the 6.12.* versions of BOINC fail to initialize a certain variable often used in calculating time limits for workunits, and therefore give resulting time limits as if a random number was substituted for this variable.

Also, some people have seen similar problems under some of the 6.10.* versions, but apparantly from some other cause not identified yet.

I'm using BOINC 6.10.58 under 64-bit Windows Vista SP2 with a GTS 450, and don't see this problem.

If you can find out just WHICH variable isn't initialized, there MIGHT be a way to have app_info.xml supply a substitute value for that variable and therefore supply a good workaround for the affected 6.12.* versions of BOINC, even if this does not also work for similar problems under the 6.10.* versions.
ID: 50145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50147 - Posted: 17 Jul 2011, 21:25:32 UTC - in response to Message 50140.  

Running 6.12.33.

I had the same problem until i changed the <duration_correction_factor> in client_state.xml file (I changed it first to 100 which was way to high then changed it to 10 and let it run). Currently it is 0.959884 so it's ok now.
If found this in one of the forum's. If it was mention here then I am sorry.
Currently running win7 x64 catalyst 11.6b.

Found it in number crunching thread - Massive "exceeded elapsed time limit" errors


And I do hope you stopped boinc before the change and restarted after ?
ID: 50147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
robertmiles

Send message
Joined: 30 Sep 09
Posts: 211
Credit: 36,977,315
RAC: 0
Message 50149 - Posted: 17 Jul 2011, 21:29:54 UTC - in response to Message 50007.  
Last modified: 17 Jul 2011, 21:41:43 UTC

Yeah tried the 75% option which didn't work.

GPU temp i've not checked but the same GPU works in another rig without any problems and all my systems are in coolermaster test benches open air.

Full legal OEM copy of Windows 7 Home Premium 64 BIT.

Works fine with collatz.

Thanks


How many CPU cores does your computer have and does that count the extra ones hyperthreading make appear in some CPUs?

75% should be fine if your CPU is enough like mine (4 CPU cores, no hyperthreading), but you're likely to need some other value otherwise.
ID: 50149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50159 - Posted: 18 Jul 2011, 7:44:32 UTC
Last modified: 18 Jul 2011, 7:47:31 UTC

I did some testing (running the same program milkyway_separation_0.82_windows_x86_64__ati14.exe):

When I remove my app_info.xml file and look at the properties of the workunit:
Estimated app speed 36043.67 GFLOPs/sec
Estimated task size 29604 GFLOPs
Resources 0.05 CPUs + 1.00 ATI GPUs (device 1)

And now it crashes.

With app_info.xml:
Estimated app speed 100.00 GFLOPs/sec
Estimated task size 14777 GFLOPs
Resources 0.02 CPUs + 1.00 ATI GPUs (device 1)


So it's no wonder the maximum time is elapsed occurs. Or am I wrong.

I have the following in the app_info.xml (<flops>1.0e11</flops>)
Perhaps a combination of the duration factor and the flops calculation is way off.

It also would be nice if the files stayed a little bit longer on the server to check the results.
ID: 50159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50160 - Posted: 18 Jul 2011, 8:17:56 UTC - in response to Message 50159.  

Well, you have not completed even one WU without the app_info, so the server has no idea how fast your GPU is and the initial estimate is waaay off. Have you tried to teach the server the way I descibed above?
ID: 50160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50162 - Posted: 18 Jul 2011, 8:27:17 UTC - in response to Message 50160.  
Last modified: 18 Jul 2011, 8:43:10 UTC

True but if the application boms out on estimates how can it calculate the correct value ?
I am running with the app_info so no problem for me.
I also checked collatz which is running without app_info and it reports a gflops value below 200 and not 36043.67 on the same gpu. I think it is better to start with a to low gflops value then a to high number if the program is using this value to calculate how long the application should run. To high a value and the maximum time will surely be exceeded to low value and it will correct itself with the duration factor.
Since I don't know how the program calculates it, it could be total rubbish what I am saying.
I also think the program estimates how long it should run based on these values and if it exceeds it it gives the error.
When i started a few days ago on milkyway all the workunits crashed so how can the server learn the correct values if they all crash. Only after adding the app_info and changing the duration factor the program ran without error.
I have no problem changing the xml files to fool the server but not your average user.
ID: 50162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50163 - Posted: 18 Jul 2011, 8:45:57 UTC - in response to Message 50162.  

True but if the application boms out on estimates how can it calculate the correct value ?

The problem is (as I understand it), that part of the estimate is calculated by the server, which in case of a new host has to just guess how fast it is and that can be completely wrong as we see. Once it has some valid results, it can use those to calculate the values for next tasks, so the estimate gets better. However this needs few results before it kicks in, that's why this slightly more complicated way. On SETI it's 10 valid results, don't know here, but it was something close to that when my HD3850 started to crunch here.

Sure, if you want to use the app_info anyway, no need to fix the values for the stock app.
ID: 50163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50164 - Posted: 18 Jul 2011, 8:52:31 UTC - in response to Message 50163.  
Last modified: 18 Jul 2011, 9:01:34 UTC

As far as I understand the program and see in the properties of the workunit without app_info. The server give you a time in which you have to finish the WU in my case it was a few days or so. But the WU takes about 2 to 5 minutes to finish and than gives the maximum time limit elapsed. So it looks to me that the program estimates how long it should take and with a to high gflops value it will never finish in time so the program gives the error even though it finished in the time given by the server.

Without app_info:
Estimated app speed 36043.67 GFLOPs/sec
Estimated task size 29604 GFLOPs

Time needed would less then a second.
Perhaps that it why it is crashing.

With app_info.xml:
Estimated app speed 100.00 GFLOPs/sec
Estimated task size 14777 GFLOPs

Time needed 147 seconds.
ID: 50164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50168 - Posted: 18 Jul 2011, 10:37:14 UTC - in response to Message 50164.  

As far as I understand the program and see in the properties of the workunit without app_info. The server give you a time in which you have to finish the WU in my case it was a few days or so.

That's the report deadline and has nothing to do with this problem. The max. runtime is calculated from the rsc_fpops_bound value, which is since the introduction of server side DCF adjusted by the server for each host and each app so that the local DCF should be always around 1. The idea behind that is, that you should not need any flop entries and the time estimates should be right for GPU and CPU apps. That was not always the case before that change and that's why we needed different flop values in the app_info for to have the right amount of tasks in cache and people running stock apps always had problems, at least at SETI, don't know here. But if the server has got it completely wrong at the beginning and all tasks error out, it can't correct it, since it doesn't get any data to do that adjustment.


ID: 50168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50170 - Posted: 18 Jul 2011, 11:24:09 UTC - in response to Message 50168.  

Ok, I can understand that.
But where does the Estimated app speed 36043.67 GFLOPs/sec come from ? Which is way to high.
Perhaps the programmer should change the program so that when this error occurs it should be allowed to continue and report the new speed value to the server or reset the server side when the error occurs.
ID: 50170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50171 - Posted: 18 Jul 2011, 12:04:42 UTC - in response to Message 50170.  

Where do you get this value from?
ID: 50171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
S@NL - EStorm

Send message
Joined: 15 Jul 11
Posts: 14
Credit: 5,978,191
RAC: 0
Message 50173 - Posted: 18 Jul 2011, 12:12:13 UTC - in response to Message 50171.  
Last modified: 18 Jul 2011, 12:18:13 UTC

When the program is running check the properties of the WU in boinc manager.
That is without app_info of course else it would show the value from app_info if entered in the file. But perhaps it is only reporting this high number on my machine.
ID: 50173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50175 - Posted: 18 Jul 2011, 13:03:08 UTC - in response to Message 50173.  

Hmm... my 6.10.18 Manager don't have it yet, but I guess the app speed is the value from <app_version><flops> tag, which can be found between the <file_info> and <workunit> tags of each project in client_state.xml. And the "task size" is than <rsc_fpops_est>. So changing the <flops> in <app_version> might eventually help as well, but I would still go with changing <rsc_fpops_bound> and let BOINC adjust the other values by itself after it gets some "experience" with the app.
ID: 50175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile morse [E.R.] - BOINC.Italy

Send message
Joined: 1 May 09
Posts: 1
Credit: 18,434,272
RAC: 0
Message 50227 - Posted: 19 Jul 2011, 9:06:18 UTC

All my last workunits on the ATI HD5850 went wrong with the message:


19/07/2011 10:50:20 | Milkyway@home | Aborting task ps_separation_82_2s_mix3_1_130095_0: exceeded elapsed time limit 56.69 (2960429.33G/52224.92G)
19/07/2011 10:50:22 | Milkyway@home | Computation for task ps_separation_82_2s_mix3_1_130099_0 finished
19/07/2011 10:50:22 | Milkyway@home | Starting task ps_separation_82_2s_mix3_1_130091_0 using milkyway version 82
19/07/2011 10:51:20 | Milkyway@home | Aborting task ps_separation_82_2s_mix3_1_130091_0: exceeded elapsed time limit 56.69 (2960429.33G/52224.92G)
19/07/2011 10:51:20 | Milkyway@home | Computation for task ps_separation_82_2s_mix3_1_130095_0 finished


Is there any solution possible?
ID: 50227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 50228 - Posted: 19 Jul 2011, 9:11:11 UTC - in response to Message 50227.  
Last modified: 19 Jul 2011, 9:30:47 UTC

Is there any solution possible?

Yes, read this thread, there are many different solusions posted in it.


EDIT: that's interesting and eventually something to look at, when the admins try to fix this problem: the application details page of your host is showing just few completed WUs while the host has over 9 million credits, i.e. it got reseted somehow recently and started estimating the runtimes from scratch and than got it wrong like for many others here.

_
ID: 50228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : News : maximum time limit elapsed bug

©2024 Astroinformatics Group