Welcome to MilkyWay@home

Broken WUs

Message boards : Number crunching : Broken WUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile gigadisk

Send message
Joined: 24 Apr 08
Posts: 3
Credit: 44,584,709
RAC: 0
Message 30965 - Posted: 15 Sep 2009, 16:13:43 UTC

I have still many WU invalid. This is bad for me, 30% WU invalid like this
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=103466262

:( i can compute whatever, but i want credit for it :P
ID: 30965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30973 - Posted: 15 Sep 2009, 17:50:48 UTC

Just looking through the linked invalid WUs one thing catched my eye. All of them were calculated on overclocked intel CPUs (some of them quite heavily overclocked). Just coincidence? Is everyone here where the WUs fail on a stock clocked CPU and/or on an AMD CPU?

One has to know that the optimized application stresses the SSE units quite hard. That applies even more to the 0.20 version as the speedup compared to 0.19 solely results from pushing more calculations through those poor units, no operations were shaved off compared to 0.19. Furthermore the L1 cache is also stressed more with 0.20.

As the instruction mix varies between the different types of WUs, it could very well be that systems at the edge of stability are able to compute the _3s WUs with more emphasis on exponential functions (besides the usual additions and multiplications), but fail at the shorter _2s WUs with a higher density of square roots and divisions. Maybe those instructions are a first indicator of hitting the clock speed limit for a given CPU.
ID: 30973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,548,171
RAC: 0
Message 30974 - Posted: 15 Sep 2009, 18:30:42 UTC
Last modified: 15 Sep 2009, 18:43:00 UTC

All my Box's are running @ Stock Speed for the CPU's and the GPU's, trying to save some Electrical costs since I have to pay for the Electricity they use. I noticed I had a few that were marked invalid running the 0.20 Application. The Forum is Dog slow right now but I'll try & find a few of them if I can.

PS: The only 3 I could find are here ...
ID: 30974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30975 - Posted: 15 Sep 2009, 18:46:05 UTC - in response to Message 30974.  
Last modified: 15 Sep 2009, 18:47:09 UTC

All my Box's are running @ Stock Speed for the CPU's and the GPU's, trying to save some Electrical costs since I have to pay for the Electricity they use. I noticed I had a few that were marked invalid running the 0.20 Application. The Forum is Dog slow right now but I'll try & find a few of them if I can.

As I see it all your computers run the ATI app and not the CPU application. The invalid WUs you have seen yesterday were caused by some broken WUs sent by the server. This appear to by fixed now. I didn't have a single invalid WU on my computers in the last hours (the button for showing only invalid WUs is a quite nice addon).

But some people using the CPU app are still complaining about invalid WUs and some get them irrespective of the used app version (i.e. 0.19 or 0.20 for CPU). That looks highly suspicious to me, especially as everyone complaining so far has overclocked its CPU.
ID: 30975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 30981 - Posted: 15 Sep 2009, 19:21:05 UTC - in response to Message 30975.  

What's strange is around 99.9% of the WUs reported are fine. It just seems that some hosts are having issues. I'm wondering if there's some weird application issue. I haven't seen any stock applications fail yet.
ID: 30981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 30983 - Posted: 15 Sep 2009, 19:30:41 UTC
Last modified: 15 Sep 2009, 19:34:34 UTC

Looking at /Your account/Your Computers/Tasks for each rig/ I see -

1. My P3 using CPU runs the results OK, but it takes 10 hours to do a WU.

2. Py Penryn quad does a WU in 28 minutes and has problems with _82_2s_6_ not validating. All other WUs, including _82_3s_6_, are OK. It's just the _82_2s_6_ with the trouble, and these seem to form about 30% of what work I get.
2. The two machines with ATI GPUs are crunching and validating all WUs, including the _82_2s_6_ ones.

So, the problem, as defined here, is with those WUs being picked up on CPUs.

Just adding -

3 rigs are Win XP Pro and the dual P3 is Win2K pro. All use BOINC Manager 6.4.7 and -

(a) The CPU rigs run the opti client 0.19 SSE version and the opti client 0.19 SSSE3.
(b) The two GPU rigs run Gispel's new 0.20 client.
Go away, I was asleep


ID: 30983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30984 - Posted: 15 Sep 2009, 19:37:29 UTC - in response to Message 30981.  
Last modified: 15 Sep 2009, 19:55:06 UTC

What's strange is around 99.9% of the WUs reported are fine. It just seems that some hosts are having issues. I'm wondering if there's some weird application issue. I haven't seen any stock applications fail yet.

Yes, overclocked hosts have issues. Someone using the stock app is not greedy enough to also overclock its CPU ;)
And as I explained, the optimzed apps tax a system much more, as they use more of the chips execution units within those vectorized loops.
ID: 30984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30986 - Posted: 15 Sep 2009, 19:45:51 UTC - in response to Message 30983.  
Last modified: 15 Sep 2009, 19:51:36 UTC

1. My P3 using CPU runs the results OK, but it takes 10 hours to do a WU.

2. Py Penryn quad does a WU in 28 minutes and has problems with _82_2s_6_ not validating. All other WUs, including _82_3s_6_, are OK. It's just the _82_2s_6_ with the trouble, and these seem to form about 30% of what work I get.

Yes, and that specific Penryn is overclocked from 3GHz to 3.8GHz. Maybe that is a tiny bit above the margins? Have you tried it at stock clocks?
As I demonstrated above, the different app versions are calculating the exact same results on a stable system also for this WU type. It is really strange that all those WUs don't validate from your computer.

3. The two machines with ATI GPUs are crunching and validating all WUs, including the _82_2s_6_ ones.

Obviously they run stable.

So if it is not BOINC 6.10.4 mixing up the input and/or result files for some percentage of the WUs (but it does so independent of the type), it is very probable you experience a stability problem on the Penryn.

Edit:
This is the from an invalidated WU of your Penryn, black and bold is the stock clock, red and bold the actual clock frequency:

<core_client_version>6.4.7</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home version 0.19 by Gipsel
CPU: Intel(R) Core(TM)2 Extreme CPU X9650 @ 3.00GHz (4 cores/threads) 3.79987 GHz (366ms)

WU completed. It took 1197.63 seconds CPU time and 1198.46 seconds wall clock time @ 3.79988 GHz.

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 8.49423808039293
Granted credit 0
application version 0.19
ID: 30986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STE\/E

Send message
Joined: 29 Aug 07
Posts: 486
Credit: 576,548,171
RAC: 0
Message 30989 - Posted: 15 Sep 2009, 19:51:03 UTC - in response to Message 30975.  

All my Box's are running @ Stock Speed for the CPU's and the GPU's, trying to save some Electrical costs since I have to pay for the Electricity they use. I noticed I had a few that were marked invalid running the 0.20 Application. The Forum is Dog slow right now but I'll try & find a few of them if I can.

As I see it all your computers run the ATI app and not the CPU application. The invalid WUs you have seen yesterday were caused by some broken WUs sent by the server. This appear to by fixed now. I didn't have a single invalid WU on my computers in the last hours (the button for showing only invalid WUs is a quite nice addon).

But some people using the CPU app are still complaining about invalid WUs and some get them irrespective of the used app version (i.e. 0.19 or 0.20 for CPU). That looks highly suspicious to me, especially as everyone complaining so far has overclocked its CPU.


OOOP's, forgot to really read what you were saying again, I'll go stand in the Corner with John ... :P ... ;)
ID: 30989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [Russia] michs

Send message
Joined: 16 Oct 08
Posts: 18
Credit: 164,409,593
RAC: 0
Message 31001 - Posted: 15 Sep 2009, 20:35:24 UTC - in response to Message 30986.  
Last modified: 15 Sep 2009, 20:38:50 UTC

ID: 31001 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 31008 - Posted: 15 Sep 2009, 20:56:36 UTC - in response to Message 31001.  

I am downclock my host, and still have errors.
Error at stock speed.
[..]
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105916938
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105916937

@ Travis:

Can you extract the parameters and results for those WUs from the database and post it here or send it to me per PM?

WUs:
ps_constrainted_82_2s_6_1044426_1253039459_0
ps_constrainted_82_2s_6_1044425_1253039459_0

I would like to run some tests on those WUs.

PS:
The first linked WU was started with the overclocked CPU only finished from a checkpoint at stock clock, so I would neglect it.
ID: 31008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [Russia] michs

Send message
Joined: 16 Oct 08
Posts: 18
Credit: 164,409,593
RAC: 0
Message 31011 - Posted: 15 Sep 2009, 21:14:03 UTC - in response to Message 31008.  

de_constrainted_82_2s_6_search_parameters_1078600_1253041937

de_constrainted_82_2s_6
parameters [14]: 0.424510724378298 5.459576507839513 -0.673558790328050 -50.000000000000000 22.700048295567910 1.362269043896991 -0.990224106288773 6.870676968365467 -0.439046188333093 27.820005081189180 23.599459349958643 0.654402464096181 1.168527166633252 7.455713423364958
metadata: i: 30, redundancy

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105952242

de_constrainted_82_2s_6_search_parameters_1078599_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.426977971424483 5.503659154387430 -0.679553731990973 -50.000000000000000 22.669012398749622 1.376720787932886 -0.990254959141226 6.860532157700593 -0.434329137268066 28.048275509612360 23.600000000000001 0.653267504395833 1.184269744817431 7.339289461433097
metadata: i: 33, redundancy

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105952241
ID: 31011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 31018 - Posted: 15 Sep 2009, 23:02:50 UTC
Last modified: 15 Sep 2009, 23:31:50 UTC

Yes Cluster Physik.

That host has been clocked at 3.8GHz for nearly 1 year, after I down clocked it from 4.05GHz on air.

I will take your advice and down clock it to 3.5GHz in a short time. If the lower clock still results in invalid results on the _82_2s_6_ WUs then I will return the clock to stock.

Now downclocked by 300MHz to 3.5GHz.

Let's see how the results pan out in the morning on the _82_2s_6_ WUs collected and crunched during the night. I am off to sleep now.
Go away, I was asleep


ID: 31018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 31019 - Posted: 15 Sep 2009, 23:20:11 UTC - in response to Message 31011.  
Last modified: 15 Sep 2009, 23:34:37 UTC

de_constrainted_82_2s_6_search_parameters_1078600_1253041937
[..]
de_constrainted_82_2s_6_search_parameters_1078599_1253041937
[..]

Thanks!

I will run those WUs with different apps (GPU as well as CPU) on my development box to see if there are any issues. If Travis reacts fast enough (I've sent him a PM) we may also get your (invalidated) results for those WUs, if Travis is able to extract it from the database (I hope it gets saved even if it was declared as invalid). Would be helpful to have something to compare the results to.

Edit:
GPU is through and I've found one possible reason already even without running it on the CPU. There may be some combination of a glitch in the WU generation, some sloppy code in the app for reading the the parameter file (also present in the stock app, nothing changed there from my side) and a buffer overflow problem in the validator (kicking in only for CPU units, as GPUs don't show this behaviour).
What's the problem? If the linefeed is missing after the metadata (WU generator glitch), there is nothing defining the end of the metadata string, so basically it is filled with a lot of junk after the real metadata (sloppy code for the parameter reading, it should stop at the file end). Those junk is then output also in the results file where it may have the exact same size as the buffer used for reading it (i.e. one letter to much in it) which is then a problem for the validator reading the application signature.

But if that would be true I wonder why no GPU has problems with invalid WUs.

Results with the ATI application:

de_constrainted_82_2s_6_search_parameters_1078600_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.424510724378298 5.459576507839513 -0.673558790328050 -50.000000000000000 22.700048295567910 1.362269043896991 -0.990224106288773 6.870676968365467 -0.439046188333093 27.820005081189180 23.599459349958643 0.654402464096181 1.168527166633252 7.455713423364958
metadata: i: 30, redundancy
fitness: -2.975430079432712
Gipsel_GPU_CAL_0.20_x64: 0.20


de_constrainted_82_2s_6_search_parameters_1078599_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.426977971424483 5.503659154387430 -0.679553731990973 -50.000000000000000 22.669012398749622 1.376720787932886 -0.990254959141226 6.860532157700593 -0.434329137268066 28.048275509612360 23.600000000000001 0.653267504395833 1.184269744817431 7.339289461433097
metadata: i: 33, redundancy
fitness: -2.975427308602937
Gipsel_GPU_CAL_0.20_x64: 0.20
ID: 31019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John R. @ SETI.USA

Send message
Joined: 1 Jan 09
Posts: 15
Credit: 85,816,654
RAC: 0
Message 31020 - Posted: 15 Sep 2009, 23:37:21 UTC

If you need anymore invalid workunits, just look up the stats for any of my 3 quads.

Yes, they are all overclocked......no,I don't believe it is the O/C that is at fault.

They have been rock steady for months......and they share time with SETI and aren't having any trouble with W/Us there.


And CP, you're the man I believe can convince Travis to fix what they broke.

And probably show him how to fix it too............
ID: 31020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 31022 - Posted: 16 Sep 2009, 0:00:14 UTC - in response to Message 31019.  
Last modified: 16 Sep 2009, 0:16:06 UTC

Results with the ATI application:

de_constrainted_82_2s_6_search_parameters_1078600_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.424510724378298 5.459576507839513 -0.673558790328050 -50.000000000000000 22.700048295567910 1.362269043896991 -0.990224106288773 6.870676968365467 -0.439046188333093 27.820005081189180 23.599459349958643 0.654402464096181 1.168527166633252 7.455713423364958
metadata: i: 30, redundancy
fitness: -2.975430079432712
Gipsel_GPU_CAL_0.20_x64: 0.20

Hmm, results with the Win64_SSE3 application (same as michs used, turned invalid for him) is exactly the same as on my GPU (which doesn't show any invalid WUs):

de_constrainted_82_2s_6_search_parameters_1078600_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.424510724378298 5.459576507839513 -0.673558790328050 -50.000000000000000 22.700048295567910 1.362269043896991 -0.990224106288773 6.870676968365467 -0.439046188333093 27.820005081189180 23.599459349958643 0.654402464096181 1.168527166633252 7.455713423364958
metadata: i: 30, redundancy
fitness: -2.975430079432712
Gipsel_0.20_x64_SSE3: 0.20


Let's hope Travis can tell us the content of michs result file. Other than that I have to state that the app he uses calculates correctly and arrives at exactly the same output as on a GPU. I can't find anything wrong with the applications for those invalid WUs. Guess the second one will show the same result (but I will run it either way).
ID: 31022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 31024 - Posted: 16 Sep 2009, 0:40:40 UTC - in response to Message 31019.  
Last modified: 16 Sep 2009, 0:50:52 UTC

As expected, also the second WU returns the same result irrespective on the platform it is crunched on:

0.20 ATI version (gets validated)

de_constrainted_82_2s_6_search_parameters_1078599_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.426977971424483 5.503659154387430 -0.679553731990973 -50.000000000000000 22.669012398749622 1.376720787932886 -0.990254959141226 6.860532157700593 -0.434329137268066 28.048275509612360 23.600000000000001 0.653267504395833 1.184269744817431 7.339289461433097
metadata: i: 33, redundancy
fitness: -2.975427308602937
Gipsel_GPU_CAL_0.20_x64: 0.20

result with 0.20 Win64_SSE3 version (michs got invalid results with the same app for this exact WU)

de_constrainted_82_2s_6_search_parameters_1078599_1253041937
de_constrainted_82_2s_6
parameters [14]: 0.426977971424483 5.503659154387430 -0.679553731990973 -50.000000000000000 22.669012398749622 1.376720787932886 -0.990254959141226 6.860532157700593 -0.434329137268066 28.048275509612360 23.600000000000001 0.653267504395833 1.184269744817431 7.339289461433097
metadata: i: 33, redundancy
fitness: -2.975427308602937
Gipsel_0.20_x64_SSE3: 0.20


If Travis comes up with the results those two WUs actually returned when both were calculated by michs, we will see if it is a very strange bug of the validator (marking only CPU WUs invalid) or if he returned different values for some reason. But what I can say quite sure is that the apps are calculating all the same. It makes no sense that GPU WUs validate and CPU WUs don't just because they were crunched on a different platform but deliver the exact same results.
ID: 31024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 31035 - Posted: 16 Sep 2009, 3:22:04 UTC - in response to Message 31024.  

What's strange is all of michs and johns WUs seem to validating correctly. I'm going to try and see if I can catch any more bad workunits.
ID: 31035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 31036 - Posted: 16 Sep 2009, 3:32:54 UTC - in response to Message 31035.  
Last modified: 16 Sep 2009, 3:35:20 UTC

What's strange is all of michs and johns WUs seem to validating correctly. I'm going to try and see if I can catch any more bad workunits.

But only the _3s WUs or on GPUs. Look at the tasks of John's computer here or michs' one here! Every _2s_6 WU is marked as invalid there.

It would be nice to know the results they sent to the server and compare them with offline generated results on machines known to produce validating results. Either their machines suddenly return faulty data for some reason or the validator makes some weird choices based on the platform the WUs were crunched.
ID: 31036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 31037 - Posted: 16 Sep 2009, 3:37:45 UTC - in response to Message 31035.  

I've made an update to the validator, let me know if it helped anything.
ID: 31037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Broken WUs

©2024 Astroinformatics Group