Message boards :
Number crunching :
Broken WUs
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 24 Apr 08 Posts: 3 Credit: 44,584,709 RAC: 0 |
I have still many WU invalid. This is bad for me, 30% WU invalid like this http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=103466262 :( i can compute whatever, but i want credit for it :P |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Just looking through the linked invalid WUs one thing catched my eye. All of them were calculated on overclocked intel CPUs (some of them quite heavily overclocked). Just coincidence? Is everyone here where the WUs fail on a stock clocked CPU and/or on an AMD CPU? One has to know that the optimized application stresses the SSE units quite hard. That applies even more to the 0.20 version as the speedup compared to 0.19 solely results from pushing more calculations through those poor units, no operations were shaved off compared to 0.19. Furthermore the L1 cache is also stressed more with 0.20. As the instruction mix varies between the different types of WUs, it could very well be that systems at the edge of stability are able to compute the _3s WUs with more emphasis on exponential functions (besides the usual additions and multiplications), but fail at the shorter _2s WUs with a higher density of square roots and divisions. Maybe those instructions are a first indicator of hitting the clock speed limit for a given CPU. |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
All my Box's are running @ Stock Speed for the CPU's and the GPU's, trying to save some Electrical costs since I have to pay for the Electricity they use. I noticed I had a few that were marked invalid running the 0.20 Application. The Forum is Dog slow right now but I'll try & find a few of them if I can. PS: The only 3 I could find are here ... |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
All my Box's are running @ Stock Speed for the CPU's and the GPU's, trying to save some Electrical costs since I have to pay for the Electricity they use. I noticed I had a few that were marked invalid running the 0.20 Application. The Forum is Dog slow right now but I'll try & find a few of them if I can. As I see it all your computers run the ATI app and not the CPU application. The invalid WUs you have seen yesterday were caused by some broken WUs sent by the server. This appear to by fixed now. I didn't have a single invalid WU on my computers in the last hours (the button for showing only invalid WUs is a quite nice addon). But some people using the CPU app are still complaining about invalid WUs and some get them irrespective of the used app version (i.e. 0.19 or 0.20 for CPU). That looks highly suspicious to me, especially as everyone complaining so far has overclocked its CPU. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
What's strange is around 99.9% of the WUs reported are fine. It just seems that some hosts are having issues. I'm wondering if there's some weird application issue. I haven't seen any stock applications fail yet. |
Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0 |
Looking at /Your account/Your Computers/Tasks for each rig/ I see - 1. My P3 using CPU runs the results OK, but it takes 10 hours to do a WU. 2. Py Penryn quad does a WU in 28 minutes and has problems with _82_2s_6_ not validating. All other WUs, including _82_3s_6_, are OK. It's just the _82_2s_6_ with the trouble, and these seem to form about 30% of what work I get. 2. The two machines with ATI GPUs are crunching and validating all WUs, including the _82_2s_6_ ones. So, the problem, as defined here, is with those WUs being picked up on CPUs. Just adding - 3 rigs are Win XP Pro and the dual P3 is Win2K pro. All use BOINC Manager 6.4.7 and - (a) The CPU rigs run the opti client 0.19 SSE version and the opti client 0.19 SSSE3. (b) The two GPU rigs run Gispel's new 0.20 client. Go away, I was asleep |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
What's strange is around 99.9% of the WUs reported are fine. It just seems that some hosts are having issues. I'm wondering if there's some weird application issue. I haven't seen any stock applications fail yet. Yes, overclocked hosts have issues. Someone using the stock app is not greedy enough to also overclock its CPU ;) And as I explained, the optimzed apps tax a system much more, as they use more of the chips execution units within those vectorized loops. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
1. My P3 using CPU runs the results OK, but it takes 10 hours to do a WU. Yes, and that specific Penryn is overclocked from 3GHz to 3.8GHz. Maybe that is a tiny bit above the margins? Have you tried it at stock clocks? As I demonstrated above, the different app versions are calculating the exact same results on a stable system also for this WU type. It is really strange that all those WUs don't validate from your computer. 3. The two machines with ATI GPUs are crunching and validating all WUs, including the _82_2s_6_ ones. Obviously they run stable. So if it is not BOINC 6.10.4 mixing up the input and/or result files for some percentage of the WUs (but it does so independent of the type), it is very probable you experience a stability problem on the Penryn. Edit: This is the from an invalidated WU of your Penryn, black and bold is the stock clock, red and bold the actual clock frequency: <core_client_version>6.4.7</core_client_version> |
Send message Joined: 29 Aug 07 Posts: 486 Credit: 576,548,171 RAC: 0 |
All my Box's are running @ Stock Speed for the CPU's and the GPU's, trying to save some Electrical costs since I have to pay for the Electricity they use. I noticed I had a few that were marked invalid running the 0.20 Application. The Forum is Dog slow right now but I'll try & find a few of them if I can. OOOP's, forgot to really read what you were saying again, I'll go stand in the Corner with John ... :P ... ;) |
Send message Joined: 16 Oct 08 Posts: 18 Credit: 164,409,593 RAC: 0 |
I am downclock my host, and still have errors. Error at stock speed. http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105882572 http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105916938 http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105916937 |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
I am downclock my host, and still have errors. @ Travis: Can you extract the parameters and results for those WUs from the database and post it here or send it to me per PM? WUs: ps_constrainted_82_2s_6_1044426_1253039459_0 ps_constrainted_82_2s_6_1044425_1253039459_0 I would like to run some tests on those WUs. PS: The first linked WU was started with the overclocked CPU only finished from a checkpoint at stock clock, so I would neglect it. |
Send message Joined: 16 Oct 08 Posts: 18 Credit: 164,409,593 RAC: 0 |
de_constrainted_82_2s_6_search_parameters_1078600_1253041937 de_constrainted_82_2s_6 parameters [14]: 0.424510724378298 5.459576507839513 -0.673558790328050 -50.000000000000000 22.700048295567910 1.362269043896991 -0.990224106288773 6.870676968365467 -0.439046188333093 27.820005081189180 23.599459349958643 0.654402464096181 1.168527166633252 7.455713423364958 metadata: i: 30, redundancy http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105952242 de_constrainted_82_2s_6_search_parameters_1078599_1253041937 de_constrainted_82_2s_6 parameters [14]: 0.426977971424483 5.503659154387430 -0.679553731990973 -50.000000000000000 22.669012398749622 1.376720787932886 -0.990254959141226 6.860532157700593 -0.434329137268066 28.048275509612360 23.600000000000001 0.653267504395833 1.184269744817431 7.339289461433097 metadata: i: 33, redundancy http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=105952241 |
Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0 |
Yes Cluster Physik. That host has been clocked at 3.8GHz for nearly 1 year, after I down clocked it from 4.05GHz on air. I will take your advice and down clock it to 3.5GHz in a short time. If the lower clock still results in invalid results on the _82_2s_6_ WUs then I will return the clock to stock. Now downclocked by 300MHz to 3.5GHz. Let's see how the results pan out in the morning on the _82_2s_6_ WUs collected and crunched during the night. I am off to sleep now. Go away, I was asleep |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
de_constrainted_82_2s_6_search_parameters_1078600_1253041937 Thanks! I will run those WUs with different apps (GPU as well as CPU) on my development box to see if there are any issues. If Travis reacts fast enough (I've sent him a PM) we may also get your (invalidated) results for those WUs, if Travis is able to extract it from the database (I hope it gets saved even if it was declared as invalid). Would be helpful to have something to compare the results to. Edit: GPU is through and I've found one possible reason already even without running it on the CPU. There may be some combination of a glitch in the WU generation, some sloppy code in the app for reading the the parameter file (also present in the stock app, nothing changed there from my side) and a buffer overflow problem in the validator (kicking in only for CPU units, as GPUs don't show this behaviour). What's the problem? If the linefeed is missing after the metadata (WU generator glitch), there is nothing defining the end of the metadata string, so basically it is filled with a lot of junk after the real metadata (sloppy code for the parameter reading, it should stop at the file end). Those junk is then output also in the results file where it may have the exact same size as the buffer used for reading it (i.e. one letter to much in it) which is then a problem for the validator reading the application signature. But if that would be true I wonder why no GPU has problems with invalid WUs. Results with the ATI application: de_constrainted_82_2s_6_search_parameters_1078600_1253041937 de_constrainted_82_2s_6 de_constrainted_82_2s_6_search_parameters_1078599_1253041937 de_constrainted_82_2s_6 |
Send message Joined: 1 Jan 09 Posts: 15 Credit: 85,816,654 RAC: 0 |
If you need anymore invalid workunits, just look up the stats for any of my 3 quads. Yes, they are all overclocked......no,I don't believe it is the O/C that is at fault. They have been rock steady for months......and they share time with SETI and aren't having any trouble with W/Us there. And CP, you're the man I believe can convince Travis to fix what they broke. And probably show him how to fix it too............ |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Results with the ATI application: Hmm, results with the Win64_SSE3 application (same as michs used, turned invalid for him) is exactly the same as on my GPU (which doesn't show any invalid WUs): de_constrainted_82_2s_6_search_parameters_1078600_1253041937 de_constrainted_82_2s_6 Let's hope Travis can tell us the content of michs result file. Other than that I have to state that the app he uses calculates correctly and arrives at exactly the same output as on a GPU. I can't find anything wrong with the applications for those invalid WUs. Guess the second one will show the same result (but I will run it either way). |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
As expected, also the second WU returns the same result irrespective on the platform it is crunched on: 0.20 ATI version (gets validated) result with 0.20 Win64_SSE3 version (michs got invalid results with the same app for this exact WU) de_constrainted_82_2s_6_search_parameters_1078599_1253041937 de_constrainted_82_2s_6 If Travis comes up with the results those two WUs actually returned when both were calculated by michs, we will see if it is a very strange bug of the validator (marking only CPU WUs invalid) or if he returned different values for some reason. But what I can say quite sure is that the apps are calculating all the same. It makes no sense that GPU WUs validate and CPU WUs don't just because they were crunched on a different platform but deliver the exact same results. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
What's strange is all of michs and johns WUs seem to validating correctly. I'm going to try and see if I can catch any more bad workunits. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
What's strange is all of michs and johns WUs seem to validating correctly. I'm going to try and see if I can catch any more bad workunits. But only the _3s WUs or on GPUs. Look at the tasks of John's computer here or michs' one here! Every _2s_6 WU is marked as invalid there. It would be nice to know the results they sent to the server and compare them with offline generated results on machines known to produce validating results. Either their machines suddenly return faulty data for some reason or the validator makes some weird choices based on the platform the WUs were crunched. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
I've made an update to the validator, let me know if it helped anything. |
©2024 Astroinformatics Group