Welcome to MilkyWay@home

Broken WUs

Message boards : Number crunching : Broken WUs
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30785 - Posted: 14 Sep 2009, 8:49:42 UTC
Last modified: 14 Sep 2009, 8:57:32 UTC

I just had a look to the some gs_constrainted_82_2s_4 WUs. The search parameters appear to be messed up, causing invalid WUs.

gs_constrainted_82_2s_4
parameters [14]: 0.618133560370761 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000
metadata: fitness: -1.#IND00000000000, redundancy


There still may be issues with the new apps (even if the new apps were tested on different WU types) with new WU types. But with all those bad WUs flying around it is hard to identify if there are issues.
ID: 30785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 30786 - Posted: 14 Sep 2009, 9:12:40 UTC - in response to Message 30785.  

I just had a look to the some gs_constrainted_82_2s_4 WUs. The search parameters appear to be messed up, causing invalid WUs.

gs_constrainted_82_2s_4
parameters [14]: 0.618133560370761 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000
metadata: fitness: -1.#IND00000000000, redundancy


There still may be issues with the new apps (even if the new apps were tested on different WU types) with new WU types. But with all those bad WUs flying around it is hard to identify if there are issues.



Cluster

I think there is more than that to the invalid WU results. I started a thread here on the same subjets (I think). But I have swapped back to 0.19 and this is also giving more than 70% invalid results (all CPU work). My 3850 GPU seems to be OK on 0.20.
Go away, I was asleep


ID: 30786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30788 - Posted: 14 Sep 2009, 9:26:03 UTC - in response to Message 30786.  

I think there is more than that to the invalid WU results. I started a thread here on the same subjets (I think). But I have swapped back to 0.19 and this is also giving more than 70% invalid results (all CPU work). My 3850 GPU seems to be OK on 0.20.

So apparently it is nothing introduced with 0.20.
The GPU app was designed in some parts to be more fault tolerant than the CPU apps. If some calculations don't make sense because the input parameters are all zero (like in the example above), it is less likely to come up with an infinity result during some intermediate steps. But the results are crap either way, they only might pass the validator easier.
ID: 30788 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 30818 - Posted: 14 Sep 2009, 13:21:53 UTC - in response to Message 30788.  

I'll be keeping a close eye on the server today to see if anything weird like this happens. The astronomers were trying some new things with the new search parameters so there might be some kind of error there.
ID: 30818 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
HDL

Send message
Joined: 6 Mar 09
Posts: 1
Credit: 37,785,089
RAC: 0
Message 30837 - Posted: 14 Sep 2009, 15:59:52 UTC - in response to Message 30788.  
Last modified: 14 Sep 2009, 16:03:55 UTC

I have many invalidated results from both 0.19 and 0.20 applicaitons.

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=91486&offset=0&show_names=0&state=4
ID: 30837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [Russia] michs

Send message
Joined: 16 Oct 08
Posts: 18
Credit: 164,409,593
RAC: 0
Message 30849 - Posted: 14 Sep 2009, 17:51:06 UTC

Still have invalid result from 2s wu
ID: 30849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 30869 - Posted: 14 Sep 2009, 19:55:48 UTC - in response to Message 30849.  

I am seeing some invalid results coming in with the 2s WUs. Unfortunately these aren't reporting what application they're coming from so that's making the problem a bit hard to track down.

I think there might be some kind of math issue in the application (I don't know if this is happening with the stock application -- which is part of the problem), which is making the application return NaN for the fitness, which is then screwing up the reported parameters, fitness and preventing the application from reporting it's version correctly.

Either that or the server isn't reading the version correctly.

I'm pretty sure the parameter sets the server is generating are valid.
ID: 30869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 30871 - Posted: 14 Sep 2009, 20:02:42 UTC - in response to Message 30869.  

Can anyone give me the input search parameters file for a bad workunit, not what the result/stderr file are reporting?

I think there's some kind of weird numerical error with the new bounds of the workunits.
ID: 30871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30873 - Posted: 14 Sep 2009, 20:08:07 UTC - in response to Message 30871.  

Can anyone give me the input search parameters file for a bad workunit, not what the result/stderr file are reporting?

I think there's some kind of weird numerical error with the new bounds of the workunits.

Just look to the very first post in this thread. That is the content of the search parameter file for a failing WU straight from my computer.
ID: 30873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30875 - Posted: 14 Sep 2009, 20:11:47 UTC - in response to Message 30869.  

I am seeing some invalid results coming in with the 2s WUs. Unfortunately these aren't reporting what application they're coming from so that's making the problem a bit hard to track down.

They are possibly coming from all versions. At least the 0.19 as well as the new 0.20 apps can't do anything meaningful with the parameters of those _2s WUs. And I really doubt the stock app would fare much better with the search parameters from the first post in the thread.
ID: 30875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 30877 - Posted: 14 Sep 2009, 20:20:19 UTC - in response to Message 30875.  

I am seeing some invalid results coming in with the 2s WUs. Unfortunately these aren't reporting what application they're coming from so that's making the problem a bit hard to track down.

They are possibly coming from all versions. At least the 0.19 as well as the new 0.20 apps can't do anything meaningful with the parameters of those _2s WUs. And I really doubt the stock app would fare much better with the search parameters from the first post in the thread.


I think I'm seeing the problem now. For some reason it looks like some of the search data is getting corrupted on the server... maybe a memory leak, i'm not sure.
ID: 30877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30894 - Posted: 14 Sep 2009, 23:09:38 UTC - in response to Message 30877.  
Last modified: 14 Sep 2009, 23:11:34 UTC

I am seeing some invalid results coming in with the 2s WUs. Unfortunately these aren't reporting what application they're coming from so that's making the problem a bit hard to track down.

They are possibly coming from all versions. At least the 0.19 as well as the new 0.20 apps can't do anything meaningful with the parameters of those _2s WUs. And I really doubt the stock app would fare much better with the search parameters from the first post in the thread.

I think I'm seeing the problem now. For some reason it looks like some of the search data is getting corrupted on the server... maybe a memory leak, i'm not sure.

But it could also point to a problem with the validator reading the results. The peculiar thing I found about the failing WUs, i.e. those which are marked as invalid even as the search_parameters look to be okay (different to the wrong parameters as posted in the first message), is a very long metadata string.
While the buffer in the app holding this data is large enough to handle it, that may not be the case for the validator reading the result files. Maybe there is some kind of buffer overflow which messes up the reading of the application version string which follows (you mentioned there is a problem with this afaik). Just an example of a 82_2s_6 WU I've seen this:

ps_constrainted_82_2s_6
parameters [14]: 0.690988920104987 30.000000000000000 -13.493293342892592 -12.643711320738422 20.569658671132817 3.141592653500000 0.856461973355750 12.795320726117398 -20.000000000000000 54.063019691397436 22.918248907707902 -1.842375307008260 1.129439573027266 1.693837140189638
metadata: p: 13, v: 0.09515176000142636092 10.58387193219777344666 -3.57425810799281862273 -8.68449978103880759761 -6.03034132886718321487 0.71988111405316512759 1.62822698278845301445 0.43824977630261663375 0.00000000000000000000 30.25906513407068842980 0.87354012086618815225 -0.42185784768854150961 1.95078927209604202631 -11.86238389186052444302


"Normal" WUs have very short metadata (like "i:26, redundancy" or something as short as this and not a complete parameter set), so this may be a problem.
ID: 30894 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pavysiu

Send message
Joined: 23 Aug 09
Posts: 1
Credit: 5,489,712
RAC: 0
Message 30930 - Posted: 15 Sep 2009, 5:23:21 UTC

Limit in 1500s( 25 min). Overclock and downclock CPU show probleme
ID: 30930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 30931 - Posted: 15 Sep 2009, 6:12:26 UTC - in response to Message 30894.  

I am seeing some invalid results coming in with the 2s WUs. Unfortunately these aren't reporting what application they're coming from so that's making the problem a bit hard to track down.

They are possibly coming from all versions. At least the 0.19 as well as the new 0.20 apps can't do anything meaningful with the parameters of those _2s WUs. And I really doubt the stock app would fare much better with the search parameters from the first post in the thread.

I think I'm seeing the problem now. For some reason it looks like some of the search data is getting corrupted on the server... maybe a memory leak, i'm not sure.

But it could also point to a problem with the validator reading the results. The peculiar thing I found about the failing WUs, i.e. those which are marked as invalid even as the search_parameters look to be okay (different to the wrong parameters as posted in the first message), is a very long metadata string.
While the buffer in the app holding this data is large enough to handle it, that may not be the case for the validator reading the result files. Maybe there is some kind of buffer overflow which messes up the reading of the application version string which follows (you mentioned there is a problem with this afaik). Just an example of a 82_2s_6 WU I've seen this:

ps_constrainted_82_2s_6
parameters [14]: 0.690988920104987 30.000000000000000 -13.493293342892592 -12.643711320738422 20.569658671132817 3.141592653500000 0.856461973355750 12.795320726117398 -20.000000000000000 54.063019691397436 22.918248907707902 -1.842375307008260 1.129439573027266 1.693837140189638
metadata: p: 13, v: 0.09515176000142636092 10.58387193219777344666 -3.57425810799281862273 -8.68449978103880759761 -6.03034132886718321487 0.71988111405316512759 1.62822698278845301445 0.43824977630261663375 0.00000000000000000000 30.25906513407068842980 0.87354012086618815225 -0.42185784768854150961 1.95078927209604202631 -11.86238389186052444302


"Normal" WUs have very short metadata (like "i:26, redundancy" or something as short as this and not a complete parameter set), so this may be a problem.


That's what a particle swarm WU should look like. The metadata buffer size is 2048 so it should be more than enough to cover that. The smaller metadata is from genetic search and differential evolution.

I'm putting in some code to check if the server is somehow sending out bad WUs. I already have a lot of checks in there to make sure everything is within bounds... so I'm not quite sure what the issue is yet.
ID: 30931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile TomaszPawel
Avatar

Send message
Joined: 9 Nov 08
Posts: 41
Credit: 92,786,635
RAC: 0
Message 30933 - Posted: 15 Sep 2009, 6:43:27 UTC - in response to Message 30931.  

ID: 30933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Glenn Rogers
Avatar

Send message
Joined: 4 Jul 08
Posts: 165
Credit: 364,966
RAC: 0
Message 30936 - Posted: 15 Sep 2009, 9:23:38 UTC

I've been using the ver 0.20 app for about 24hrs now and I haven't seen any wu's that have been invalid as yet, but that could change.. Will keep an eye on it and let you folks know if any change happens
ID: 30936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile verstapp
Avatar

Send message
Joined: 26 Jan 09
Posts: 589
Credit: 497,834,261
RAC: 0
Message 30939 - Posted: 15 Sep 2009, 11:12:34 UTC

No such problems here either.
Cheers,

PeterV

.
ID: 30939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Traumtänzer

Send message
Joined: 6 May 08
Posts: 1
Credit: 16,952,704
RAC: 0
Message 30940 - Posted: 15 Sep 2009, 11:21:01 UTC

Anyway,
still having troubles, also with _6 WU's.
See:
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=103258426
ID: 30940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 30946 - Posted: 15 Sep 2009, 12:13:09 UTC
Last modified: 15 Sep 2009, 12:25:18 UTC

Still getting more than 50% of the work on this host (32120) posted as invalid. This is only using CPUs (a quad), while the old quad (64209) uses a 4850 without producing any invalid results.

Travis

Your request under your Home Page post Work Unit errors Part II still seems to be giving problems as you can see from my first link at this post.
Go away, I was asleep


ID: 30946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 30961 - Posted: 15 Sep 2009, 15:30:22 UTC
Last modified: 15 Sep 2009, 15:33:13 UTC

@Travis:

I think the really strange thing is that the GPU apps appear get the 82_2s_6 validated without any problems now, while the validation of CPU results still have quite some issues (0.19 as well as 0.20 are affected). So the changes to the validator helped indeed but only for the GPU apps as I see it.

I was trying to find some reason for the this, and I have to admit I don't find one. I've pulled some random 82_2s_6 WUs from my computer and crunched them both with the GPU (which validates and gives credits) as well as on the CPU (where people can't get it validated). At least on my system both CPU as well as GPU gives the exact same reasonable looking result file (besides the application signature) as promised for 0.20. So I have really no idea what is the reason for the validation issues.

Can someone save the search_parameters file of a 82_2s_6 WU failing to validate on his CPU and post it here? I can give instruction how to run the app offline with those WU so one can analyze the result file as well as the checkpoint file when run on his specific CPU an compare it with the output of a GPU or CPUs on other systems.

As an example, I've taken the file "de_constrainted_82_2s_6_search_parameters_794609_1253021251":

de_constrainted_82_2s_6
parameters [14]: 0.407141940752653 7.133047922264693 -0.787640701136234 -49.917644985287808 22.229337470866646 1.338302021799693 -1.008627028233573 6.374012226312546 -0.495391293332432 26.857345597800879 23.600000000000001 0.649729058771463 1.078751219607000 7.248185836949247
metadata: i: 2

I calculated it with the 0.20 Win64_SSE3 CPU application (failing to validate on those WU types for a lot of people as well as with the 0.19 version) on a PhenomX4. The result was:

de_constrainted_82_2s_6
parameters [14]: 0.407141940752653 7.133047922264693 -0.787640701136234 -49.917644985287808 22.229337470866646 1.338302021799693 -1.008627028233573 6.374012226312546 -0.495391293332432 26.857345597800879 23.600000000000001 0.649729058771463 1.078751219607000 7.248185836949247
metadata: i: 2
fitness: -2.975754770268507
Gipsel_0.20_x64_SSE3: 0.20

I calculated the exact same WU with the 0.20 ATI_Win64 app (which actually sent this result to the server and it validated without problems) and got this:

de_constrainted_82_2s_6
parameters [14]: 0.407141940752653 7.133047922264693 -0.787640701136234 -49.917644985287808 22.229337470866646 1.338302021799693 -1.008627028233573 6.374012226312546 -0.495391293332432 26.857345597800879 23.600000000000001 0.649729058771463 1.078751219607000 7.248185836949247
metadata: i: 2
fitness: -2.975754770268507
Gipsel_GPU_CAL_0.20_x64: 0.20

As I said, the results are identical besides the application string and I don't see the slightst reason why the CPU result should not be validated.
ID: 30961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Broken WUs

©2024 Astroinformatics Group