Welcome to MilkyWay@home

testing new validator


Advanced search

Message boards : News : testing new validator
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profilekrahulik

Send message
Joined: 7 Nov 08
Posts: 14
Credit: 180,768,799
RAC: 0
100 million credit badge13 year member badge
Message 38157 - Posted: 5 Apr 2010, 23:45:10 UTC

This WU:
3x 58xx -> valid
1x 48xx/47xx -> invalid
ID: 38157 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge14 year member badge
Message 38158 - Posted: 5 Apr 2010, 23:52:38 UTC - in response to Message 38155.  

Is it possible for you to compile a test case for those of us who have both 48xx and 58xx cards so we can run them and see what is really doing the good work.

At Lunatics we have a bench suite that allows us to test new builds of the Astropulse apps.



I'll run some sample WUs standalone on my laptop tonight so I can be sure of the fitness. I'll put out the input files and the expected output when they're done.
ID: 38158 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileFurlozza
Avatar

Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
20 million credit badge13 year member badge
Message 38169 - Posted: 6 Apr 2010, 3:26:19 UTC - in response to Message 38158.  
Last modified: 6 Apr 2010, 3:30:28 UTC

Am running down MW wus until problem solved.

*putting on devil's advocate hat*

Could the problem not be that in the 58x0 series, an instruction has been included in cards that actually makes them more accurate?

*removing hat*

EDIT: make the DELETING wus
You should see the world from my eyes.
ID: 38169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge14 year member badge
Message 38173 - Posted: 6 Apr 2010, 4:20:43 UTC - in response to Message 38169.  

Am running down MW wus until problem solved.

*putting on devil's advocate hat*

Could the problem not be that in the 58x0 series, an instruction has been included in cards that actually makes them more accurate?

*removing hat*

EDIT: make the DELETING wus


Well, unless the 58x0 series is more accurate than CPUs (which I doubt), they're the culprit.

From an email Anthony just sent me:


this is a 2 stream workunit with sgr coordinates)

(uses hardcoded values in atSurveyGeometry.c)
-2.558875331749281 v0.19 CPU application (SSE3)
-2.558875331749119 v0.20 apps ((ati 48xx)
-2.558875331749284 v0.18 optimized
-2.558875331749081 nvidia on boinc

-2.558875355118770 (not v0.20) ati on boinc 58xx

(use computed values in atSurveyGeometry.c)
-2.558875329826787 cpu (from repository)
-2.558875329826697 nvidia (old version, circa oct 2009)
-2.558875329826689 nvidia (new unreleased version)

the 58x0 series just isn't matching up to anything we have.

ID: 38173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 108
Credit: 430,760,953
RAC: 0
300 million credit badge12 year member badgeextraordinary contributions badge
Message 38176 - Posted: 6 Apr 2010, 7:48:05 UTC - in response to Message 38169.  
Last modified: 6 Apr 2010, 7:51:33 UTC

<removed>
ID: 38176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileSimplex0
Avatar

Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,229,009
RAC: 0
100 million credit badge14 year member badge
Message 38291 - Posted: 7 Apr 2010, 10:10:31 UTC

A few questions.



1. The 58xx cards have been around for 6 months does this means the all the results during that time can have an error that are twice as big as they was supposed to be?



2. I am using 4870 cards to crunch an have something like 150 results that are marked as 'Invalid', caused by 58xx cards I assume. Will I get credit granted for that later?



3. If this has been going on for 6 months and you have spotted the problem just recently you obviously have serious problem with your validation method. Do you have a strategy now to prevent it from happen again?


Despite this problems I still think you guy's in Milkyway@home are the nr.1 in BOINC
ID: 38291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge14 year member badge
Message 38325 - Posted: 7 Apr 2010, 21:25:44 UTC - in response to Message 38291.  

A few questions.

1. The 58xx cards have been around for 6 months does this means the all the results during that time can have an error that are twice as big as they was supposed to be?


I think this was maybe a more recent change? At any rate the results we've been getting for the searches have always been validated -- it's just that the issue didn't show up as much because we were not validating the vast majority of the workunits; we were just validating the ones which improved the searches we were doing. So while they had the error it didn't effect our results very much at all. The reason it's been a big deal lately was because in order to fix scripting and single precision app issues we started validating most workunits (even those that didnt improve our searches). So before we were only validating 2-5% of WUs, now we're validating 50-75%.


2. I am using 4870 cards to crunch an have something like 150 results that are marked as 'Invalid', caused by 58xx cards I assume. Will I get credit granted for that later?


Right now my focus is on trying to get the server running stably again and upgrading to the new application. I'm not sure if I'm going to have time to go through the database manually and fix everyones lost credit. Most of these workunits have also been purged from the database right now, so there's really no good way to update and grant lost credit. I think it's just something everyone is going to have to live with and I apologize for that.


3. If this has been going on for 6 months and you have spotted the problem just recently you obviously have serious problem with your validation method. Do you have a strategy now to prevent it from happen again?


Well the real issue here was that we went from doing nearly no validation (we were only validating a minority of results which actually improved our search populations), to doing a lot more validation which made the problem really apparent -- so I guess the swap was a good thing :) On our end, we don't really need this extra validation because results which don't improve our search populations aren't particularly important, other than to weed out bad applications (which in this case we were unlucky enough to have one). But at any rate, I think with the more strict validation we have in place now, this kind of thing shouldn't happen again.


Despite this problems I still think you guy's in Milkyway@home are the nr.1 in BOINC


Glad after all of this we aren't totally hated here :)
ID: 38325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 108
Credit: 430,760,953
RAC: 0
300 million credit badge12 year member badgeextraordinary contributions badge
Message 38336 - Posted: 7 Apr 2010, 22:49:43 UTC - in response to Message 38325.  
Last modified: 7 Apr 2010, 22:50:52 UTC

But at any rate, I think with the more strict validation we have in place now, this kind of thing shouldn't happen again.
For a possible counter-example, check Workunit 90623954. Two anonymous platforms sporting versions 0.20b and 0.22 out-quorumed an HD5870 running version 0.23. All of the results were from ATI Cypress boards (HD5870 and HD5850 apparently).

Shouldn't 5xxx-series GPU results from applications prior to 0.23 be automatically discarded?
ID: 38336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge14 year member badge
Message 38341 - Posted: 7 Apr 2010, 23:14:50 UTC - in response to Message 38336.  


Shouldn't 5xxx-series GPU results from applications prior to 0.23 be automatically discarded?


They should... but that's takes a couple extra database queries per workunit, and the server is crashing enough as it is. I had the check in there for awhile and the server couldn't keep up with it.
ID: 38341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,499
RAC: 0
1 billion credit badge13 year member badge
Message 38344 - Posted: 7 Apr 2010, 23:31:31 UTC - in response to Message 38336.  

Two anonymous platforms sporting versions 0.20b and 0.22 out-quorumed an HD5870 running version 0.23.

There are probably a significant number of people running AP and not paying close attention to the boards. Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well. Hopefully they are monitoring their email a bit more closely.

Cheers,
Gary.
ID: 38344 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 108
Credit: 430,760,953
RAC: 0
300 million credit badge12 year member badgeextraordinary contributions badge
Message 38353 - Posted: 8 Apr 2010, 0:58:36 UTC - in response to Message 38344.  

Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well.
I personally think it would be more appropriate if RPI were sending out these e-mails but...

...I've notified 4 other owners as requested.
ID: 38353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilebanditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
500 thousand credit badge14 year member badge
Message 38356 - Posted: 8 Apr 2010, 2:08:24 UTC - in response to Message 38353.  

Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well.
I personally think it would be more appropriate if RPI were sending out these e-mails but...

Doing a mass email to everyone should be easy. Just the ones who don't want emails from the project would be left out, then individual emails.
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 38356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileSimplex0
Avatar

Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,229,009
RAC: 0
100 million credit badge14 year member badge
Message 38362 - Posted: 8 Apr 2010, 9:00:39 UTC - in response to Message 38341.  


Shouldn't 5xxx-series GPU results from applications prior to 0.23 be automatically discarded?


They should... but that's takes a couple extra database queries per workunit, and the server is crashing enough as it is. I had the check in there for awhile and the server couldn't keep up with it.



Until that problem is sorted out I vill run Folding@home instead.
Hope you will have this fixed soon.
ID: 38362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,560,157
RAC: 0
200 million credit badge13 year member badge
Message 38627 - Posted: 12 Apr 2010, 11:26:39 UTC

I have unfortunately had to swap 7 machines running various 3850, 4850, and 4870 cards on to another project as each one was producing 90% computation errors or work not validated. I'll check back in a few days and see if this is still continuing.
Don't drink water, that's the stuff that rusts pipes
ID: 38627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge14 year member badge
Message 38643 - Posted: 12 Apr 2010, 19:14:31 UTC - in response to Message 38627.  

I have unfortunately had to swap 7 machines running various 3850, 4850, and 4870 cards on to another project as each one was producing 90% computation errors or work not validated. I'll check back in a few days and see if this is still continuing.


Did your machines upgrade to the correct application (0.23) and are they running the right brook32/64.dll?

If they're giving that many errors it's probably because they're using the wrong application.
ID: 38643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTomasz R. Gwiazda
Avatar

Send message
Joined: 23 Mar 09
Posts: 13
Credit: 100,032,796
RAC: 0
100 million credit badge13 year member badge
Message 38646 - Posted: 12 Apr 2010, 19:37:08 UTC - in response to Message 38643.  

hi Travis

I've got question, my 3 hosts (2x5870, 5870, 4850) were disconected form MW. And after you published new app 0.23 reconnected.
Everything works fine.
But from time to time i receve some WU for app 0.19 , even when i manualy delete an app from hdd
Is it suppose to happen ?

Join us at www.boincatpoland.org
ID: 38646 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge14 year member badge
Message 38648 - Posted: 12 Apr 2010, 20:00:30 UTC - in response to Message 38646.  

hi Travis

I've got question, my 3 hosts (2x5870, 5870, 4850) were disconected form MW. And after you published new app 0.23 reconnected.
Everything works fine.
But from time to time i receve some WU for app 0.19 , even when i manualy delete an app from hdd
Is it suppose to happen ?


If you're running windows, for CPU the highest app version is still 0.19. So if it's running 0.19 on the CPU that's not a problem.
ID: 38648 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTomasz R. Gwiazda
Avatar

Send message
Joined: 23 Mar 09
Posts: 13
Credit: 100,032,796
RAC: 0
100 million credit badge13 year member badge
Message 38649 - Posted: 12 Apr 2010, 20:08:01 UTC - in response to Message 38648.  

yes i run Win7 and XP
but not using CPU at all (never used for MW)
hmm maybe it's "fault" of MW preferences

Use CPU
(enforced by 6.10+ clients)

Join us at www.boincatpoland.org
ID: 38649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,560,157
RAC: 0
200 million credit badge13 year member badge
Message 38671 - Posted: 13 Apr 2010, 9:40:00 UTC - in response to Message 38643.  

Did your machines upgrade to the correct application (0.23) and are they running the right brook32/64.dll?

If they're giving that many errors it's probably because they're using the wrong application.


Thanks for the response. I've had a bit of a change round and they seem OK now, so they are back crunching for MW after many days of lost work! See my post here. http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1679&nowrap=true#38670
Don't drink water, that's the stuff that rusts pipes
ID: 38671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileCrunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
200 million credit badge14 year member badge
Message 38677 - Posted: 13 Apr 2010, 11:30:29 UTC - in response to Message 38648.  
Last modified: 13 Apr 2010, 11:32:58 UTC

hi Travis

I've got question, my 3 hosts (2x5870, 5870, 4850) were disconected form MW. And after you published new app 0.23 reconnected.
Everything works fine.
But from time to time i receve some WU for app 0.19 , even when i manualy delete an app from hdd
Is it suppose to happen ?


If you're running windows, for CPU the highest app version is still 0.19. So if it's running 0.19 on the CPU that's not a problem.


I think i know what he's talking about.
Yesterday, for some reason i had the same happening here on two machines. Although running on anonymous platform which only has the gpu app specified, the server send some tasks assigned for the CPU app (0.19) which was not selected in the prefs (Don't use CPU) nor specified in the app_info.xml.

That shouldn't have happened at all.

The boinc client started all of those tasks at once using the GPU app(i checked that in the tasks manager) and labeled them as CPU app in the boinc manager.

So my 8 core machine had 2 GPU apps running (as specified in the app_info.xml) and another 8 active tasks showing up as CPU tasks in the manager, although it actually used the gpu app for them.

Result of that was that the V8 slowed down quite a bit and the other one locked up completely.

EDIT
Same thing has happened on Collatz -> http://boinc.thesonntags.com/collatz/forum_thread.php?id=370

So i think that it's a bug server side, where it does't honor prefs nor apps specified in an app_info.xml.

Join Support science! Joinc Team BOINC United now!
ID: 38677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : News : testing new validator

©2022 Astroinformatics Group