Compute Errors
log in

Advanced search

Message boards : Number crunching : Compute Errors

Author Message
Profile The Gas Giant
Avatar
Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,865,573
RAC: 0
Message 24158 - Posted: 4 Jun 2009 | 18:00:25 UTC

It appears as though there are some more faulty wu's out there....

ps_sgr_208_3s_1_20922_1244137807

Plus others...

Profile [FVG] bax
Avatar
Send message
Joined: 7 Mar 09
Posts: 8
Credit: 140,903,170
RAC: 0
Message 24160 - Posted: 4 Jun 2009 | 18:09:07 UTC - in response to Message 24158.
Last modified: 4 Jun 2009 | 18:17:06 UTC

seems to me that:

only "3s" type are involved !

GPU goes on "compute error" but CPU works well

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=72202438

Profile Kalessin
Avatar
Send message
Joined: 10 Nov 07
Posts: 42
Credit: 27,012,695
RAC: 0
Message 24161 - Posted: 4 Jun 2009 | 18:25:49 UTC
Last modified: 4 Jun 2009 | 18:31:54 UTC

At the moment i have only 2s running. 1s and 3s are producing complete stillstand on both of my GPUs.

ok 1s start working again, took them only a while to recover after i terminated the 3s.

Grrrr.
____________
Dragons can fly because they don't fit into pirate ships!

Profile [XTBA>XTC] ZeuZ
Send message
Joined: 27 Dec 07
Posts: 14
Credit: 5,089,974
RAC: 0
Message 24162 - Posted: 4 Jun 2009 | 18:31:12 UTC

Same problem here

Profile [FVG] bax
Avatar
Send message
Joined: 7 Mar 09
Posts: 8
Credit: 140,903,170
RAC: 0
Message 24163 - Posted: 4 Jun 2009 | 18:36:44 UTC - in response to Message 24161.

...1s ... complete stillstand


only 1s in standstill here (GPU)

Profile Kalessin
Avatar
Send message
Joined: 10 Nov 07
Posts: 42
Credit: 27,012,695
RAC: 0
Message 24165 - Posted: 4 Jun 2009 | 18:44:00 UTC

Ok just had a closer look.
The stillstand here is caused by the "3s" but if everything stood still, the "1s" and "2s" do need a complete restart of boinc to get going again.
____________
Dragons can fly because they don't fit into pirate ships!

Profile [P3D] Crashtest
Send message
Joined: 8 Jan 09
Posts: 58
Credit: 16,804,661
RAC: 12
Message 24166 - Posted: 4 Jun 2009 | 18:50:34 UTC - in response to Message 24165.

yes - got the "3s" problem too (Gipsel 0.19e GPU App)

"1s" and "2s" ok

Profile borandi
Avatar
Send message
Joined: 21 Feb 09
Posts: 180
Credit: 26,221,261
RAC: 0
Message 24167 - Posted: 4 Jun 2009 | 18:51:53 UTC

I thought I had a haul when it said '48 new tasks'... =P
____________

Profile borandi
Avatar
Send message
Joined: 21 Feb 09
Posts: 180
Credit: 26,221,261
RAC: 0
Message 24172 - Posted: 4 Jun 2009 | 19:28:08 UTC

6/4/2009 7:58:26 PM|Milkyway@home|Scheduler request completed: got 5 new tasks
6/4/2009 8:07:51 PM|Milkyway@home|Scheduler request completed: got 19 new tasks
6/4/2009 8:14:51 PM|Milkyway@home|Scheduler request completed: got 21 new tasks
6/4/2009 8:16:03 PM|Milkyway@home|Scheduler request completed: got 16 new tasks
6/4/2009 8:20:46 PM|Milkyway@home|Scheduler request completed: got 16 new tasks
6/4/2009 8:24:14 PM|Milkyway@home|Scheduler request completed: got 7 new tasks

Seems like the s3 WUs are shutting down the automated machines? I'm getting only s1 and s2 at the minute after a small batch of s3
____________

Profile [FVG] bax
Avatar
Send message
Joined: 7 Mar 09
Posts: 8
Credit: 140,903,170
RAC: 0
Message 24173 - Posted: 4 Jun 2009 | 19:34:18 UTC - in response to Message 24172.

..Seems like the s3 WUs are shutting down the automated machines?


we have the first WU virus ;-)

hi hi

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24174 - Posted: 4 Jun 2009 | 19:36:52 UTC
Last modified: 4 Jun 2009 | 19:48:42 UTC

Same here, wu's failing to run then causing errors.

ps_sgr_208_3s_2_9411_1244143134_0
ps_sgr_208_3s_2_9410_1244143134_0
ps_sgr_208_3s_2_9409_1244143134_0
ps_sgr_208_3s_2_9408_1244143134_0
ps_sgr_208_3s_2_9407_1244143134_0
ps_sgr_208_3s_2_9386_1244143134_0
ps_sgr_208_3s_2_9385_1244143134_0
ps_sgr_208_3s_2_9384_1244143134_0

<core_client_version>6.6.28</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>

and, seems to be using a lot of processor power for a wu unit running on an ATI
and, once downloaded they fail to start processing, a shut down and restart of BOINC does not remedy this either.
I've started to abort these units but the next batch of downloads just sends more of the same!
____________

Localizer
Send message
Joined: 28 Jan 08
Posts: 39
Credit: 379,928,803
RAC: 1
Message 24176 - Posted: 4 Jun 2009 | 19:51:00 UTC - in response to Message 24174.

........... just seems to be hanging WUs with a side order of computation errors at the moment on the ps_sgr_208_xs series of WUs.

Profile borandi
Avatar
Send message
Joined: 21 Feb 09
Posts: 180
Credit: 26,221,261
RAC: 0
Message 24179 - Posted: 4 Jun 2009 | 20:04:16 UTC

Now,

On one of my ATi clients, I have to cancel the s3s, but s1 and s2 work fine.
On another ATi client, s3 WUs come out as comp errors, but s1 and s2 process and are marked invalid.
____________

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24181 - Posted: 4 Jun 2009 | 20:08:30 UTC - in response to Message 24179.

I'm getting errors on s1, s2 & s3's

I'm having to micromange at the moment as none are completeing, therefore, none are being returned.....so unless i abort them i don't get any new units.
the circle of life...............
____________

Profile Berserk_Tux
Avatar
Send message
Joined: 2 Jan 08
Posts: 79
Credit: 365,471,675
RAC: 0
Message 24184 - Posted: 4 Jun 2009 | 20:12:36 UTC - in response to Message 24176.
Last modified: 4 Jun 2009 | 20:16:44 UTC

I have the samme problems. All my Ati's hang and errors out:-(
____________

boosted
Send message
Joined: 4 Feb 08
Posts: 116
Credit: 17,263,566
RAC: 0
Message 24185 - Posted: 4 Jun 2009 | 20:13:40 UTC

I have had about 60 or so error or freeze on the 3s units.
____________

Profile Labbie
Avatar
Send message
Joined: 29 Aug 07
Posts: 327
Credit: 116,463,193
RAC: 0
Message 24186 - Posted: 4 Jun 2009 | 20:18:14 UTC

My 2 ATI machines error immediately on the 3s WUs, my CPU machine finishes these just fine.
____________

Calm Chaos Forum...Join Calm Chaos Now

Profile [FVG] bax
Avatar
Send message
Joined: 7 Mar 09
Posts: 8
Credit: 140,903,170
RAC: 0
Message 24188 - Posted: 4 Jun 2009 | 20:23:28 UTC - in response to Message 24165.

Ok just had a closer look.
The stillstand here is caused by the "3s" but if everything stood still, the "1s" and "2s" do need a complete restart of boinc to get going again.


I think this is the correct diagnosys for ATI GPU clients

if I'm lucky, I erase all "3s" WUs before BOINC starts crunching them. In this way I save "1s" and "2s" without restarting BOINC

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24190 - Posted: 4 Jun 2009 | 20:33:21 UTC - in response to Message 24188.
Last modified: 4 Jun 2009 | 20:45:56 UTC

Ok just had a closer look.
The stillstand here is caused by the "3s" but if everything stood still, the "1s" and "2s" do need a complete restart of boinc to get going again.


I think this is the correct diagnosys for ATI GPU clients

if I'm lucky, I erase all "3s" WUs before BOINC starts crunching them. In this way I save "1s" and "2s" without restarting BOINC


Tried this approach, does not work for me................

Just checked my task manager, though i have no wu's currently downloaded or running, task manager shows : (multiple instances) astronomy_0.19_ATI_SSE2e.exe running in the background?
____________

Profile [FVG] bax
Avatar
Send message
Joined: 7 Mar 09
Posts: 8
Credit: 140,903,170
RAC: 0
Message 24191 - Posted: 4 Jun 2009 | 20:47:02 UTC - in response to Message 24190.

Just checked my task manager, though i have no wu's currently downloaded or running, task manager shows : astronomy_0.19_ATI_SSE2e.exe running in the background?


my task manager do not show this ;-)

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24195 - Posted: 4 Jun 2009 | 20:57:15 UTC
Last modified: 4 Jun 2009 | 20:59:15 UTC

I have re-installed the cpu version of the opt. app. and can confirm that these same units work as expected under the CPU and as I will be unable to micro manage throughout the night, will run the units this way until tomorrow when I will try again, hopefully it may be sorted by then.
____________

Takumo
Send message
Joined: 5 Apr 09
Posts: 2
Credit: 4,513,838
RAC: 0
Message 24196 - Posted: 4 Jun 2009 | 21:08:42 UTC
Last modified: 4 Jun 2009 | 21:09:50 UTC

I can confirm that "3s" cause errors on my machine too. They are hanging without any output. Restarting BOINC causes "computation error". "2s" and "1s" are crunched as ever.

ATI 48700
Vista Ultimate SP1
Catalyst 9.1
astronomy_0.19_ati_SSE2e

Profile The Gas Giant
Avatar
Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,865,573
RAC: 0
Message 24197 - Posted: 4 Jun 2009 | 21:09:20 UTC

I've had the 2s wu's validate OK on my 4870.

Dr Dan
Avatar
Send message
Joined: 17 Mar 08
Posts: 165
Credit: 410,224,363
RAC: 0
Message 24200 - Posted: 4 Jun 2009 | 21:35:16 UTC
Last modified: 4 Jun 2009 | 21:35:59 UTC

Well Just for a notation here, I have 14 hd 4870 that all are hung up and erroring out. Some one please push the oops button please and fix them...

Thanks..

DD,
____________

Kokomiko
Avatar
Send message
Joined: 27 Sep 07
Posts: 8
Credit: 25,779,225
RAC: 0
Message 24201 - Posted: 4 Jun 2009 | 21:41:39 UTC

All WUs of the ....3s.... lot got errors on my GPU (HD4870). If I cancel them, the .....2s.... will not be calculated til I restart the whole PC, a restart of BOINC isn't enough. What is changed?

Cluster Physik
Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 24203 - Posted: 4 Jun 2009 | 21:49:10 UTC

Sorry guys!

I just have seen this mess and but it is already midnight here. So I can't do much about it before tomorrow.
What has changed with the ps_208_s3 WUs compared to all earlier ones is that these WUs do computations on three streams (opposed to only one ore two in all previous WUs). The GPU app was supposed to handle that (Travis told me that probably three streams will coming already in february or so), but as there were no appropriate WUs it was never tested. Obviously there is a bug in the GPU app (as the CPU app is working). So I need to find the bug, fix it and distribute a new version of the GPU application.

I hope it will be ready tomorrow.

Profile borandi
Avatar
Send message
Joined: 21 Feb 09
Posts: 180
Credit: 26,221,261
RAC: 0
Message 24204 - Posted: 4 Jun 2009 | 21:50:09 UTC
Last modified: 4 Jun 2009 | 21:51:10 UTC

If you abort the 3s, and the 2s or 1s don't start up, rather than restart your computer:

Close BOINC
Open Task Manager
End all Astronomy sse2 processes
Reopen BOINC

The WUs start crunching again

Sorry guys!

I just have seen this mess and but it is already midnight here. So I can't do much about it before tomorrow.
What has changed with the ps_208_s3 WUs compared to all earlier ones is that these WUs do computations on three streams (opposed to only one ore two in all previous WUs). The GPU app was supposed to handle that (Travis told me that probably three streams will coming already in february or so), but as there were no appropriate WUs it was never tested. Obviously there is a bug in the GPU app (as the CPU app is working). So I need to find the bug, fix it and distribute a new version of the GPU application.

I hope it will be ready tomorrow.


Thanks for the update :)
____________

Kokomiko
Avatar
Send message
Joined: 27 Sep 07
Posts: 8
Credit: 25,779,225
RAC: 0
Message 24205 - Posted: 4 Jun 2009 | 21:56:19 UTC - in response to Message 24204.


The plural of CPU is CPUs
The plural of GPU is GPUs

THERE IS NO APOSTROPHE UNLESS YOU ARE DESCRIBING A FEATURE!!!


... but you know: "Bad English is the language of the science ..." :D

Profile verstapp
Avatar
Send message
Joined: 26 Jan 09
Posts: 585
Credit: 464,286,454
RAC: 730
Message 24207 - Posted: 4 Jun 2009 | 22:06:10 UTC

>"Bad English is the language of the science ..."
No, bad english is the language of the net. :)
____________
Cheers,

PeterV

.

Profile borandi
Avatar
Send message
Joined: 21 Feb 09
Posts: 180
Credit: 26,221,261
RAC: 0
Message 24208 - Posted: 4 Jun 2009 | 22:13:52 UTC
Last modified: 4 Jun 2009 | 22:16:57 UTC

<offtopic> Meh I was fed up with people and businesses putting apostrophes in CPUs, GPUs, CDs, TVs, DVDs and PCs, so decided to make a facebook fan page about Spelling, Punctuation and Apostophe Use. Not too bad getting 10k fans in a month - I'm not the only soldier in the crusade against bad English. Not bad for a computational chemist who actually dropped out of English A-Level during Wuthering Heights.

Also, science rocks my socks.
</offtopic>
____________

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24213 - Posted: 4 Jun 2009 | 22:42:18 UTC - in response to Message 24204.

If you abort the 3s, and the 2s or 1s don't start up, rather than restart your computer:

Close BOINC
Open Task Manager
End all Astronomy sse2 processes
Reopen BOINC

The WUs start crunching again

Sorry guys!

I just have seen this mess and but it is already midnight here. So I can't do much about it before tomorrow.
What has changed with the ps_208_s3 WUs compared to all earlier ones is that these WUs do computations on three streams (opposed to only one ore two in all previous WUs). The GPU app was supposed to handle that (Travis told me that probably three streams will coming already in february or so), but as there were no appropriate WUs it was never tested. Obviously there is a bug in the GPU app (as the CPU app is working). So I need to find the bug, fix it and distribute a new version of the GPU application.

I hope it will be ready tomorrow.


Thanks for the update :)


This would entail (used this word before) micro management. If it wasn't almost midnight here I would do that, it's going to affect a lot of people over the next few hours.
Solution, temp. revert to the cpu app till CP sorts the problem!
____________

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 24214 - Posted: 4 Jun 2009 | 22:49:48 UTC

I have 6 208_3s_2's now which seem to be running ok so far with ~15-30% increase in crunch time.
____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile [AF>Occitania>Lengadocian] F5LCU
Send message
Joined: 30 Mar 08
Posts: 25
Credit: 65,632,132
RAC: 49,189
Message 24225 - Posted: 5 Jun 2009 | 5:46:45 UTC

Hello

Are 3s WUs a test (either voluntary or involuntary) to encourage GPU ATI cruncheurs to wait for milky_GPU applications?

I switched yesterday to ubuntu where the CPU optimized application runs well without errors.

Profile [AF>Occitania] Meteore31
Send message
Joined: 11 Dec 07
Posts: 11
Credit: 95,010,859
RAC: 1,450
Message 24229 - Posted: 5 Jun 2009 | 7:37:20 UTC

On Vista64 and ATI HD4870 it doesn't run correctly, there are errors... snif

Profile verstapp
Avatar
Send message
Joined: 26 Jan 09
Posts: 585
Credit: 464,286,454
RAC: 730
Message 24230 - Posted: 5 Jun 2009 | 8:07:20 UTC

Having just got home from work [GMT+10] and then spent 15mins aborting stalled 3s WUs [on 3 PCs] and then restarting boinc to get things going again, might I suggest that until the 3s problem is sorted the project stop serving up 3s WUs. Of course if the project can detect which PCs are or are not running Cluster's GPU client [individual PC racs could be a hint] then problem solved - just send 3s WUs to CPU-only PCs and 1 and 2s to the GPUs.

Someone may be ahead of me here - the last couple of batches of WUs that I got contained only 1s and 2s WUs.
____________
Cheers,

PeterV

.

Localizer
Send message
Joined: 28 Jan 08
Posts: 39
Credit: 379,928,803
RAC: 1
Message 24234 - Posted: 5 Jun 2009 | 9:21:30 UTC

............. I'm still getting the 3s in mixed batches.

Profile [AF>EDLS] frederic abussan
Avatar
Send message
Joined: 30 Nov 07
Posts: 9
Credit: 165,873,750
RAC: 1
Message 24236 - Posted: 5 Jun 2009 | 9:38:26 UTC - in response to Message 24234.

I have 4 gpu running in circles for nothing because I am not close to deal with this problem computers ares far away from me, why not stop send gpu wu on cpu site? problem résolve yaiting for gpu site opening lol

Profile Berserk_Tux
Avatar
Send message
Joined: 2 Jan 08
Posts: 79
Credit: 365,471,675
RAC: 0
Message 24237 - Posted: 5 Jun 2009 | 9:53:05 UTC - in response to Message 24236.

Please, stop sending out 3s wu's please.

____________

Profile [P3D] Crashtest
Send message
Joined: 8 Jan 09
Posts: 58
Credit: 16,804,661
RAC: 12
Message 24238 - Posted: 5 Jun 2009 | 10:00:57 UTC - in response to Message 24237.

Yea - stop sending "3s" WU for about 2 days - Gipsel is working on it !

I got 48new WUs at once - all "3s" - 60s later - all crashed ...

The Milkyway-Project-Team is waisting WUs ... !

Takumo
Send message
Joined: 5 Apr 09
Posts: 2
Credit: 4,513,838
RAC: 0
Message 24239 - Posted: 5 Jun 2009 | 10:05:35 UTC

Still the same ... "3s" still coming ...

Profile Neil Polson
Avatar
Send message
Joined: 31 Dec 08
Posts: 9
Credit: 1,332,776
RAC: 0
Message 24241 - Posted: 5 Jun 2009 | 10:11:57 UTC

Just had this 3s invalidate on my P4 cpu. So not just a gpu problem it seems.
____________

Profile Simplex0
Avatar
Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,221,048
RAC: 0
Message 24242 - Posted: 5 Jun 2009 | 10:21:12 UTC

And same here. I was thinking that it was related to my instalation of Folding@home so I removed it, to bad. This project used to be greate but now it's just crap with an alpha stus that seams to go on for ever.

Profile [TiDC] Anlupa
Send message
Joined: 17 Nov 08
Posts: 2
Credit: 33,096,279
RAC: 2,232
Message 24243 - Posted: 5 Jun 2009 | 10:32:36 UTC

Hi everyone!
Í've got same problem with my ATI 4850.
With any version of aplication.

Profile Phil
Avatar
Send message
Joined: 13 Feb 08
Posts: 1124
Credit: 46,740
RAC: 0
Message 24245 - Posted: 5 Jun 2009 | 10:38:49 UTC - in response to Message 24242.

And same here. I was thinking that it was related to my instalation of Folding@home so I removed it, to bad. This project used to be greate but now it's just crap with an alpha stus that seams to go on for ever.

The work runs fine with the stock application.
When people chose the anonymous platform, unforeseen things happen. Be patient, it'll get fixed.

Profile verstapp
Avatar
Send message
Joined: 26 Jan 09
Posts: 585
Credit: 464,286,454
RAC: 730
Message 24247 - Posted: 5 Jun 2009 | 10:47:55 UTC

But why wasn't it fixed yesterday! :)
____________
Cheers,

PeterV

.

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24250 - Posted: 5 Jun 2009 | 11:56:30 UTC

I've aborted 40-50 of the 3s units this morning to go along with the 100 or so of them last night. All GPU crunching grinds to a halt when these hang up. Please stop sending these things out until there is a fix. What happened to testing the work before sending it out to everybody?
____________

4870 GPU
4870 GPU

Profile Berserk_Tux
Avatar
Send message
Joined: 2 Jan 08
Posts: 79
Credit: 365,471,675
RAC: 0
Message 24251 - Posted: 5 Jun 2009 | 12:09:31 UTC - in response to Message 24250.

Oh my god, Is anybody home hear. Stop sending out 3s wu's Now!!!!!

____________

John Clark
Send message
Joined: 4 Oct 08
Posts: 1613
Credit: 62,010,297
RAC: 27,556
Message 24252 - Posted: 5 Jun 2009 | 12:10:48 UTC

Like many others posting in to this thread I have a stalled ATI HD3850 due, I surmise, to the problem ps_sgr_208_3s_etc WUs.

Interestingly, the HD3850 seems to be getting lots of work, but cannot crunch them as the _3s_ cause the system to stall for some reason or other. When the GPU does crunch it ends up terminating as a computer error.

I am going to detach and reattach to allow Milkyway to do CPU crunching as I noticed there is no problem with these _3s_ WUs on them. I can continue this way, at 10% of the GPU RAC, until these _3s_ WUs stop.

I hope Travis or Dave can see a way to sort this problem out in this project soon. It has been at least 2 days now living (or not) with this problem.

:((
____________
Go away, I was asleep


Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 24255 - Posted: 5 Jun 2009 | 12:48:14 UTC

I didn't have any problems with the six 3s_2 I got. Is it only the 3s_1 giving problems? Could be some lingering still that need canceled.
____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile verstapp
Avatar
Send message
Joined: 26 Jan 09
Posts: 585
Credit: 464,286,454
RAC: 730
Message 24257 - Posted: 5 Jun 2009 | 13:24:01 UTC

The current workaround is to abort all the _3s_ WUs, close and restart boinc, and hope you get some _1s_ or _2s_ WUs next time. Its horribly manual and requires keeping an eye on boinc, which explains all the requests/demands in this thread for Travis to fix it.
____________
Cheers,

PeterV

.

Temujin
Send message
Joined: 12 Oct 07
Posts: 77
Credit: 404,471,187
RAC: 0
Message 24258 - Posted: 5 Jun 2009 | 13:36:57 UTC - in response to Message 24257.

The current workaround is to abort all the _3s_ WUs, close and restart boinc, and hope you get some _1s_ or _2s_ WUs next time. Its horribly manual and requires keeping an eye on boinc, which explains all the requests/demands in this thread for Travis to fix it.

Come on guys, deep breaths :)
The problem isn't with the workunits, they're fine, so Travis can't fix it.
It's a bug in Cluster Physiks GPU application as he has already pointed out.
Give him a chance and he'll sort it

Profile [KWSN]John Galt 007
Avatar
Send message
Joined: 12 Dec 08
Posts: 56
Credit: 136,122,081
RAC: 0
Message 24259 - Posted: 5 Jun 2009 | 13:44:25 UTC - in response to Message 24258.

The current workaround is to abort all the _3s_ WUs, close and restart boinc, and hope you get some _1s_ or _2s_ WUs next time. Its horribly manual and requires keeping an eye on boinc, which explains all the requests/demands in this thread for Travis to fix it.

Come on guys, deep breaths :)
The problem isn't with the workunits, they're fine, so Travis can't fix it.
It's a bug in Cluster Physiks GPU application as he has already pointed out.
Give him a chance and he'll sort it


The real problem is that if a WU gets sent out to 2 GPU clients, and both abort it, the WU dies from too many errors, so the project suffers.

Just my $0.02.....
____________
Click to help Seti City.




Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 24261 - Posted: 5 Jun 2009 | 14:04:58 UTC - in response to Message 24258.


It's a bug in Cluster Physiks GPU application as he has already pointed out.

Didn't catch that when I read through. I only have a cpu, so no problems.
____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Localizer
Send message
Joined: 28 Jan 08
Posts: 39
Credit: 379,928,803
RAC: 1
Message 24262 - Posted: 5 Jun 2009 | 14:05:25 UTC
Last modified: 5 Jun 2009 | 14:05:59 UTC

Hi John - agreed. However GPU users cannot choose their WUs - and once downloaded these WUs can't be run - therefore abort or leave stalled is the only option. Granted I'm sure that CP will get it sorted, but the project may be best suspending the generation of this type of WU until it is addressed ......... can't be good for the retruned data to have so may aborted WUs or project resets.

John Vickers
Volunteer moderator
Project developer
Project scientist
Avatar
Send message
Joined: 11 May 09
Posts: 30
Credit: 81,093
RAC: 0
Message 24263 - Posted: 5 Jun 2009 | 14:10:55 UTC
Last modified: 5 Jun 2009 | 14:26:44 UTC

Hello MW@Home,

These *_3s_* runs are 3 stream runs that I started. I will tell Travis that there is a problem with them on the GPUs but not the CPUs and abort said run asap.

Sorry for the inconvenience,
John Vickers

Profile [KWSN]John Galt 007
Avatar
Send message
Joined: 12 Dec 08
Posts: 56
Credit: 136,122,081
RAC: 0
Message 24266 - Posted: 5 Jun 2009 | 14:23:44 UTC - in response to Message 24263.

Hello MW@Home,

These *_3s_* runs are 3 stream runs that I started. I will tell Travis that there is a problem with them on the GPUs but not the CPUs and abort said run asap.

Sorry for the inconvenience,
John Vickers


No probs...

Once CP gets the ATI app sorted out, we will burn thru these like nothing...

And thanks for posting...
____________
Click to help Seti City.




Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 24268 - Posted: 5 Jun 2009 | 14:25:28 UTC - in response to Message 24266.

Looks like the searches are stopped, we'll not do 3 stream runs until the ATI code is fixed :)
____________

Profile Kevint
Avatar
Send message
Joined: 22 Nov 07
Posts: 285
Credit: 1,076,786,368
RAC: 0
Message 24269 - Posted: 5 Jun 2009 | 14:26:44 UTC - in response to Message 24268.

Looks like the searches are stopped, we'll not do 3 stream runs until the ATI code is fixed :)



Ahhhh,,

Just as I was really starting to enjoy them :)...



____________
.

Profile Scarecrow
Avatar
Send message
Joined: 30 Jan 09
Posts: 56
Credit: 73,625
RAC: 0
Message 24270 - Posted: 5 Jun 2009 | 14:46:21 UTC


"booger"



_________________
*** BOFH excuse #309:
firewall needs cooling

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24276 - Posted: 5 Jun 2009 | 16:29:18 UTC

No 3s to be had, everything seems ok at present.
____________

John Clark
Send message
Joined: 4 Oct 08
Posts: 1613
Credit: 62,010,297
RAC: 27,556
Message 24277 - Posted: 5 Jun 2009 | 16:50:03 UTC

I detached and reattached my PC, with the AGP HD3850, to run the WUs on it's CPUs.

I now have 12 _3s_2_ WUs to crunch. So, there may be a few of thse around still.

I only moved to this about an hour ago, so when this picks up new work I will look see if I only get the _1s_2's and the _2s_1s and _2s_2's first before moving back to the GPU.
____________
Go away, I was asleep


Profile [P3D] Crashtest
Send message
Joined: 8 Jan 09
Posts: 58
Credit: 16,804,661
RAC: 12
Message 24278 - Posted: 5 Jun 2009 | 17:17:57 UTC

Gipsel got it fixed with version 0.19f:

http://www.file-upload.net/download-1684038/Milkyway_0.19f_ATI.zip.html


Vielen Dank Gipsel !

Divide Overflow
Avatar
Send message
Joined: 16 Feb 09
Posts: 109
Credit: 11,089,510
RAC: 0
Message 24279 - Posted: 5 Jun 2009 | 17:17:59 UTC

Unrelated to the 3s workunits, I've noticed some odd behavior since updating my BOINC client to version 6.6.31. When I get some work from here on the ATI app it sits at ready to start instead of beginning immediately as it used to. I have to pause / resume the other project in progress for it to begin working GPU tasks in parallel with other CPU project work. Re-installing the ATI app did not fix the problem. I'll watch how it behaves for a little while and may have to roll back to an earlier BOINC version. I've left the supplied app_info.xml as default and it's worked fine in the past. Perhaps it may be necessary to specify some values now.

Cluster Physik
Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 24281 - Posted: 5 Jun 2009 | 17:40:23 UTC - in response to Message 24278.

Gipsel got it fixed with version 0.19f

I will make a link out of it.
Version 0.19f of the ATI GPU application
is ready for download. Besides the fix for the three stream WUs it reports now the GPU time as WU time.

It was a really absurd bug, a missing line break in the GPU assembly (I edited the assembly after protyping it in a high level language). Because offline tools compiled it without problem (the compiler embedded to the graphics driver is obviously pickier) it was a bit hard to track that down in about 40kB assembly. But it works now.

Profile Simplex0
Avatar
Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,221,048
RAC: 0
Message 24284 - Posted: 5 Jun 2009 | 17:57:48 UTC - in response to Message 24278.

Gipsel got it fixed with version 0.19f:

http://www.file-upload.net/download-1684038/Milkyway_0.19f_ATI.zip.html


Vielen Dank Gipsel !



Yes! thank you.
You guys are just great and second to non in the BOINC community.
Apparently you and the guys at Stanford are the only ones that know how to make programs that can put the ATI card to good use.

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 24287 - Posted: 5 Jun 2009 | 18:06:25 UTC - in response to Message 24270.


"booger"



:p that run of comics is great!


____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile Phil
Avatar
Send message
Joined: 13 Feb 08
Posts: 1124
Credit: 46,740
RAC: 0
Message 24290 - Posted: 5 Jun 2009 | 18:37:22 UTC - in response to Message 24284.

Gipsel got it fixed with version 0.19f:

http://www.file-upload.net/download-1684038/Milkyway_0.19f_ATI.zip.html


Vielen Dank Gipsel !



Yes! thank you.
You guys are just great and second to non in the BOINC community.
Apparently you and the guys at Stanford are the only ones that know how to make programs that can put the ATI card to good use.

Aah. Not crap anymore then?

Bill592
Avatar
Send message
Joined: 19 May 09
Posts: 30
Credit: 1,062,540
RAC: 0
Message 24305 - Posted: 5 Jun 2009 | 21:40:33 UTC - in response to Message 24281.

Gipsel got it fixed with version 0.19f

I will make a link out of it.
Version 0.19f of the ATI GPU application
is ready for download. Besides the fix for the three stream WUs it reports now the GPU time as WU time.

It was a really absurd bug, a missing line break in the GPU assembly.




THANK YOU ! Cluster Physik, you are doing a Great job with the ATI Apps !

I wish my other projects like Einstein would implement ATI instead of this
focus on Cuda only nonsense.

Bill

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24315 - Posted: 5 Jun 2009 | 23:19:10 UTC
Last modified: 5 Jun 2009 | 23:21:30 UTC

I just had an odd one. I updated to the new version. I finished ok but it got 0 credit. Here it is The odd thing it was a ps_sgr_208_2s_2 unit.
____________

4870 GPU
4870 GPU

Profile Labbie
Avatar
Send message
Joined: 29 Aug 07
Posts: 327
Credit: 116,463,193
RAC: 0
Message 24318 - Posted: 5 Jun 2009 | 23:39:54 UTC - in response to Message 24315.

I just had an odd one. I updated to the new version. I finished ok but it got 0 credit. Here it is The odd thing it was a ps_sgr_208_2s_2 unit.


If you look in the result, it shows it as Invalid. Why, I don't know.


____________

Calm Chaos Forum...Join Calm Chaos Now

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24319 - Posted: 5 Jun 2009 | 23:42:08 UTC - in response to Message 24318.
Last modified: 5 Jun 2009 | 23:42:53 UTC

Me either, I'm keeping an eye on it for a while to see if any others do the same. That's the first invalid unit i've had here.
____________

4870 GPU
4870 GPU

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24331 - Posted: 6 Jun 2009 | 0:23:01 UTC
Last modified: 6 Jun 2009 | 0:23:23 UTC

I've just had another one on a diffrent system. Here it is, I'm not sure what's happening. Both are very stable systems.
____________

4870 GPU
4870 GPU

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24347 - Posted: 6 Jun 2009 | 1:18:16 UTC

Here's another
____________

4870 GPU
4870 GPU

Profile Labbie
Avatar
Send message
Joined: 29 Aug 07
Posts: 327
Credit: 116,463,193
RAC: 0
Message 24350 - Posted: 6 Jun 2009 | 1:34:18 UTC

Someone mentioned that the new 0.19f app runs the GPU a little harder and a little hotter. Are your fans clean?


____________

Calm Chaos Forum...Join Calm Chaos Now

Profile Simplex0
Avatar
Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,221,048
RAC: 0
Message 24358 - Posted: 6 Jun 2009 | 4:50:40 UTC - in response to Message 24290.

Aah. Not crap anymore then?


That was related to the new wu's and the fact that 2 months have gone
after the message "Almost there!" on the 'Milkyway@Home for GPUs' page
where the GPU application and discussions regarded that should take place.

John Clark
Send message
Joined: 4 Oct 08
Posts: 1613
Credit: 62,010,297
RAC: 27,556
Message 24365 - Posted: 6 Jun 2009 | 8:12:29 UTC - in response to Message 24350.

Someone mentioned that the new 0.19f app runs the GPU a little harder and a little hotter. Are your fans clean?



Yes! Very much so.

Last week I was suffering overheating, and stripped down, removed PCs to outside and compressor cleaned all machines and the dust bunnies. The only PC with a MW compliant GPU had the GPU removed from the socket and separately cleaned.

Interestingly, the PC is running just as heavily loaded now (98%), but the GPU temperature has dropped back to a normal (65C from 72C) and the GPU fan load has dropped to 47% from 55%.

Perhaps it's the cooler weather.
____________
Go away, I was asleep


Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24383 - Posted: 6 Jun 2009 | 14:41:43 UTC - in response to Message 24350.
Last modified: 6 Jun 2009 | 15:07:34 UTC

Yes, the fans are clean. One is a recent build and the other was just changed over into a new case and everything was cleaned before installing. Under load the cards are reporting temps of ~75C. Neither card is overclocked. Both systems are in Antec900 cases. Plenty of airflow there :-).
I had several more return as invalid overnight. I didn't do a reboot after going to 19f yesterday at 2230. I just rebooted both systems at 1400 UTC today. I'm going to watch them for a bit and see if that makes any difference.

<edit>
Reboot made no difference. Still getting the occasional invalid result. Is anybody else having the same problem? I might suspect it was a computer issue if it wasn't happening on both systems.
____________

4870 GPU
4870 GPU

John Clark
Send message
Joined: 4 Oct 08
Posts: 1613
Credit: 62,010,297
RAC: 27,556
Message 24389 - Posted: 6 Jun 2009 | 15:47:58 UTC - in response to Message 24383.

Yes, the fans are clean. One is a recent build and the other was just changed over into a new case and everything was cleaned before installing. Under load the cards are reporting temps of ~75C. Neither card is overclocked. Both systems are in Antec900 cases. Plenty of airflow there :-).
I had several more return as invalid overnight. I didn't do a reboot after going to 19f yesterday at 2230. I just rebooted both systems at 1400 UTC today. I'm going to watch them for a bit and see if that makes any difference.

<edit>
Reboot made no difference. Still getting the occasional invalid result. Is anybody else having the same problem? I might suspect it was a computer issue if it wasn't happening on both systems.


KWSN imcrazynow

You made me look at my only GPU driven rig, and I am getting the odd invalid results as well. Three to be precise since UK lunchtime (13.30 - UTC+1). The successful WUs between them seem to be getting fewer before the next invalid result.

I wonder what ti is?
____________
Go away, I was asleep


Profile Labbie
Avatar
Send message
Joined: 29 Aug 07
Posts: 327
Credit: 116,463,193
RAC: 0
Message 24390 - Posted: 6 Jun 2009 | 17:27:33 UTC

I didn't have any last night, but now I just looked and have 6 out of almost 600 results on two machines.


____________

Calm Chaos Forum...Join Calm Chaos Now

Profile caferace
Avatar
Send message
Joined: 4 Aug 08
Posts: 46
Credit: 8,255,900
RAC: 0
Message 24399 - Posted: 6 Jun 2009 | 20:18:31 UTC

I've had a few between my two GPU boxes. But *maybe* 1 in 50 or less. Example:

Task ID 74360632
Name ps_sgr_235_2s_1_674470_1244317010_0
Workunit 73253956
Created 6 Jun 2009 19:36:53 UTC
Sent 6 Jun 2009 19:39:25 UTC
Received 6 Jun 2009 19:52:15 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 24918
Report deadline 9 Jun 2009 19:39:25 UTC
CPU time 84.34375
stderr out

<core_client_version>6.4.6</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home ATI GPU application version 0.19f by Gipsel
CPU: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (4 cores/threads) 2.40002 GHz (259ms)

CAL Runtime: 1.3.145
Found 1 CAL device

Device 0: ATI Radeon HD 3800 (RV670) 512 MB local RAM (remote 28 MB cached + 512 MB uncached)
GPU core clock: 669 MHz, memory clock: 829 MHz
320 shader units organized in 4 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

3 WUs already running on GPU 0
No free GPU! Waiting ... 461.688 seconds.
Starting WU on GPU 0

main integral, 160 iterations
predicted runtime per iteration is 408 ms (33.3333 ms are allowed), dividing each iteration in 13 parts
borders of the domains at 0 128 248 376 496 616 744 864 984 1112 1232 1360 1480 1600
Calculated about 3.70012e+012 floatingpoint ops on GPU, 6.34181e+007 on FPU. Approximate GPU time 84.3438 seconds.

probability calculation (stars)
Calculated about 1.20373e+009 floatingpoint ops on FPU.

WU completed.
CPU time: 19.9844 seconds, GPU time: 84.3438 seconds, wall clock time: 716.234 seconds, CPU frequency: 2.40012 GHz

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0.364573240852738
Granted credit 0
application version 0.19

-----

-jim

Profile [KWSN]John Galt 007
Avatar
Send message
Joined: 12 Dec 08
Posts: 56
Credit: 136,122,081
RAC: 0
Message 24400 - Posted: 6 Jun 2009 | 20:29:07 UTC

I have had a few over 4 different 3850 cards, but am not really worrying about it...Probably over 99% have completed sucessully...
____________
Click to help Seti City.




John Clark
Send message
Joined: 4 Oct 08
Posts: 1613
Credit: 62,010,297
RAC: 27,556
Message 24415 - Posted: 6 Jun 2009 | 23:51:57 UTC

CPUs are having invalid results as well. My old dual P3 has 1 in 4 results returned here.
____________
Go away, I was asleep


Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24417 - Posted: 6 Jun 2009 | 23:56:53 UTC
Last modified: 6 Jun 2009 | 23:59:33 UTC

I'm still getting quite a few. Maybe around 5%, that's still alot as far as the project goes. Somebody may want to look at why it's happening. That could be quite alot of work on the grand scale that will be missing if not corrected. Insta purge isn't helping any. If they could see why the're failing it would help.
____________

4870 GPU
4870 GPU

Profile Labbie
Avatar
Send message
Joined: 29 Aug 07
Posts: 327
Credit: 116,463,193
RAC: 0
Message 24423 - Posted: 7 Jun 2009 | 1:14:12 UTC

I'm now showing 7 on my CPU machine that were not there this morning.


____________

Calm Chaos Forum...Join Calm Chaos Now

Divide Overflow
Avatar
Send message
Joined: 16 Feb 09
Posts: 109
Credit: 11,089,510
RAC: 0
Message 24438 - Posted: 7 Jun 2009 | 3:50:23 UTC
Last modified: 7 Jun 2009 | 3:50:55 UTC

If the rare invalid result is occurring on both CPU and GPU crunched tasks, perhaps there is still a fundamental issue with some of the new work we're getting.

I see a small percentage of my work report as invalid as well.

Profile The Gas Giant
Avatar
Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,865,573
RAC: 0
Message 24439 - Posted: 7 Jun 2009 | 5:10:14 UTC

I'm seeing approx 10% failure rate on my 4870.

Profile Neil Polson
Avatar
Send message
Joined: 31 Dec 08
Posts: 9
Credit: 1,332,776
RAC: 0
Message 24442 - Posted: 7 Jun 2009 | 5:39:23 UTC
Last modified: 7 Jun 2009 | 5:44:15 UTC

It seems overnight all my 2s_2 have been marked as invalid (Have had 5 of them). All on cpu. No problems with the other searches.
____________

Profile The Gas Giant
Avatar
Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,865,573
RAC: 0
Message 24444 - Posted: 7 Jun 2009 | 6:33:57 UTC

My work E6850 (cpu) has about a 5% to 10% error rate.

John Clark
Send message
Joined: 4 Oct 08
Posts: 1613
Credit: 62,010,297
RAC: 27,556
Message 24449 - Posted: 7 Jun 2009 | 8:23:36 UTC

The error rate on my 3850 seems to have disappeared, but risen on the CPUs to about 5%.
____________
Go away, I was asleep


Chris S
Avatar
Send message
Joined: 20 Sep 08
Posts: 1357
Credit: 173,075,472
RAC: 9
Message 24451 - Posted: 7 Jun 2009 | 8:41:39 UTC

My card is not getting any errors since updating to 0.19f, but my CPU's are hardly getting any work at all so I can't comment on them. The last 2s unit that errored out resulted in 3 screens worth of debugging information in the output file. If anyone wants it I'll pass it on. Hopefully when the GPU project is able to start up all this will go away.

Profile The Gas Giant
Avatar
Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,865,573
RAC: 0
Message 24457 - Posted: 7 Jun 2009 | 9:40:14 UTC
Last modified: 7 Jun 2009 | 10:01:54 UTC

Just a few of the invalid wu's from my 4870

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73584677
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571669
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73544740
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73544745
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571652
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571666
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571667
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73544733
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73501652
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73501661
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73544733
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73544740
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73544745
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571652
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571667
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571666
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73571669


And a couple from my cpu

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73422247
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=73422257

Profile Bruce
Avatar
Send message
Joined: 28 Apr 08
Posts: 1415
Credit: 2,716,428
RAC: 0
Message 24458 - Posted: 7 Jun 2009 | 9:48:53 UTC
Last modified: 7 Jun 2009 | 10:05:43 UTC

Yep I've got a few invalid Wu's too, all are 2_2's.
;-(
____________

Seejay
Avatar
Send message
Joined: 22 Dec 07
Posts: 51
Credit: 2,405,016
RAC: 0
Message 24462 - Posted: 7 Jun 2009 | 11:14:57 UTC - in response to Message 24458.

Yep I've got a few invalid Wu's too, all are 2_2's.
;-(


Me too - all CPU - all 2_2s.

____________
Seejay **Proud Member and Founder of BOINC Team Allprojectstats.com**

John Vickers
Volunteer moderator
Project developer
Project scientist
Avatar
Send message
Joined: 11 May 09
Posts: 30
Credit: 81,093
RAC: 0
Message 24563 - Posted: 8 Jun 2009 | 14:05:58 UTC
Last modified: 8 Jun 2009 | 14:50:52 UTC

Hello MW@Home,

Can you please tell me if these _2s_2 runs that were returning errors are all "ps_sgr_208_2s_2", all "ps_sgr_235_2s_1" or a mix of both?

Thank You,
John Vickers

Edit: typo: "ps_sgr_235_2s_2" - > "ps_sgr_235_2s_1"

Brian Silvers
Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 24564 - Posted: 8 Jun 2009 | 14:12:47 UTC - in response to Message 24563.

Hello MW@Home,

Can you please tell me if these _2s_2 runs that were returning errors are all "ps_sgr_208_2s_2", all "ps_sgr_235_2s_2" or a mix of both?

Thank You,
John Vickers


It would help somewhat if the purge delay could be increased some more, that way there would be more results for people to look through to find an answer to your question. Alternatively, you could write a SQL query to run against the results database periodically to gather the data from your side.

Profile Crunch3r
Volunteer developer
Avatar
Send message
Joined: 17 Feb 08
Posts: 358
Credit: 256,958,531
RAC: 3,213
Message 24566 - Posted: 8 Jun 2009 | 14:18:43 UTC - in response to Message 24563.

Hello MW@Home,

Can you please tell me if these _2s_2 runs that were returning errors are all "ps_sgr_208_2s_2", all "ps_sgr_235_2s_2" or a mix of both?

Thank You,
John Vickers


From what i've seen only "ps_sgr_208_2s_2" cause errors.
____________

Join BOINC United now!

Cluster Physik
Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 24568 - Posted: 8 Jun 2009 | 14:25:38 UTC - in response to Message 24563.
Last modified: 8 Jun 2009 | 14:27:44 UTC

Can you please tell me if these _2s_2 runs that were returning errors are all "ps_sgr_208_2s_2", all "ps_sgr_235_2s_2" or a mix of both?

Most of the invalid WUs are ps_sgr_208_2s_2 ones, with a few ps_sgr_235_2s_1 amongst them (ratio 15:1 or so). I haven't found a single failed ps_sgr_235_2s_2 WU (in about 900 I looked at).

I would think the outlier detection may be too picky for some of the WUs, that means also some correct results get rejected.

Profile krahulik
Send message
Joined: 7 Nov 08
Posts: 14
Credit: 179,303,710
RAC: 3
Message 24584 - Posted: 8 Jun 2009 | 16:51:53 UTC

Hello MW@Home,
Can you please tell me if these _2s_2 runs that were returning errors are all "ps_sgr_208_2s_2", all "ps_sgr_235_2s_1" or a mix of both?
Thank You,
John Vickers
Edit: typo: "ps_sgr_235_2s_2" - > "ps_sgr_235_2s_1"

In my case, invalid WU´s (with returning error) are ps_sgr_235_2s_1 and ps_sgr_208_2s_2 (most cases).

Also, no failed ps_sgr_235_2s_2 found.

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24587 - Posted: 8 Jun 2009 | 17:43:47 UTC

These are the ones I could get before insta purge took care of them.

Host 39176 GPU
ps_sgr_208_2s_2_1637698_1244467419
ps_sgr_208_2s_2_1637697_1244467419
ps_sgr_208_2s_2_1637695_1244467419_0
ps_sgr_208_2s_2_1624691_1244465356_0
ps_sgr_208_2s_2_1615716_1244463950
ps_sgr_208_2s_2_1615702_1244463950_0

Host 60779 GPU
ps_sgr_208_2s_2_1641829_1244468066
ps_sgr_208_2s_2_1628695_1244465990
ps_sgr_208_2s_2_1628692_1244465990
ps_sgr_208_2s_2_1623219_1244465117
ps_sgr_235_2s_1_1572923_1244457187
ps_sgr_208_2s_2_308822_1244201732

Host 39247 CPU
ps_sgr_208_2s_2_1598226_1244461206
____________

4870 GPU
4870 GPU

Profile [KWSN]John Galt 007
Avatar
Send message
Joined: 12 Dec 08
Posts: 56
Credit: 136,122,081
RAC: 0
Message 24591 - Posted: 8 Jun 2009 | 18:37:54 UTC - in response to Message 24587.

These are the ones I could get before insta purge took care of them.

Host 39176 GPU
ps_sgr_208_2s_2_1637698_1244467419
ps_sgr_208_2s_2_1637697_1244467419
ps_sgr_208_2s_2_1637695_1244467419_0
ps_sgr_208_2s_2_1624691_1244465356_0
ps_sgr_208_2s_2_1615716_1244463950
ps_sgr_208_2s_2_1615702_1244463950_0

Host 60779 GPU
ps_sgr_208_2s_2_1641829_1244468066
ps_sgr_208_2s_2_1628695_1244465990
ps_sgr_208_2s_2_1628692_1244465990
ps_sgr_208_2s_2_1623219_1244465117
ps_sgr_235_2s_1_1572923_1244457187
ps_sgr_208_2s_2_308822_1244201732

Host 39247 CPU
ps_sgr_208_2s_2_1598226_1244461206


All but one of my 0 credits have come on the 208_2s_2 WUs as well...too bad instapurge will get them shortly...I didn't see any in the most recent results...

____________
Click to help Seti City.




Profile [AF>HFR>RR] ThierryH
Send message
Joined: 2 Jan 08
Posts: 23
Credit: 489,580,124
RAC: 144,087
Message 24593 - Posted: 8 Jun 2009 | 19:09:30 UTC - in response to Message 24268.

Looks like the searches are stopped, we'll not do 3 stream runs until the ATI code is fixed :)


Thanks Travis. It was a very good decision.
Now, Cluster Physik gave us fixed code since 3 days. Perhaps it's time to restart 3 stream runs. This kind of WUs is longer to calculate than others. It could give more work for everyone.

Thank you,
Thierry.
____________

Profile Yankton
Send message
Joined: 28 Sep 08
Posts: 1
Credit: 136,165
RAC: 0
Message 24725 - Posted: 9 Jun 2009 | 21:08:38 UTC

I'm getting a lot of sigsev errors on 2s-4 and 2s-6.

Is this a related problem?

Profile Crunch3r
Volunteer developer
Avatar
Send message
Joined: 17 Feb 08
Posts: 358
Credit: 256,958,531
RAC: 3,213
Message 24726 - Posted: 9 Jun 2009 | 21:18:25 UTC - in response to Message 24725.
Last modified: 9 Jun 2009 | 21:19:23 UTC

I'm getting a lot of sigsev errors on 2s-4 and 2s-6.

Is this a related problem?


No. It's Ubuntu causing it. Get a proper linux distribution or downgrade to 8.xx.
____________

Join BOINC United now!

Profile verstapp
Avatar
Send message
Joined: 26 Jan 09
Posts: 585
Credit: 464,286,454
RAC: 730
Message 24728 - Posted: 9 Jun 2009 | 21:28:15 UTC

_3s_ WUs crunching madly with Cluster's v.0.19f.
____________
Cheers,

PeterV

.

Profile Phil
Avatar
Send message
Joined: 13 Feb 08
Posts: 1124
Credit: 46,740
RAC: 0
Message 24731 - Posted: 9 Jun 2009 | 21:33:15 UTC - in response to Message 24726.
Last modified: 9 Jun 2009 | 21:34:36 UTC

I'm getting a lot of sigsev errors on 2s-4 and 2s-6.

Is this a related problem?


No. It's Ubuntu causing it. Get a proper linux distribution or downgrade to 8.xx.

Ha, that was the conclusion I came to as well ;-)
It runs einstein okay but practically nothing else.

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24795 - Posted: 10 Jun 2009 | 11:58:38 UTC
Last modified: 10 Jun 2009 | 12:18:24 UTC

Something caused both of my gpu crunchers to freeze up around 5:30 UTC. I say that time because that was the time the last result remaining on my systems was sent out to me.I have no idea what it was. That was 12:30 AM my time. I just got up this AM and found it. I don't see any errors but insta purge probably took the units away before I could see them. Might be something worth looking into.

<edit>
It was only milkyway that froze. Also running Prime Grid on both systems. I was still running.
<edit 2> It just froze up on one machine again. It was running ps_sgr_208_3s_6 and a ps_sgr_210_3s_5 and a ps_sgr_235_2s_6 When it froze. This is definately worth looking into.
<edit 3> It was either the 210_3 or the 235_2 that locked it up.
____________

4870 GPU
4870 GPU

John Vickers
Volunteer moderator
Project developer
Project scientist
Avatar
Send message
Joined: 11 May 09
Posts: 30
Credit: 81,093
RAC: 0
Message 24912 - Posted: 11 Jun 2009 | 2:10:13 UTC
Last modified: 11 Jun 2009 | 2:11:07 UTC

KWSN,

Are you using the code recently released by Cluster Physik ( http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=886#24282 ) that included a fix for 3 stream runs on ATI GPUs? This is most likely the issue if its the *_3s_* runs crashing and only on GPU.

Thanks,
John Vickers

Lord Tedric
Avatar
Send message
Joined: 9 Nov 07
Posts: 151
Credit: 8,391,608
RAC: 0
Message 24924 - Posted: 11 Jun 2009 | 7:00:06 UTC

I'm not so much getting a system freeze but a GPU reset! These seem to be happenong with the 3s, specifically the '3s 6', I have visually seen this happen, have'nt noticed it on '3s 5', but will be watching - difficult to catch as it has only happened on three seperate occasions in the last 36 hours.
____________

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 24950 - Posted: 11 Jun 2009 | 12:14:51 UTC

Yes, I am using the latest version 0.19f. I was able to play around with it a little more last night. It seems that absolutely everything is running high priority for some reason. I made no changes to my BOINC prefrences either. I also found that if I suspend my other project (prime grid) everthing starts back up. That will however have a negative impact on PG. None of this started happening until a recent windows update. I'm very much open to suggestions on how to correct it. I thought I might try and reinstall 19f as soon as I can get a chance in case something got messed up with the update. If that doesn't work maybe reinstalling BOINC. The two systems are running 6.4.7
____________

4870 GPU
4870 GPU

Profile [KWSN]John Galt 007
Avatar
Send message
Joined: 12 Dec 08
Posts: 56
Credit: 136,122,081
RAC: 0
Message 24971 - Posted: 11 Jun 2009 | 13:56:45 UTC - in response to Message 24950.

Yes, I am using the latest version 0.19f. I was able to play around with it a little more last night. It seems that absolutely everything is running high priority for some reason. I made no changes to my BOINC prefrences either. I also found that if I suspend my other project (prime grid) everthing starts back up. That will however have a negative impact on PG. None of this started happening until a recent windows update. I'm very much open to suggestions on how to correct it. I thought I might try and reinstall 19f as soon as I can get a chance in case something got messed up with the update. If that doesn't work maybe reinstalling BOINC. The two systems are running 6.4.7


I have seen that with my 4850 in my i7. It seems like the WU hangs at some point, either from the CPU getting overloaded (all MW tasks 'running' but only 3 crunching) or BOINC trying to do task switching.
____________
Click to help Seti City.




JAMC
Send message
Joined: 9 Sep 08
Posts: 96
Credit: 336,443,946
RAC: 0
Message 24976 - Posted: 11 Jun 2009 | 14:34:29 UTC

I am getting the 0.19f GPU lock up as well on 2 different pc's this am. MW GPU app locks up but the SETI AP CPU keeps going without problems. Stop BOINC and start BOINC again gets the GPU app going again. I am also getting High Priority running on GPU WU's as well.

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 25004 - Posted: 11 Jun 2009 | 17:24:11 UTC
Last modified: 11 Jun 2009 | 17:26:07 UTC

Look at this one. Note the GPU time and the wall clock time.
<EDIT>
I would definately call that a hang!

Task ID 77725727
Name ps_sgr_235_2s_6_1603847_1244722735_0
Workunit 76446733
Created 11 Jun 2009 12:18:59 UTC
Sent 11 Jun 2009 12:20:07 UTC
Received 11 Jun 2009 15:50:10 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 39176
Report deadline 14 Jun 2009 12:20:07 UTC
CPU time 11878.53
stderr out <core_client_version>6.4.7</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home ATI GPU application version 0.19f by Gipsel
CPU: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (4 cores/threads) 3.54598 GHz (227ms)

CAL Runtime: 1.3.145
Found 1 CAL device

Device 0: ATI Radeon HD 4800 (RV770) 1024 MB local RAM (remote 28 MB cached + 1024 MB uncached)
GPU core clock: 750 MHz, memory clock: 900 MHz
800 shader units organized in 10 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

3 WUs already running on GPU 0
No free GPU! Waiting ... 93.7969 seconds.
Starting WU on GPU 0

main integral, 160 iterations
predicted runtime per iteration is 145 ms (33.3333 ms are allowed), dividing each iteration in 5 parts
borders of the domains at 0 320 640 960 1280 1600
Calculated about 3.70012e+012 floatingpoint ops on GPU, 6.34181e+007 on FPU. Approximate GPU time 11878.5 seconds.

probability calculation (stars)
Calculated about 1.20373e+009 floatingpoint ops on FPU.

WU completed.
CPU time: 1.35938 seconds, GPU time: 11878.5 seconds, wall clock time: 12032.2 seconds, CPU frequency: 3.546 GHz

</stderr_txt>
]]>

Validate state Valid
Claimed credit 82.9045335151938
Granted credit 27.75994
application version 0.19
____________

4870 GPU
4870 GPU

Profile Kevint
Avatar
Send message
Joined: 22 Nov 07
Posts: 285
Credit: 1,076,786,368
RAC: 0
Message 25009 - Posted: 11 Jun 2009 | 17:59:36 UTC
Last modified: 11 Jun 2009 | 18:01:01 UTC

Imcrazy,

This happens quit a bit on hosts that are shared with other projects.

It seems that the shorter the other projects WU's the more MW hangs.

I believe it has something to do with the way BOINC handles debt.

I think you mentioned you are also crunching Prime Grid and Aqua. The shorter WU's will suspend your MW WU's until your short term, long term debt is cleared.

The new Multi Thread aqua can play havoc on the ATI app since a Aqua WU now wants to use multiple CPU's and will occasionally put MW in suspend mode.



To test this, when you see MW hung up, just suspend the other projects, MW should take off and start crunching again without having to reset or reboot your box
____________
.

Profile [KWSN]John Galt 007
Avatar
Send message
Joined: 12 Dec 08
Posts: 56
Credit: 136,122,081
RAC: 0
Message 25011 - Posted: 11 Jun 2009 | 18:12:26 UTC - in response to Message 25009.

Imcrazy,

This happens quit a bit on hosts that are shared with other projects.

It seems that the shorter the other projects WU's the more MW hangs.

I believe it has something to do with the way BOINC handles debt.

I think you mentioned you are also crunching Prime Grid and Aqua. The shorter WU's will suspend your MW WU's until your short term, long term debt is cleared.

The new Multi Thread aqua can play havoc on the ATI app since a Aqua WU now wants to use multiple CPU's and will occasionally put MW in suspend mode.



To test this, when you see MW hung up, just suspend the other projects, MW should take off and start crunching again without having to reset or reboot your box


Thanks, Kevin...a good explanation, since I am running PG on my i7 with the GPU doing MW, and I see the PSP sieve WUs jumping into EDF mode, even though the due date is 7 days off and I have a .5 day cache.
____________
Click to help Seti City.




Profile Spankinmonkee [TopGun] Division SETI.USA
Avatar
Send message
Joined: 22 Mar 08
Posts: 38
Credit: 48,760,692
RAC: 0
Message 25031 - Posted: 11 Jun 2009 | 18:48:30 UTC
Last modified: 11 Jun 2009 | 18:48:56 UTC

I've had 3 systems hang up also...and I'm not running any other project.

Do you think the shorter WU that Travis took care of was causing this?

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 25054 - Posted: 11 Jun 2009 | 20:11:10 UTC - in response to Message 25009.

The only other project that I have running on that system is prime grid. After the windows update it seems like everthing PG and MW went into high priority. It did that on both quads. Last night I reset my debts to 0 on one system using BOINC DV. I'll try that on the other(the one this long winded WU came from) tonight.
____________

4870 GPU
4870 GPU

Profile KWSN imcrazynow
Avatar
Send message
Joined: 22 Nov 08
Posts: 136
Credit: 220,420,807
RAC: 47,088
Message 25853 - Posted: 18 Jun 2009 | 0:12:51 UTC - in response to Message 25009.

Thanks for the info! I did notice that i could suspend the Prime Grid PSP Sieve units and MW would imediately start back up and run for an extended period. Now it's happening with Prime Grid PSP LLR units (Prime Grid Challenge). Not exactly a short running task 30+ hours for each one. I'm only running 3 at a time to leave 1 core free for MW. I did start a new thread "Hanging Work Units" Please look there for any new developments or suggestions.
____________

4870 GPU
4870 GPU

PeteS
Send message
Joined: 19 Mar 09
Posts: 27
Credit: 117,633,230
RAC: 0
Message 26680 - Posted: 29 Jun 2009 | 6:30:11 UTC

Can you people using 3850 tell what settings you are using in app_info.xml? I have tried many different settings, but still get VPU Recovery events..

I am running on Win7 64bit, ATI 0.19f and BOINC 6.6.36

Everything is fine and the WU's are processed at peak efficiency IF I don't do anything on the computer. But if it is used normally (watch videos, browse web with Firefox etc) then I constantly get blank screen+VPU recovery+jammed WU that I have to either kill with task manager or restart BOINC. I don't seem to have this problem on my 48xx series cards only the 3850.

I get somewhat better functionality (still problems but less) with f60 w1.7 n1, but then GPU utilization is only 50-60%.

Haksu
Send message
Joined: 28 Nov 08
Posts: 4
Credit: 17,795,128
RAC: 0
Message 26683 - Posted: 29 Jun 2009 | 7:34:32 UTC - in response to Message 26680.

Hi
this might be of only little help as both my pcs where I have a 3850 are old and dedicated to crunching but anyway..
I have two AGP bus 3850 on MW, one is on a AMD Athlon and the other Intel Celeron, both running XP Home.
Both are running on standard settings except n2 as the cards are 512 mb and the motherboards have only 256 mb each.
Basic functionality (web etc) is ok even if as said these are dedicated to MW and pretty much taken back to use as I noted that I can fit a AGP 3850 into them and do some crunching

Post to thread

Message boards : Number crunching : Compute Errors


Main page · Your account · Message boards


Copyright © 2013 AstroInformatics Group