Welcome to MilkyWay@home

Sudden mass of WU's finishing with Computation Error


Advanced search

Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Aaron Nancarrow

Send message
Joined: 21 Jun 09
Posts: 4
Credit: 3,915,585
RAC: 0
3 million credit badge10 year member badge
Message 33905 - Posted: 28 Nov 2009, 22:22:38 UTC

Ok, so I checked my system this morning, and found that overnight, a serious amount of WU's for MilkyWay had resulted in Computation Error, all with 0s elapsed.

Anyone else experiencing this?

I'm running BOINC 6.10.18
MilkyWay app is 0.21 (cuda23)
Running NVidia driver 191.07 - havent had time to update to 195.62 as yet
OS: Win7 64bit
ID: 33905 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 33912 - Posted: 29 Nov 2009, 1:21:52 UTC - in response to Message 33905.  

Ok, so I checked my system this morning, and found that overnight, a serious amount of WU's for MilkyWay had resulted in Computation Error, all with 0s elapsed.

Anyone else experiencing this?

I'm running BOINC 6.10.18
MilkyWay app is 0.21 (cuda23)
Running NVidia driver 191.07 - havent had time to update to 195.62 as yet
OS: Win7 64bit


Happened to me the other day. I reverted to Vista and the problem went away.

PS: I tried the 195.62 driver and that just caused VDU crashes all the time. So much so that out of 48 WU's only 10 did NOT error.
ID: 33912 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 33914 - Posted: 29 Nov 2009, 1:32:50 UTC

Not a mass but I had four of them on my GTX260 with 191.07 friday afternoon (UTC), too. OS is WinXP32 SP2 with BOINC 6.10.17. All errors were 0x1, invalid function.
ID: 33914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJockMacMad TSBT
Avatar

Send message
Joined: 28 Jan 09
Posts: 31
Credit: 85,059,711
RAC: 0
50 million credit badge10 year member badge
Message 33919 - Posted: 29 Nov 2009, 2:37:13 UTC
Last modified: 29 Nov 2009, 2:38:39 UTC

Yes I am.

Only on 1 machine thoug but every unit dies. Reinstalled every driver from ATI from 8.6x driver to 9.4 through 9.11 catalyst to no avail.

Redownloaded the apps. Check the right version i.e. ATI or AMD.

For me its only the machine with ATI and nVidia in it and crappy cards but its driving me mad.

02:36 am and I am still here fighting it.

Oh and Collatz optimised app is fine grrr
ID: 33919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aaron Nancarrow

Send message
Joined: 21 Jun 09
Posts: 4
Credit: 3,915,585
RAC: 0
3 million credit badge10 year member badge
Message 33984 - Posted: 30 Nov 2009, 8:42:11 UTC

Still happening - every single WU - must be almost 1000 in total by now. Gonna hafta suspend this project 'til I figure out whats going down.
ID: 33984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 34017 - Posted: 30 Nov 2009, 19:24:39 UTC - in response to Message 33984.  

Still happening - every single WU - must be almost 1000 in total by now. Gonna hafta suspend this project 'til I figure out whats going down.


When did this start? With the recent batch of new workunits? What GPU are you using?
ID: 34017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilebanditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
500 thousand credit badge10 year member badge
Message 34018 - Posted: 30 Nov 2009, 19:36:40 UTC

@ Aaron: Your result gives this:
Can't get shared memory segment name: shmget() failed
I remeber this problem, can't remember the solution.



@ Travis: If you look his first message it is a day ago, so it wouldn't be the new tasks.
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 34018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 34019 - Posted: 30 Nov 2009, 19:40:15 UTC - in response to Message 34018.  


@ Travis: If you look his first message it is a day ago, so it wouldn't be the new tasks.


Yeah I just saw that. Not getting shared memory sounds like a problem with the BOINC client.
ID: 34019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
redgoldendragon

Send message
Joined: 18 Sep 09
Posts: 1
Credit: 694,971
RAC: 0
500 thousand credit badge10 year member badge
Message 34033 - Posted: 30 Nov 2009, 21:53:37 UTC

Good evening guys (and girls).

Starting with the increased size, all new workunints end up as invalid.
For the meantime I have stopped the project.
Any clues?


Thanks a lot.
ID: 34033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfilePaul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
100 million credit badge10 year member badge
Message 34056 - Posted: 1 Dec 2009, 8:40:11 UTC
Last modified: 1 Dec 2009, 8:44:34 UTC

Well, I cannot see any difference on my tasks that complete successfuly and those that fail. I put the one system into NNT ... GTX295 cards are what I see the failures on. Just checked, my dual GTX260 system is also showing a high failure rate.

Most interesting is that the system with a GTX280 is not showing a single failure.

My ATI system is running off Collatz tasks so can't check there till tomorrow sometime.

The tasks complete successfully but fail validation. Most interestingly I have had some tasks validate...

Not enough information for me to tell you more ...
ID: 34056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileverstapp
Avatar

Send message
Joined: 26 Jan 09
Posts: 589
Credit: 497,834,261
RAC: 0
300 million credit badge10 year member badge
Message 34057 - Posted: 1 Dec 2009, 10:36:51 UTC

6 x 4870s here - no invalid tasks.
I think the Nv in your brand name is the problem. :)
Cheers,

PeterV

.
ID: 34057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfilePaul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
100 million credit badge10 year member badge
Message 34061 - Posted: 1 Dec 2009, 13:34:59 UTC - in response to Message 34057.  

6 x 4870s here - no invalid tasks.
I think the Nv in your brand name is the problem. :)

Doubt it ...

Tried running some of the ATI tasks I have on hand but they all seem to be the short ones. So, no comparison.

My GTX280 card seems to be running the long tasks just fine. Not sure why the 295 and 260 cards are having problems unless it has to do with multiple GPUs ... not sure why that would crop up now ... worked fine before ...
ID: 34061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34065 - Posted: 1 Dec 2009, 17:02:01 UTC
Last modified: 1 Dec 2009, 17:05:12 UTC

All new WUs seem to finish invalid on both of my cuda machines. Have no ATI here.

Machine 1: intel Q9650 WinXP SP2 GTX260 191.07 BOINC 6.10.17
Machine 2: intel Pentium D Win2003 SP2 GTX260 191.07 BOINC 6.10.17

Project degraded to NNT until this is fixed.

Oh boy, this reminds me of the song Flakes from Frank Zappa ;-)
ID: 34065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JAMC

Send message
Joined: 9 Sep 08
Posts: 96
Credit: 336,443,946
RAC: 0
300 million credit badge10 year member badge
Message 34066 - Posted: 1 Dec 2009, 17:26:21 UTC

I have had no errors/invalid tasks yet with my W7 64 bit, Q6600, NVIDIA GeForce GTX 260 (896MB) driver: 19107...
ID: 34066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34067 - Posted: 1 Dec 2009, 17:30:34 UTC - in response to Message 34066.  

I have had no errors/invalid tasks yet with my W7 64 bit, Q6600, NVIDIA GeForce GTX 260 (896MB) driver: 19107...


Hidden computers are always very helpful for diagnostics. How big is your cache? Maybe you're still crunching older WUs. When did you get them?
ID: 34067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JAMC

Send message
Joined: 9 Sep 08
Posts: 96
Credit: 336,443,946
RAC: 0
300 million credit badge10 year member badge
Message 34068 - Posted: 1 Dec 2009, 18:18:51 UTC - in response to Message 34067.  

I'll get right back to you on that...
ID: 34068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
salyavin

Send message
Joined: 13 Nov 07
Posts: 1
Credit: 11,688,539
RAC: 5,133
10 million credit badge10 year member badge
Message 34073 - Posted: 1 Dec 2009, 22:09:32 UTC - in response to Message 34065.  

Another problem machine
AMD Phenom II X4 940 Processor Linux 64 bit Nvidia GTX 285 Boinc 6.10.18
ID: 34073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilebanditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
500 thousand credit badge10 year member badge
Message 34081 - Posted: 2 Dec 2009, 0:20:18 UTC - in response to Message 34073.  

Another problem machine
AMD Phenom II X4 940 Processor Linux 64 bit Nvidia GTX 285 Boinc 6.10.18

Gives this message:
core_client_version>6.10.18</core_client_version>
<![CDATA[
<stderr_txt>
Device index specified on the command line was 0
Looking for a Double Precision capable NVIDIA GPU
The device GeForce GTX 285 specified on the command line can be used
called boinc_finish
</stderr_txt>

Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 34081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34108 - Posted: 2 Dec 2009, 19:16:07 UTC

OK, now after I tried some more things like project reset, reboot, detach and reattach, I'm completely stumped what to do next. Seems that I can no longer deliver valid results.

I browsed through the user stats and there can be seen that it's not a global problem. All of the top users seem to have no problems crunching the new WUs, and it's independent of BOINC client version or GPU type.

My GPUs are almost new. One is crunching since two weeks, the other one is running since the beginning of november and it's very unlikely that both cards will die in the same second. The fact that I'm not the only one having this problem makes it less likely that all those GPUs are malfunctioning.

Other projects like seti and collatz are running fine on both machines so it is assumed that the GPUs are all fine.

So what's going on here? Due to excessive DB purging here at MW there is not much of a history but I'm almost sure that it all began when the WU size has been increased. And the only thing that has been changed on both machines was anti virus pattern files.

Here are the machines again:

Machine 1: intel Q9650, WinXP SP2, GTX260 191.07, BOINC 6.10.17
Machine 2: intel Pentium D, Win2003 SP2, GTX260 191.07, BOINC 6.10.17

Please help, any suggestion will be appreciated.
ID: 34108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileCrunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
200 million credit badge10 year member badge
Message 34109 - Posted: 2 Dec 2009, 19:22:50 UTC - in response to Message 34108.  



So what's going on here?
Please help, any suggestion will be appreciated.


The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes.

One could try dviding the 'domain size' to prevent that.
Anyway, that's just a guess without having a look at the cuda code at all.

Might be something totaly different. Only Anthony will know for sure.




Join Support science! Joinc Team BOINC United now!
ID: 34109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error

©2019 Astroinformatics Group