Welcome to MilkyWay@home

All WUs Failing?

Message boards : Number crunching : All WUs Failing?
Message board moderation

To post messages, you must log in.

AuthorMessage
pippen

Send message
Joined: 26 Oct 11
Posts: 7
Credit: 10,537,126
RAC: 0
Message 56568 - Posted: 19 Dec 2012, 16:15:37 UTC

Hello,

I have had no real problems for many months, but as of the 18th of December at some point, every work unit for MW is failing, all at about 14 seconds into the run ("computational error"). Thoughts or suggestions? I run 2 other projects that are showing no issues. I do have an AMD/ATI GPU, but MW does not use it because it does not support double precision. The failure rate has been 100% on these.

System is Windows 7 I7 based system, plenty of memory and disk space available. I did a project refresh, and the new tasks failed just like the others.

Thanks! -Mike
ID: 56568 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile microchip
Avatar

Send message
Joined: 25 Feb 09
Posts: 82
Credit: 15,824,247
RAC: 0
Message 56569 - Posted: 19 Dec 2012, 17:33:05 UTC

First thing I'd check is if you have sufficient cooling and also do a system wide check, ie, check RAM, disk, GPU, CPU for errors. There are diagnostics programs that can do that but I can't recommend one as I don't use Windows

If all comes out clean, then I'd suspect problems on M@H's side of things
Team Belgium
ID: 56569 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John G

Send message
Joined: 1 Apr 10
Posts: 49
Credit: 171,863,025
RAC: 0
Message 56570 - Posted: 19 Dec 2012, 17:55:37 UTC

Milkyway needs a double precision card the one you are running I think is not !

Regards
ID: 56570 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 595
Credit: 18,968,541
RAC: 5,799
Message 56572 - Posted: 19 Dec 2012, 19:44:06 UTC - in response to Message 56570.  

Milkyway needs a double precision card the one you are running I think is not !

The errors occur on the CPU and that has DP for sure. So that's not the reason.

Looking into the std_err... look like some permisions issue. I'd try excluding BOINC data directory from antivirus scanning, if that does not help, reinstall BOINC (just install it once again over the old installation).
ID: 56572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pippen

Send message
Joined: 26 Oct 11
Posts: 7
Credit: 10,537,126
RAC: 0
Message 56582 - Posted: 20 Dec 2012, 15:42:28 UTC - in response to Message 56572.  

Milkyway needs a double precision card the one you are running I think is not !

The errors occur on the CPU and that has DP for sure. So that's not the reason.

Looking into the std_err... look like some permisions issue. I'd try excluding BOINC data directory from antivirus scanning, if that does not help, reinstall BOINC (just install it once again over the old installation).


Yes, I always get the message at startup from MW@H that my GPU does not support double precision and can't be used, so MW does not use it and just runs on the CPUs, as you indicate.

I'll check permissions and AV, though I have done nothing new (at least deliberately) and certainly have not had any popups. Since every run is aborting within 15 seconds (and just on this project), I was thinking I might need to check/purge the temp files. Where does MW@H store those in Windows 7? Any I should NOT delete? Is there a more detailed log file I can check?

Again, things have been processing just fine up until the 18th.

Thanks! -Mike
ID: 56582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pippen

Send message
Joined: 26 Oct 11
Posts: 7
Credit: 10,537,126
RAC: 0
Message 56584 - Posted: 20 Dec 2012, 15:50:14 UTC - in response to Message 56569.  

First thing I'd check is if you have sufficient cooling and also do a system wide check, ie, check RAM, disk, GPU, CPU for errors. There are diagnostics programs that can do that but I can't recommend one as I don't use Windows

If all comes out clean, then I'd suspect problems on M@H's side of things


I run temp and process monitors all the time, and nothing is showing any issues. SETI and WCG are both running fine, so I lean against a hardware issue per-se. Since it fails at the start of the run, 100% of the time, anything hardware related would more likely be showing in the other programs as well. I'll watch these things and if I don't see anything else to explain it, I'll track some down and run them. I don't overclock or anything (gave that stuff up in the late 80's and early 90's after seeing how bad that worked in retrospect) and in general I try not to stress the system.

Thanks!!

-Mike


ID: 56584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 595
Credit: 18,968,541
RAC: 5,799
Message 56586 - Posted: 20 Dec 2012, 22:08:53 UTC - in response to Message 56582.  

Since every run is aborting within 15 seconds (and just on this project), I was thinking I might need to check/purge the temp files. Where does MW@H store those in Windows 7? Any I should NOT delete? Is there a more detailed log file I can check?

Usually you should not need to delete anything by hand, but you can check if there's something "dead" in the slots directory (with standard installation on Win7 this should be in C:\ProgramData\BOINC), i.e. check so all current slot folders are in use. If there something in there, that's not in use, specially has not been in use since the 18th, you probably can delete it. It's quite easy to spot, because you should have as many slots as WUs running (empty slot dirs do not count, they are "OK"). But if you have for example 2 WUs running and 3 slot dirs with files in them, that's one too many.

Anything else you should NOT delete... for now.

There are no more detailed logs, the std_err is the only log from the application.
ID: 56586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
clockedover

Send message
Joined: 26 Feb 10
Posts: 1
Credit: 18,801,942
RAC: 56
Message 56615 - Posted: 22 Dec 2012, 23:45:56 UTC

Try to make a "repair" with the boinc client installation executable, may fix some problems.
ID: 56615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pippen

Send message
Joined: 26 Oct 11
Posts: 7
Credit: 10,537,126
RAC: 0
Message 56619 - Posted: 23 Dec 2012, 23:43:56 UTC - in response to Message 56572.  

Milkyway needs a double precision card the one you are running I think is not !

The errors occur on the CPU and that has DP for sure. So that's not the reason.

Looking into the std_err... look like some permisions issue. I'd try excluding BOINC data directory from antivirus scanning, if that does not help, reinstall BOINC (just install it once again over the old installation).


The permission issue tipped me off... My McAfee Access Protection service had gone a bit bonkers, and I forget that turning off "real time protection" does not turn that off. My Windows Update even failed with the same "permission" error, and I saw the "shield up" thing in the McAfee icon, so I checked those logs. It looks like it was blocking access to something to do with the ATI driver.

Disabled Access Protection for a while and everything started running fine. Finished the patches, rebooted and did a bit of cleanup on my temp dirs and it has been running fine ever since. Thanks to everyone for the help!!

-Mike
ID: 56619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : All WUs Failing?

©2024 Astroinformatics Group