Message boards :
Number crunching :
All WUs Failing?
Message board moderation
Author | Message |
---|---|
Send message Joined: 26 Oct 11 Posts: 7 Credit: 10,537,126 RAC: 0 |
Hello, I have had no real problems for many months, but as of the 18th of December at some point, every work unit for MW is failing, all at about 14 seconds into the run ("computational error"). Thoughts or suggestions? I run 2 other projects that are showing no issues. I do have an AMD/ATI GPU, but MW does not use it because it does not support double precision. The failure rate has been 100% on these. System is Windows 7 I7 based system, plenty of memory and disk space available. I did a project refresh, and the new tasks failed just like the others. Thanks! -Mike |
Send message Joined: 25 Feb 09 Posts: 82 Credit: 15,824,247 RAC: 0 |
First thing I'd check is if you have sufficient cooling and also do a system wide check, ie, check RAM, disk, GPU, CPU for errors. There are diagnostics programs that can do that but I can't recommend one as I don't use Windows If all comes out clean, then I'd suspect problems on M@H's side of things Team Belgium |
Send message Joined: 1 Apr 10 Posts: 49 Credit: 171,863,025 RAC: 0 |
Milkyway needs a double precision card the one you are running I think is not ! Regards |
Send message Joined: 19 Jul 10 Posts: 623 Credit: 19,255,064 RAC: 8 |
Milkyway needs a double precision card the one you are running I think is not ! The errors occur on the CPU and that has DP for sure. So that's not the reason. Looking into the std_err... look like some permisions issue. I'd try excluding BOINC data directory from antivirus scanning, if that does not help, reinstall BOINC (just install it once again over the old installation). |
Send message Joined: 26 Oct 11 Posts: 7 Credit: 10,537,126 RAC: 0 |
Milkyway needs a double precision card the one you are running I think is not ! Yes, I always get the message at startup from MW@H that my GPU does not support double precision and can't be used, so MW does not use it and just runs on the CPUs, as you indicate. I'll check permissions and AV, though I have done nothing new (at least deliberately) and certainly have not had any popups. Since every run is aborting within 15 seconds (and just on this project), I was thinking I might need to check/purge the temp files. Where does MW@H store those in Windows 7? Any I should NOT delete? Is there a more detailed log file I can check? Again, things have been processing just fine up until the 18th. Thanks! -Mike |
Send message Joined: 26 Oct 11 Posts: 7 Credit: 10,537,126 RAC: 0 |
First thing I'd check is if you have sufficient cooling and also do a system wide check, ie, check RAM, disk, GPU, CPU for errors. There are diagnostics programs that can do that but I can't recommend one as I don't use Windows I run temp and process monitors all the time, and nothing is showing any issues. SETI and WCG are both running fine, so I lean against a hardware issue per-se. Since it fails at the start of the run, 100% of the time, anything hardware related would more likely be showing in the other programs as well. I'll watch these things and if I don't see anything else to explain it, I'll track some down and run them. I don't overclock or anything (gave that stuff up in the late 80's and early 90's after seeing how bad that worked in retrospect) and in general I try not to stress the system. Thanks!! -Mike |
Send message Joined: 19 Jul 10 Posts: 623 Credit: 19,255,064 RAC: 8 |
Since every run is aborting within 15 seconds (and just on this project), I was thinking I might need to check/purge the temp files. Where does MW@H store those in Windows 7? Any I should NOT delete? Is there a more detailed log file I can check? Usually you should not need to delete anything by hand, but you can check if there's something "dead" in the slots directory (with standard installation on Win7 this should be in C:\ProgramData\BOINC), i.e. check so all current slot folders are in use. If there something in there, that's not in use, specially has not been in use since the 18th, you probably can delete it. It's quite easy to spot, because you should have as many slots as WUs running (empty slot dirs do not count, they are "OK"). But if you have for example 2 WUs running and 3 slot dirs with files in them, that's one too many. Anything else you should NOT delete... for now. There are no more detailed logs, the std_err is the only log from the application. |
Send message Joined: 26 Feb 10 Posts: 1 Credit: 18,803,640 RAC: 1 |
Try to make a "repair" with the boinc client installation executable, may fix some problems. |
Send message Joined: 26 Oct 11 Posts: 7 Credit: 10,537,126 RAC: 0 |
Milkyway needs a double precision card the one you are running I think is not ! The permission issue tipped me off... My McAfee Access Protection service had gone a bit bonkers, and I forget that turning off "real time protection" does not turn that off. My Windows Update even failed with the same "permission" error, and I saw the "shield up" thing in the McAfee icon, so I checked those logs. It looks like it was blocking access to something to do with the ATI driver. Disabled Access Protection for a while and everything started running fine. Finished the patches, rebooted and did a bit of cleanup on my temp dirs and it has been running fine ever since. Thanks to everyone for the help!! -Mike |
©2024 Astroinformatics Group