Scheduled Server Maintenance Concluded

Author	Message
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66364 - Posted: 4 May 2017, 13:21:56 UTC - in response to Message 66363. Half of my 140 units are working fine. It's no big deal if they fail after 2 seconds, it's only wasted 2 seconds of my computer's time. With multi-GPU setups, there will be so many failed tasks that the client tries to "protect" the network by postponing the communication with the server (the idea is to stop faulty machines from thrashing thousands of workunits, but it has backfired on us in this case). ID: 66364 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66365 - Posted: 4 May 2017, 13:24:51 UTC - in response to Message 66364. Not happening with me, only 50% of the CPU tasks are failing. So it always has plenty to do until the next update. Most of the GPU ones I'm getting are 146 which work fine. Just checked - 3 of the last 31 GPU tasks were 140 and failed. 28 were 146 and all worked. ID: 66365 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 66367 - Posted: 4 May 2017, 14:26:21 UTC GIPICS, I've had issues using the WU cancelling feature in the past. Especially when you try to cancel 100,000+ workunits. I let the server queue run for a full day before putting up the new runs. This shrank the queue drastically. I expect they should all be cleared out soon. Everyone, Sorry for the inconvenience caused as the work units clear the queue. I tried to mitigate the issue beforehand since I knew it would be a problem. Luckily, the new application seems to be working as intended so everything should be running well once the work units are cleared. Jake ID: 66367 · Rating: 0 · rate: / Reply Quote

Leo J Keller II Send message Joined: 25 Nov 16 Posts: 2 Credit: 3,962,420 RAC: 0	Message 66368 - Posted: 4 May 2017, 20:58:05 UTC I have been unable to download any work units today and am totally out of MilkyWay@Homework units at this point. I am running a iMac with MacOS 10.12.4. My other BOINC projects are operating fine. Do I need to update something? ID: 66368 · Rating: 0 · rate: / Reply Quote

aad Send message Joined: 30 Mar 09 Posts: 63 Credit: 621,678,650 RAC: 1,177	Message 66369 - Posted: 4 May 2017, 22:25:19 UTC - in response to Message 66368. Last modified: 4 May 2017, 22:30:30 UTC I have been unable to download any work units today and am totally out of MilkyWay@Homework units at this point. I am running a iMac with MacOS 10.12.4. My other BOINC projects are operating fine. Do I need to update something? As you can see here; http://milkyway.cs.rpi.edu/milkyway/apps.php there is no new application for the Mac yet. http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4129 ID: 66369 · Rating: 0 · rate: / Reply Quote

GIPICS Send message Joined: 24 Apr 17 Posts: 8 Credit: 77,149,813 RAC: 0	Message 66371 - Posted: 4 May 2017, 23:21:12 UTC Is still a massacre. A lot of faulty wus Unattended hosts get deferred communication, postponed on 24h this is a drama... ID: 66371 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66372 - Posted: 5 May 2017, 0:30:26 UTC - in response to Message 66371. Why are people getting so upset? My computers are working fine (Windows 10, Intel CPUs, AMD graphics). If you don't have enough work, sign up for a backup project. ID: 66372 · Rating: 0 · rate: / Reply Quote

GIPICS Send message Joined: 24 Apr 17 Posts: 8 Credit: 77,149,813 RAC: 0	Message 66373 - Posted: 5 May 2017, 1:04:35 UTC Because unattended hosts get sometimes BSOD or more often just stuck for 24h waiting for countdown's end before to do a new update for wus request. But is not about getting upset, is about wasting time.. ID: 66373 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66375 - Posted: 5 May 2017, 9:07:51 UTC - in response to Message 66373. I've never had a BSOD from a broken WU, there must be something else wrong with your machine. And what is this 24 h wait? My computers ask for more WU whenever they need it. ID: 66375 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66376 - Posted: 5 May 2017, 9:52:22 UTC - in response to Message 66375. I've never had a BSOD from a broken WU, there must be something else wrong with your machine. And what is this 24 h wait? My computers ask for more WU whenever they need it. BSODs happen, with or without BOINC, even on perfectly fine machines. Your computers have about 1000 tasks all together. But a single PC with 3 or 4 powerful GPUs will have +30k tasks and many of them will end in computational errors, because they are incompatible with the new app. With so many computational errors, communication with the server gets deferred. It's by design, because servers normally don't expect so many thrashed workunits, to put it simple. ID: 66376 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66377 - Posted: 5 May 2017, 10:01:22 UTC - in response to Message 66376. If you're getting a BSOD, you should run memtest on a triple scan, most likely you have faulty memory. I never ever get a BSOD doing anything. Try increasing your queue size or adding a backup project on priority 0 (it will only run if you run out of Milkyway tasks). ID: 66377 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66378 - Posted: 5 May 2017, 10:48:20 UTC - in response to Message 66377. If you're getting a BSOD, you should run memtest on a triple scan, most likely you have faulty memory. I never ever get a BSOD doing anything. Try increasing your queue size or adding a backup project on priority 0 (it will only run if you run out of Milkyway tasks). Things are not that simple and you are not even trying to comprehend the problem here. Milkyway allows only 80 tasks per GPU, so it's impossible to have a large queue (as you have suggested). 80 tasks aren't enough for even a full hour with a Tahiti GPU, therefore regular communication with the server is extremely important, to be able to obtain new tasks often enough. Of course, I have backup projects. But Tahiti GPUs are strong only in FP64 nowadays, therefore they excel only in Milkyway. My Gridcoin magnitude and rewards are decreased whenever backup projects kick in. It's simply inefficient to use Tahitis for FP32 projects today and only PrimeGrid is a viable FP64 alternative, however only for very long WUs which take days to crunch. ID: 66378 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66379 - Posted: 5 May 2017, 13:29:55 UTC - in response to Message 66378. What's the time limit for re-contacting the Milkyway server? I've seen my machines contact them far more often than once an hour. And it's only a small proportion that fail. Anyway, just be patient as this is only a temporary problem! And if we're only getting 80 tasks per GPU, this would seem to indicate there isn't that much work to be done on this project. I really don't understand why some people need to have their machines running flat out 24/7. If it runs out of work to do sometimes, so be it. Get another project, or not. As for your BSOD, seriously you shouldn't get that no matter what any program does wrong. Only bad memory (or sometimes a bad graphics card) can cause a BSOD. What is the code on the BSOD? ID: 66379 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66380 - Posted: 5 May 2017, 14:58:54 UTC - in response to Message 66379. What's the time limit for re-contacting the Milkyway server? I've seen my machines contact them far more often than once an hour. And it's only a small proportion that fail. Anyway, just be patient as this is only a temporary problem! And if we're only getting 80 tasks per GPU, this would seem to indicate there isn't that much work to be done on this project. I really don't understand why some people need to have their machines running flat out 24/7. If it runs out of work to do sometimes, so be it. Get another project, or not. As for your BSOD, seriously you shouldn't get that no matter what any program does wrong. Only bad memory (or sometimes a bad graphics card) can cause a BSOD. What is the code on the BSOD? You still don't get it, just repeating the same irrelevant stuff over and over again. Normally, communication is deferred for 90 seconds after every Update. But, to explain it for the fourth time, plenty of computation errors (which we are getting now due to new app version) are deferring communication for up to 24hrs. I hope it's clear now? As for BSODs, I am not getting any, that's just one of your superficial assumptions. And even if I did, I doubt I would ask you for help. ID: 66380 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66381 - Posted: 5 May 2017, 15:08:53 UTC - in response to Message 66380. Sorry I didn't notice you said there's a longer deferral for errors. But I'm hardly getting any errors, only a small amount of the older WUs fail, and the majority of those are CPU tasks, the GPU runs fine for 90% of tasks. If you're getting a lot more errors, perhaps there's a fault with your setup? Is it overclocked? Are you using significantly different GPUs to me which show up a bug in the new application? You said earlier "BSODs happen, with or without BOINC, even on perfectly fine machines." I was correcting you. BSOD is a hardware error. ID: 66381 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66382 - Posted: 5 May 2017, 15:40:01 UTC - in response to Message 66381. You said earlier "BSODs happen, with or without BOINC, even on perfectly fine machines." I was correcting you. BSOD is a hardware error. https://support.microsoft.com/en-us/help/17074/windows-7-resolving-stop-blue-screen-errors "Stop errors (also sometimes called blue screen or black screen errors) can occur if a serious problem causes Windows 7 to shut down or restart unexpectedly. These errors can be caused by both hardware and software issues." ID: 66382 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66383 - Posted: 5 May 2017, 15:44:30 UTC - in response to Message 66382. Not my experience as a computer tech since 1997. I've stopped them all by replacing faulty memory, or the odd one had a dodgy GPU. Almost every software error will be caught by Windows (since about version 2000). ID: 66383 · Rating: 0 · rate: / Reply Quote

aad Send message Joined: 30 Mar 09 Posts: 63 Credit: 621,678,650 RAC: 1,177	Message 66384 - Posted: 5 May 2017, 17:02:01 UTC - in response to Message 66383. About 'how much errors' appears on one machine; State: All (3620) Â· In progress (322) Â· Validation pending (0) Â· Validation inconclusive (439) Â· Valid (2639) Â· Invalid (13) Â· Error (207) The errors here were all 1 second errors, same as all the wingmen! Notice the 'in progress (322)'; In fact this is a 2 GPU machine, with only one GPU crunching for Milkyway! 80/GPU...........not in my case luckely! ID: 66384 · Rating: 0 · rate: / Reply Quote

bluestang Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0	Message 66385 - Posted: 5 May 2017, 17:02:32 UTC - in response to Message 66383. Last modified: 5 May 2017, 17:03:21 UTC Not my experience as a computer tech since 1997. I've stopped them all by replacing faulty memory, or the odd one had a dodgy GPU. Almost every software error will be caught by Windows (since about version 2000). lol, you don't run enough WUs in a single day to even make good observation anyways. ID: 66385 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,129,960 RAC: 1,448	Message 66386 - Posted: 5 May 2017, 17:06:59 UTC - in response to Message 66385. lol, you don't run enough WUs in a single day to even make good observation anyways. I run three other projects aswell, and Milkyway isn't the highest priority. Anyway, I currently pay for the electricity for the crunching, so I don't do as much as I used to. I did at one point have about 10 of the latest GPUs running 24/7. About 'how much errors' appears on one machine; State: All (3620) Â· In progress (322) Â· Validation pending (0) Â· Validation inconclusive (439) Â· Valid (2639) Â· Invalid (13) Â· Error (207) The errors here were all 1 second errors, same as all the wingmen! Notice the 'in progress (322)'; In fact this is a 2 GPU machine, with only one GPU crunching for Milkyway! 80/GPU...........not in my case luckely! So not that many as a percentage. Nothing to worry about. ID: 66386 · Rating: 0 · rate: / Reply Quote