Scheduled Server Maintenance Concluded
log in

Advanced search

Message boards : News : Scheduled Server Maintenance Concluded

Previous · 1 · 2 · 3 · Next
Author Message
Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 99,193

Message 66364 - Posted: 4 May 2017, 13:21:56 UTC - in response to Message 66363.

Half of my 140 units are working fine. It's no big deal if they fail after 2 seconds, it's only wasted 2 seconds of my computer's time.

With multi-GPU setups, there will be so many failed tasks that the client tries to "protect" the network by postponing the communication with the server (the idea is to stop faulty machines from thrashing thousands of workunits, but it has backfired on us in this case).

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66365 - Posted: 4 May 2017, 13:24:51 UTC - in response to Message 66364.

Not happening with me, only 50% of the CPU tasks are failing. So it always has plenty to do until the next update. Most of the GPU ones I'm getting are 146 which work fine. Just checked - 3 of the last 31 GPU tasks were 140 and failed. 28 were 146 and all worked.

Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 430
Credit: 7,615,331
RAC: 8,186

Message 66367 - Posted: 4 May 2017, 14:26:21 UTC

GIPICS,

I've had issues using the WU cancelling feature in the past. Especially when you try to cancel 100,000+ workunits. I let the server queue run for a full day before putting up the new runs. This shrank the queue drastically. I expect they should all be cleared out soon.

Everyone,

Sorry for the inconvenience caused as the work units clear the queue. I tried to mitigate the issue beforehand since I knew it would be a problem. Luckily, the new application seems to be working as intended so everything should be running well once the work units are cleared.

Jake

Leo J Keller II
Send message
Joined: 25 Nov 16
Posts: 1
Credit: 474,298
RAC: 871

Message 66368 - Posted: 4 May 2017, 20:58:05 UTC

I have been unable to download any work units today and am totally out of MilkyWay@Homework units at this point. I am running a iMac with MacOS 10.12.4. My other BOINC projects are operating fine. Do I need to update something?

aad
Send message
Joined: 30 Mar 09
Posts: 51
Credit: 242,155,069
RAC: 431,967

Message 66369 - Posted: 4 May 2017, 22:25:19 UTC - in response to Message 66368.
Last modified: 4 May 2017, 22:30:30 UTC

I have been unable to download any work units today and am totally out of MilkyWay@Homework units at this point. I am running a iMac with MacOS 10.12.4. My other BOINC projects are operating fine. Do I need to update something?

As you can see here;
http://milkyway.cs.rpi.edu/milkyway/apps.php
there is no new application for the Mac yet.

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4129

Profile GIPICS
Send message
Joined: 24 Apr 17
Posts: 8
Credit: 47,456,536
RAC: 7,498

Message 66371 - Posted: 4 May 2017, 23:21:12 UTC

Is still a massacre.
A lot of faulty wus
Unattended hosts get deferred communication, postponed on 24h

this is a drama...

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66372 - Posted: 5 May 2017, 0:30:26 UTC - in response to Message 66371.

Why are people getting so upset? My computers are working fine (Windows 10, Intel CPUs, AMD graphics). If you don't have enough work, sign up for a backup project.

Profile GIPICS
Send message
Joined: 24 Apr 17
Posts: 8
Credit: 47,456,536
RAC: 7,498

Message 66373 - Posted: 5 May 2017, 1:04:35 UTC

Because unattended hosts get sometimes BSOD or more often just stuck for 24h waiting for countdown's end before to do a new update for wus request.
But is not about getting upset, is about wasting time..

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66375 - Posted: 5 May 2017, 9:07:51 UTC - in response to Message 66373.

I've never had a BSOD from a broken WU, there must be something else wrong with your machine.

And what is this 24 h wait? My computers ask for more WU whenever they need it.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 99,193

Message 66376 - Posted: 5 May 2017, 9:52:22 UTC - in response to Message 66375.

I've never had a BSOD from a broken WU, there must be something else wrong with your machine.

And what is this 24 h wait? My computers ask for more WU whenever they need it.

BSODs happen, with or without BOINC, even on perfectly fine machines.

Your computers have about 1000 tasks all together. But a single PC with 3 or 4 powerful GPUs will have +30k tasks and many of them will end in computational errors, because they are incompatible with the new app. With so many computational errors, communication with the server gets deferred. It's by design, because servers normally don't expect so many thrashed workunits, to put it simple.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66377 - Posted: 5 May 2017, 10:01:22 UTC - in response to Message 66376.

If you're getting a BSOD, you should run memtest on a triple scan, most likely you have faulty memory. I never ever get a BSOD doing anything.

Try increasing your queue size or adding a backup project on priority 0 (it will only run if you run out of Milkyway tasks).

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 99,193

Message 66378 - Posted: 5 May 2017, 10:48:20 UTC - in response to Message 66377.

If you're getting a BSOD, you should run memtest on a triple scan, most likely you have faulty memory. I never ever get a BSOD doing anything.

Try increasing your queue size or adding a backup project on priority 0 (it will only run if you run out of Milkyway tasks).

Things are not that simple and you are not even trying to comprehend the problem here. Milkyway allows only 80 tasks per GPU, so it's impossible to have a large queue (as you have suggested). 80 tasks aren't enough for even a full hour with a Tahiti GPU, therefore regular communication with the server is extremely important, to be able to obtain new tasks often enough.

Of course, I have backup projects. But Tahiti GPUs are strong only in FP64 nowadays, therefore they excel only in Milkyway. My Gridcoin magnitude and rewards are decreased whenever backup projects kick in. It's simply inefficient to use Tahitis for FP32 projects today and only PrimeGrid is a viable FP64 alternative, however only for very long WUs which take days to crunch.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66379 - Posted: 5 May 2017, 13:29:55 UTC - in response to Message 66378.

What's the time limit for re-contacting the Milkyway server? I've seen my machines contact them far more often than once an hour. And it's only a small proportion that fail. Anyway, just be patient as this is only a temporary problem!

And if we're only getting 80 tasks per GPU, this would seem to indicate there isn't that much work to be done on this project.

I really don't understand why some people need to have their machines running flat out 24/7. If it runs out of work to do sometimes, so be it. Get another project, or not.

As for your BSOD, seriously you shouldn't get that no matter what any program does wrong. Only bad memory (or sometimes a bad graphics card) can cause a BSOD. What is the code on the BSOD?

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 99,193

Message 66380 - Posted: 5 May 2017, 14:58:54 UTC - in response to Message 66379.

What's the time limit for re-contacting the Milkyway server? I've seen my machines contact them far more often than once an hour. And it's only a small proportion that fail. Anyway, just be patient as this is only a temporary problem!

And if we're only getting 80 tasks per GPU, this would seem to indicate there isn't that much work to be done on this project.

I really don't understand why some people need to have their machines running flat out 24/7. If it runs out of work to do sometimes, so be it. Get another project, or not.

As for your BSOD, seriously you shouldn't get that no matter what any program does wrong. Only bad memory (or sometimes a bad graphics card) can cause a BSOD. What is the code on the BSOD?

You still don't get it, just repeating the same irrelevant stuff over and over again. Normally, communication is deferred for 90 seconds after every Update. But, to explain it for the fourth time, plenty of computation errors (which we are getting now due to new app version) are deferring communication for up to 24hrs. I hope it's clear now?

As for BSODs, I am not getting any, that's just one of your superficial assumptions. And even if I did, I doubt I would ask you for help.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66381 - Posted: 5 May 2017, 15:08:53 UTC - in response to Message 66380.

Sorry I didn't notice you said there's a longer deferral for errors. But I'm hardly getting any errors, only a small amount of the older WUs fail, and the majority of those are CPU tasks, the GPU runs fine for 90% of tasks. If you're getting a lot more errors, perhaps there's a fault with your setup? Is it overclocked? Are you using significantly different GPUs to me which show up a bug in the new application?

You said earlier "BSODs happen, with or without BOINC, even on perfectly fine machines." I was correcting you. BSOD is a hardware error.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 99,193

Message 66382 - Posted: 5 May 2017, 15:40:01 UTC - in response to Message 66381.

You said earlier "BSODs happen, with or without BOINC, even on perfectly fine machines." I was correcting you. BSOD is a hardware error.

https://support.microsoft.com/en-us/help/17074/windows-7-resolving-stop-blue-screen-errors

"Stop errors (also sometimes called blue screen or black screen errors) can occur if a serious problem causes Windows 7 to shut down or restart unexpectedly. These errors can be caused by both hardware and software issues."

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66383 - Posted: 5 May 2017, 15:44:30 UTC - in response to Message 66382.

Not my experience as a computer tech since 1997. I've stopped them all by replacing faulty memory, or the odd one had a dodgy GPU. Almost every software error will be caught by Windows (since about version 2000).

aad
Send message
Joined: 30 Mar 09
Posts: 51
Credit: 242,155,069
RAC: 431,967

Message 66384 - Posted: 5 May 2017, 17:02:01 UTC - in response to Message 66383.

About 'how much errors' appears on one machine;
State: All (3620) · In progress (322) · Validation pending (0) · Validation inconclusive (439) · Valid (2639) · Invalid (13) · Error (207)

The errors here were all 1 second errors, same as all the wingmen!

Notice the 'in progress (322)';
In fact this is a 2 GPU machine, with only one GPU crunching for Milkyway!
80/GPU...........not in my case luckely!

bluestang
Send message
Joined: 13 Oct 16
Posts: 36
Credit: 67,046,505
RAC: 843

Message 66385 - Posted: 5 May 2017, 17:02:32 UTC - in response to Message 66383.
Last modified: 5 May 2017, 17:03:21 UTC

Not my experience as a computer tech since 1997. I've stopped them all by replacing faulty memory, or the odd one had a dodgy GPU. Almost every software error will be caught by Windows (since about version 2000).


lol, you don't run enough WUs in a single day to even make good observation anyways.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 61

Message 66386 - Posted: 5 May 2017, 17:06:59 UTC - in response to Message 66385.

lol, you don't run enough WUs in a single day to even make good observation anyways.


I run three other projects aswell, and Milkyway isn't the highest priority. Anyway, I currently pay for the electricity for the crunching, so I don't do as much as I used to. I did at one point have about 10 of the latest GPUs running 24/7.

About 'how much errors' appears on one machine;
State: All (3620) · In progress (322) · Validation pending (0) · Validation inconclusive (439) · Valid (2639) · Invalid (13) · Error (207)

The errors here were all 1 second errors, same as all the wingmen!

Notice the 'in progress (322)';
In fact this is a 2 GPU machine, with only one GPU crunching for Milkyway!
80/GPU...........not in my case luckely!


So not that many as a percentage. Nothing to worry about.

Previous · 1 · 2 · 3 · Next
Post to thread

Message boards : News : Scheduled Server Maintenance Concluded


Main page · Your account · Message boards


Copyright © 2017 AstroInformatics Group