Scheduled Server Maintenance Concluded
log in

Advanced search

Message boards : News : Scheduled Server Maintenance Concluded

Previous · 1 · 2 · 3
Author Message
Profile Cliff
Avatar
Send message
Joined: 28 Nov 14
Posts: 45
Credit: 51,783,604
RAC: 118,996

Message 66390 - Posted: 6 May 2017, 12:54:06 UTC - in response to Message 66367.
Last modified: 6 May 2017, 12:54:46 UTC


Everyone,

Sorry for the inconvenience caused as the work units clear the queue. I tried to mitigate the issue beforehand since I knew it would be a problem. Luckily, the new application seems to be working as intended so everything should be running well once the work units are cleared.

Jake


Be nice if the server didn't keep on sending last gen WU again and again.
As resends that is...
I've started aborting any I see, but I'm just monitoring 1 rig with a quick look at the others now and then.
____________
Regards,
Cliff.
--
Been there Done That, still no Damn T-Shirt

John G
Send message
Joined: 1 Apr 10
Posts: 49
Credit: 171,863,025
RAC: 0

Message 66392 - Posted: 7 May 2017, 0:09:11 UTC

I have just been aborting them. No big deal!!

Regards

John

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66393 - Posted: 7 May 2017, 2:21:19 UTC - in response to Message 66392.

I have just been aborting them. No big deal!!

Regards

John


Indeed, people get too worked up about these things. Run more than one project, allow your computer to not always be running, whatever. It's hardly the end of the world if you're not calculating something 24/7. The project is being updated, some hiccups will arise, get used to it.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 110,008

Message 66396 - Posted: 7 May 2017, 20:58:21 UTC - in response to Message 66392.

I have just been aborting them. No big deal!!

How do you abort a task on an unattended machine?

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66397 - Posted: 7 May 2017, 21:11:21 UTC - in response to Message 66396.

I have just been aborting them. No big deal!!

How do you abort a task on an unattended machine?


They abort themselves.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 110,008

Message 66398 - Posted: 7 May 2017, 22:39:45 UTC - in response to Message 66397.

They abort themselves.

Ah, more of your erroneous assumptions. They don't abort themselves. They end in 'Computational error' (1 (0x1) Unknown error number). Which is completely different from 'Aborted' (203 (0xcb) EXIT_ABORTED_VIA_GUI). The important difference being that computation errors defer communication with the server and aborted tasks don't. So indeed, it's preferable to abort them, but that's not possible on unattended machines and I certainly don't have the whole day to abort deprecated Milkyway tasks.

Frankly, I am tired of correcting you all the time. Can't you research anything before you post?

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66399 - Posted: 7 May 2017, 22:51:45 UTC - in response to Message 66398.

They abort themselves.

Ah, more of your erroneous assumptions. They don't abort themselves. They end in 'Computational error' (1 (0x1) Unknown error number). Which is completely different from 'Aborted' (203 (0xcb) EXIT_ABORTED_VIA_GUI). The important difference being that computation errors defer communication with the server and aborted tasks don't. So indeed, it's preferable to abort them, but that's not possible on unattended machines and I certainly don't have the whole day to abort deprecated Milkyway tasks.

Frankly, I am tired of correcting you all the time. Can't you research anything before you post?


I don't know what your problem is, but my computers work just fine. If things end in errors, they stop all by themselves. Then BOINC finds something else to do. Stop being so bloody rude I'm getting sick of your arrogance.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 110,008

Message 66401 - Posted: 7 May 2017, 23:50:42 UTC - in response to Message 66399.

I don't know what your problem is, but my computers work just fine.

So you haven't noticed yet this is the thread about the problems which have appeared after latest Scheduled Server Maintenance? It's in the title of the thread, in case you have missed it. Computational errors mentioned here so often are the result of the latest Milkyway@home upgrades - Jake himself kindly invited us to provide some feedback on that. And your hosts are also producing such errors, check here for example:

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=686691&offset=0&show_names=0&state=6&appid

So, you see the problem now? You are trying to provide some advice, but you don't even know what you are talking about (not even your own hosts). You should be grateful for this education, not angry.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66404 - Posted: 8 May 2017, 14:06:47 UTC - in response to Message 66401.

I don't know what your problem is, but my computers work just fine.

So you haven't noticed yet this is the thread about the problems which have appeared after latest Scheduled Server Maintenance? It's in the title of the thread, in case you have missed it. Computational errors mentioned here so often are the result of the latest Milkyway@home upgrades - Jake himself kindly invited us to provide some feedback on that. And your hosts are also producing such errors, check here for example:

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=686691&offset=0&show_names=0&state=6&appid

So, you see the problem now? You are trying to provide some advice, but you don't even know what you are talking about (not even your own hosts). You should be grateful for this education, not angry.


I've got 194 valid and only 17 in error, hardly cause for concern. It's not preventing me getting any more from the server, and if it did it would switch to Einstein, Universe, or SETI without my intervention.

And I have NEVER had a blue screen from any project, apart from a computer with dodgy RAM - I used to build them and used BOINC to do a 3 day burn in test. Any errors or excessively high temperatures and I checked the hardware.

It's you that keeps being abusive, I'm a calm person.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 110,008

Message 66405 - Posted: 8 May 2017, 14:50:27 UTC - in response to Message 66404.

I've got 194 valid and only 17 in error, hardly cause for concern. It's not preventing me getting any more from the server

OK. But, to explain it once again, with multiple GPUs and higher output, you would get so many computational errors that it would prevent you from getting more tasks from the server.

I guess it's the mechanism to prevent faulty machines from thrashing thousands of workunits. But in this case computational errors are not happening because of faulty hardware, but because of recent switch to Milkyway version 1.46, deprecating thousands of older tasks which cannot be computed successfully with newer application. In normal circumstances, 17 computational errors with 194 valids, that would be a 8% failure rate - very high and a strong indication of some hardware problems.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66406 - Posted: 8 May 2017, 15:00:26 UTC - in response to Message 66405.

I've got 194 valid and only 17 in error, hardly cause for concern. It's not preventing me getting any more from the server

OK. But, to explain it once again, with multiple GPUs and higher output, you would get so many computational errors that it would prevent you from getting more tasks from the server.

I guess it's the mechanism to prevent faulty machines from thrashing thousands of workunits. But in this case computational errors are not happening because of faulty hardware, but because of recent switch to Milkyway version 1.46, deprecating thousands of older tasks which cannot be computed successfully with newer application. In normal circumstances, 17 computational errors with 194 valids, that would be a 8% failure rate - very high and a strong indication of some hardware problems.


Is the server block not done on a percentage? If my 17 to 194 isn't triggering it, why should 170 to 1940?

Actually I can't be sure it hasn't triggered it on mine - is that recorded in my results data somewhere? I might not notice as Milkyway is set to 25% share at the moment, all I'd see is it using the other three projects.

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 110,008

Message 66407 - Posted: 8 May 2017, 15:50:50 UTC - in response to Message 66406.

Is the server block not done on a percentage? If my 17 to 194 isn't triggering it, why should 170 to 1940?

That's a good question. Jake will know this for sure, but I believe the trigger is set on absolute numbers, not on a percentage. Because 1% error-rate on a machine with 8 powerful GPUs is still literally thousands of thrashed tasks, while on a machine with just one low-range GPU, 1% is a couple of tasks only. It makes more sense to use absolute numbers to protect the network from faulty machines.


Actually I can't be sure it hasn't triggered it on mine - is that recorded in my results data somewhere?

I think such events aren't shown in the Event Log, therefore the only way to spot it is to check your hosts last contact time through your Milkyway account. If your host hasn't contacted the server for hours or more, communication might have been deferred due to computational errors. Or, if the machine is attended, check it through your BOINC Manager - communication deferred timer is always shown on the Projects tab (under Status column).

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66408 - Posted: 8 May 2017, 15:58:58 UTC - in response to Message 66407.

That's a good question. Jake will know this for sure, but I believe the trigger is set on absolute numbers, not on a percentage. Because 1% error-rate on a machine with 8 powerful GPUs is still literally thousands of thrashed tasks, while on a machine with just one low-range GPU, 1% is a couple of tasks only. It makes more sense to use absolute numbers to protect the network from faulty machines.


I disagree. Let's say I have 1 GPU, and you have 8 of the same model GPU in one machine. I give back 2 faulty tasks and 98 good ones. You give 16 faulty tasks and 784 good ones. We both have a 2% failure rate, and we're both giving 98% useful results. Why should your computer be treated any differently by the server? Yes, 8 times as many faulty tasks, but you're doing 8 times as many good ones too. Now if you had 8 single GPU machines, and one of them was giving all the faulty tasks, and the other 7 weren't, then I'd say the server should block your one computer.

I think such events aren't shown in the Event Log, therefore the only way to spot it is to check your hosts last contact time through your Milkyway account. If your host hasn't contacted the server for hours or more, communication might have been deferred due to computational errors. Or, if the machine is attended, check it through your BOINC Manager - communication deferred timer is always shown on the Projects tab (under Status column).


Can't really tell, as it flips between 4 projects. I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar.

Darrell
Avatar
Send message
Joined: 28 Mar 09
Posts: 9
Credit: 16,106,467
RAC: 2,178

Message 66411 - Posted: 8 May 2017, 21:10:08 UTC - in response to Message 66327.

Hi Jake, this started this morning:

<search_application> milkyway_separation 1.46 Windows x86 double OpenCL </search_application>
Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 21 </number_params_per_WU>
Number of parameters doesn't make sense

The tasks with the four you mentioned are all failing while those not from those runs are completing.

All four types of tasks do not like the number_params_per_WU set to 21, those tasks that have this parameter set to 20, run ok.

Brickhead
Avatar
Send message
Joined: 20 Mar 08
Posts: 108
Credit: 2,339,824,289
RAC: 1,659,655

Message 66412 - Posted: 8 May 2017, 21:32:17 UTC

Jake, what are these being generated from scratch (not quota-fill) today?

de_modfit_19_3s_146_bundle5_ModfitConstraintsWithDisk_1...

They all fail, and the names don't fit neither your list nor the old run.
____________

Vortac
Send message
Joined: 22 Apr 09
Posts: 77
Credit: 1,051,203,004
RAC: 110,008

Message 66416 - Posted: 9 May 2017, 5:59:15 UTC - in response to Message 66408.

I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar.

Well, there are so many erroneous tasks now (apparently, a new batch that is failing has arrived) you will almost surely get a 24hrs deferrment too.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66417 - Posted: 9 May 2017, 9:36:11 UTC - in response to Message 66416.
Last modified: 9 May 2017, 9:36:50 UTC

I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar.

Well, there are so many erroneous tasks now (apparently, a new batch that is failing has arrived) you will almost surely get a 24hrs deferrment too.


Yes I did notice a load of failures sitting on 3 of my machines this morning. I guess failures help them to fix programming problems, and they don't use much computing time. Having 4 projects helps with errors and server downtime etc. I've never had a computer sat idle. Presumably the new application will be more efficient or something, so it's worth experimenting with it.

LostInTennessee
Send message
Joined: 20 Jun 10
Posts: 5
Credit: 471,106,230
RAC: 454,849

Message 66421 - Posted: 9 May 2017, 13:47:01 UTC - in response to Message 66416.

I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar.

Well, there are so many erroneous tasks now (apparently, a new batch that is failing has arrived) you will almost surely get a 24hrs deferrment too.



Same problem here!

[VENETO] boboviz
Send message
Joined: 10 Feb 09
Posts: 28
Credit: 3,815,663
RAC: 29,003

Message 66428 - Posted: 11 May 2017, 6:31:30 UTC
Last modified: 11 May 2017, 6:31:39 UTC

A LOT of wus are ok
Some wus crashes after 2 seconds:

Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
malloc failed: 2969687792 bytes
20:35:15 (7596): called boinc_finish(1)


Strange.

Peter Hucker
Send message
Joined: 5 Jul 11
Posts: 103
Credit: 15,265,491
RAC: 68

Message 66429 - Posted: 11 May 2017, 9:55:43 UTC - in response to Message 66428.

A LOT of wus are ok
Some wus crashes after 2 seconds:
Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
malloc failed: 2969687792 bytes
20:35:15 (7596): called boinc_finish(1)


Strange.


I'm getting less and less. I think the problem is fixed, we're just using up the broken ones. My main fast computer is getting them all succeeding now. Only one of the slower computers which had some queued up is failing a few.

Previous · 1 · 2 · 3
Post to thread

Message boards : News : Scheduled Server Maintenance Concluded


Main page · Your account · Message boards


Copyright © 2017 AstroInformatics Group