Scheduled Server Maintenance Concluded

Author	Message
Cliff Send message Joined: 28 Nov 14 Posts: 51 Credit: 86,696,721 RAC: 0	Message 66390 - Posted: 6 May 2017, 12:54:06 UTC - in response to Message 66367. Last modified: 6 May 2017, 12:54:46 UTC Everyone, Sorry for the inconvenience caused as the work units clear the queue. I tried to mitigate the issue beforehand since I knew it would be a problem. Luckily, the new application seems to be working as intended so everything should be running well once the work units are cleared. Jake Be nice if the server didn't keep on sending last gen WU again and again. As resends that is... I've started aborting any I see, but I'm just monitoring 1 rig with a quick look at the others now and then. Regards, Cliff. -- Been there Done That, still no Damn T-Shirt ID: 66390 · Rating: 0 · rate: / Reply Quote

John G Send message Joined: 1 Apr 10 Posts: 49 Credit: 171,863,025 RAC: 0	Message 66392 - Posted: 7 May 2017, 0:09:11 UTC I have just been aborting them. No big deal!! Regards John ID: 66392 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66393 - Posted: 7 May 2017, 2:21:19 UTC - in response to Message 66392. I have just been aborting them. No big deal!! Regards John Indeed, people get too worked up about these things. Run more than one project, allow your computer to not always be running, whatever. It's hardly the end of the world if you're not calculating something 24/7. The project is being updated, some hiccups will arise, get used to it. ID: 66393 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66396 - Posted: 7 May 2017, 20:58:21 UTC - in response to Message 66392. I have just been aborting them. No big deal!! How do you abort a task on an unattended machine? ID: 66396 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66397 - Posted: 7 May 2017, 21:11:21 UTC - in response to Message 66396. I have just been aborting them. No big deal!! How do you abort a task on an unattended machine? They abort themselves. ID: 66397 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66398 - Posted: 7 May 2017, 22:39:45 UTC - in response to Message 66397. They abort themselves. Ah, more of your erroneous assumptions. They don't abort themselves. They end in 'Computational error' (1 (0x1) Unknown error number). Which is completely different from 'Aborted' (203 (0xcb) EXIT_ABORTED_VIA_GUI). The important difference being that computation errors defer communication with the server and aborted tasks don't. So indeed, it's preferable to abort them, but that's not possible on unattended machines and I certainly don't have the whole day to abort deprecated Milkyway tasks. Frankly, I am tired of correcting you all the time. Can't you research anything before you post? ID: 66398 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66399 - Posted: 7 May 2017, 22:51:45 UTC - in response to Message 66398. They abort themselves. Ah, more of your erroneous assumptions. They don't abort themselves. They end in 'Computational error' (1 (0x1) Unknown error number). Which is completely different from 'Aborted' (203 (0xcb) EXIT_ABORTED_VIA_GUI). The important difference being that computation errors defer communication with the server and aborted tasks don't. So indeed, it's preferable to abort them, but that's not possible on unattended machines and I certainly don't have the whole day to abort deprecated Milkyway tasks. Frankly, I am tired of correcting you all the time. Can't you research anything before you post? I don't know what your problem is, but my computers work just fine. If things end in errors, they stop all by themselves. Then BOINC finds something else to do. Stop being so bloody rude I'm getting sick of your arrogance. ID: 66399 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66401 - Posted: 7 May 2017, 23:50:42 UTC - in response to Message 66399. I don't know what your problem is, but my computers work just fine. So you haven't noticed yet this is the thread about the problems which have appeared after latest Scheduled Server Maintenance? It's in the title of the thread, in case you have missed it. Computational errors mentioned here so often are the result of the latest Milkyway@home upgrades - Jake himself kindly invited us to provide some feedback on that. And your hosts are also producing such errors, check here for example: https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=686691&offset=0&show_names=0&state=6&appid So, you see the problem now? You are trying to provide some advice, but you don't even know what you are talking about (not even your own hosts). You should be grateful for this education, not angry. ID: 66401 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66404 - Posted: 8 May 2017, 14:06:47 UTC - in response to Message 66401. I don't know what your problem is, but my computers work just fine. So you haven't noticed yet this is the thread about the problems which have appeared after latest Scheduled Server Maintenance? It's in the title of the thread, in case you have missed it. Computational errors mentioned here so often are the result of the latest Milkyway@home upgrades - Jake himself kindly invited us to provide some feedback on that. And your hosts are also producing such errors, check here for example: https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=686691&offset=0&show_names=0&state=6&appid So, you see the problem now? You are trying to provide some advice, but you don't even know what you are talking about (not even your own hosts). You should be grateful for this education, not angry. I've got 194 valid and only 17 in error, hardly cause for concern. It's not preventing me getting any more from the server, and if it did it would switch to Einstein, Universe, or SETI without my intervention. And I have NEVER had a blue screen from any project, apart from a computer with dodgy RAM - I used to build them and used BOINC to do a 3 day burn in test. Any errors or excessively high temperatures and I checked the hardware. It's you that keeps being abusive, I'm a calm person. ID: 66404 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66405 - Posted: 8 May 2017, 14:50:27 UTC - in response to Message 66404. I've got 194 valid and only 17 in error, hardly cause for concern. It's not preventing me getting any more from the server OK. But, to explain it once again, with multiple GPUs and higher output, you would get so many computational errors that it would prevent you from getting more tasks from the server. I guess it's the mechanism to prevent faulty machines from thrashing thousands of workunits. But in this case computational errors are not happening because of faulty hardware, but because of recent switch to Milkyway version 1.46, deprecating thousands of older tasks which cannot be computed successfully with newer application. In normal circumstances, 17 computational errors with 194 valids, that would be a 8% failure rate - very high and a strong indication of some hardware problems. ID: 66405 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66406 - Posted: 8 May 2017, 15:00:26 UTC - in response to Message 66405. I've got 194 valid and only 17 in error, hardly cause for concern. It's not preventing me getting any more from the server OK. But, to explain it once again, with multiple GPUs and higher output, you would get so many computational errors that it would prevent you from getting more tasks from the server. I guess it's the mechanism to prevent faulty machines from thrashing thousands of workunits. But in this case computational errors are not happening because of faulty hardware, but because of recent switch to Milkyway version 1.46, deprecating thousands of older tasks which cannot be computed successfully with newer application. In normal circumstances, 17 computational errors with 194 valids, that would be a 8% failure rate - very high and a strong indication of some hardware problems. Is the server block not done on a percentage? If my 17 to 194 isn't triggering it, why should 170 to 1940? Actually I can't be sure it hasn't triggered it on mine - is that recorded in my results data somewhere? I might not notice as Milkyway is set to 25% share at the moment, all I'd see is it using the other three projects. ID: 66406 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66407 - Posted: 8 May 2017, 15:50:50 UTC - in response to Message 66406. Is the server block not done on a percentage? If my 17 to 194 isn't triggering it, why should 170 to 1940? That's a good question. Jake will know this for sure, but I believe the trigger is set on absolute numbers, not on a percentage. Because 1% error-rate on a machine with 8 powerful GPUs is still literally thousands of thrashed tasks, while on a machine with just one low-range GPU, 1% is a couple of tasks only. It makes more sense to use absolute numbers to protect the network from faulty machines. Actually I can't be sure it hasn't triggered it on mine - is that recorded in my results data somewhere? I think such events aren't shown in the Event Log, therefore the only way to spot it is to check your hosts last contact time through your Milkyway account. If your host hasn't contacted the server for hours or more, communication might have been deferred due to computational errors. Or, if the machine is attended, check it through your BOINC Manager - communication deferred timer is always shown on the Projects tab (under Status column). ID: 66407 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66408 - Posted: 8 May 2017, 15:58:58 UTC - in response to Message 66407. That's a good question. Jake will know this for sure, but I believe the trigger is set on absolute numbers, not on a percentage. Because 1% error-rate on a machine with 8 powerful GPUs is still literally thousands of thrashed tasks, while on a machine with just one low-range GPU, 1% is a couple of tasks only. It makes more sense to use absolute numbers to protect the network from faulty machines. I disagree. Let's say I have 1 GPU, and you have 8 of the same model GPU in one machine. I give back 2 faulty tasks and 98 good ones. You give 16 faulty tasks and 784 good ones. We both have a 2% failure rate, and we're both giving 98% useful results. Why should your computer be treated any differently by the server? Yes, 8 times as many faulty tasks, but you're doing 8 times as many good ones too. Now if you had 8 single GPU machines, and one of them was giving all the faulty tasks, and the other 7 weren't, then I'd say the server should block your one computer. I think such events aren't shown in the Event Log, therefore the only way to spot it is to check your hosts last contact time through your Milkyway account. If your host hasn't contacted the server for hours or more, communication might have been deferred due to computational errors. Or, if the machine is attended, check it through your BOINC Manager - communication deferred timer is always shown on the Projects tab (under Status column). Can't really tell, as it flips between 4 projects. I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar. ID: 66408 · Rating: 0 · rate: / Reply Quote

Darrell Send message Joined: 28 Mar 09 Posts: 9 Credit: 16,162,511 RAC: 0	Message 66411 - Posted: 8 May 2017, 21:10:08 UTC - in response to Message 66327. Hi Jake, this started this morning: <search_application> milkyway_separation 1.46 Windows x86 double OpenCL </search_application> Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.' Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 5 </number_WUs> <number_params_per_WU> 21 </number_params_per_WU> Number of parameters doesn't make sense The tasks with the four you mentioned are all failing while those not from those runs are completing. All four types of tasks do not like the number_params_per_WU set to 21, those tasks that have this parameter set to 20, run ok. ID: 66411 · Rating: 0 · rate: / Reply Quote

Brickhead Send message Joined: 20 Mar 08 Posts: 108 Credit: 2,607,924,860 RAC: 0	Message 66412 - Posted: 8 May 2017, 21:32:17 UTC Jake, what are these being generated from scratch (not quota-fill) today? de_modfit_19_3s_146_bundle5_ModfitConstraintsWithDisk_1... They all fail, and the names don't fit neither your list nor the old run. ID: 66412 · Rating: 0 · rate: / Reply Quote

Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 66416 - Posted: 9 May 2017, 5:59:15 UTC - in response to Message 66408. I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar. Well, there are so many erroneous tasks now (apparently, a new batch that is failing has arrived) you will almost surely get a 24hrs deferrment too. ID: 66416 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66417 - Posted: 9 May 2017, 9:36:11 UTC - in response to Message 66416. Last modified: 9 May 2017, 9:36:50 UTC I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar. Well, there are so many erroneous tasks now (apparently, a new batch that is failing has arrived) you will almost surely get a 24hrs deferrment too. Yes I did notice a load of failures sitting on 3 of my machines this morning. I guess failures help them to fix programming problems, and they don't use much computing time. Having 4 projects helps with errors and server downtime etc. I've never had a computer sat idle. Presumably the new application will be more efficient or something, so it's worth experimenting with it. ID: 66417 · Rating: 0 · rate: / Reply Quote

LostInTennessee Send message Joined: 20 Jun 10 Posts: 5 Credit: 744,974,914 RAC: 0	Message 66421 - Posted: 9 May 2017, 13:47:01 UTC - in response to Message 66416. I can't see a notice in the messages saying "Message from Server - deferred for 1 day for faulty WUs" or similar. Well, there are so many erroneous tasks now (apparently, a new batch that is failing has arrived) you will almost surely get a 24hrs deferrment too. Same problem here! ID: 66421 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 10 Feb 09 Posts: 52 Credit: 16,397,122 RAC: 180	Message 66428 - Posted: 11 May 2017, 6:31:30 UTC Last modified: 11 May 2017, 6:31:39 UTC A LOT of wus are ok Some wus crashes after 2 seconds: Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.' Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' malloc failed: 2969687792 bytes 20:35:15 (7596): called boinc_finish(1) Strange. ID: 66428 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,327,049 RAC: 9,076	Message 66429 - Posted: 11 May 2017, 9:55:43 UTC - in response to Message 66428. A LOT of wus are ok Some wus crashes after 2 seconds: Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.' Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' malloc failed: 2969687792 bytes 20:35:15 (7596): called boinc_finish(1) Strange. I'm getting less and less. I think the problem is fixed, we're just using up the broken ones. My main fast computer is getting them all succeeding now. Only one of the slower computers which had some queued up is failing a few. ID: 66429 · Rating: 0 · rate: / Reply Quote