New(er) Searches
log in

Advanced search

Message boards : News : New(er) Searches

Author Message
Matthew
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 6 May 09
Posts: 217
Credit: 6,856,375
RAC: 0

Message 56023 - Posted: 31 Oct 2012, 0:37:32 UTC

I've started some new searches:

ps_separation_15_3s_stndrd_1
ps_separation_22_3s_free0_1
ps_separation_22_3s_edge0_2
de_separation_15_3s_stndrd_1
de_separation_22_3s_free0_1
de_separation_22_3s_edge0_2

I found a few minor errors in the parameter files - I wouldn't have expected them to be significant, but with the weird errors that we are seeing, I figured it would be best to clean the files up and restart the runs. I also started a set of Stripe 15 runs, since these have been successful previously.

When the Nan/Inf errors occurred, we were seeing a strange exit status; I'm trying to track that down. More news soon.

Cheers,
Matthew N.

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56029 - Posted: 31 Oct 2012, 17:10:21 UTC - in response to Message 56023.
Last modified: 31 Oct 2012, 17:20:52 UTC

Those seem to run OK, but the max. of 1 (or actually 2) errors is not enough: wuid=261886822.

I got something, that "looks OK", but needed a confirmation from a wingman. Unfortunately both of them are doing not much more than generating computing errors -> completed, can't validate.
____________
.

GaryG
Avatar
Send message
Joined: 29 Aug 12
Posts: 31
Credit: 40,781,945
RAC: 0

Message 56031 - Posted: 31 Oct 2012, 17:49:12 UTC

I also have two that completed but can't validate.

http://milkyway.cs.rpi.edu/milkyway/results.php?userid=461415&offset=0&show_names=0&state=4&appid=

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56032 - Posted: 31 Oct 2012, 18:39:08 UTC - in response to Message 56031.

I also have two that completed but can't validate.

http://milkyway.cs.rpi.edu/milkyway/results.php?userid=461415&offset=0&show_names=0&state=4&appid=

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=464745&offset=0&show_names=0&state=4&appid= (the other url only you can open)

Yes, that's exactly the same thing, there are too many not properly working computers for the current max errors setting.
____________
.

astro-marwil
Send message
Joined: 3 Jul 12
Posts: 13
Credit: 7,601,982
RAC: 0

Message 56037 - Posted: 31 Oct 2012, 23:37:23 UTC - in response to Message 56023.

Hallo!
Still all error of type 0x1 - incorrect function - , also new types of tasks.

Kind regards
Martin

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 440

Message 56038 - Posted: 1 Nov 2012, 1:31:28 UTC

These new searches are looking much, much better, at least for me. My error rate is down to less than 20/day now whereas it had been consistently between 60-80/day. Only 4 errors in the last 12 hours with the edge0/free0 runs.
____________

GaryG
Avatar
Send message
Joined: 29 Aug 12
Posts: 31
Credit: 40,781,945
RAC: 0

Message 56039 - Posted: 1 Nov 2012, 2:02:46 UTC - in response to Message 56032.

I also have two that completed but can't validate.

http://milkyway.cs.rpi.edu/milkyway/results.php?userid=461415&offset=0&show_names=0&state=4&appid=

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=464745&offset=0&show_names=0&state=4&appid= (the other url only you can open)

Yes, that's exactly the same thing, there are too many not properly working computers for the current max errors setting.


Looks like they are probably bad in any case, see the following wu.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=261364915
____________

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56041 - Posted: 1 Nov 2012, 12:27:45 UTC - in response to Message 56037.

Hallo!
Still all error of type 0x1 - incorrect function - , also new types of tasks.

???
All errors (except for one) are from the old separation_22_3s_edge/free_3 batches.
____________
.

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56042 - Posted: 1 Nov 2012, 12:34:41 UTC - in response to Message 56039.
Last modified: 1 Nov 2012, 12:36:01 UTC

Looks like they are probably bad in any case, see the following wu.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=261364915

No, don't think so. The first two results look OK, they needed however a third one as confirmation (many WUs need 3 results ATM). Unfortunately hosts 144013 and 143124, to which _2 and _3 were assigned, are doing pretty much nothing else than trashing WUs, look at their task lists.
____________
.

astro-marwil
Send message
Joined: 3 Jul 12
Posts: 13
Credit: 7,601,982
RAC: 0

Message 56044 - Posted: 1 Nov 2012, 16:55:14 UTC
Last modified: 1 Nov 2012, 16:56:37 UTC

Hallo!
At me there are still about 3% ending up with failure, 16% can not become validated immediately.
May be, it became a bit better than before, but not sure.
On the other hand, the total crunching power of 349TFLPOs now begins to increase since jesterday, which is a good sign. It has droped by about 20% from 545TFLOPs before this bad task series. This drop is not only due to the higher rate of failure but also due to the very high rate of tasks to become verified very often by more than one wingman. This extra crunching binds al lot of crunching effort.

Kind regards and happy crunching
Martin

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56046 - Posted: 1 Nov 2012, 19:34:07 UTC

Next killed WU by hosts with some issues: 262295807.
____________
.

Matthew
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 6 May 09
Posts: 217
Credit: 6,856,375
RAC: 0

Message 56050 - Posted: 1 Nov 2012, 21:00:10 UTC

Thanks for the feedback! I'm still looking into this, I hope to have it figured out tomorrow.

-Matthew N.

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56051 - Posted: 1 Nov 2012, 22:08:54 UTC

One more. Apparently aborting a WU by the user is also counted as an error which might indicate a bug in the WU.
____________
.

Profile dskagcommunity
Avatar
Send message
Joined: 26 Feb 11
Posts: 170
Credit: 183,085,176
RAC: 0

Message 56065 - Posted: 3 Nov 2012, 13:47:16 UTC
Last modified: 3 Nov 2012, 13:48:12 UTC

have summary yesterday and today only 4 computing errors, so it seems to go back to normal business soon for me ^^

Oh yes, and good desition to go to double up the cumputing time per WU :)
____________
DSKAG Austria Research Team: http://www.research.dskag.at



astro-marwil
Send message
Joined: 3 Jul 12
Posts: 13
Credit: 7,601,982
RAC: 0

Message 56071 - Posted: 3 Nov 2012, 21:10:08 UTC

Hallo!
Since more than 3 days I haven´t had any crunching error, but still lots of not validated tasks. But I see under "pending tasks" / WU ID , that there are still a lot of failed tasks. The total numder of pending tasks listed is shrinking also, and the total crunching power is still increasing.

So, you´re at the right path! Hopefully !!!

Kind regards and happy crunching.
Martin

Profile Ray_GTI-R
Avatar
Send message
Joined: 5 Nov 10
Posts: 69
Credit: 15,061,882
RAC: 22

Message 56076 - Posted: 4 Nov 2012, 2:28:03 UTC
Last modified: 4 Nov 2012, 2:49:25 UTC

Is it me or do all WU's suddenly required longer to crunch?

Check my stats e.g., ATI GPU 30% longer, most CPU tasks require about the same extra but on an Android machine (ODROID-X) can take more than 60% longer 46,362.69 seconds/12.87 hours where it used to be 27,830.55/7.73 hours???

Credits-per-wu appear to have increased so maybe it all balances out in the end?


____________

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56078 - Posted: 4 Nov 2012, 9:00:17 UTC - in response to Message 56071.

But I see under "pending tasks" / WU ID , that there are still a lot of failed tasks.

Pending tasks are not failed, they are just waiting for a wingman to confirm your result.
____________
.

astro-marwil
Send message
Joined: 3 Jul 12
Posts: 13
Credit: 7,601,982
RAC: 0

Message 56080 - Posted: 4 Nov 2012, 12:50:06 UTC - in response to Message 56078.
Last modified: 4 Nov 2012, 12:58:20 UTC

Hallo!

Pending tasks are not failed, they are just waiting for a wingman to confirm your result.

That´s right, but they also require extra crunching power and so reduces the possible progress of the project. So a high rate of not validated tasks is still a drawback. See also this message lower here.

Kind regards and happy crunching.
Martin

Link
Avatar
Send message
Joined: 19 Jul 10
Posts: 327
Credit: 16,283,020
RAC: 0

Message 56088 - Posted: 4 Nov 2012, 15:39:25 UTC - in response to Message 56080.

Pending tasks are not failed, they are just waiting for a wingman to confirm your result.

That´s right, but they also require extra crunching power and so reduces the possible progress of the project. So a high rate of not validated tasks is still a drawback.

The extra crunching is used to ensure that only proper results get validated, see this post. Quality of the results is more important the pure throughput, many projects send out the same WU to at least two computers by default, Milkyway is still doing that only if the validator is not sure about the first (and eventually also the second) result, now it only happens more often than it did in the past.
____________
.

astro-marwil
Send message
Joined: 3 Jul 12
Posts: 13
Credit: 7,601,982
RAC: 0

Message 56097 - Posted: 5 Nov 2012, 0:01:13 UTC - in response to Message 56088.
Last modified: 5 Nov 2012, 0:04:34 UTC

Hallo Link!

Pending tasks are not failed, they are just waiting for a wingman to confirm your result.


That´s right, but they also require extra crunching power and so reduces the possible progress of the project. So a high rate of not validated tasks is still a drawback.


The extra crunching is used to ensure that only proper results get validated, ...


Well, I know about this. But now we have about a factor of 4 to 5 higher rate of extra validation runnings than before. And that binds crunching power. Only the software developpers can decide, whether this has to be accepted as unavoidable, or it can be made better by some programm code modification. Up to a month or so before it was better. And it would be better to come to the same situation as before. That is it.
And now I don´t like to talk about this any more. It´s not worth.

Kind regards and happy crunching
Martin


Post to thread

Message boards : News : New(er) Searches


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group