Welcome to MilkyWay@home

New Separation Runs

Message boards : News : New Separation Runs
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59372 - Posted: 15 Jul 2013, 22:07:59 UTC

I have added the following runs to the currently running separation runs:

de_separation_83_DR_8_rev_3_2
de_separation_85_DR_8_rev_3_2


If you notice any issues with these runs please post to this thread.


Jeff
ID: 59372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59379 - Posted: 16 Jul 2013, 15:09:58 UTC

I added the follow run to the set



ps_separation_82_DR_8_rev_3_1




Jeff
ID: 59379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59391 - Posted: 18 Jul 2013, 15:26:29 UTC

I have added

de_separation_82_DR_8_rev_3_1

To these runs.


Jeff
ID: 59391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59400 - Posted: 19 Jul 2013, 19:19:43 UTC

I have added

de_separation_81_DR_8_rev_3_1


to this set
ID: 59400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 627
Credit: 19,303,095
RAC: 1,042
Message 59470 - Posted: 28 Jul 2013, 10:40:47 UTC

Not an issue with any of those runs, but since I see it quite often: you might want to reconsider, what should be counted as WU error and what not.

For example WU 404295517 ended up with:

532334131 	484725 	26 Jul 2013, 17:48:27 UTC 	26 Jul 2013, 17:49:36 UTC 	Aborted by user 	 	  	0.00 	 	0.00	 	--- 	MilkyWay@Home v1.02 (opencl_amd_ati)
532335417 	293662 	26 Jul 2013, 17:51:02 UTC 	26 Jul 2013, 22:01:22 UTC 	Completed, can't validate 	 	873.83 	 	6.18		0.00 	MilkyWay@Home Anonymous platform (ATI GPU)
532461638 	323514 	26 Jul 2013, 22:02:59 UTC 	26 Jul 2013, 22:04:05 UTC 	Not started by deadline - canceled 	0.00 	 	0.00		--- 	MilkyWay@Home v1.00
532462749 	487766 	26 Jul 2013, 22:05:43 UTC 	27 Jul 2013, 11:00:43 UTC 	Completed, can't validate 	 	5,393.92 	5,249.26 	0.00 	MilkyWay@Home v1.01
532809360 	455984 	27 Jul 2013, 11:02:02 UTC 	27 Jul 2013, 11:16:11 UTC 	Validate error 	 	 	 	242.37 	 	197.93		--- 	MilkyWay@Home Anonymous platform (CPU)


I think "Aborted by user" and "Not started by deadline - canceled" do not inticate an issue with a WU and should not be counted as errors.
ID: 59470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
paris
Avatar

Send message
Joined: 26 Apr 08
Posts: 87
Credit: 64,801,496
RAC: 0
Message 59472 - Posted: 28 Jul 2013, 13:49:36 UTC - in response to Message 59470.  

I agree. I just encountered one of those myself.
WU de_separation_85_DR_8_rev_3_2_1372784654_7235365


Plus SETI Classic = 21,082 WUs
ID: 59472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Batman

Send message
Joined: 7 Mar 09
Posts: 1
Credit: 20,467,651
RAC: 0
Message 59484 - Posted: 31 Jul 2013, 1:22:50 UTC

Hi

All the similar tasks [As mentioned in previous post], sent today have 'compution error' after 1.08 minutes exactly of stating. So far about 30 tasks.
None have worked.
All other tasks beforehand ok.
What to do?

Thanks

Steve
ID: 59484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59485 - Posted: 31 Jul 2013, 1:30:35 UTC

Reboot.
The previous problems were with validation issues due to user aborts.
When a machine starts to have a run of errors like that the first thing to do is reboot it.

If it continues after the reboot what tasks are you getting the error for and what is the stderr.txt output for the errors.
ID: 59485 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59486 - Posted: 31 Jul 2013, 1:42:31 UTC

On the Vista Machine

It appears the first bad workunit gets this

7.0.64

Incorrect function.
(0x1) - exit code 1 (0x1)


milkyway_separation 1.00 Windows x86 double
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file
Using SSE3 path
Failed to commit move of 'separation_checkpoint_tmp' to 'separation_checkpoint' (6704): It is too late to perform the requested operation, since the Transaction has already been aborted.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory
Write checkpoint failed
04:55:30 (6020): called boinc_finish


]]>



Let me know if the reboot helps.....
That particular work unit has made it through two other machines and your version of the app hasn't changed so it is local to the machine...

ID: 59486 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 59507 - Posted: 2 Aug 2013, 16:47:30 UTC

I have added the run

de_separation_80_DR_8_rev_3_2


Jeff
ID: 59507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 627
Credit: 19,303,095
RAC: 1,042
Message 60544 - Posted: 10 Dec 2013, 12:09:53 UTC
Last modified: 10 Dec 2013, 12:14:08 UTC

I' getting quite many validate errors with the de_separation_DR8_82_Rev_4_2_2 runs. Not critical, just on about 5 tasks a day, most of this run still validates fine for me.

Seems to be an issue with the 0.82 CAL application, which is the only one I can use with my ATI HD3850. One of my wingmen has the same issue on his HD5800, though I'm not sure why he is getting the CAL app assigned from the server. But it seems to be the application and not the GPU.

Don't know wether you can do something about it or not, I'm aware, that the CAL app is actually not supported anymore, but wanted to inform you anyway. Previous DR8_82 revisions run fine.
ID: 60544 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 60546 - Posted: 10 Dec 2013, 12:59:02 UTC

Looking into to it. I will post follow ups here.


Jeff Thompson
ID: 60546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 60554 - Posted: 12 Dec 2013, 0:33:36 UTC

Workunit 471493607

A Cayman (cat 12.6+) and a Tahiti (cat 12.4) gpu, both running the old 0.82 app, are validated against each other and a Cypress (cat 12.3) result with app 1.02 is marked invalid. That Cypress is usually running with ZERO invalids, which makes me wonder if the proper result was marked as bad.

<search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application>
<background_integral> 0.000250000203482 </background_integral>
<stream_integral> 126.473032247388670 424.189607956046640 0.000000000000000 </stream_integral>
<background_likelihood> -3.715093764065778 </background_likelihood>
<stream_only_likelihood> -14.211423036510251 -3.997919366742975 -235.464758791616960 </stream_only_likelihood>
<search_likelihood> -3.079346382417015 </search_likelihood>

<background_integral> 0.000250000203482 </background_integral>
<stream_integral> 126.473032247388687 424.189607956046700 0.000000000000000 </stream_integral>
<background_likelihood> -3.715093764065778 </background_likelihood>
<stream_only_likelihood> -14.211423036510251 -3.997919366742975 -4.106932733094966 </stream_only_likelihood>
<search_likelihood> -2.988016909917461 </search_likelihood>
<search_application> milkywayathome_client separation 0.82 Linux x86_64 double CAL++ </search_application>

<background_integral> 0.000250000203482 </background_integral>
<stream_integral> 126.473032247388690 424.189607956046640 0.000000000000000 </stream_integral>
<background_likelihood> -3.715093764065779 </background_likelihood>
<stream_only_likelihood> -14.211423036510251 -3.997919366742975 -4.106932733094965 </stream_only_likelihood>
<search_likelihood> -2.988016909917461 </search_likelihood>
<search_application> milkywayathome_client separation 0.82 Windows x86_64 double CAL++ </search_application>


The old CAL app should only get accepted for gpus (i.e. HD3800) that can't run the newer openCL app.

ID: 60554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 627
Credit: 19,303,095
RAC: 1,042
Message 60561 - Posted: 13 Dec 2013, 9:42:30 UTC - in response to Message 60554.  
Last modified: 13 Dec 2013, 9:46:37 UTC

A Cayman (cat 12.6+) and a Tahiti (cat 12.4) gpu, both running the old 0.82 app, are validated against each other and a Cypress (cat 12.3) result with app 1.02 is marked invalid. That Cypress is usually running with ZERO invalids, which makes me wonder if the proper result was marked as bad.

Well, they have to first figure out, which app is getting the result wrong. The fact, that the CAL result is ususlly marked invalid does not mean, that the result was actually wrong, it simply means, that there are more OpenCL results coming in (just like in your case there were more CAL results for that WU).

Looking at the example you posted, the difference is only in the last number of stream_only_likelihood and in search_likelihood and that's exactly what I see also on my invalid results. If I compare my valid results of this run with those, that are marked invalid, I don't see any big difference in the numbers, in particular I don't see that huge negative number in stream_only_likelihood, which the OpenCL app gets. The weird thing with that number is that it's always (at least in all WUs I checked) very close to -235 in the OpenCL result. That's a bit strange, the numbers in my valid WUs are not even nearly that close to each other, they are simply somewhere in the range 0 to -100, most of them however in the range 0 to -20.

So the question remains: which application is getting the result right? And that we can't answer from our end.
ID: 60561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 627
Credit: 19,303,095
RAC: 1,042
Message 60562 - Posted: 13 Dec 2013, 10:49:05 UTC - in response to Message 60561.  

The weird thing with that number is that it's always (at least in all WUs I checked) very close to -235 in the OpenCL result. That's a bit strange, the numbers in my valid WUs are not even nearly that close to each other, they are simply somewhere in the range 0 to -100, most of them however in the range 0 to -20.

To make it more precise: while the numbers of OpenCL results of those "problematic" WUs are all in the range of 234.3-235.5, the CAL app gets 2.8-4.1 for those, all negative of course.

The applications agree as far, as to where in those ranges the result is, i.e. if the OpenCL app gets higher value, than the CAL app does that too, like in your example -235.5 and -4.1. -234.8 and -2.9 would be another combination, which I see on many WUs.

So something in those WUs makes one of the applications get the number way too high or way too low, depends on which is the right one.
ID: 60562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 60572 - Posted: 14 Dec 2013, 1:04:06 UTC

Good find of those pattern.

Usually I expect a far newer scientific app to be more improved, but you are right that is has to be checked if there hasn't a new bug found it's way into the newer app. Before I thought it was only a difference in the last few digits. With the pattern you found now, this gives a completely different picture.

If the old CAL app gives the wrong numbers, the validator need to get fixed to catch those + results from the old app coming in from newer (openCL enable) GPUs need to get denied to reduce wrong validations.

If the newer app is the problem, we need a quick fix for this one.

Luckily the problematic results seem to be only a small number of all those results coming in.
ID: 60572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 627
Credit: 19,303,095
RAC: 1,042
Message 60583 - Posted: 14 Dec 2013, 13:29:20 UTC

I hope to see some fix or at least some sort of information soon, I'm getting more and more in valids every day, yesterday I got 19 of those worth over 2000 credits, so now 10% of my daily production is going to the bin. :(

PS: no, that's not about credits, I'd simply prefer to do something more useful for the project.
ID: 60583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 60585 - Posted: 14 Dec 2013, 17:50:04 UTC

These applications haven't changed. So the problem is in the data and specifically in how the systems are processing an unconstrained stream.

The error rate overall for the application is not peaking or rising past a normal run.
So I am trying to find out why it is spiking for this one set of parameters and the spiking for a particular user base.

RIght now I am not seeing a pattern to hone in on what part of the system is causing the bug.

The background and the third stream set should change with each other if one is off the other should be off also.

I am going to grab some invalids on different runs also to see if it is a problem that is there in other data runs.


So I am looking into this and I will ping back a bit more on what we see.
But I have nothing conclusive to report right now. I
brought down one run on the stripe in question and am waiting a bit to bring down the other shortly. When they clear through I want to see if the problem is on other runs also.

Jeff
ID: 60585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 627
Credit: 19,303,095
RAC: 1,042
Message 60586 - Posted: 14 Dec 2013, 18:51:31 UTC - in response to Message 60585.  

Thanks for the update.
ID: 60586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jeffery M. Thompson
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 23 Sep 12
Posts: 159
Credit: 16,977,106
RAC: 0
Message 60587 - Posted: 14 Dec 2013, 18:54:12 UTC

Thanks for the info

There are two things I have to look at in this and I am focussing right now on the different values being returned there is also the problem of the wrong app being assigned.



Jeff
ID: 60587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : New Separation Runs

©2024 Astroinformatics Group