New Separation Runs

Author	Message
Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59372 - Posted: 15 Jul 2013, 22:07:59 UTC I have added the following runs to the currently running separation runs: de_separation_83_DR_8_rev_3_2 de_separation_85_DR_8_rev_3_2 If you notice any issues with these runs please post to this thread. Jeff ID: 59372 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59379 - Posted: 16 Jul 2013, 15:09:58 UTC I added the follow run to the set ps_separation_82_DR_8_rev_3_1 Jeff ID: 59379 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59391 - Posted: 18 Jul 2013, 15:26:29 UTC I have added de_separation_82_DR_8_rev_3_1 To these runs. Jeff ID: 59391 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59400 - Posted: 19 Jul 2013, 19:19:43 UTC I have added de_separation_81_DR_8_rev_3_1 to this set ID: 59400 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556	Message 59470 - Posted: 28 Jul 2013, 10:40:47 UTC Not an issue with any of those runs, but since I see it quite often: you might want to reconsider, what should be counted as WU error and what not. For example WU 404295517 ended up with: 532334131 484725 26 Jul 2013, 17:48:27 UTC 26 Jul 2013, 17:49:36 UTC Aborted by user 0.00 0.00 --- MilkyWay@Home v1.02 (opencl_amd_ati) 532335417 293662 26 Jul 2013, 17:51:02 UTC 26 Jul 2013, 22:01:22 UTC Completed, can't validate 873.83 6.18 0.00 MilkyWay@Home Anonymous platform (ATI GPU) 532461638 323514 26 Jul 2013, 22:02:59 UTC 26 Jul 2013, 22:04:05 UTC Not started by deadline - canceled 0.00 0.00 --- MilkyWay@Home v1.00 532462749 487766 26 Jul 2013, 22:05:43 UTC 27 Jul 2013, 11:00:43 UTC Completed, can't validate 5,393.92 5,249.26 0.00 MilkyWay@Home v1.01 532809360 455984 27 Jul 2013, 11:02:02 UTC 27 Jul 2013, 11:16:11 UTC Validate error 242.37 197.93 --- MilkyWay@Home Anonymous platform (CPU) I think "Aborted by user" and "Not started by deadline - canceled" do not inticate an issue with a WU and should not be counted as errors. ID: 59470 · Rating: 0 · rate: / Reply Quote

paris Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0	Message 59472 - Posted: 28 Jul 2013, 13:49:36 UTC - in response to Message 59470. I agree. I just encountered one of those myself. WU de_separation_85_DR_8_rev_3_2_1372784654_7235365 Plus SETI Classic = 21,082 WUs ID: 59472 · Rating: 0 · rate: / Reply Quote

Batman Send message Joined: 7 Mar 09 Posts: 1 Credit: 20,467,651 RAC: 0	Message 59484 - Posted: 31 Jul 2013, 1:22:50 UTC Hi All the similar tasks [As mentioned in previous post], sent today have 'compution error' after 1.08 minutes exactly of stating. So far about 30 tasks. None have worked. All other tasks beforehand ok. What to do? Thanks Steve ID: 59484 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59485 - Posted: 31 Jul 2013, 1:30:35 UTC Reboot. The previous problems were with validation issues due to user aborts. When a machine starts to have a run of errors like that the first thing to do is reboot it. If it continues after the reboot what tasks are you getting the error for and what is the stderr.txt output for the errors. ID: 59485 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59486 - Posted: 31 Jul 2013, 1:42:31 UTC On the Vista Machine It appears the first bad workunit gets this 7.0.64 Incorrect function. (0x1) - exit code 1 (0x1) milkyway_separation 1.00 Windows x86 double Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '' expected near '4' Error reading astronomy parameters from file 'astronomy_parameters.txt' Trying old parameters file Using SSE3 path Failed to commit move of 'separation_checkpoint_tmp' to 'separation_checkpoint' (6704): It is too late to perform the requested operation, since the Transaction has already been aborted. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory Write checkpoint failed 04:55:30 (6020): called boinc_finish ]]> Let me know if the reboot helps..... That particular work unit has made it through two other machines and your version of the app hasn't changed so it is local to the machine... ID: 59486 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59507 - Posted: 2 Aug 2013, 16:47:30 UTC I have added the run de_separation_80_DR_8_rev_3_2 Jeff ID: 59507 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556	Message 60544 - Posted: 10 Dec 2013, 12:09:53 UTC Last modified: 10 Dec 2013, 12:14:08 UTC I' getting quite many validate errors with the de_separation_DR8_82_Rev_4_2_2 runs. Not critical, just on about 5 tasks a day, most of this run still validates fine for me. Seems to be an issue with the 0.82 CAL application, which is the only one I can use with my ATI HD3850. One of my wingmen has the same issue on his HD5800, though I'm not sure why he is getting the CAL app assigned from the server. But it seems to be the application and not the GPU. Don't know wether you can do something about it or not, I'm aware, that the CAL app is actually not supported anymore, but wanted to inform you anyway. Previous DR8_82 revisions run fine. ID: 60544 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 60546 - Posted: 10 Dec 2013, 12:59:02 UTC Looking into to it. I will post follow ups here. Jeff Thompson ID: 60546 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 60554 - Posted: 12 Dec 2013, 0:33:36 UTC Workunit 471493607 A Cayman (cat 12.6+) and a Tahiti (cat 12.4) gpu, both running the old 0.82 app, are validated against each other and a Cypress (cat 12.3) result with app 1.02 is marked invalid. That Cypress is usually running with ZERO invalids, which makes me wonder if the proper result was marked as bad. <search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application> <background_integral> 0.000250000203482 </background_integral> <stream_integral> 126.473032247388670 424.189607956046640 0.000000000000000 </stream_integral> <background_likelihood> -3.715093764065778 </background_likelihood> <stream_only_likelihood> -14.211423036510251 -3.997919366742975 -235.464758791616960 </stream_only_likelihood> <search_likelihood> -3.079346382417015 </search_likelihood> <background_integral> 0.000250000203482 </background_integral> <stream_integral> 126.473032247388687 424.189607956046700 0.000000000000000 </stream_integral> <background_likelihood> -3.715093764065778 </background_likelihood> <stream_only_likelihood> -14.211423036510251 -3.997919366742975 -4.106932733094966 </stream_only_likelihood> <search_likelihood> -2.988016909917461 </search_likelihood> <search_application> milkywayathome_client separation 0.82 Linux x86_64 double CAL++ </search_application> <background_integral> 0.000250000203482 </background_integral> <stream_integral> 126.473032247388690 424.189607956046640 0.000000000000000 </stream_integral> <background_likelihood> -3.715093764065779 </background_likelihood> <stream_only_likelihood> -14.211423036510251 -3.997919366742975 -4.106932733094965 </stream_only_likelihood> <search_likelihood> -2.988016909917461 </search_likelihood> <search_application> milkywayathome_client separation 0.82 Windows x86_64 double CAL++ </search_application> The old CAL app should only get accepted for gpus (i.e. HD3800) that can't run the newer openCL app. ID: 60554 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556	Message 60561 - Posted: 13 Dec 2013, 9:42:30 UTC - in response to Message 60554. Last modified: 13 Dec 2013, 9:46:37 UTC A Cayman (cat 12.6+) and a Tahiti (cat 12.4) gpu, both running the old 0.82 app, are validated against each other and a Cypress (cat 12.3) result with app 1.02 is marked invalid. That Cypress is usually running with ZERO invalids, which makes me wonder if the proper result was marked as bad. Well, they have to first figure out, which app is getting the result wrong. The fact, that the CAL result is ususlly marked invalid does not mean, that the result was actually wrong, it simply means, that there are more OpenCL results coming in (just like in your case there were more CAL results for that WU). Looking at the example you posted, the difference is only in the last number of stream_only_likelihood and in search_likelihood and that's exactly what I see also on my invalid results. If I compare my valid results of this run with those, that are marked invalid, I don't see any big difference in the numbers, in particular I don't see that huge negative number in stream_only_likelihood, which the OpenCL app gets. The weird thing with that number is that it's always (at least in all WUs I checked) very close to -235 in the OpenCL result. That's a bit strange, the numbers in my valid WUs are not even nearly that close to each other, they are simply somewhere in the range 0 to -100, most of them however in the range 0 to -20. So the question remains: which application is getting the result right? And that we can't answer from our end. ID: 60561 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556	Message 60562 - Posted: 13 Dec 2013, 10:49:05 UTC - in response to Message 60561. The weird thing with that number is that it's always (at least in all WUs I checked) very close to -235 in the OpenCL result. That's a bit strange, the numbers in my valid WUs are not even nearly that close to each other, they are simply somewhere in the range 0 to -100, most of them however in the range 0 to -20. To make it more precise: while the numbers of OpenCL results of those "problematic" WUs are all in the range of 234.3-235.5, the CAL app gets 2.8-4.1 for those, all negative of course. The applications agree as far, as to where in those ranges the result is, i.e. if the OpenCL app gets higher value, than the CAL app does that too, like in your example -235.5 and -4.1. -234.8 and -2.9 would be another combination, which I see on many WUs. So something in those WUs makes one of the applications get the number way too high or way too low, depends on which is the right one. ID: 60562 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 60572 - Posted: 14 Dec 2013, 1:04:06 UTC Good find of those pattern. Usually I expect a far newer scientific app to be more improved, but you are right that is has to be checked if there hasn't a new bug found it's way into the newer app. Before I thought it was only a difference in the last few digits. With the pattern you found now, this gives a completely different picture. If the old CAL app gives the wrong numbers, the validator need to get fixed to catch those + results from the old app coming in from newer (openCL enable) GPUs need to get denied to reduce wrong validations. If the newer app is the problem, we need a quick fix for this one. Luckily the problematic results seem to be only a small number of all those results coming in. ID: 60572 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556	Message 60583 - Posted: 14 Dec 2013, 13:29:20 UTC I hope to see some fix or at least some sort of information soon, I'm getting more and more in valids every day, yesterday I got 19 of those worth over 2000 credits, so now 10% of my daily production is going to the bin. :( PS: no, that's not about credits, I'd simply prefer to do something more useful for the project. ID: 60583 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 60585 - Posted: 14 Dec 2013, 17:50:04 UTC These applications haven't changed. So the problem is in the data and specifically in how the systems are processing an unconstrained stream. The error rate overall for the application is not peaking or rising past a normal run. So I am trying to find out why it is spiking for this one set of parameters and the spiking for a particular user base. RIght now I am not seeing a pattern to hone in on what part of the system is causing the bug. The background and the third stream set should change with each other if one is off the other should be off also. I am going to grab some invalids on different runs also to see if it is a problem that is there in other data runs. So I am looking into this and I will ping back a bit more on what we see. But I have nothing conclusive to report right now. I brought down one run on the stripe in question and am waiting a bit to bring down the other shortly. When they clear through I want to see if the problem is on other runs also. Jeff ID: 60585 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556	Message 60586 - Posted: 14 Dec 2013, 18:51:31 UTC - in response to Message 60585. Thanks for the update. ID: 60586 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 60587 - Posted: 14 Dec 2013, 18:54:12 UTC Thanks for the info There are two things I have to look at in this and I am focussing right now on the different values being returned there is also the problem of the wrong app being assigned. Jeff ID: 60587 · Rating: 0 · rate: / Reply Quote