Message boards :
News :
New Separation Runs
Message board moderation
Author | Message |
---|---|
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
I have added the following runs to the currently running separation runs: de_separation_83_DR_8_rev_3_2 de_separation_85_DR_8_rev_3_2 If you notice any issues with these runs please post to this thread. Jeff |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
I added the follow run to the set ps_separation_82_DR_8_rev_3_1 Jeff |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
I have added de_separation_82_DR_8_rev_3_1 To these runs. Jeff |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
I have added de_separation_81_DR_8_rev_3_1 to this set |
Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556 |
Not an issue with any of those runs, but since I see it quite often: you might want to reconsider, what should be counted as WU error and what not. For example WU 404295517 ended up with: 532334131 484725 26 Jul 2013, 17:48:27 UTC 26 Jul 2013, 17:49:36 UTC Aborted by user 0.00 0.00 --- MilkyWay@Home v1.02 (opencl_amd_ati) 532335417 293662 26 Jul 2013, 17:51:02 UTC 26 Jul 2013, 22:01:22 UTC Completed, can't validate 873.83 6.18 0.00 MilkyWay@Home Anonymous platform (ATI GPU) 532461638 323514 26 Jul 2013, 22:02:59 UTC 26 Jul 2013, 22:04:05 UTC Not started by deadline - canceled 0.00 0.00 --- MilkyWay@Home v1.00 532462749 487766 26 Jul 2013, 22:05:43 UTC 27 Jul 2013, 11:00:43 UTC Completed, can't validate 5,393.92 5,249.26 0.00 MilkyWay@Home v1.01 532809360 455984 27 Jul 2013, 11:02:02 UTC 27 Jul 2013, 11:16:11 UTC Validate error 242.37 197.93 --- MilkyWay@Home Anonymous platform (CPU) I think "Aborted by user" and "Not started by deadline - canceled" do not inticate an issue with a WU and should not be counted as errors. |
Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0 |
I agree. I just encountered one of those myself. WU de_separation_85_DR_8_rev_3_2_1372784654_7235365 Plus SETI Classic = 21,082 WUs |
Send message Joined: 7 Mar 09 Posts: 1 Credit: 20,467,651 RAC: 0 |
Hi All the similar tasks [As mentioned in previous post], sent today have 'compution error' after 1.08 minutes exactly of stating. So far about 30 tasks. None have worked. All other tasks beforehand ok. What to do? Thanks Steve |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
Reboot. The previous problems were with validation issues due to user aborts. When a machine starts to have a run of errors like that the first thing to do is reboot it. If it continues after the reboot what tasks are you getting the error for and what is the stderr.txt output for the errors. |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
On the Vista Machine It appears the first bad workunit gets this Incorrect function. (0x1) - exit code 1 (0x1) Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: ' Error reading astronomy parameters from file 'astronomy_parameters.txt' Trying old parameters file Using SSE3 path Failed to commit move of 'separation_checkpoint_tmp' to 'separation_checkpoint' (6704): It is too late to perform the requested operation, since the Transaction has already been aborted. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error. Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory Write checkpoint failed 04:55:30 (6020): called boinc_finish ]]> Let me know if the reboot helps..... That particular work unit has made it through two other machines and your version of the app hasn't changed so it is local to the machine... |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
I have added the run de_separation_80_DR_8_rev_3_2 Jeff |
Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556 |
I' getting quite many validate errors with the de_separation_DR8_82_Rev_4_2_2 runs. Not critical, just on about 5 tasks a day, most of this run still validates fine for me. Seems to be an issue with the 0.82 CAL application, which is the only one I can use with my ATI HD3850. One of my wingmen has the same issue on his HD5800, though I'm not sure why he is getting the CAL app assigned from the server. But it seems to be the application and not the GPU. Don't know wether you can do something about it or not, I'm aware, that the CAL app is actually not supported anymore, but wanted to inform you anyway. Previous DR8_82 revisions run fine. |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
Looking into to it. I will post follow ups here. Jeff Thompson |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
Workunit 471493607 A Cayman (cat 12.6+) and a Tahiti (cat 12.4) gpu, both running the old 0.82 app, are validated against each other and a Cypress (cat 12.3) result with app 1.02 is marked invalid. That Cypress is usually running with ZERO invalids, which makes me wonder if the proper result was marked as bad. <search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application> <background_integral> 0.000250000203482 </background_integral> <stream_integral> 126.473032247388670 424.189607956046640 0.000000000000000 </stream_integral> <background_likelihood> -3.715093764065778 </background_likelihood> <stream_only_likelihood> -14.211423036510251 -3.997919366742975 -235.464758791616960 </stream_only_likelihood> <search_likelihood> -3.079346382417015 </search_likelihood> <background_integral> 0.000250000203482 </background_integral> <stream_integral> 126.473032247388687 424.189607956046700 0.000000000000000 </stream_integral> <background_likelihood> -3.715093764065778 </background_likelihood> <stream_only_likelihood> -14.211423036510251 -3.997919366742975 -4.106932733094966 </stream_only_likelihood> <search_likelihood> -2.988016909917461 </search_likelihood> <search_application> milkywayathome_client separation 0.82 Linux x86_64 double CAL++ </search_application> <background_integral> 0.000250000203482 </background_integral> <stream_integral> 126.473032247388690 424.189607956046640 0.000000000000000 </stream_integral> <background_likelihood> -3.715093764065779 </background_likelihood> <stream_only_likelihood> -14.211423036510251 -3.997919366742975 -4.106932733094965 </stream_only_likelihood> <search_likelihood> -2.988016909917461 </search_likelihood> <search_application> milkywayathome_client separation 0.82 Windows x86_64 double CAL++ </search_application> The old CAL app should only get accepted for gpus (i.e. HD3800) that can't run the newer openCL app. |
Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556 |
A Cayman (cat 12.6+) and a Tahiti (cat 12.4) gpu, both running the old 0.82 app, are validated against each other and a Cypress (cat 12.3) result with app 1.02 is marked invalid. That Cypress is usually running with ZERO invalids, which makes me wonder if the proper result was marked as bad. Well, they have to first figure out, which app is getting the result wrong. The fact, that the CAL result is ususlly marked invalid does not mean, that the result was actually wrong, it simply means, that there are more OpenCL results coming in (just like in your case there were more CAL results for that WU). Looking at the example you posted, the difference is only in the last number of stream_only_likelihood and in search_likelihood and that's exactly what I see also on my invalid results. If I compare my valid results of this run with those, that are marked invalid, I don't see any big difference in the numbers, in particular I don't see that huge negative number in stream_only_likelihood, which the OpenCL app gets. The weird thing with that number is that it's always (at least in all WUs I checked) very close to -235 in the OpenCL result. That's a bit strange, the numbers in my valid WUs are not even nearly that close to each other, they are simply somewhere in the range 0 to -100, most of them however in the range 0 to -20. So the question remains: which application is getting the result right? And that we can't answer from our end. |
Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556 |
The weird thing with that number is that it's always (at least in all WUs I checked) very close to -235 in the OpenCL result. That's a bit strange, the numbers in my valid WUs are not even nearly that close to each other, they are simply somewhere in the range 0 to -100, most of them however in the range 0 to -20. To make it more precise: while the numbers of OpenCL results of those "problematic" WUs are all in the range of 234.3-235.5, the CAL app gets 2.8-4.1 for those, all negative of course. The applications agree as far, as to where in those ranges the result is, i.e. if the OpenCL app gets higher value, than the CAL app does that too, like in your example -235.5 and -4.1. -234.8 and -2.9 would be another combination, which I see on many WUs. So something in those WUs makes one of the applications get the number way too high or way too low, depends on which is the right one. |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
Good find of those pattern. Usually I expect a far newer scientific app to be more improved, but you are right that is has to be checked if there hasn't a new bug found it's way into the newer app. Before I thought it was only a difference in the last few digits. With the pattern you found now, this gives a completely different picture. If the old CAL app gives the wrong numbers, the validator need to get fixed to catch those + results from the old app coming in from newer (openCL enable) GPUs need to get denied to reduce wrong validations. If the newer app is the problem, we need a quick fix for this one. Luckily the problematic results seem to be only a small number of all those results coming in. |
Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556 |
I hope to see some fix or at least some sort of information soon, I'm getting more and more in valids every day, yesterday I got 19 of those worth over 2000 credits, so now 10% of my daily production is going to the bin. :( PS: no, that's not about credits, I'd simply prefer to do something more useful for the project. |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
These applications haven't changed. So the problem is in the data and specifically in how the systems are processing an unconstrained stream. The error rate overall for the application is not peaking or rising past a normal run. So I am trying to find out why it is spiking for this one set of parameters and the spiking for a particular user base. RIght now I am not seeing a pattern to hone in on what part of the system is causing the bug. The background and the third stream set should change with each other if one is off the other should be off also. I am going to grab some invalids on different runs also to see if it is a problem that is there in other data runs. So I am looking into this and I will ping back a bit more on what we see. But I have nothing conclusive to report right now. I brought down one run on the stripe in question and am waiting a bit to bring down the other shortly. When they clear through I want to see if the problem is on other runs also. Jeff |
Send message Joined: 19 Jul 10 Posts: 627 Credit: 19,362,069 RAC: 3,556 |
Thanks for the update. |
Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0 |
Thanks for the info There are two things I have to look at in this and I am focussing right now on the different values being returned there is also the problem of the wrong app being assigned. Jeff |
©2024 Astroinformatics Group