Message boards :
News :
Nbody 1.04
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6
Author | Message |
---|---|
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
We inocculated the rest of the 30677n batch against disk scribbling, but unfortunately they all turned out to be as short as estimated. Got to do some payback to other projects for a while, but we'll fetch some more later today and inocculate them on receipt. Sorry, Lady Watson and I were busy helping Bernd catch a slippery (non-stick) file over at Einstein - missed this one. As you probably know by now, I don't think this will have worked. XP doesn't automatically refresh the view displayed in an explorer window when the underlying disk status changes: I suspect you will just have had a stale view of something that is no longer there. IIRC, Vista and Win7 are better at auto-refresh - with XP, you have to keep pressing F5 (refresh) to see if the file size changes. OTOH, that gives you the chance to use the 'flicker' test - if something does change, it's more likely the corner of your eye will catch it. |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
The one you just posted about (we call it a '196') is a Maximum Disk Usage Exceeded (MDUE) one. Older BOINC CC's may refer to it as a '177'. IOW, a different error code, but basically the same thing. The older clients actually call it a '-177'. As well as extending the number range, David moved some of the cases from below the axis (negative 'error' - reserved mainly for crashes and program coding errors) to above the axis (positive 'exit status' - more to do with scheduling and operational issues). |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
Meanwhile, here's a genuine new exit status for us: 197 (Maximum elapsed time exceeded) for task 383317771. That would have been an error -177 under an old client too - resource limit exceeded - but the different status number makes it clear that we're talking about a different resource. Fortunately, the good Doctor managed to capture some clinical samples: The WU was issued with <rsc_fpops_est>232567000.000000</rsc_fpops_est> The final report was <final_cpu_time>25.911770</final_cpu_time> And I'm using (in app_info) <flops>91244303218.731995</flops> Dividing out the numbers (the units are compatible), that gives an initial runtime estimate of 2.55 milliseconds, and a maximum allowed time of 25.5 seconds. Even with this project's incredibly generous limit of allowing the task to run for 10,000 times longer than its initial estimate (the BOINC default, used on most projects I've checked it on, is 10x), the bound was too low, and BOINC killed the task on schedule. The flops value I'm using was derived from the APR the server had calculated from the first 14 completed tasks under anonymous platform: that APR has now dropped to 29.123541063248 (* 10^9), so obviously I have some editing to do. If my value had matched the value the server used to calculate <rsc_fpops_bound>, I should have been allowed 79.85 seconds, and I might have got away with it (as my wingmate did). Unfortunately, the edit will have to wait. I'm now running de_nbody_100K_104_1_1356215205_100871_1 (anyone else seen a 100K?): the remaining time of 65h:30 is clearly an underestimate, with the task barely past 4.2% progress after 5 hours elapsed time. Tuesday morning, you reckon? |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
Sorry, Lady Watson and I were busy helping Bernd catch a slippery (non-stick) file over at Einstein - missed this one. Yep, I know about the problem with view refreshes on XP-32, but XPP-64 is actually based on early 'Longhorn', and thus is better in that regard. Ether way, my main point is because Windows wouldn't allow BOINC to clean up the slot folder, I have actual copies of everything that was there just prior to the fault. I was wondering if you thought they might be useful from a forensics POV. The Project Team might be able to make use of them. Also, roger that on the new error discovery, and yes the one currently running on the test rig is a 100K (8:55::00 elapsed and running). I noted though that mine isn't 'fresh', it's a resend of a deadline timed out one from originally issued on Jan 5th. |
Send message Joined: 5 Nov 12 Posts: 3 Credit: 6,378,981 RAC: 0 |
I will stick with continuing crunching everything that´s on queue :D Thanks for reply! |
Send message Joined: 10 Oct 12 Posts: 1 Credit: 2,939,743 RAC: 947 |
As a datapoint for you devs: On my laptop, an Intel i7 with Win7/64bits, the N-Body 1.04 fails always within a second of starting, it seems to me. http://imgur.com/jPQroFj |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
Here's another datapoint: task de_nbody_100K_104_1_1356215205_100871_1 finished after 532,816 seconds run time, 1,516,005 seconds CPU time. [That's under my 3-thread MT plan class, via anonymous platform] The task was interrupted by a totally unrelated BSOD on the machine, but recovered successfully from checkpoint and completed without suffering a 374 heap corruption error. That makes me wonder if different task types follow (and hence checkpoint) different processing paths, and the memory reload problem only applies to certain cases? If so, the fact that this one says "Number of particles in bins is very small compared to total. (1 << 100000). Skipping distance calculation" might be a clue. Too late to do any research on this run, but if 374 re-appears next time, we could try and look for processing information like this in wingmate reports for tasks which crash on exit (presumably before this final line is added to std_err). Talking of wingmates, I wonder how long it'll be before two more come along with 2½ weeks of fast CPU time to validate this task? |
Send message Joined: 8 Aug 12 Posts: 9 Credit: 156,273 RAC: 0 |
Is it ktm32.dll or ktmw32.dll? The error message says the first but your message board link has the second. James Joined MilkyWay@Home in 2012 Online since ArpNET days First activity on Honeywell 1648 Series Mainframe in 1975 at age 12. |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
Is it ktm32.dll or ktmw32.dll? The error message says the first but your message board link has the second. ktmw32.dll is a Microsoft Windows support file. ktm32.dll is a typing error. |
©2024 Astroinformatics Group