Nbody 1.04

Author	Message
Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 56871 - Posted: 13 Jan 2013, 19:29:02 UTC - in response to Message 56869. I was figuring the discrepancy in exit codes was due to mine having a tighter limit on disk usage than the wingman did, or something along those lines. No, it was the different version of the BOINC core client in use. Your v6.12.34 sends back exit codes which are compatible with the web display code running here. The codes were revised for his v7.0.28 client, and actually give more useful (more specific) data - such as breaking down your RSC_LIMIT into specifically a DISK_LIMIT (or other limit as appropriate). But although David rationalised the client codes, he didn't re-sychronise the client and web codes until Crystal Pellet and I worked out what was going wrong, in that BOINC thread I linked. ID: 56871 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56872 - Posted: 13 Jan 2013, 19:39:25 UTC - in response to Message 56871. Ah yes, I see now! Drat, the sad verse is now you've convinced me I'm going to have to spend some time reviewing the stuff on the main BOINC site and mail lists.... Wait a second.... Maybe I can schedule a root canal, instead! :-D ID: 56872 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56873 - Posted: 13 Jan 2013, 21:12:31 UTC - in response to Message 56870. Last modified: 13 Jan 2013, 21:15:45 UTC <snip>But if the initial estimates are as bad as we've seen recently, Milkyway tasks will be scheduled to run immediately in 'High Priority' as soon as the download completes. That will undoubtedly be interpreted by (multi-project) volunteers as a project decision, and I fear that the "you're taking over my computer" criticism will be directed at the project admins, rather than the 'boinc central' programmers who wrote the main server code. LOL... Worse, It's a No-Win-Scenario as well. If they go with Run Shared paradigm like my unregulated XPP rig does.... Same outcome from the peanut gallery. Regardless of the fact that way would take advantage of MT in such a way MT isn't just a useless 'Gee Whiz' feature but just goes quietly along doing it's main thing (which is to speed up nBody), while at the same time not making a big deal about it to the uninformed and/or OCD users (and the major fracas that goes along with that), and still manages to have about the same net effect on overall host performance when you take time metrics longer than the last few hours as the basis! ;-) Pardon me.... What's that called?? OH, Yeah.... The right choice! :-D ID: 56873 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 56889 - Posted: 15 Jan 2013, 12:33:00 UTC R.I.P. We regret to have to inform readers of the sad demise of ps_nbody_orphan_real_CHISQ_1356215205_306270_0, who passed away at the ripe old age of nearly 26 hours (elapsed - 65 hours CPU), from terminal "Maximum disk usage exceeded" syndrome. Her sister de_nbody_orphan_real_CHISQ_1356215205_306777_0 is under close observation with the doctor in attendance, but we fear that she is also suffering from an abnormal growth on the 'nbody_checkpoint' file. The patient's condition seems to have stabilised at 31,670 KB for the time being, but we fear that this is a late-onset mutation in the workunit DNA, and owing to the patient's poor reaction to anaesthetic, we may not be able to operate in time to save her. If the condition does turn out to be inherited, we intend to try an experimental course of steroids injected directly into the <rsc_disk_bound>52428800.000000</rsc_disk_bound> region, in an attempt to save the remaining siblings. Further medical bulletins will be issued as and when any of the patients respond to treatment. ID: 56889 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56891 - Posted: 15 Jan 2013, 12:39:02 UTC - in response to Message 56889. Sorry for your loss.... Unfortunately, I have to run right now, but good luck with the new treatment plan. ID: 56891 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56895 - Posted: 15 Jan 2013, 17:18:12 UTC - in response to Message 56889. Last modified: 15 Jan 2013, 17:24:03 UTC R.I.P. We regret to have to inform readers of the sad demise of ps_nbody_orphan_real_CHISQ_1356215205_306270_0, who passed away at the ripe old age of nearly 26 hours (elapsed - 65 hours CPU), from terminal "Maximum disk usage exceeded" syndrome. Her sister de_nbody_orphan_real_CHISQ_1356215205_306777_0 is under close observation with the doctor in attendance, but we fear that she is also suffering from an abnormal growth on the 'nbody_checkpoint' file. The patient's condition seems to have stabilised at 31,670 KB for the time being, but we fear that this is a late-onset mutation in the workunit DNA, and owing to the patient's poor reaction to anaesthetic, we may not be able to operate in time to save her. If the condition does turn out to be inherited, we intend to try an experimental course of steroids injected directly into the <rsc_disk_bound>52428800.000000</rsc_disk_bound> region, in an attempt to save the remaining siblings. Further medical bulletins will be issued as and when any of the patients respond to treatment. Hmmmm... I got back from my errands a bit ago, and due to the recent nBody Code Blue event you had, I just made the rounds of the ones running on my XPP-64 host. Currently, I have two running concurrently. One is is a sub type 112_2013, and the other is a CHISQ. The status report on the CheckPoint (CP) file for the 112_ is in the 20+ MB range, but I'm already pretty sure it's fatally wounded anyway due to an unavoidable restart of the host itself (374 syndrome). The other one is a CHISQ, but at this point appears to clean of 374, and it's CP file was just under 10MB when I looked. Since the 112_ is most likely dead already, I figured there wasn't much to lose by doing some exploratory surgery on it. So I took look at just what's in the CP file. It seems at at first glance to be a a disk copy of the process memory at the time the checkpoint is done. I don't know at this point if this is relevant or not, but if it is, it might just dovetail with my earlier comment about 'scribbling' to the disk at certain times. Right now I'm baking my noodle a bit to figure out a relatively simple (and safe for the patient) way of proving or disproving this. In any event, I think your proposed treatment will help in this regard as well. ID: 56895 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 56896 - Posted: 15 Jan 2013, 18:17:12 UTC - in response to Message 56895. Hmmmm... I got back from my errands a bit ago, and due to the recent nBody Code Blue event you had, I just made the rounds of the ones running on my XPP-64 host. Currently, I have two running concurrently. One is is a sub type 112_2013, and the other is a CHISQ. The status report on the CheckPoint (CP) file for the 112_ is in the 20+ MB range, but I'm already pretty sure it's fatally wounded anyway due to an unavoidable restart of the host itself (374 syndrome). The other one is a CHISQ, but at this point appears to clean of 374, and it's CP file was just under 10MB when I looked. Since the 112_ is most likely dead already, I figured there wasn't much to lose by doing some exploratory surgery on it. So I took look at just what's in the CP file. It seems at at first glance to be a a disk copy of the process memory at the time the checkpoint is done. I don't know at this point if this is relevant or not, but if it is, it might just dovetail with my earlier comment about 'scribbling' to the disk at certain times. Right now I'm baking my noodle a bit to figure out a relatively simple (and safe for the patient) way of proving or disproving this. In any event, I think your proposed treatment will help in this regard as well. This morning's patient is still alive and well at 37.5% progress: there has been no change in the size of the checkpoint file, and nothing else has been scribbled into the slot directory to increase disk usage. [I can't think that scribblings anywhere else on the disk - which a BOINC task wouldn't be permitted to do anyway, except in the project directory - would trigger a fatal case of -177/196 for one specific MW task] Unfortunately, the doctor's surgery hours have ended for today, and I expect that any problems later in the run will happen either while I'm out to dinner, or later when I'm asleep in bed. So I've set a cron job to log the size of the checkpoint file every ten minutes, so we can see if it changes at any point during the run. If there is a crash, but the checkpoint file turns out not to be involved, I can log the entire contents of the slot directory during the next run. And if that doesn't catch anything, then we can decrease the logging interval - and so on. ID: 56896 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56897 - Posted: 15 Jan 2013, 19:02:42 UTC - in response to Message 56896. This morning's patient is still alive and well at 37.5% progress: there has been no change in the size of the checkpoint file, and nothing else has been scribbled into the slot directory to increase disk usage. [I can't think that scribblings anywhere else on the disk - which a BOINC task wouldn't be permitted to do anyway, except in the project directory - would trigger a fatal case of -177/196 for one specific MW task] Unfortunately, the doctor's surgery hours have ended for today, and I expect that any problems later in the run will happen either while I'm out to dinner, or later when I'm asleep in bed. So I've set a cron job to log the size of the checkpoint file every ten minutes, so we can see if it changes at any point during the run. If there is a crash, but the checkpoint file turns out not to be involved, I can log the entire contents of the slot directory during the next run. And if that doesn't catch anything, then we can decrease the logging interval - and so on. Agreed, Windows (or any modern OS for that matter) would definitely not permit any user app to just haphazardly write to disk anywhere it felt like, but would allow it to write junk into its own slot directory until an app setup given and/or user specified limit was reached, or ultimately a 'master' BOINC resource limit was exceeded. It was the first case I was thinking of here when referring to scribbling. I would imagine the action of last resort in a 'perfect storm' type BOINC malfunction is Windows would finally say, "Guess what BOINC, you're going to have to die! I can't figure out what you're doing or asking me to do for you, and worse, have become toxic to everyone's health (including my own)! Don't go away mad... Just go away..." POOF!! :-) ID: 56897 · Rating: 0 · rate: / Reply Quote

Uncle Send message Joined: 26 May 11 Posts: 4 Credit: 1,990,725 RAC: 0	Message 56909 - Posted: 16 Jan 2013, 6:41:29 UTC Good day to all! My system is: Linux OpenSUSE 12.2 A have found this in logs of my BOINC7 client. May be this will help as all to improve Project: * glibc detected * ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.04_x86_64-pc-linux-gnu__mt: double free or corruption (!prev): 0x000000000276e060 *** ======= Backtrace: ========= [0x4ee6a2] [0x40be7c] [0x404e2a] [0x401f01] [0x4d34e4] [0x403011] ======= Memory map: ======== 00400000-0065d000 r-xp 00000000 08:07 132321 /home/kilvador/Ð—Ð°Ð³Ñ€ÑƒÐ·ÐºÐ¸/Berkeley/BOINC/projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.04_x86_64-pc-linux-gnu__mt 0085c000-00860000 rw-p 0025c000 08:07 132321 /home/kilvador/Ð—Ð°Ð³Ñ€ÑƒÐ·ÐºÐ¸/Berkeley/BOINC/projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.04_x86_64-pc-linux-gnu__mt 00860000-00896000 rw-p 00000000 00:00 0 0276a000-03352000 rw-p 00000000 00:00 0 [heap] 7f5df321b000-7f5df36b3000 rw-s 00000000 08:07 132000 /home/kilvador/Ð—Ð°Ð³Ñ€ÑƒÐ·ÐºÐ¸/Berkeley/BOINC/slots/7/boinc_milkyway_nbody_7 7f5df36b3000-7f5df36b4000 rw-p 00000000 00:00 0 7f5df39c1000-7f5df39c2000 rw-p 00000000 00:00 0 7f5df39c2000-7f5df39c3000 ---p 00000000 00:00 0 7f5df39c3000-7f5df39c6000 rw-p 00000000 00:00 0 [stack:2405] 7f5df39c6000-7f5df39c8000 rw-s 00000000 08:07 131986 /home/kilvador/Ð—Ð°Ð³Ñ€ÑƒÐ·ÐºÐ¸/Berkeley/BOINC/slots/7/boinc_mmap_file 7fff6cad5000-7fff6caf6000 rw-p 00000000 00:00 0 [stack] 7fff6cb24000-7fff6cb25000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] ID: 56909 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 56912 - Posted: 16 Jan 2013, 11:10:09 UTC Well, how's about that, then? The patient survived the night, and has just reported in at 103,076 seconds elapsed time, 278,691 seconds CPU. The Doctor still has the emergency room quarantined, so it'll be a little while before we can innoculate the siblings - but the was no sign of checkpoint bloat when I checked first thing this morning. The inner Gollum in me also notes that the task was awarded 35,081 credits - mine, all mine, my lovelies. This host is rated at 41 credits/core/hour on the original cobblestone benchmark*time definition, or 3,174 credits for the reported CPU time. The original definition certainly undervalues modern SIMD processors (which can do much more productive work per clock cycle than the benchmark suggests), but I'd put the ratio at nearer 2x than 10x. The last time I ran an MT test task with bad runtime estimates (AQUA), we had a runaway credit inflation episode - unfortunately, the whole project closed down before we could get to the bottom of it. Anybody else see credit exploding here? ID: 56912 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56914 - Posted: 16 Jan 2013, 11:32:17 UTC - in response to Message 56912. Last modified: 16 Jan 2013, 11:49:12 UTC LOL... Sweet payoff! Especially since you tweaked to run MT, you didn't have to burn 278K secs elapsed to crunch it! Gotta love parallel processing in that regard! ;-) <edit> Well, the 112_2013 finished and was pretty tasty too, though not quite as good as your's was! The interesting thing was, as you can see, there is no doubt it restarted from checkpoint and did not 374. The sad verse is it leaves the root cause of the heap problem a real mystery at this point. :-( ID: 56914 · Rating: 0 · rate: / Reply Quote

Uncle Send message Joined: 26 May 11 Posts: 4 Credit: 1,990,725 RAC: 0	Message 56915 - Posted: 16 Jan 2013, 11:58:53 UTC - in response to Message 56914. Last modified: 16 Jan 2013, 12:02:59 UTC It's LoL )) And sad at the same time ((( What could you do about it for preventing the problem in future? ID: 56915 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56916 - Posted: 16 Jan 2013, 12:05:54 UTC - in response to Message 56915. Last modified: 16 Jan 2013, 12:14:59 UTC I'm not sure about that at the moment. I think the first step now is to accelerate my plans to build a new high power gaming/engineering workstation. In the mean time, I guess I need to load up some more powerful software tools on this rig so I can dig deeper (and hopefully safer) into what's going on with nBody. <edit> In the mean time I stick by the earlier recommendation to leave the apps in memory when suspended. It doesn't hurt anything in most cases, and was definitely shown to help reduce the occurrence of 374 syndrome on Windows hosts. As a side note, now that you've found News and Number Crunching, hopefully you can get some more specific help with your Linux difficulties. Unfortunately, nix type OS's aren't my strong suit anymore. :-( Too much time spent in the MS playpen, I guess! ;-) ID: 56916 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 56917 - Posted: 16 Jan 2013, 12:24:43 UTC We inocculated the rest of the 30677n batch against disk scribbling, but unfortunately they all turned out to be as short as estimated. Got to do some payback to other projects for a while, but we'll fetch some more later today and inocculate them on receipt. ID: 56917 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56920 - Posted: 16 Jan 2013, 12:37:29 UTC - in response to Message 56917. Last modified: 16 Jan 2013, 12:46:47 UTC I should have mentioned before, as part of this experiment I didn't do the inoculation on mine as a control. I'm planning to leave mine exposed for the same reason. <Sigh> I think I need a new battery in my wireless keyboard. Now I just have to remember where I put them since the recent round of major housecleaning! ;-) <edit> CHISQ completed with a similar positive outcome. No restarts on that one though. ID: 56920 · Rating: 0 · rate: / Reply Quote

Swedis Send message Joined: 5 Nov 12 Posts: 3 Credit: 6,378,981 RAC: 0	Message 56925 - Posted: 17 Jan 2013, 0:26:55 UTC Last modified: 17 Jan 2013, 0:55:03 UTC Okay, this is certainly a personal record! Should i even keep on going with these giants or skip until next update? <Edit> Nvm, got computation error half hour later :) ID: 56925 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56927 - Posted: 17 Jan 2013, 12:40:09 UTC - in response to Message 56917. Last modified: 17 Jan 2013, 12:53:45 UTC We inocculated the rest of the 30677n batch against disk scribbling, but unfortunately they all turned out to be as short as estimated. Got to do some payback to other projects for a while, but we'll fetch some more later today and inocculate them on receipt. I don't know if this will be any help or not, but I had a CHISQ fault out on MDUE which was what I was expecting since all the wingmen had as well. The really interesting thing I discovered (but should have realized earlier if I had thought more about it) is I had the slot directory open on the desktop to make it easier to pop in and take a look from time to time, but then wandered off for a bit and didn't close Windows Explorer. When I came back the task had already faulted out and been reported, so I didn't see it happen. However, since slot folder had been open on the desktop, Windows didn't allow BOINC to delete the files in it when it tried to clean up the mess. Therefore, I still have a copy of all the files for the task which I would think should be current right up to the point just before it failed. Thoughts? ID: 56927 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56928 - Posted: 17 Jan 2013, 12:50:04 UTC - in response to Message 56925. Last modified: 17 Jan 2013, 12:51:04 UTC Okay, this is certainly a personal record! Should i even keep on going with these giants or skip until next update? <Edit> Nvm, got computation error half hour later :) That hard to say for sure. You have to look at it on a case by case basis. However, if you have taken the precautions mentioned here, and it does complete, and you are considered 'reliable' by the validator already, or get the right wingmen.... The payoff can be pretty good on the long ones for being a guinea pig. ;-) Your call. :-D ID: 56928 · Rating: 0 · rate: / Reply Quote

Mr6686 Send message Joined: 15 Oct 12 Posts: 3 Credit: 18,270,747 RAC: 0	Message 56930 - Posted: 17 Jan 2013, 15:48:50 UTC New error for me (196 (0xc4) Unknown error number) with WU 296867011. It said "- Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFDE43C72" then the BOINC Windows Runtime Debugger ran. ID: 56930 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56931 - Posted: 17 Jan 2013, 16:13:16 UTC - in response to Message 56930. Last modified: 17 Jan 2013, 16:14:59 UTC Thanks for the input. I took a look at both of the nBody failures your host has showing. The one you just posted about (we call it a '196') is a Maximum Disk Usage Exceeded (MDUE) one. Older BOINC CC's may refer to it as a '177'. IOW, a different error code, but basically the same thing. The earlier one was what we call a 374, and refers to a memory heap corruption problem. Both of these have been seen, and the best working theory at this point, is both seem to point to problems when the app tries to checkpoint and/or restart from the checkpoint file (as opposed to resuming from memory). HTH ID: 56931 · Rating: 0 · rate: / Reply Quote