nbody fpops calculation updated

Author	Message
Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 59180 - Posted: 1 Jul 2013, 2:44:34 UTC It looks like our fpops calculation for the nbody simulation is way too high, and was causing problems. I've updated the work generator so newly generated workunits should have an estimated fpops of aproximately 100x less. Let us know how they're working. --Travis ID: 59180 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 59184 - Posted: 1 Jul 2013, 13:58:37 UTC Were you talking about problems like this? ID: 59184 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59186 - Posted: 1 Jul 2013, 15:52:26 UTC These work units appear to be before Travis made the changes. It appears the errors are the stack overflow you reported earlier that we are working on getting the patch out for currently. The change in fpops_est should change the estimated time we send out on new work units. This will happen on work units generated since Travis made the change. So the Work units will be dated July 1 st in UTC. Jeff ID: 59186 · Rating: 0 · rate: / Reply Quote

jdzukley Send message Joined: 26 May 11 Posts: 32 Credit: 45,660,273 RAC: 8	Message 59194 - Posted: 2 Jul 2013, 13:53:15 UTC Last modified: 2 Jul 2013, 13:56:19 UTC Thanks for the update on the estimated time to complete estimates. Per my observations, they are much improved. The only very variance is that some tasks do take 5-6 time longer than estimated. This is +/- insignificant as the original estimate my be 2.5 minutes, with actual run times of 15 minutes. I have a question based on observing total installed CPU processor efficiencies. For this statement, I am basing my comments on a computer that has 12 cores. Most MT tasks run at an average of +/- 80% total efficiency as observed using the Windows Resource Manager. Therefore logic suggest that it would be better to run 12 separate nbody tasks at +/- 100% rather than a single MT task at 80%? Consider that I also have 2 GPU cards installed, and with those cards running at 95 +/- GPU loads (as observed with GPU Z) it is rare that the total CPU machine loads get above 85%. I look forward to any discussion about MT tasks verses single run tasks. I suppose the answer is to compare run time averages verses CPU seconds used. Is there any current review of these results? ID: 59194 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59199 - Posted: 2 Jul 2013, 18:55:09 UTC On the first part of the estimate being off. We changed the calculation to get a better order of magnitude estimate. We have the data to fine tune this the hard part is getting it to be mostly right as the input parameters have a large spread that they can run across to be correct. We suspect we will go through one or two more iterations of this each being a fine grained improvement each time. We hope our estimates are closer and helping with other scheduling issues. As for the second part some work units will generate more force equations than others. When there are loads of force equations mt will be better. So not all work units are the same and in aggregate it does appear mt is better than non mt. Though we are looking at optimizing how we set up initial conditions to improve efficiency. Jeff ID: 59199 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 29 Sep 10 Posts: 54 Credit: 1,431,288 RAC: 7	Message 59229 - Posted: 5 Jul 2013, 4:01:39 UTC - in response to Message 59180. Last modified: 5 Jul 2013, 4:04:40 UTC Does this explain why I was given so much credit for a few workunits earlier? I posted about it, but no one has shed any light on the observation. http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3277 NOTE: All those WU have been deleted from the server. However, if you look at my credit, you'll see my MW credits more than doubled in 1 day. This was due to completing a single WU. http://www.allprojectstats.com/showuser.php?projekt=61&id=126741 Edit: I'm also reading about credit system problems from June. Perhaps something is going on with the credit system? http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3294 ID: 59229 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59230 - Posted: 5 Jul 2013, 5:56:10 UTC Last modified: 5 Jul 2013, 12:22:11 UTC On these work units you posted earlier that was before we changed the calculations so they would have had the same calculation as before. The change happened on July 1st UTC it was around 8 pm June 30 EST. So only work units created after that would have a different estimate of fpops. As for the credit system. we was made aware of the 0.64 credit work units and we are looking at why this is happening. So we don't have an answer yet. There are examples on the pre changes to the fpops estimate change that we are looking at but we are also trying to monitor the current work units. ID: 59230 · Rating: 0 · rate: / Reply Quote

kurt Send message Joined: 17 Nov 10 Posts: 12 Credit: 483,196,608 RAC: 3,627	Message 59233 - Posted: 5 Jul 2013, 13:33:25 UTC I still get computation errors on one of the computers that is running 2 6950's.However, the other running a single 7850 has never had a problem.Any ideas?Kurt ID: 59233 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59235 - Posted: 5 Jul 2013, 13:48:24 UTC If you could post a work unit or two we could look at stderr returned. Most likely it is the stack overflow that we have a patch in testing for and we are trying to get out asap. It could just be a difference in resources between the two machines before the issue is exposed. Though with more details we can be more definitive. Jeff ID: 59235 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 59238 - Posted: 5 Jul 2013, 14:36:10 UTC - in response to Message 59235. Hmmm... I've been looking over Kurt's host, and his is looking more like a driver kernel mode GPF to me. ID: 59238 · Rating: 0 · rate: / Reply Quote