Why is it so hard to get work?

Author	Message
The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 23485 - Posted: 27 May 2009, 10:18:38 UTC - in response to Message 23478. This is what I remember Travis posting.... Quote: To be honest, before people started using scripts to hammer the server, we were getting around 9-11 workunits a second. Now we're seeing around 6-7 workunits a second. ____________ Doesn't look like less to me... Uhhh... 6-7 a second is not LESS than 9-11 a second, in what reality? The only thing I am not sure of in regards to that statement is whether he was talking about outbound or inbound. In either case, it is less... IF 9-11 inbound before AND 6-7 inbound after, then the amount of science being returned to the project is LESS after than before. IF 9-11 outbound before AND 6-7 outbound after, then the amount of science being sent out from the project to participants is LESS after than before. The ONLY way for there NOT to be LESS is if the meaning was really: IF it took 9-11 seconds before each task was sent to a participant before, but 6-7 seconds before tasks were sent out after. That is the only way in which a lower number equates to a faster turnaround...for what was said to mean that there was a smaller delay in sending out tasks after the scripts were in more general use than before. That is a considerably different meaning from what most people would translate what was said, so while it is possible that was the meaning, it is highly unlikely. Perhaps you are being thrown off by the use of the three words "to be honest", and you are taking them to mean that there is disagreement with the idea that the number is less...? There is no doubt that the amount of science being completed IS less now than before the outrage, however, this is NOT due to scripts being used. Something changed at the server side. Before the outage 9 to 11 wu per seconds were before transferred to shared memory and after the outrage only 6-7 wu are being trasnferred to shared memory. It was plain to see on BOINCstats at the time. If the use of scripts was causing the problems then we'd expect to see a further drop off in wu's being transferred as more people use them. Can we please stop perpetuating the myth that the use of scripts is causing fewer wu's to be available overall. Their use is most likely the cause of why the people not using them are not able to get as much work as they were before, but overall there are around 100 new people joining every day, so there'd be fewer in any case..... ;) ID: 23485 · Rating: 0 · rate: /

[AF>DoJ] supersonic Send message Joined: 5 Mar 09 Posts: 19 Credit: 102,651,985 RAC: 0	Message 23487 - Posted: 27 May 2009, 11:10:13 UTC - in response to Message 23443. As I pointed out in this message and to Travis... To stop the scriptors hitting the project so hard, you could increase the minimum time between host contacts at the server end. LHC@home increased theirs to just over 15 minutes....maybe you could try 2 minutes and see what happens. I believe it is a simple server side setting. I agree with this 15 minute minimum time between host contact... As a script user, if I decide to stop running script, the whole project wins a little, and I loose a lot (compared to those still running script). It would be better that every scripter stops, so that we'll all win. But that won't happen as we're greedy humans. That's why I won't stop running script. But as I see it hammers the server, I see as an easy solution to set a 15 minute (or watever) minimum time between host contact ... Brian Silver's analogy with grocery behavior is very accurate. We live a starvation time Every body wants bread, but there is very little amount of baguettes available. A fight is close to occur if no rule is established to limit the maximum number of baguettes a good mother can bring back to feed her hungry childs. ID: 23487 · Rating: 0 · rate: /

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 23491 - Posted: 27 May 2009, 11:49:13 UTC - in response to Message 23482. Last modified: 27 May 2009, 11:56:09 UTC Barring some disaster, it should be 2-4 weeks. At that point, I can sincerely say that I wish all of the GPU folks well with their future endeavors... ;-) Oh, I take it that until that point, you wouldn't be sincere in wishing well to GPU folks <g>. While I don't wish them ill will, I do have to admit that it's hard to well up a tear when someone is complaining about the difference of not having work for a few hours out of the day vs. those of us who don't get any tasks from here for a couple days straight and instead of banging on things to try to get something from here, we decide to just get work from somewhere else. ID: 23491 · Rating: 0 · rate: /

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 23492 - Posted: 27 May 2009, 11:54:37 UTC - in response to Message 23485. There is no doubt that the amount of science being completed IS less now than before the outrage, however, this is NOT due to scripts being used. Something changed at the server side. Before the outage 9 to 11 wu per seconds were before transferred to shared memory and after the outrage only 6-7 wu are being trasnferred to shared memory. It was plain to see on BOINCstats at the time. Invoking the fact that I'm a system admin, you can't glean that much off of that graph. Many times people use those graphs when there's been a credit change, usually a reduction, to "sick the admin" with...saying "SEE, EVEN AN IDIOT CAN SEE THAT YOU'RE DRIVING PEOPLE AWAY". Without the raw metrics, you're guessing... Additionally, you're putting your own spin on what Travis said. I didn't read that at all. I'm seeking clarification. Can we please stop perpetuating the myth that the use of scripts is causing fewer wu's to be available overall. Their use is most likely the cause of why the people not using them are not able to get as much work as they were before, but overall there are around 100 new people joining every day, so there'd be fewer in any case..... ;) You'd need to take that up with Travis, since the way I read what he said was to the effect that since people have been hammering on the server, the project productivity has gone down. ID: 23492 · Rating: 0 · rate: /

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 23501 - Posted: 27 May 2009, 15:59:09 UTC - in response to Message 23485. Last modified: 27 May 2009, 16:08:05 UTC This is what I remember Travis posting.... Quote: To be honest, before people started using scripts to hammer the server, we were getting around 9-11 workunits a second. Now we're seeing around 6-7 workunits a second. ____________ Doesn't look like less to me... <snip intervening reply> There is no doubt that the amount of science being completed IS less now than before the outrage, however, this is NOT due to scripts being used. Something changed at the server side. Before the outage 9 to 11 wu per seconds were before transferred to shared memory and after the outrage only 6-7 wu are being trasnferred to shared memory. It was plain to see on BOINCstats at the time. If the use of scripts was causing the problems then we'd expect to see a further drop off in wu's being transferred as more people use them. Can we please stop perpetuating the myth that the use of scripts is causing fewer wu's to be available overall. Their use is most likely the cause of why the people not using them are not able to get as much work as they were before, but overall there are around 100 new people joining every day, so there'd be fewer in any case..... ;) Whoa!! Stop the presses.... Where are you coming up with this analysis from what Travis said? He's not talking about the rate that work is being transferred into the shared memory buffer (IMO). He's talking about the rate that tasks are being returned to the project. The only irrefutable statements in your arguments has been that the use of scripting (to jump to the head of the line) does not effect the sum total of all work available, and the overall daily throughput is down (by 30% from the peak in credit terms). In fact, it looks to me like you drew your conclusions first, and then went looking for proof rather than the other way around. Now there are a couple of factors which haven't been considered here which could have a major impact for project overall throughput. The first is they don't seem to be running as many different searches concurrently as they did before. This means there is a smaller pool of potential candidates to generate new work for. The second is they all seem to be particle swarm searches. What's this have to do with anything you might ask? If you recall, the PS searches are genetic algorithm simulations, for which it has been stated many times in the past are highly 'iteration' sensitive. You can not be generating work for the next round of calculation before you have sufficient returned data for the current round, or you risk having the simulation 'wander off' down an 'undesirable' (read that as wrong) path. IOWs, if you get too far ahead of yourself, you run the risk of getting doo-doo back. This means that if you have an battalion of very fast machines grabbing all the work they can, every chance they can, the odds that you will reach a point where work generation must stall temporarily are greatly increased. The reason is there are still far more 'conventional' hosts out there, and they still grab a significant amount of work, even though they might get shut out on a request a lot or even run out for periods of time. Anecdotal evidence on my hosts tends to support this. I have observed on some occasions, a host will get the full request of work up to the session limit right away, while at other times you get one or two at a shot, regardless of how many time you 'pop the update button' as soon as the current 7 second delay expires. This is telling me the condition at that time is not that the project can't move tasks into the buffer fast enough, but they are having 'trouble' finding a candidate to generate work for, and there just isn't anything to send. This might explain in large part why historically simulations like this have been developed and run on homogeneous supercomputers or clusters, like Blue Gene for example. I seem to recall that one of the science goals was to figure out if it was even worth the trouble of doing this on the loosely coupled, heterogeneous platform that is known as BOINC (or public DC in general). I'm pretty sure that one of the findings from this is going to be, "Yes, it works fairly well. However, a main consideration in setting a run up is that you need to evaluate if you should band limit the range of host performance to the type work being done, in order to achieve optimum task throughput and host utilization.". Alinator ID: 23501 · Rating: 0 · rate: /

AriZonaMoon* Send message Joined: 29 Sep 08 Posts: 1618 Credit: 46,511,893 RAC: 0	Message 23503 - Posted: 27 May 2009, 16:37:15 UTC - in response to Message 23484. Last modified: 27 May 2009, 16:38:34 UTC Oh may god.:( Close the thread please! ;-) its like a horror movie, isnt it? We hate it but we look at it anyway. Still - it can be enough of even horror movies.. day out and day in. ;p I think i should stay away from reading this thread. It gives such a bad feeling..... Its probably something wrong with me. Just dont fit in. :p Better to go back to other areas and have fun. ;-D ID: 23503 · Rating: 0 · rate: /

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 23509 - Posted: 27 May 2009, 17:34:25 UTC - in response to Message 23501. He's not talking about the rate that work is being transferred into the shared memory buffer (IMO). He's talking about the rate that tasks are being returned to the project. Actually, I'm not sure what he was talking about. There are two main possibilities...either the amount inbound or the amount outbound from the scheduler and file transfer servers. It's confusing because of the choice of words he used, specifically "getting". That word can be used to mean the amount being received back in, or it could be talking about average performance rates outbound. I'd really like him to clarify which direction he was talking about... The second is they all seem to be particle swarm searches. What's this have to do with anything you might ask? Honestly, you're giving a great explanation below as to why hammering on the server can be counter-productive. I hope people take the time to read it and think on it... If you recall, the PS searches are genetic algorithm simulations, for which it has been stated many times in the past are highly 'iteration' sensitive. You can not be generating work for the next round of calculation before you have sufficient returned data for the current round, or you risk having the simulation 'wander off' down an 'undesirable' (read that as wrong) path. IOWs, if you get too far ahead of yourself, you run the risk of getting doo-doo back. This means that if you have an battalion of very fast machines grabbing all the work they can, every chance they can, the odds that you will reach a point where work generation must stall temporarily are greatly increased. The reason is there are still far more 'conventional' hosts out there, and they still grab a significant amount of work, even though they might get shut out on a request a lot or even run out for periods of time. Anecdotal evidence on my hosts tends to support this. I have observed on some occasions, a host will get the full request of work up to the session limit right away, while at other times you get one or two at a shot, regardless of how many time you 'pop the update button' as soon as the current 7 second delay expires. This is telling me the condition at that time is not that the project can't move tasks into the buffer fast enough, but they are having 'trouble' finding a candidate to generate work for, and there just isn't anything to send. Again, I hope people read through that...and think on it. I'm pretty sure that one of the findings from this is going to be, "Yes, it works fairly well. However, a main consideration in setting a run up is that you need to evaluate if you should band limit the range of host performance to the type work being done, in order to achieve optimum task throughput and host utilization.". In other words, establish proper quotas and controls so as to not let the end-users overrun you. (Sorry to be blunt about it, but that's what's going on)... ID: 23509 · Rating: 0 · rate: /

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 23510 - Posted: 27 May 2009, 18:25:16 UTC - in response to Message 23472. <snip> The thing is, to immediately make those 'exclusion' changes upon the release of the project generated CUDA application for the GPU project, would leave a bunch of ATI_GPU folks homeless during an intervening period of days to weeks. I don't think that is the sequence Travis has in mind (could be wrong though). AHHHH.... I see your drift now, sorry. A lot of stuff to absorb quickly in this thread! ;-) I agree, there should at least be a period before a full 'pilot' test run, where the ATi guys can have a chance to get something ready for them on the AP. It would be unfair and unwise to launch GPU fully and not have them included. Alinator ID: 23510 · Rating: 0 · rate: /

Bill & Patsy Send message Joined: 7 Jul 08 Posts: 47 Credit: 13,629,944 RAC: 0	Message 23511 - Posted: 27 May 2009, 19:03:31 UTC Interesting that there seems to be a general consensus on what's "fair" to the GPU folks vis-a-vis the transition, but (as evident from the continuing argument) a lingering wish by some that using scripts to push to the front of the line (along with causing collateral damage) could also be "fair". Wouldn't it be agreed that the former reflects regard for the effects of actions on fellow participants (good), while the latter reflects disregard for the effects of actions on fellow participants (you decide)? --Bill ID: 23511 · Rating: 0 · rate: /

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 23512 - Posted: 27 May 2009, 20:06:11 UTC Travis said in message 23346 Actually, the problem isn't that work isn't being generated fast enough, it's that while there's available work, the server can't move it into shared memory fast enough to keep up with work requests (which get work fed from shared memory). I'm really hoping to put this all behind us once we get milkyway_gpu up and running (which is getting really close). Probably another week or two. I know it's been a long time, but instead of working on half-assed semi-fixes that probably wouldn't work anyways, we decided to take the extra time and effort into making a real fix -- splitting up the project into CPU and GPU versions where we can have correspondingly sized workunits. We really do appreciate everything sticking with us through all of this. There is at least a light at the end of the tunnel now :) Travis said in message 23369 To be honest, before people started using scripts to hammer the server, we were getting around 9-11 workunits a second. Now we're seeing around 6-7 workunits a second. OK, OK. I put 1 and 1 together to get what I said - which I'm pretty sure equals 2. It is irrefutable that immediately prior to the outage more work was being done than immediately after. Now that can't be due to to a sudden increase in the use of scripts during the outrage. Something else had to have happened. In any case it's a futile debate and we can just wait for the light at the end of the tunnel, which is slowly getting larger. Live long and BOINC. ID: 23512 · Rating: 0 · rate: /

Blurf Volunteer moderator Project administrator Send message Joined: 13 Mar 08 Posts: 804 Credit: 26,380,161 RAC: 0	Message 23520 - Posted: 27 May 2009, 21:23:25 UTC This is being locked due to various complaints. ID: 23520 · Rating: 0 · rate: /