new workunit queue size (6)

Author	Message
Kevint Send message Joined: 22 Nov 07 Posts: 285 Credit: 1,076,786,368 RAC: 0	Message 13211 - Posted: 28 Feb 2009, 3:09:08 UTC Last modified: 28 Feb 2009, 3:13:57 UTC Travis, As long as you think your server can handle the requests.. This is just 1 machine with a GPU - you will notice it requesting every few seconds, and getting 1 - it is going to destroy your bandwidth - and may cause some issues on the RPC calls on your side.. I am not sure about this, but something to consider. 2/27/2009 19:59:30\|Milkyway@home\|Sending scheduler request: Requested by user. Requesting 2488366 seconds of work, reporting 1 completed tasks 2/27/2009 19:59:35\|Milkyway@home\|Scheduler request completed: got 1 new tasks 2/27/2009 19:59:46\|Milkyway@home\|Sending scheduler request: To report completed tasks. Requesting 2488191 seconds of work, reporting 1 completed tasks 2/27/2009 19:59:51\|Milkyway@home\|Scheduler request completed: got 1 new tasks 2/27/2009 20:00:02\|Milkyway@home\|Sending scheduler request: To report completed tasks. Requesting 2488079 seconds of work, reporting 1 completed tasks 2/27/2009 20:00:07\|Milkyway@home\|Scheduler request completed: got 1 new tasks 2/27/2009 20:00:17\|Milkyway@home\|Sending scheduler request: To report completed tasks. Requesting 2487999 seconds of work, reporting 1 completed tasks 2/27/2009 20:00:22\|Milkyway@home\|Scheduler request completed: got 1 new tasks Guess it is time to open up another project to run as well. . ID: 13211 · Rating: 0 · rate: / Reply Quote

gomeyer Send message Joined: 26 Sep 08 Posts: 12 Credit: 1,228,382 RAC: 0	Message 13212 - Posted: 28 Feb 2009, 3:10:53 UTC Do you have the option to limit DL's to perhaps 10 at once but keep the total limit at 20 per core? That should ease the hits on the scheduler yet leave us with a little better comfort level if we do manage get a full quota, and let us get a partial load to hold us over in the mean time. ID: 13212 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 13215 - Posted: 28 Feb 2009, 3:13:41 UTC - in response to Message 13210. I don't know how many GPUs are already in use on this project, but with more likely to be used soon I would think it a good idea to increase the size of each wu, so there will not need to be so many requests. I'd rather have a real fix for the problem, because if we do this the same problem will just show up again when we get more users and starting seeing the same amount of workunit requests... ID: 13215 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 13216 - Posted: 28 Feb 2009, 3:14:34 UTC - in response to Message 13212. Do you have the option to limit DL's to perhaps 10 at once but keep the total limit at 20 per core? That should ease the hits on the scheduler yet leave us with a little better comfort level if we do manage get a full quota, and let us get a partial load to hold us over in the mean time. I was thinking about doing something like this. Going to let the 6 WU queue go for awhile and see if it helps anything first. ID: 13216 · Rating: 0 · rate: / Reply Quote

Bob in FL Send message Joined: 19 Jul 08 Posts: 5 Credit: 2,547,855 RAC: 0	Message 13218 - Posted: 28 Feb 2009, 3:35:50 UTC Up until yesterday my quad had very rarely run out of work. Now, in the last 24 hours, it has run completely dry probably 5 times or more and all I get is the same as others: "Scheduler request completed: got 0 new tasks" ID: 13218 · Rating: 0 · rate: / Reply Quote

Arion Send message Joined: 10 Aug 08 Posts: 218 Credit: 41,846,854 RAC: 0	Message 13219 - Posted: 28 Feb 2009, 3:41:17 UTC - in response to Message 13216. I was thinking about doing something like this. Going to let the 6 WU queue go for awhile and see if it helps anything first. I haven't been able to keep my caches full (set for .1 days) for a while now. Cutting back to 6 per core is just going to make it beg even more. Seems to me this would cause more problems for the server instead of it making fewer requests at a longer interval. so I'll help out. Setting all systems to pull from einstein when the server here won't honor requests for work. Not a complaint as I know you got problems, but would rather have something to do than the computers sitting at idle for long stretches. Maybe this will take some of the load off. Probably wouldn't hurt if others did the same thing over the weekend until you got this worked out. (but then again if more people were doing that you would probably think 6 is enough) <smile> ID: 13219 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 13221 - Posted: 28 Feb 2009, 3:44:28 UTC - in response to Message 13218. Up until yesterday my quad had very rarely run out of work. Now, in the last 24 hours, it has run completely dry probably 5 times or more and all I get is the same as others: "Scheduler request completed: got 0 new tasks" I've been debugging the new code for the assimilator/validator which lets us do some validation of workunits to keep our searches from getting screwed up by invalid results. This has caused the server to crash quite a few times this evening, so that might be causing a lot of the lack of work. The assimilator/validator doesn't seem to be crashing anymore fingers crossed so work availability should be better from here on out. ID: 13221 · Rating: 0 · rate: / Reply Quote

jedirock Send message Joined: 8 Nov 08 Posts: 178 Credit: 6,140,854 RAC: 0	Message 13223 - Posted: 28 Feb 2009, 4:48:30 UTC - in response to Message 13221. The assimilator/validator doesn't seem to be crashing anymore fingers crossed so work availability should be better from here on out. My quad seems to be keeping filled so far. It's running the GPU app, 0.19. ID: 13223 · Rating: 0 · rate: / Reply Quote

Glenn Rogers Send message Joined: 4 Jul 08 Posts: 165 Credit: 364,966 RAC: 0	Message 13224 - Posted: 28 Feb 2009, 4:59:13 UTC Gday all.. Just saw this result in my task list wondering why it is so??? Task ID 12690510 Name ps_s82_10_394_1235776518_0 Workunit 12383039 Created 27 Feb 2009 23:15:21 UTC Sent 27 Feb 2009 23:15:54 UTC Received 28 Feb 2009 4:51:43 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 21898 Report deadline 2 Mar 2009 23:15:54 UTC CPU time 1027.891 stderr out <core_client_version>6.4.6</core_client_version> <![CDATA[ <stderr_txt> Running Milkyway@home version 0.19 by Gipsel CPU: Genuine Intel(R) CPU T2300 @ 1.66GHz (2 cores/threads) 1.66677 GHz (917ms) WU completed. It took 1027.89 seconds CPU time and 1038.24 seconds wall clock time @ 1.66678 GHz. </stderr_txt> ]]> Validate state Invalid Claimed credit 2.83954728750274 Granted credit 0 application version 0.19 Every other task has been validated correctly dont know why this is the odd one out....... ID: 13224 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,538,504 RAC: 0	Message 13225 - Posted: 28 Feb 2009, 6:09:49 UTC - in response to Message 13202. What lowering to 6 does is temporarily (say for maybe a couple of hours) reduce the false (success / 0 new work) messages by replacing them with 'met your CPU limit of 6'. Then, when the completed work drops the queue back from 20 to 6, the same problem pops up again -- but more frequently. Now one needs to hit the server for more work almost continuously since it may take about 15 minutes of server pounding for more work to get 45 minutes of work by which time two more work units have completed. I'm still trying to figure out how one can script a 'pound the server continuously script' <rueful smile>. I'm going to lower the number to 6 and see if that helps. ID: 13225 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,538,504 RAC: 0	Message 13226 - Posted: 28 Feb 2009, 6:10:54 UTC - in response to Message 13219. Yup -- I think you are spot on here. I haven't been able to keep my caches full (set for .1 days) for a while now. Cutting back to 6 per core is just going to make it beg even more. Seems to me this would cause more problems for the server instead of it making fewer requests at a longer interval. so I'll help out. Setting all systems to pull from einstein when the server here won't honor requests for work. Not a complaint as I know you got problems, but would rather have something to do than the computers sitting at idle for long stretches. Maybe this will take some of the load off. Probably wouldn't hurt if others did the same thing over the weekend until you got this worked out. (but then again if more people were doing that you would probably think 6 is enough) <smile> ID: 13226 · Rating: 0 · rate: / Reply Quote

mscharmack Send message Joined: 4 Dec 07 Posts: 45 Credit: 1,257,904 RAC: 0	Message 13227 - Posted: 28 Feb 2009, 6:24:52 UTC Last modified: 28 Feb 2009, 6:42:29 UTC Feed the beast. Limiting on the workunit queue size to 6 will not stop the beast from starvation. Feed the beast. Holy Mackerel! Call headquarters. Get the lieutenant. ID: 13227 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 13228 - Posted: 28 Feb 2009, 6:41:46 UTC - in response to Message 13224. Gday all.. Just saw this result in my task list wondering why it is so??? Task ID 12690510 Name ps_s82_10_394_1235776518_0 Workunit 12383039 Created 27 Feb 2009 23:15:21 UTC Sent 27 Feb 2009 23:15:54 UTC Received 28 Feb 2009 4:51:43 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 21898 Report deadline 2 Mar 2009 23:15:54 UTC CPU time 1027.891 stderr out <core_client_version>6.4.6</core_client_version> <![CDATA[ <stderr_txt> Running Milkyway@home version 0.19 by Gipsel CPU: Genuine Intel(R) CPU T2300 @ 1.66GHz (2 cores/threads) 1.66677 GHz (917ms) WU completed. It took 1027.89 seconds CPU time and 1038.24 seconds wall clock time @ 1.66678 GHz. </stderr_txt> ]]> Validate state Invalid Claimed credit 2.83954728750274 Granted credit 0 application version 0.19 Every other task has been validated correctly dont know why this is the odd one out....... Hmmm... Hard to say, but assuming it wasn't just a coincidence the most likely reason is that the backend 'lost' the output file (due to the troubleshooting at MW) for some reason. If there aren't any further repeats of it, then I wouldn't worry about it. Alinator ID: 13228 · Rating: 0 · rate: / Reply Quote

m4rtyn Send message Joined: 16 Jan 08 Posts: 18 Credit: 4,111,257 RAC: 0	Message 13229 - Posted: 28 Feb 2009, 7:18:57 UTC Sorry, but reducing the wu limit to 6 has only made things worse. m4rtyn ***************************** ***************************** ID: 13229 · Rating: 0 · rate: / Reply Quote

Riil Send message Joined: 10 Feb 09 Posts: 13 Credit: 1,704,492 RAC: 0	Message 13234 - Posted: 28 Feb 2009, 8:26:32 UTC Last modified: 28 Feb 2009, 8:42:00 UTC 2009-02-28 09:22:33\|Milkyway@home\|Message from server: No work sent 2009-02-28 09:22:33\|Milkyway@home\|Message from server: (reached per-CPU limit of 6 tasks) I see this info more often than this about getin' 0 new tasks. So it's gettin' better now imo. Anyway i see that more WUs is with granted credit 0. As far as i can see now 0 credit is granted in about 10 % of my new WUs. ID: 13234 · Rating: 0 · rate: / Reply Quote

GalaxyIce Send message Joined: 6 Apr 08 Posts: 2018 Credit: 100,142,856 RAC: 0	Message 13236 - Posted: 28 Feb 2009, 8:50:39 UTC - in response to Message 13234. i see that more WUs is with granted credit 0. I'm seeing WUs claiming zero credit and being awarded the credit they jolly well deserve. (OK, it's some GPU which show zero crunching time but I've timed some of them on a stopwatch and they do actually take a few seconds. I mean, I wouldn't even see them if it was zero seconds, would I) ID: 13236 · Rating: 0 · rate: / Reply Quote

etrecords Send message Joined: 15 May 08 Posts: 7 Credit: 126,077,128 RAC: 0	Message 13238 - Posted: 28 Feb 2009, 9:18:07 UTC At this moment I see also workunits that are getting q credits. I did not see this before this night. Now it looks like to happen with about 10% of the wu. ID: 13238 · Rating: 0 · rate: / Reply Quote

etrecords Send message Joined: 15 May 08 Posts: 7 Credit: 126,077,128 RAC: 0	Message 13239 - Posted: 28 Feb 2009, 9:21:17 UTC I forgot. For me it seems to work more smoothfull. I get less wu but the system is running most of time now. The message reaches cpu limit is ocuring regurly, but this is the setting at this moment. This means that the work available for my serer is at that moment the maximum that it is allowed, so I think this has made the situation more stable. ID: 13239 · Rating: 0 · rate: / Reply Quote

GalaxyIce Send message Joined: 6 Apr 08 Posts: 2018 Credit: 100,142,856 RAC: 0	Message 13241 - Posted: 28 Feb 2009, 9:39:24 UTC - in response to Message 13239. I forgot. For me it seems to work more smoothfull. I get less wu but the system is running most of time now. The message reaches cpu limit is ocuring regurly, but this is the setting at this moment. This means that the work available for my serer is at that moment the maximum that it is allowed, so I think this has made the situation more stable. Hallelujah, someone's happy ;) ID: 13241 · Rating: 0 · rate: / Reply Quote

John Clark Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0	Message 13243 - Posted: 28 Feb 2009, 9:44:51 UTC Last modified: 28 Feb 2009, 9:45:19 UTC Compared to yesterday my rigs (CPU only) seem to be running well, and fed. Yesterday, with the cache at 20 per CPU, I was in the same position as everyone else - zilch. ID: 13243 · Rating: 0 · rate: / Reply Quote