had some corruption in the searches
log in

Advanced search

Message boards : News : had some corruption in the searches

Author Message
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0

Message 41686 - Posted: 23 Aug 2010, 2:07:55 UTC

I finally figured out why no work was being generated but both assimilators were up and running. Work should be flowing again now.

Next up is that I'm getting the new assimilators incorporated into the stop/start boinc scripts, which will mean they should be showing up on the server status page once that gets going, and automatically restarted after a server restart. So this should help a bit with the work outages.
____________

Profile mdhittle*
Avatar
Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0

Message 41698 - Posted: 23 Aug 2010, 14:53:54 UTC - in response to Message 41686.

I finally figured out why no work was being generated but both assimilators were up and running. Work should be flowing again now.

Next up is that I'm getting the new assimilators incorporated into the stop/start boinc scripts, which will mean they should be showing up on the server status page once that gets going, and automatically restarted after a server restart. So this should help a bit with the work outages.


Travis, this is a good start in getting things running smoothly, again.

Right now, I am only able to download 1 workunit at a time every 61 seconds for 12 CPUs and 4 GPUs. I should be able to fill my cache to 72 workunits, but that doesn't happen anymore.

Profile Werkstatt
Send message
Joined: 19 Feb 08
Posts: 350
Credit: 123,760,875
RAC: 1,243

Message 41704 - Posted: 23 Aug 2010, 22:18:18 UTC - in response to Message 41698.


Right now, I am only able to download 1 workunit at a time every 61 seconds for 12 CPUs and 4 GPUs. I should be able to fill my cache to 72 workunits, but that doesn't happen anymore.


Do you use an app_info.xml?
When I use one, I also get the wu's only one by one with a cache of 3 or 4 wu's. Starting Boinc without app_info, mw runs normal.

Alexander

Profile mdhittle*
Avatar
Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0

Message 41705 - Posted: 23 Aug 2010, 22:40:10 UTC - in response to Message 41704.


Right now, I am only able to download 1 workunit at a time every 61 seconds for 12 CPUs and 4 GPUs. I should be able to fill my cache to 72 workunits, but that doesn't happen anymore.


Do you use an app_info.xml?
When I use one, I also get the wu's only one by one with a cache of 3 or 4 wu's. Starting Boinc without app_info, mw runs normal.

Alexander


I have tried it both ways, and haven't had any success either way.

Profile mdhittle*
Avatar
Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0

Message 41706 - Posted: 23 Aug 2010, 23:05:06 UTC

Since the N-Body workunits have been being issued over the weekend, I have only been able to get 1 regular workunit at a time. I have 12 cores and 4 ATI GPUs and should be able to build a cache up of 72 workunits. But I can't get enough workunits to keep 2 GPUs busy part time. Here is a print out of the sched_op_debug showing that I am requesting over 600,000 seconds of work and I only get 1 work unit in response.

I have checked my debt, and it was ok. But, just in case, I reset it to zero with no effect. I have tried running with and without an app_info file, no help.

Any help you can give in debugging this would be a great help.

8/23/2010 6:54:46 PM Milkyway@home Reporting 1 completed tasks, requesting new tasks for GPU
8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs
8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] ATI GPU work request: 6863331.47 seconds; 3.00 GPUs
8/23/2010 6:54:48 PM Milkyway@home Scheduler request completed: got 1 new tasks
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Server version 611
8/23/2010 6:54:48 PM Milkyway@home Project requested delay of 61 seconds
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] estimated total CPU job duration: 0 seconds
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] estimated total ATI GPU job duration: 470 seconds
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] handle_scheduler_reply(): got ack for result de_16_3s_2_147968_1282603700_0
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Deferring communication for 1 min 1 sec
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Reason: requested by project

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0

Message 41709 - Posted: 23 Aug 2010, 23:58:29 UTC - in response to Message 41706.

8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] ATI GPU work request: 6863331.47 seconds; 3.00 GPUs


Look more closely. Your request is for nearly 7M secs of work and not 686K secs. Since a full week (the deadline) is just over 600K secs, it's not really surprising that the scheduler doesn't want to send you much work. How do you get your client to actually ask for 6863331.47 seconds of work for 3 GPUs? That is actually over 26 days of work per GPU. How are you able to set your cache that high?

____________
Cheers,
Gary.

Profile mdhittle*
Avatar
Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0

Message 41710 - Posted: 24 Aug 2010, 0:06:59 UTC - in response to Message 41709.
Last modified: 24 Aug 2010, 0:17:38 UTC

8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] ATI GPU work request: 6863331.47 seconds; 3.00 GPUs


Look more closely. Your request is for nearly 7M secs of work and not 686K secs. Since a full week (the deadline) is just over 600K secs, it's not really surprising that the scheduler doesn't want to send you much work. How do you get your client to actually ask for 6863331.47 seconds of work for 3 GPUs? That is actually over 26 days of work per GPU. How are you able to set your cache that high?


I am not sure how BOINC figures out how much work to request. Regardless of the number, two facts remain.

1. I haven't changed anything, that is the same amount it has always requested for 12 CPUs, 4 GPUs, and processing 4 workunits every 86 seconds. Though right now, my times are a bit slower because I am running Collatz on 3 of the GPUs and have it OC'd to optimize them, instead of MW.

2. I am only getting one workunit at a time.

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0

Message 41721 - Posted: 24 Aug 2010, 14:46:35 UTC - in response to Message 41710.

I am not sure how BOINC figures out how much work to request.

The client part of BOINC requests work based on the values you set for two particular user controlled preferences - 'connect to the internet every X.XX days' and 'maintain enough work for an additional Y days'. I think the max for each is 10 days so it is theoretically possible to ask for a total of 20 days work. Of course, with a deadline of just 7 days, the theoretical max value is irrelevant.

The crazy part is that the log snippet you posted shows a request for 79.4 days of work spread over 3 GPUs. So there has to be some weird bug(s) in the BOINC client you are using that generates such an impossible work request. BTW, you mention 12 CPUs and 4 GPUs but your hosts are hidden so it's not possible to see what BOINC thinks about your host (or hosts). Also, you support quite a number of projects but how many of these are all fighting for work in competition with MW?

The server part of BOINC is bound to reject a 79.4 day request because it figures that the work could not possibly all be returned within the deadline. A critical part of the 'thinking' at the server end is to do with what you set for the first of the two preferences because if you set a large value there, the scheduler has to allow for the possibility that the client may really not be able to make a further contact with the server (to return any completed results) for that large number of days.

The 79.4 day request seems to suggest that perhaps you have large values for both preferences. If you have, it would be very interesting to experiment with something more reasonable like 0.01/1 or 0.01/2 or even 0.01/3 and see what happens. The first preference should always attempt to reflect reality - use a very low value or even zero if you have an 'always on' internet connection. Don't go overboard with the size of the 'extra days' preference or else you risk triggering BOINC bugs which cause the scheduler to make weird decisions just like you are seeing.

I don't know if any of this is relevant to your situation or not. It shouldn't be too hard to do a few experiments with preferences and see what happens. In the end it may just be that BOINC simply cannot handle the mix you are throwing at it. Since GPU processing was tacked onto BOINC as an afterthought, it's probably going to take quite a while yet for all the issues to get sorted out.


____________
Cheers,
Gary.

Profile mdhittle*
Avatar
Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0

Message 41723 - Posted: 24 Aug 2010, 15:29:46 UTC - in response to Message 41721.
Last modified: 24 Aug 2010, 15:39:26 UTC

This is what Milkyway thinks of my computer:

GenuineIntel Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz [Family 6 Model 44 Stepping 2] (12 processors)
[4] CAL ATI Radeon HD5800 series (Cypress) (1024MB) driver: 1.4.737
Microsoft Windows 7 x64 Edition, (06.01.7600.00)
Boinc version 6.10.58

On this system, I am only running Milkyway, Collatz as a backup (0%), Aqua, FreeHal, and WuProp. This is the same as it has always been. Nothing on this system has changed from before August 21 and after,

But, if you look at FreeDC's stats for Milkyway , you will notice that the overall amount of credits has dropped from an average of around 99,000,000 a day to around 30,000,000 a day since August 21. August 21 just happens to be the day the N-Body workunits were released into the wild, and the flow of regular workunits slowed to a trickle. There is a slight blip on Monday when the validator was restarted and the backup of workunits were validated causing an increase for Monday.

http://stats.free-dc.org/stats.php?page=proj&proj=mil


Post to thread

Message boards : News : had some corruption in the searches


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group