Welcome to MilkyWay@home

Posts by Mad_Max

1) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 69398)
Posted 23 Dec 2019 by Mad_Max
Post:

I added another paragraph in my last message after you replied. I'd like someone to explain where this extra timer comes into play. Normally my client connects every 1.5 minutes, and tries to report completed tasks and get more work. Why is my client then just dozing off for TEN minutes instead of 1.5 when the work runs out? It's contacting the server for the purpose of getting new work, NOT to report work. So it should still have the same desire for new work as it did before. So I should only have it idle for 1.5 minutes, not 10!! What is going on here?!


I already explained it here few month ago:
- 1.5 min back off timer is set by server after each communication. Its OK - protection from SPAMming/DDOS
- 10 min back off time is set by BOINC client if getting new work request failed. It is expected behavior set by BOINC programmers and works from client side, so it can be overridden by user (manually or by script)
- getting new work request fails due to errors in MW server software which can not correctly handle combined request (reporting completed work + requesting new work in one request) - server sends new work only if client does NOT report completed work in the same communication. This error is a root of the problem and it also trigger 10 min backoff timer in client/
2) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68932)
Posted 28 Jul 2019 by Mad_Max
Post:
some thoughts on fixing the problem with fast GPUs running out of data.
Seti is not asking for new data after every "upload" and it seems an "upload" is not the same are reporting. In any event, seti asks for data much less frequently than milkyway. I am guessing that milkyway, on fast gpus, is asking for more data BEFORE the timeout "you asked for data too soon" and that is why no data is ever sent till 10 or more minutes after the very last reported tasks.

Milkyway needs to STOP asking for data after each upload or better, get help from the SETI folks on how they implemented their buffering.


No, it not the case - I and other user already cheeked it long ago.
Server side timeout on MW servers is only 1.5 min (91 sec) and BOINC client always wait at least this time before next request. You can see error about "Not sending work - last request too recent" only after forced updates (manual or via cmd/script).

About 10 min - this is another internal CLIENT (not server) side timeout - if BOINC client gets error while requesting work from server - it will wait additional ~10 min before next request even if server did not ask for any delays/timeouts.
But this is not the cause of the error - this is its consequence. It make problem a little bit worse (by increasing idle time) but has nothing to with the season/source of error itself.

And about SETI - that you described it is a standard BOINC client behavior on any project then it has enough work in queue. It will contact server for reporting completed tasks and getting more work only about 1 time per hour or even less frequently. Only reason why at MW client sent requests very often - because it can not get enough work in queue due to server error on all combined requests.
After some failed request queue begin running dry and client begins ask for work more often trying ti fill it up.
But again - it not the cause/source of error. It is a consequence too.

1 - normal work
2 - errors while getting new work due to some errors in processing combined request (reporting completed WUs + requesting WUs in the same request to server)
3 - low local work cache on client because client successfully reports completed work but can not get any new
4 - client begin sending request to server often due to low work cache, but they all still fail
5 - work cache completely empty, all completed WUs already reported to server
6 - work request finally succeed (as work request without reporting completed work just fine) and brings a lot of new WUs
7 - go to 1
3) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68893)
Posted 13 Jul 2019 by Mad_Max
Post:
Yes, such workaround (not a fix really) work fine.

I used slightly different cmd file after i have identified and described source of the problem there https://boinc.berkeley.edu/forum_thread.php?id=12918&postid=91355#91355
:start
timeout 120
boinccmd.exe --project milkyway.cs.rpi.edu/milkyway update
goto start

For not very fast machines it eliminates idle time with no task completely.
For very fast machines it reduces idle time to ~2 minutes maximum and about ~1 min on average. (if there are no other delays from other reasons like internet connection or server outages of course)

P.S.
If project staff will ever want to fix this problem for good there is a hint where to start: bug occurs only on combined request (reporting completed tasks + requesting new), while pure work requests work fine.
That is why it mostly affects fast computer (they almost always have some finished task in queue) while rarely seen on slow.
And why forced manual updates usually works OK while automatic requests fails: If BOINC client has nothing to report yet - server will give it a new work. If client reports something completed - the server will not give it new work.
4) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68614)
Posted 28 Apr 2019 by Mad_Max
Post:
If and when #3076 makes it into the client master, the problems with work fetch should clear up. Richard and I signed off on it. Now just have to have the code reviewers give it their blessing and then have the PR merged into master.


No. This fix you mention is only for users who use max concurrent options to limit how many tasks from one project can be run simultaneously.
And there is a bug (or non optimal behavior at least) when BOINC client does not request new work if it already have enough to fill this limit.

E.g user set max concurrent option to 6. (6 tasks max can run simultaneously) and work buffer for 1 day (e.g 100 tasks) but BOINC does not fill buffer if it already have 6 or more tasks while correct value will be 100(or other equivalent of 1 day of work).
This patch should fix it.

But right now in MW we have different issue: BOINC can not get new work regardless of "max concurrent" setting:
- it hit users who do not use this option at all
- it gets new work only after work buffer completely empty or if user makes manual update. not when work buffer fall < "max concurrent" as described in patch description and discussion

One possible temp solution is for those who are having issues with getting the client to request tasks when they have run dry is to backlevel to BOINC 7.12.1. That client would not have all the recent changes that disrupted work fetch that is occurring in BOINC 7.14.2.

You wrong again here. This problem with getting work from MW exist with very old BOINC clients too. I saw it even in v. 7.6.22 - its all the same as in the latest stable 7.14.2

So there are two different uncorrelated problems/issues with work fetch/scheduler. And patch you point to will not fix it. And roll back to prev ver of client not help either.
There are a lot of bugs and very strange logic in the BOINC work scheduler - it is know to be the weakest buggy part of the code for years
5) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68613)
Posted 27 Apr 2019 by Mad_Max
Post:
My log is not as detailed as the ones I read here. There must be some diagnostic setting I am not using.
Yes. I have used additional debug options in cc_config.xml to get more details in the log about work requests. There is it:
<cc_config>
<log_flags>
	<work_fetch_debug>1</work_fetch_debug>
	<sched_op_debug>1</sched_op_debug>
</log_flags>
</cc_config>

They SPAM a LOT to logs so turned off by default.
Question: Is this a problem that fixes itself after some time passes? If so, about how many minutes maximum of idle time?
Yes, it fix itself. but only after queue is completely empty. Sometimes immediately after last task from queue completed. Sometimes up to 10 min pass after last tasks finished. It depends on "back off timer". Very first request after last task will be successful. But if there is a backoff timer counting (due to previous failed requests) BOINC will wait before requesting new work again even with empty work queue.
6) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68567)
Posted 23 Apr 2019 by Mad_Max
Post:
As expected - latest BOINC client did not change anything. Problem is still here.
More logs - when queue is almost empty - BOINC start counting GPU like this (0.75 GPU) and stil does not get any work
23/04/2019 09:45:45 | Milkyway@Home | [work_fetch] set_request() for AMD/ATI GPU: ninst 2 nused_total 0.25 nidle_now 0.75 fetch share 1.00 req_inst 0.75 req_secs 74882.47
23/04/2019 09:45:45 | Milkyway@Home | [sched_op] Starting scheduler request
23/04/2019 09:45:45 | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) AMD/ATI GPU (74882.47 sec, 0.75 inst)
23/04/2019 09:45:46 | Milkyway@Home | Sending scheduler request: To fetch work.
23/04/2019 09:45:46 | Milkyway@Home | Reporting 2 completed tasks
23/04/2019 09:45:46 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU
23/04/2019 09:45:46 | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
23/04/2019 09:45:46 | Milkyway@Home | [sched_op] AMD/ATI GPU work request: 74882.47 seconds; 0.75 devices
23/04/2019 09:45:48 | Milkyway@Home | Scheduler request completed: got 0 new tasks
23/04/2019 09:45:48 | Milkyway@Home | [sched_op] Server version 713
23/04/2019 09:45:48 | Milkyway@Home | Project requested delay of 91 seconds
23/04/2019 09:45:48 | Milkyway@Home | [work_fetch] backing off AMD/ATI GPU 884 sec
23/04/2019 09:45:48 | Milkyway@Home | [sched_op] Deferring communication for 00:01:31


But after work queue is totally empty it ask work for whole GPU (1.00 devises) and finally get a lot of work - 68 tasks in this example:
23/04/2019 09:57:40 | Milkyway@Home | [work_fetch] set_request() for AMD/ATI GPU: ninst 2 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 1.00 req_secs 75450.52
23/04/2019 09:57:40 | Milkyway@Home | [sched_op] Starting scheduler request
23/04/2019 09:57:40 | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) AMD/ATI GPU (75450.52 sec, 1.00 inst)
23/04/2019 09:57:40 | Milkyway@Home | Sending scheduler request: To fetch work.
23/04/2019 09:57:40 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU
23/04/2019 09:57:40 | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
23/04/2019 09:57:40 | Milkyway@Home | [sched_op] AMD/ATI GPU work request: 75450.52 seconds; 1.00 devices
23/04/2019 09:57:42 | Milkyway@Home | Scheduler request completed: got 68 new tasks
23/04/2019 09:57:42 | Milkyway@Home | [sched_op] Server version 713
23/04/2019 09:57:42 | Milkyway@Home | Project requested delay of 91 seconds
23/04/2019 09:57:42 | Milkyway@Home | [sched_op] estimated total CPU task duration: 0 seconds
23/04/2019 09:57:42 | Milkyway@Home | [sched_op] estimated total AMD/ATI GPU task duration: 50593 seconds
23/04/2019 09:57:42 | Milkyway@Home | [sched_op] Deferring communication for 00:01:31
23/04/2019 09:57:42 | Milkyway@Home | [sched_op] Reason: requested by project
23/04/2019 09:57:42 |  | [work_fetch] Request work fetch: RPC complete
7) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68566)
Posted 23 Apr 2019 by Mad_Max
Post:
Yep, my BOINC client also think it has no GPU when request work for GPU. Ofcourse it gets nothing.
23/04/2019 07:31:24 | Milkyway@Home | [sched_op] Starting scheduler request
23/04/2019 07:31:24 | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) AMD/ATI GPU (60773.27 sec, [b]0.00 inst[/b])
23/04/2019 07:31:24 | Milkyway@Home | Sending scheduler request: To fetch work.
23/04/2019 07:31:24 | Milkyway@Home | Reporting 1 completed tasks
23/04/2019 07:31:24 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU
23/04/2019 07:31:24 | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
23/04/2019 07:31:24 | Milkyway@Home | [sched_op] AMD/ATI GPU work request: 60773.27 seconds; [b]0.00 devices[/b]
23/04/2019 07:31:26 | Milkyway@Home | Scheduler request completed: got 0 new tasks
23/04/2019 07:31:26 | Milkyway@Home | [sched_op] Server version 713
23/04/2019 07:31:26 | Milkyway@Home | Project requested delay of 91 seconds
23/04/2019 07:31:26 | Milkyway@Home | [sched_op] handle_scheduler_reply(): got ack for task de_modfit_80_bundle4_4s_south4s_0_1555431910_2294680_0
23/04/2019 07:31:26 | Milkyway@Home | [work_fetch] backing off AMD/ATI GPU 524 sec


Will try to update to latest BOINC client to check if this fixed already or not (probably not)...
8) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68565)
Posted 23 Apr 2019 by Mad_Max
Post:
Well - now i catch a bug with no new WU until a queue is full empty (or manual request/update).

Which i have never had before.

Only one thing i have changed on my side - i set <exlude_gpu> option in cc_config.
I have 2 GPUs in computer and tried to separate/dedicate them from projects. One GPU run MW@H exclusively while second run E@H exclusively.
As mix of MW and Einstein tasks on same GPU performs pretty bad.

2 <exlude_gpu> options in cc_config did the trick successfully but immediately after it problem with getting enough MW WUs pop-up.

So it may be not a server bug/issues but bugs in BOINC client work scheduler if <exlude_gpu> is used.
E.g. BOINC try to load WUs for GPU what excluded from project while "forgetting" it has another GPU which allowed to run this project.

Debug entries like
11/04/2019 18:18:06 | Milkyway@Home | [sched_op] NVIDIA GPU work request: 127884.51 seconds; 0.00 devices
reported above also suggest this: BOINC want a lot work for GPU but stupid enough to report it has no GPU avaible. So server does not send any work for such strange request.

So i curious - other people with such problems also use <exlude_gpu> or <ignore_XXX_dev>?
9) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68466)
Posted 1 Apr 2019 by Mad_Max
Post:
P.S.

Actually <next_rpc_delay> is NOT a server side delay. It has other name (forgot it).
<next_rpc_delay> is also a client side delay, but this time not a min delay (do not contact server until this time pass) but max delay (DO contact server after this time pass even there is no any need for it - eg. nothing to report and no need to ask new for new task).

Is it really needed? This option force ALL attached client to contact server every 10 mins even if client does not actually work for the project currently. For example here is log snippet from one of my computers where MW set to backoff project(low priority):
....................................
01/04/2019 15:54:30 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 15:54:30 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 15:54:32 | Milkyway@Home | Scheduler request completed
01/04/2019 16:04:37 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 16:04:37 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 16:04:40 | Milkyway@Home | Scheduler request completed
01/04/2019 16:14:44 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 16:14:44 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 16:14:46 | Milkyway@Home | Scheduler request completed
01/04/2019 16:24:52 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 16:24:52 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 16:24:54 | Milkyway@Home | Scheduler request completed
01/04/2019 16:34:56 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 16:34:56 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 16:34:58 | Milkyway@Home | Scheduler request completed
01/04/2019 16:45:03 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 16:45:03 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 16:45:05 | Milkyway@Home | Scheduler request completed
01/04/2019 16:55:06 | Milkyway@Home | Sending scheduler request: Requested by project.
01/04/2019 16:55:06 | Milkyway@Home | Not requesting tasks: don't need (CPU: job cache full; AMD/ATI GPU: job cache full)
01/04/2019 16:55:09 | Milkyway@Home | Scheduler request completed
01/04/2019 17:05:14 | Milkyway@Home | Sending scheduler request: Requested by project.
...........................................
and so on every ~10 min
It keeps hammering server with useless requests. Usually it is useful only for specific purposes like canceling WU in progress from the server side. Server can not contact client directly so instead it ask client to "check in" every X min/hours for possible new instructions. But even for this case usually few hours is enough. Every 10 min is a overkill.
10) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68465)
Posted 1 Apr 2019 by Mad_Max
Post:
I think mmonnin has found the problem for fast clients. The rpc_delay is much too long for fast turnaround clients. They can exhaust all 200 tasks in the ten minute span before next connection.

This is a server side parameter that project staff can alter. Seti has a 303 second rpc_delay which is borderline too long for fast clients also.


200 computed tasks in less 10 minutes? It is not possible even for fastest machines. Very best computers with few modern powerful GPUs working in parallel dedicated to the single project of MW can do 200 tasks "only" in ~20-40 min.

Real problem is not rpc_delay by itself but mismatch between <next_rpc_delay> = 600 sec and <request_delay> = 91 sec.
Server currently asks client to wait at-least 91 sec before next request, but "ban" (does not give new tasks and reset countdown timer back to 600 sec) if <600 sec passed after previous request.
Fast computers report completed and request new tasks every few min (as they think it is OK to do so as the server ask to wait only 91 sec, so sending new request after 120 sec for example looks OK) and get "banned" every time because <600 sec passed from latest request.

With <request_delay> > <next_rpc_delay> there will be no such problems. So INCREASE of <request_delay> to >600 too (shown as "communication deferred countdown timer" in project status of BOINC client ) sec should fix this problem. Also it will reduce server load significantly.
Or decrease server side timeout.
In either case client slide delay should be bigger or atleast equal to the server side delay/timeout to avoid false detection of fast clients as "misbehaving" ones.
11) Message boards : News : Workunit Credit Issues - Fix Implemented (Message 68297)
Posted 20 Mar 2019 by Mad_Max
Post:
I think better to just let it be as it is. As far i can see not much WU was affected before fix kick-in, so not big deal.

P.S.
Site works much better now! It was so SLOW and laggy before upgrade.
12) Message boards : News : Scheduled Maintenance Concluded (Message 65758)
Posted 13 Nov 2016 by Mad_Max
Post:
Yes, i also try to keep one 1 CPU core idle to get max performance of GPUs apps. But this is exactly how current MW app disrupt it - BOINC think MW app uses only ~0.5 CPU core per one GPU task so start 2 MW GPU tasks on one CPU core. But in reality they use full CPU core each now, so all CPU cores work with 100% load include 1 core i reserved to support GPU work on other BOINC projects. Resulting in huge slowdown in processing GPU tasks of other BOINC projects.

It can be temporally fixed by modification of app_config file (to tell BOINC about MW use full CPU core per each task now), but it is app needed to be fixed, not users configs. In normal situation MW use less 0.5 CPU core (it jumps to 100% sometimes, but for short periods only)
13) Message boards : News : Scheduled Maintenance Concluded (Message 65752)
Posted 13 Nov 2016 by Mad_Max
Post:
I also confirm problem with 1.42 app: they run on CPU only (100% load on one CPU core, 0% load on GPU) seems like they runnining in OpenCL emulation mode
Also due to unexpected high CPU load severely disrupting work of other BOINC projects and BOINC work scheduler

So I abort all 1.42 WUs and set MW to "no new tasks" until its fixed.

P.S.
Windows 7 x64, AMD GPUs (HD 7870/7850)




©2024 Astroinformatics Group