Welcome to MilkyWay@home

Finally getting new tasks only seconds after running out. May not be worth the hassle.

Message boards : Number crunching : Finally getting new tasks only seconds after running out. May not be worth the hassle.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69357 - Posted: 18 Dec 2019, 15:52:02 UTC - in response to Message 69355.  

I'm trying... Here's the cc_config file as I have it - added the flags and options as you recommend. This is the config file in place when I posted the earlier eventlog. Check your PMs for a question regarding the other edits you recommended.
<cc_config>
<log_flags>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<task>1</task>
<app_msg_receive>0</app_msg_receive>
<app_msg_send>0</app_msg_send>
<async_file_debug>0</async_file_debug>
<benchmark_debug>1</benchmark_debug>
<checkpoint_debug>0</checkpoint_debug>
<coproc_debug>0</coproc_debug>
<cpu_sched>0</cpu_sched>
<cpu_sched_debug>0</cpu_sched_debug>
<cpu_sched_status>0</cpu_sched_status>
<dcf_debug>0</dcf_debug>
<disk_usage_debug>0</disk_usage_debug>
<file_xfer>1</file_xfer>
<file_xfer_debug>0</file_xfer_debug>
<gui_rpc_debug>0</gui_rpc_debug>
<heartbeat_debug>0</heartbeat_debug>
<http_debug>0</http_debug>
<http_xfer_debug>0</http_xfer_debug>
<idle_detection_debug>1</idle_detection_debug>
<mem_usage_debug>0</mem_usage_debug>
<mw_debug>0</mw_debug>
<network_status_debug>0</network_status_debug>
<notice_debug>0</notice_debug>
<poll_debug>0</poll_debug>
<priority_debug>0</priority_debug>
<proxy_debug>0</proxy_debug>
<rr_simulation>0</rr_simulation>
<rrsim_detail>0</rrsim_detail>
<sched_ops>1</sched_ops>
<sched_op_debug>1</sched_op_debug>
<scrsave_debug>0</scrsave_debug>
<slot_debug>0</slot_debug>
<state_debug>0</state_debug>
<statefile_debug>0</statefile_debug>
<suspend_debug>0</suspend_debug>
<task>0</task>
<task_debug>0</task_debug>
<time_debug>0</time_debug>
<trickle_debug>0</trickle_debug>
<unparsed_xml>0</unparsed_xml>
<work_fetch_debug>0</work_fetch_debug>
</log_flags>
<options>
<abort_jobs_on_exit>0</abort_jobs_on_exit>
<allow_multiple_clients>0</allow_multiple_clients>
<allow_remote_gui_rpc>1</allow_remote_gui_rpc>
<disallow_attach>0</disallow_attach>
<dont_check_file_sizes>0</dont_check_file_sizes>
<dont_contact_ref_site>0</dont_contact_ref_site>
<lower_client_priority>0</lower_client_priority>
<dont_suspend_nci>0</dont_suspend_nci>
<dont_use_vbox>0</dont_use_vbox>
<dont_use_wsl>0</dont_use_wsl>
<exit_after_finish>0</exit_after_finish>
<exit_before_start>0</exit_before_start>
<exit_when_idle>0</exit_when_idle>
<fetch_minimal_work>0</fetch_minimal_work>
<fetch_on_update>0</fetch_on_update>
<force_auth>default</force_auth>
<http_1_0>0</http_1_0>
<http_transfer_timeout>300</http_transfer_timeout>
<http_transfer_timeout_bps>10</http_transfer_timeout_bps>
<max_event_log_lines>2000</max_event_log_lines>
<max_file_xfers>20</max_file_xfers>
<max_file_xfers_per_project>20</max_file_xfers_per_project>
<max_stderr_file_size>0</max_stderr_file_size>
<max_stdout_file_size>0</max_stdout_file_size>
<max_tasks_reported>0</max_tasks_reported>
<mw_low_water_pct>1</mw_low_water_pct>
<mw_high_water_pct>16</mw_high_water_pct>
<mw_wait_interval>512</mw_wait_interval>
<ncpus>-1</ncpus>
<no_alt_platform>0</no_alt_platform>
<no_gpus>0</no_gpus>
<no_info_fetch>0</no_info_fetch>
<no_opencl>0</no_opencl>
<no_priority_change>0</no_priority_change>
<os_random_only>0</os_random_only>
<process_priority>-1</process_priority>
<process_priority_special>-1</process_priority_special>
<use_all_gpus>1</use_all_gpus>
<proxy_info>
<socks_server_name></socks_server_name>
<socks_server_port>80</socks_server_port>
<http_server_name></http_server_name>
<http_server_port>80</http_server_port>
<socks5_user_name></socks5_user_name>
<socks5_user_passwd></socks5_user_passwd>
<socks5_remote_dns>0</socks5_remote_dns>
<http_user_name></http_user_name>
<http_user_passwd></http_user_passwd>
<no_proxy></no_proxy>
<no_autodetect>0</no_autodetect>
</proxy_info>
<rec_half_life_days>10.000000</rec_half_life_days>
<report_results_immediately>0</report_results_immediately>
<run_apps_manually>0</run_apps_manually>
<save_stats_days>30</save_stats_days>
<skip_cpu_benchmarks>0</skip_cpu_benchmarks>
<simple_gui_only>0</simple_gui_only>
<start_delay>0.000000</start_delay>
<stderr_head>0</stderr_head>
<suppress_net_info>0</suppress_net_info>
<unsigned_apps_ok>0</unsigned_apps_ok>
<use_all_gpus>0</use_all_gpus>
<use_certs>0</use_certs>
<use_certs_only>0</use_certs_only>
<vbox_window>0</vbox_window>
</options>
</cc_config>
ID: 69357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69358 - Posted: 18 Dec 2019, 16:19:18 UTC - in response to Message 69357.  

I'm trying... Here's the cc_config file as I have


I have never seen a cc_config that big. I am surprised that boinc did not choke on it. I certainly did.

Not a network expert but if you are running a proxy then maybe slow network traffic is causing the problem.
ID: 69358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69359 - Posted: 18 Dec 2019, 16:54:18 UTC

for some reason, MW server assigns a GPU count of "0" to a 3 GPU rig whenever there are Tasks "ready to Start" in the list. (and still only sends ~100 tasks per gpu).
ID: 69359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69361 - Posted: 18 Dec 2019, 17:47:32 UTC - in response to Message 69359.  

for some reason, MW server assigns a GPU count of "0" to a 3 GPU rig whenever there are Tasks "ready to Start" in the list. (and still only sends ~100 tasks per gpu).


The "0" is assigned if under 91 seconds OR files are attached for upload.

Put the following between the <log_flags> in cc_config

<mw_debug>1</mw_debug>


If using 7.15.0 then some messages should get printed confirming it is enabled. Turn off the sched_op flag and any other that clutter the event log
zip it up and pm it to me to look at or post the link to the zip.. One should see the effect of the delay

The default delay is 256 seconds. or 4 minutes. That much at lest must go by before any files are uploaded. Changing it to 512 means a wait of 8+ minutes before seeing any difference.

If you turn off NNT and then turn it back on after the last task is reported you should see 300 *3 = 900 download. something is wrong if that does not happen. 7.14.666 will not help you since spoofing the number of GPUs will not get anything extra on this project.
ID: 69361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69362 - Posted: 19 Dec 2019, 4:27:59 UTC - in response to Message 69359.  
Last modified: 19 Dec 2019, 4:29:50 UTC

Think I know what your problem is. You have reached MAX_TASKS_LIMIT. I suspect that applies to all Milkytasks both the n-body (CPU) and the NVidia ones. That is something that was discussed by mmennin and Dave over a boinc. Usually it is 48 tasks. I don't think at 32 cores you have enough to handle the 3 NVidia and do the n-body.

12/18/2019 8:25:31 PM | Milkyway@Home | No tasks sent
12/18/2019 8:25:31 PM | Milkyway@Home | This computer has reached a MAX_TASK_LIMIT
…3 minutes later...
12/18/2019 8:27:04 PM | Milkyway@Home | This computer has reached a limit on tasks in progress


How many n-body are you running? I see that your 7.15.0 is concurrently running n-body as well as NVidia
Suggest you only run CPU tasks on projects that do NOT have GPU apps
ID: 69362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69365 - Posted: 19 Dec 2019, 15:13:31 UTC - in response to Message 69362.  
Last modified: 19 Dec 2019, 15:15:00 UTC

Yeah, I can account for the MAX_TASK_LIMIT. (Again, it can only reach that LIMIT if I manually snooze the GPUs BEFORE any completed tasks are ready to report). But the issue is not whether a 36 thread CPU can handle the load, It does. Each Titan V runs 6 tasks each with 0.5 CPU per completing tasks in 52-55sec, N-body sims on 16C each complete in ~ 60-90min with the total CPU load under 70% average). The problem is that the MW server will not send new GPU tasks to my client so long as any completed tasks are ready to report. Once GPU tasks download and start, there is never a 91sec period where there are no completed tasks. This is not an issue for the CPU tasks since it will not accumulate completed tasks every 91 sec.


ID: 69365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69367 - Posted: 19 Dec 2019, 17:00:24 UTC - in response to Message 69365.  
Last modified: 19 Dec 2019, 17:06:12 UTC

The project has been aware of this problem for some time and I suspect if they knew how to fix it they would have by now.

Maybe the project can answer this question:

Does the MAX_TASK_LIMIT apply to the sum of the CPU and GPU tasks and what is its value?

I also notice that the 2 * 16 + 0.5 * 18 = 41 which exceeds the your CPU core count of 36

There are other users here running 7.15.0 and have not mentioned a problem.

My suggestion is the same: Do not run CPU tasks on a project that has GPU work available.

I am also guessing that suspending the n-body will not make a difference as I recall boinc will not download any new project data if any are suspended. I am not sure if that applies to GPU when CPU is suspended if might. You might ask over at boinc and maybe Richard might be able to answer that.

For sure, your core count of 41 is too high plus you need to leave a few core for the OS to work with

I recommend to abort the n-body all together. Your 3 titans are doing well enough for the project. WCG is a good choice for CPU tasks.
ID: 69367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69368 - Posted: 19 Dec 2019, 19:03:55 UTC - in response to Message 69367.  

Yeah, I'm not having any problem with 7.15, I hope that's not a misunderstanding. 7.15 works fine, but for this issue, it seems to work the same as 7.14 on my install. I'll let cpu tasks run out (disable both n-body and separation in preferences) and see if that solves the issue. I did not suspend N-body. I set preferences to exclude this task type in web preferences.
I have experienced the same "no new tasks while any are returned" issue before with cpu tasks disabled, but using 7.14 at the time - hence the reason for trying 7.15. And this same behavior occurs on another rig here running Radeon VII, which completes tasks in under 1min... again, so no 91 sec period has zero completed tasks "Waiting to Report" :)
The core count might be the cause, but I'll test that (again) andI never see more than 80% CPU package load (ever) - the MT tasks actually lower CPU usage when compared to Separation on the same number of threads. I hope this is the issue and fixes this "no GPU tasks when completed GPU tasks are queued for uploading. That would be a viola!
Correction to last post/PM - MT 16C tasks complete in as little as 13min and as long as 80min irrespective of whether any GPU tasks are running.

Max_TASK_LIMIT would seem to affect the number a client can hold, but not trhe number it could receive until hitting that number. I'm less concernd whether I can load up to 800 or 737... but only with, as you point out, the known problem with ZERO GPU tasks being added to reach that, or any limit while one or more completed GPU tasks is "Ready to report".
ID: 69368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,021,010
RAC: 86,774
Message 69369 - Posted: 19 Dec 2019, 19:27:32 UTC - in response to Message 69358.  

I have never seen a cc_config that big. I am surprised that boinc did not choke on it. I certainly did.

That is the stock cc_config.xml that is created and fully populated the first time you change the logging options of the Event Log in the Manager. BOINC ships with no cc_config.xml initially.

Simplest way to get a correctly formatted one with all the possible parameters is change a logging flag, even just temporarily and it writes out the cc_config.xml file.
ID: 69369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69370 - Posted: 19 Dec 2019, 19:42:59 UTC - in response to Message 69369.  
Last modified: 19 Dec 2019, 19:45:49 UTC

C'mon man. This is the cc_ file currently in use (only recent change is 512 to 256 to see if it does anything). Boincmgr reading it occurs once (unless you manually request additional reads). And of course it had many flags toggled "1" while trying to ID the actual root cause of the problem.

Guys - nothing is slowing down the actual processing of tasks... except that MW will not provide GPU tasks so long as any are READY TO REPORT. I hope that is clear now. *although it is a know bug*
This has nothing to do with the length of a cc_config file.

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<mw_debug>1</mw_debug>
</log_flags>
<options>
<use_all_gpus>1</use_all_gpus>
<allow_remote_gui_rpc>1</allow_remote_gui_rpc>
<mw_low_water_pct>1</mw_low_water_pct>
<mw_high_water_pct>16</mw_high_water_pct>
<mw_wait_interval>256</mw_wait_interval>
</options>
</cc_config>
ID: 69370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69371 - Posted: 19 Dec 2019, 21:26:49 UTC - in response to Message 69369.  

I have never seen a cc_config that big. I am surprised that boinc did not choke on it. I certainly did.

That is the stock cc_config.xml that is created and fully populated the first time you change the logging options of the Event Log in the Manager. BOINC ships with no cc_config.xml initially.

Simplest way to get a correctly formatted one with all the possible parameters is change a logging flag, even just temporarily and it writes out the cc_config.xml file.

Yea but there are several duplicate entries in the cc file which is strange
ID: 69371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69372 - Posted: 19 Dec 2019, 23:44:44 UTC - in response to Message 69371.  

In the one right below? Yeah - in the initial one where I was just popping flags in at the request of... probably too much advice from several users. my fault.
ID: 69372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69373 - Posted: 19 Dec 2019, 23:50:24 UTC

BTW - is there any "formal" bug report system for Boinc or MW thru Boinc?
I'm not throwing compute power at this for 'coin, just for the community. My wife is an RPI grad and I gave many an organic chemistry seminar/lecture there decades ago.
ID: 69373 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69374 - Posted: 20 Dec 2019, 2:33:19 UTC

Okay... so what I decided to do was to clean out the nvidia driver base and install the latest Studio driver, Using the cc file shown below and v7.15 it now seems to be fetching 20-30 tasks while at the same time sending ~180 completed tasks back to the MW server! Viola. I'll let it run overnight and see it it holds up. It may (somehow) have been related to the 441.22 vs 441.66 drivers (both are the non-DCH Studio driver). 441.66 clean install may have been the fix. Crazy!
<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<mw_debug>1</mw_debug>
</log_flags>
<options>
<use_all_gpus>1</use_all_gpus>
<allow_remote_gui_rpc>1</allow_remote_gui_rpc>
<mw_low_water_pct>1</mw_low_water_pct>
<mw_high_water_pct>16</mw_high_water_pct>
<mw_wait_interval>512</mw_wait_interval>
</options>
</cc_config>
THe R6Ax299 rig was the "offending" party. :)
ID: 69374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69376 - Posted: 20 Dec 2019, 15:13:23 UTC

Spoke too soon. After running overnight, the behavior returned back to No new tasks when there are completed task "ready to report". Worked for a couple of hours it seems. UGH!
ID: 69376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 69378 - Posted: 20 Dec 2019, 20:46:04 UTC - in response to Message 69376.  

Spoke too soon. After running overnight, the behavior returned back to No new tasks when there are completed task "ready to report". Worked for a couple of hours it seems. UGH!


This has been going on ever since MW updated the Boinc Server side software this latest time, people have pleeded with them to contact the Developers, or other Projects where it works just fine, directly for asssistance but they seemingly never have. This may well contiue to be a problem until they upgrade again in a few years, that's of course assuming that the Developers are watching all of this and can program in a fix without screwing up every other project.
ID: 69378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69379 - Posted: 20 Dec 2019, 20:50:43 UTC - in response to Message 69376.  
Last modified: 20 Dec 2019, 20:52:25 UTC

Spoke too soon. After running overnight, the behavior returned back to No new tasks when there are completed task "ready to report". Worked for a couple of hours it seems. UGH!


I still think the problem is the n-body hogging the CPU and hitting a limit on tasks in progress.
Use Milkyway project settings to prevent their cpu tasks form downloading and purge all the Milkyway CPU bound ones you got. I noticed you had purged a number of them yesterday, but you still had them listed on the display image.

Think they have to be purged, not just suspended. If this does not work then you are on your own as I cannot duplicate the problem with my 6 gpus.

I once lost 900 tasks in a few seconds with a bad edit of the app_config.xml. Not going to hurt the project. That can make up stuff faster than you or I can process it.
ID: 69379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69380 - Posted: 20 Dec 2019, 21:50:08 UTC - in response to Message 69379.  

I did disable the N-body sims and aborted any tasks I had "Ready to run. Same issue. Running separation and gpu tasks the thread count never exceeds 32. There was a few instances where two 16C N-body tasks would start up if there were no GPU tasks running (which can be for 10-20min before new tasks DL after the last is uploaded).
Check you PM. I'll link some more info trying the boinctasks (no mgr regedit thing). :)
ID: 69380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jpmboy
Avatar

Send message
Joined: 29 Apr 17
Posts: 33
Credit: 7,041,502,264
RAC: 0
Message 69381 - Posted: 20 Dec 2019, 21:54:00 UTC - in response to Message 69378.  

Yeah, it sure is a bug in the server side software... maybe we can find a way to patch the issue from the client side. JStaeson has been at this pretty creatively. Maybe his home-grown fix can work on this rig also.
Honestly, it's not an issue on my R290 or R295x2 rig. Only on the Titan V and Radeon VII rigs. If we can "fix" this Titan V rig (3 gpus) it shold also work on the VII rig. Well, I hope anyway. :)
ID: 69381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 69383 - Posted: 20 Dec 2019, 23:20:23 UTC - in response to Message 69381.  

Yeah, it sure is a bug in the server side software... maybe we can find a way to patch the issue from the client side. JStaeson has been at this pretty creatively. Maybe his home-grown fix can work on this rig also.
Honestly, it's not an issue on my R290 or R295x2 rig. Only on the Titan V and Radeon VII rigs. If we can "fix" this Titan V rig (3 gpus) it shold also work on the VII rig. Well, I hope anyway. :)


The problem is thousands of people have this problem and they just aren't switching to a homeande version of Boinc to fix it, the Project needs to fix this for EVERYONE!!
ID: 69383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : Finally getting new tasks only seconds after running out. May not be worth the hassle.

©2024 Astroinformatics Group