Welcome to MilkyWay@home

Posts by JHMarshall

1) Message boards : Number crunching : Delay in getting new work units untill all work units have cleared (Message 69148)
Posted 2 Oct 2019 by JHMarshall
Post:
Apologies, I thought the problem had been resolved. I'm looking into the problem and will hopefully have a solution soon (or at least an explanation).

- Tom

Tom,

Thank you,

Joe
2) Message boards : Number crunching : Delay in getting new work units untill all work units have cleared (Message 69145)
Posted 30 Sep 2019 by JHMarshall
Post:
same problem with 7.16.3: One does not get any data for up to 15 minutes after running out, then getting 900 or so all downloaded at once.

report bunch of tasks and immediately ask for more and get nothing

[code]
9/30/2019 10:11:25 AM Starting BOINC client version 7.16.3 for windows_x86_64

107 Milkyway@Home 9/30/2019 10:15:44 AM Sending scheduler request: To fetch work.
108 Milkyway@Home 9/30/2019 10:15:44 AM Reporting 19 completed tasks
109 Milkyway@Home 9/30/2019 10:15:44 AM Requesting new tasks for AMD/ATI GPU
110 Milkyway@Home 9/30/2019 10:15:47 AM Scheduler request completed: got 0 new tasks
[/code

Curious: Is it possible, in the app, to ask for more data before reporting or uploading? Can the app be built with VS2017 or later?


From the research I've done, the delay is normal operation for the BOINC client. The delay is designed to keep the client from continually
pestering a project when it has no work.

In our case, MW has work but fails to send it when the client requests it. This makes the client think the project has no work and the client
backs off request times.

The real problem with MW is exactly what you show in lines 107 to 110 in your log. The client asks for work and MW fails to send it.

This is exactly what I see in my logs.

Joe
3) Message boards : Number crunching : Delay in getting new work units untill all work units have cleared (Message 69138)
Posted 29 Sep 2019 by JHMarshall
Post:
OK … I finally found some information on the "resource backoff". This is normal client operation. If the client doesn't get work for a specific
resource (in our case the GPU) when it requests work, it stops asking for a certain time interval. This is the resource backoff time.

On all my system MW NEVER sends new tasks when reporting completed tasks. This results in the client setting a resource backoff.

The problem is more apparent with fast GPUs because they always have tasks to report at the 90 sec RPC backoff interval. Therefore. they are
always in a resource backoff situation until they no longer have tasks to report. The resource backoff seems to start at a value between 100 to 400 secs.
with an increment of 600 secs. If a computer takes serveral minutes to run a MW task, it takes hours to empty the cache and a 5 to 15 minute gap is not very
noticeable. On a system with very fast GPUs the cache can be emptied in 40 minutes or less. Then a 5 to 15 minute gap is an eternity and is very frustrating.

After reporting the last completed task(s) and failing to get new tasks, the client will not request new MW tasks until the resource backoff has counted down.
This is the gap we fill with "0 resource share" projects until the client is allowed to request new MW tasks.

A user update request clears the resource backoff. This is why a user update request after all tasks are complete and the RPC backoff time has
counted down refills the cache.

If MW could figure out why the server NEVER sends new tasks when the client requests new tasks when reporting a completed task(s), I think our issues would be resolved.

Joe
4) Message boards : Number crunching : Delay in getting new work units untill all work units have cleared (Message 69136)
Posted 28 Sep 2019 by JHMarshall
Post:
We have been monitoring the situation, and it seems like the community has found fixes to some of the problems you are experiencing.

Jake said that the problem appeared to be some obscure BOINC setting somewhere, and had asked BOINC forums about it. It looks like this issue disappears in the new beta of a BOINC client, so they must have patched whatever was causing problems. When that is released, hopefully the problem will be resolved.

- Tom



Tom,

Sorry but I think you are incorrect. The community has not found fixes to the problem (delay in getting work). We use workarounds that process work from other projects
while MW sits on its butt. I've seen nothing that shows this is a client issue especially since it started after MW server changes. I run many projects and only one project
has this issue.

I can duplicate the problem on all my systems: slow GPUs, fast GPUs, Nvidia GPUs, and AMD GPUs. I have logs showing the strange MW behavior
and normal behavior from Einstein on the same system with identical settings. MW logs show a strange "resource backoff" of 10 minutes.
That backoff doesn't show up in Einstein logs.

MW really has two problems unique to MW:
1. Consistent failure to send new tasks when reporting a completed task.
2. Strange "resource backoff" that results in long delays in refiling the cache when all tasks are complete. Hence, we process other projects while waiting.

I have a commented log (62K text file) with my settings and how to duplicate the issue. I can send it to you via private message if you are interested. I don't
know what the limit is for text in a forum post. I could paste it in a forum post for all to see and analyze if a 62k post is allowed. I don't use any internet shares,
so I can't post a link. All my data is kept local.

I might be looking at the log incorrectly. But, I and maybe many others in the community would like to see some interest from the project in resolving this issue.
It's not just going to go away.

Pretty please,

Joe
5) Message boards : Number crunching : Delay in getting new work units untill all work units have cleared (Message 69128)
Posted 27 Sep 2019 by JHMarshall
Post:
Has anyone seen any response from a Milkyway administrator/developer/moderator on this issue since Jake left the project?

Joe
6) Message boards : Number crunching : Large surge of Invalid results and Validate errors on ALL machines (Message 68757)
Posted 19 May 2019 by JHMarshall
Post:
Ditto!! Validate errors on both Nvidia and AMD (ati) tasks.
7) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68519)
Posted 11 Apr 2019 by JHMarshall
Post:
Vortac

Very interesting.

From your log this one worked:

11/04/2019 18:56:52 | Milkyway@Home | [sched_op] NVIDIA GPU work request: 129416.79 seconds; 1.00 devices
11/04/2019 18:56:56 | Milkyway@Home | Scheduler request completed: got 200 new tasks

and this one failed:

11/04/2019 18:18:06 | Milkyway@Home | [sched_op] NVIDIA GPU work request: 127884.51 seconds; 0.00 devices
11/04/2019 18:18:09 | Milkyway@Home | Scheduler request completed: got 0 new tasks

Notice the failed request had 0.00 devices. The question here is why 0 devices?


Joe
8) Message boards : News : 30 Workunit Limit Per Request - Fix Implemented (Message 68461)
Posted 31 Mar 2019 by JHMarshall
Post:
I also think mmonnin has found the problem for fast clients. On my fastest system I complete the 200 WUs in 40 min, getting no new work for requests each time complete WUs are reported. After all WUs are completed the my BOINC client does not request any new work for 10 minutes. So 40 min computing and 10 minutes sitting idle is not very efficient!!


I would certaintly like to see the next_rpc_delay reduced.

Joe
9) Message boards : News : Update on This Weeks Errors (Message 66747)
Posted 23 Oct 2017 by JHMarshall
Post:
I've had a couple of systems running MW on autopilot for the last few days while I address other issues. I just got back to all the commotion. WOW!! I don't know about the others, but I crunch for the science, not for the worthless credits. So, I just brought 4 more machines back onto MW because so many were whinning they were dropping out.

MW and Einstein were my first BOINC projects and they are still my favorite GPU projects.

Thanks for the hard work Jake and keep you head up, but duck when you have to!!!

Joe
10) Message boards : Number crunching : Errors (Message 66609)
Posted 15 Sep 2017 by JHMarshall
Post:
I'm seeing errors also and so are other systems crunching the same WUs.

Update: rebooted system and first 4 WUs completed normally.
Will update after more results. ???????

On my systems the "de_modfit_fast_20_3s_146_bundle5_" WUs are getting computation errors. Other WUs seem to be working.
11) Message boards : Number crunching : Updated GPU Requirements (Message 66013)
Posted 13 Dec 2016 by JHMarshall
Post:
Just added a Geforce 210 into a PC without on-board video. Looks like it is not supported by Milkyway. Tried to download some tasks, and am not getting any. I'm just checking to confirm that this card will not work here.

Thanks.


The GeForce 210 does not have double precision compute capability required by MW.

Joe
12) Message boards : News : Scheduled Maintenance Concluded (Message 65889)
Posted 17 Nov 2016 by JHMarshall
Post:
Great work! My 7950s,7970s, and R9 280Xs are all cranking away!

Joe
13) Message boards : Number crunching : AMD Radeon R9 Fury X - app_info.xml and apps - optimizations (Message 64946)
Posted 27 Jul 2016 by JHMarshall
Post:

I'm a bit disappointed that a project task last ~ 16 seconds with the Milkyway 1.36 ATI app on one FuryX VGA card (1 WU/GPU).
I looked to other PCs, e.g. hostid=590597 with 'R9 200 Series - Hawaii' VGA cards. This VGA card have just 44 ComputeUnits (CUs) A task last ~ 15 seconds.
The/my FuryX have 64 CUs. But a task last ~ 16 seconds.

Is there something wrong - possibilities to optimize/fine tune?

Thanks.


Take a look at memory speed and whether the R9 280 has 128bit, 256 bit or 384 bit thruput speeds, the faster the memory and the faster the bits get transferred at one time, the faster the card crunches. It's not JUST the CU's anymore.


It's not just memory speed here. Since MW use double precision calculations (DP) the card's DP compute capability is the real driver.

R9 280 series DP = 1/4 SP
R9 290 series DP = 1/8 SP
R9 Fury X series DP = 1/16 SP

The R9 280 series may not be the fastest single precision performer but the 1/4 DP to SP ratio makes it the leader in double precision for the $.

Joe
14) Message boards : Number crunching : Computation Error on (Message 64873)
Posted 12 Jul 2016 by JHMarshall
Post:
My systems are normally very stable on MW however in the last day I've had 68 errors and 11 invalids while successfully completing over 5000 WUs. I checked the wingman results for all the tasks producing errors and they also had errors. None of these WUs were successfully completed and all were cancelled because of too many errors. Check you wingman results on your tasks. If they also had problems it is just a batch of bad WUs.

Joe
15) Message boards : Number crunching : ATI 7770 issues (Message 56649)
Posted 27 Dec 2012 by JHMarshall
Post:
Mark,

Here are some run time references for comparison to your numbers. I have a Pentium E5300 Dual core system (2.6GHz) that I have run both HD 7770 and HD 7950s recently. I don't have older run times for comparison. I run BOINC 7.0.28 and Win 7 Pro 64 bit SP1 with 8 GB ram. It's an older system with a PCI-E 1.1 bus. One core runs Einstein (CPU tasks oly) and the other core is free for MW and whatever else I'm doing like surfing. Here are the times:

HD 7770 - 515 secs runtime with 8.8 secs CPU
HD 7950 - 61 secs runtime with 3.7 secs CPU

The only difference in the runs was the GPU, everything was the same.
I run Catalyst 12.6. I've had increased run times when trying all the more recent releases so I've stuck with 12.6.

Joe
16) Message boards : Number crunching : Nvidia Geforce GTX 650 Ti slower than GTX 285 ? (Message 56554)
Posted 17 Dec 2012 by JHMarshall
Post:
Thanks Joe, it all makes sense now.

I've just read that for the GTX285 the DP performance is only 1/12 that of the SP performance, this would explain why it is faster than the GTX650 Ti even though it has a slower SP performance.

I think I'll buy an AMD card next time.

Best Regards,
Daniel


Other projects, like DistRTgen and PrimeGrid, LOVE the Nvidia cards MUCH more!! There are still other projects, that like this one but SP projects, love the AMD cards more. Unfortunately there is no 'one card fits all' for all the possible gpu projects.


Right, that's why I run both types. Most of the time my Nvidia card runs Einstein, GPUGrid, or PrimeGrid. I move it to MW mostly for challenges. The AMD cards are king on MW because of the DP requirement. They also do well on Einstein (SP) which has a very good OpenCL app for AMD.

Joe
17) Message boards : Number crunching : GTX670's and the MilkyWay project (Message 56553)
Posted 17 Dec 2012 by JHMarshall
Post:
Sorry I haven't gotten back to you, I was on vacation and shut down.
Joe


I now go on vacation and leave my pc's ON and crunching, if they crash they crash, but if they don't it is better for me. It is VERY hard to compete with your Team if I don't keep them running!! I am kidding of course, your Team has some VERY prolific crunchers on it!! AND I LOVE the picture of Ingrid's home on the cliff!!


My AMD 7950 systems occasionally hang when starting a task and I hate the idea of them sucking juice and not doing anything for several days!

It sure does mess up my RAC when I shut down though!

Joe
18) Message boards : Number crunching : GTX670's and the MilkyWay project (Message 56540)
Posted 16 Dec 2012 by JHMarshall
Post:
Sorry I haven't gotten back to you, I was on vacation and shut down.

I can't see your task or work unit and I'm probably not the best person to answer /diagnose your problem. All I know is that my GTX 560 Ti runs fine but not as fast as my AMD cards.

I'm using Boinc 7.0.28, Windows 7 Pro 64 bit, and NVidia driver: (from Boinc Log)
"306.97, Cuda 5.0, Compute capabiilty 2.1" .

I looked in the top 100+ computers on MW and all NVidia cards were using 306.97. That said, when I first started using the NVidia card on Einstein it was at 301.42. I don't remember if I updated the driver before or after I started using the card on MW.

Joe
19) Message boards : Number crunching : Nvidia Geforce GTX 650 Ti slower than GTX 285 ? (Message 56539)
Posted 16 Dec 2012 by JHMarshall
Post:
Sorry, I've been on vacation. The double precision throughput of the NVidia 6xx series cards is 1/24 x the single precision.

You can see from the AMD tables on the 79xx series that they are 1/4 x single precision. This is why the AMD cards are much faster in DP. In single precision the NVidia cards and AMD cards are closely equivalent for the dollar, but in DP the AMD cards can be 6 times faster.

NVidia wants you to buy Tesla cards for several thousand dollars, but even these don't complete well on a cost basis with the AMD cards in DP. However, the Tesla cards do have lots of error checking and corrections built-in that make them more reliable for critical computations. These projects handle the reliability problems by sending out work units to multiple machines to verify results.

Joe
20) Message boards : Number crunching : Nvidia Geforce GTX 650 Ti slower than GTX 285 ? (Message 56509)
Posted 15 Dec 2012 by JHMarshall
Post:
Try these

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units


http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units




Next 20

©2020 Astroinformatics Group