Welcome to MilkyWay@home

Posts by Gary Roberts

1) Message boards : Number crunching : Radeon 4800 (Message 66099)
Posted 8 Jan 2017 by Profile Gary Roberts
Have you been through this thread? You should have added your voice there - squeaky wheel syndrome :-).

There are lots of people with the same old cards which stopped being able to get work quite a while ago. OpenCL 1.0 doesn't seem to be the problem. The reason given was that some "update to the BOINC scheduler" has prevented these cards from being recognised.

I have no idea why it takes so long to fix this :-(.
2) Message boards : Number crunching : Updated GPU Requirements (Currently not supporting GPU tasks) (Message 63036)
Posted 18 Jan 2015 by Profile Gary Roberts
The HD4800 series card supports OpenCL version 1.0. I think Milky Way requires version 1.1

Nope. I have several hosts with HD4850 GPUs (OpenCL 1.0) that have been crunching continuously here for a long time without problems. Here is one that joined up back in May 2009.
3) Message boards : Number crunching : Please assist - can't get Radeon HD 4850 recognized in Linux (Message 59481)
Posted 30 Jul 2013 by Profile Gary Roberts
Thanks, Gary! That really made my day – so glad to know that you're continuing to crunch away on the 4850s with Linux.

You're most welcome. When people post useful experiences, they deserve to be told how useful it was and to be given thanks accordingly.

And thank you also for the advice on OC'ing the 4850. I just bumped up to 700, 750. More power to MilkyWay!

I've done the transition to 3 hosts so far without problem and #4 will be ready to go very shortly. I'm running down existing caches rather than trashing any tasks. That's quick to do with MWAH but takes longer for EAH. So I've adopted the strategy of suspending the unstarted EAH tasks and just waiting for the running tasks to finish. After reporting, and stopping BOINC, I can then remove the unstarted tasks by editing the state file. This way, upon restarting under Linux, the scheduler at EAH will notice the missing tasks, mark them as 'lost' and resend them to me, fully prepped to be crunched under Linux. Works a treat!

With regards to changing frequencies for the 4850, I've discovered that you can go lower than the lower limit of the 'allowed range'. So I tried setting the clocks to 700, 500 and it worked! It seems to give about 1-2 deg C reduction in temperature from the value reported for 700, 750, although I haven't yet tested things thoroughly enough to be certain it's real. 700, 500 seems to give no observable change in the crunch time so I'll try 700, 300 on the next host and see how that goes.

p.s. 1,000,000,000+ credits!!! – wow


12 4850s crunching for over 4 years should achieve something, I guess :-).
4) Message boards : Number crunching : Please assist - can't get Radeon HD 4850 recognized in Linux (Message 59465)
Posted 27 Jul 2013 by Profile Gary Roberts
... I didn't think the 2.7 SDK would work on a 4000 series, or at least not 'officially'. ;-)

I came across this thread recently while doing a bit of googling for things like catalyst drivers for Linux, OpenCL runtime and AMD APP SDK, HD4850, etc. I have 12 hosts with 4850s running Win XP. I'm doing Einstein on the CPUs and MW on the GPUs.

These 12 hosts are pretty much the last Win XP machines I have as all my Einstein hosts run Linux (PCLinuxOS). The bulk were converted in 2007-2009 period. I started MW GPU in early 2009 and Win XP was the logical choice for the OS back then. The 12 GPU hosts have run pretty much continuously since the start. They crash occasionally and several have shown signs of overheating but I clean the CPU and GPU fans when needed. Usually that fixes things but on several now I've had to re-grease the GPU heat sink in order to get the temperature back to normal. All-in-all they soldier on with minimal interference from me so there hasn't been any great urge to change the OS.

That all changed recently when the hard disk failed on one of them. I didn't relish the prospect of reinstalling XP so I decided it was the perfect time to transition all of them to Linux, with the failed hard drive host being the test machine. I read a lot of stuff at various places, but this thread was the clincher - the proof that it should be quite easy to do. Many thanks in particular to Shodan7 and captainjack for their valuable comments. The information they shared made the whole exercise a snap.

The failed drive wasn't being recognised by the BIOS so I had the thought that it mightn't be mechanical failure but rather a failure of the interface electronics. The drive is a Seagate ST320014A - 20GB IDE. I got a lot of these drives in 2007 in a bunch of P3 machines I bought very cheaply at auction. They would have been around 5 years old then so more than 10 years now. They have been surprisingly reliable - very few failures while I've owned them. So, before throwing it away, I decided to try swapping the interface board from a known good ST320014A. When I powered up, the drive was immediately recognised, Windows booted normally and crunching recommenced. You get lucky sometimes :-). However, my mind was made up. I was going to transition this machine to Linux - period!

All I had to do was put this refurbished drive aside (just in case) and install the latest PCLinuxOS on a replacement. I use a live USB HD image for installation so it's very quick compared to optical media. I also keep a fully updated copy of the PCLinuxOS repository on the same disk so immediately after the install I can update all packages and add any new ones required. Magically quick, compared to the internet. On my LAN, I have a machine that acts as a Samba file server so I never have to actually 'install' BOINC or download any apps, data files, etc. I keep 'templates' of everything which I just retrieve from the Samba share. So I just choose the BOINC version template I want to use, add the account xml files for the projects, add the required project directories with all their data files and apps and edit the state file template to give the new host its identity. In this case, I wanted the new host to have the same identity as the Win XP host it was replacing - for both MWAH and EAH. This is quite easy to do with a few key edits in the state file template.

After setting up all the BOINC and project files, confirming that the Catalyst 13.1 proprietary driver (fglrx-legacy for HD4850 and older) was installed, I downloaded the 2.7 version of the AMD APP SDK from the AMD website. I unpacked the tar file and ran the install shell script and it did its thing and reported success with no error messages.

The last thing was to reboot the machine and then with fingers crossed, launch BOINC. I was pleasantly surprised when I was able to download new tasks and start crunching them with no issues. So I can indeed confirm that the 2.7 SDK is fine for crunching on a HD4850 here at MWAH. I can't use these GPUs at EAH so I'm very happy they continue to be useful here.

I was happy to see tasks being returned and validated. The crunch times were slightly longer than for winXP (120secs compared to 115secs). Then I remembered that the core frequency had been pushed up to 700MHz under Windows. So I tried 'aticonfig --help' and found all the options needed to enable overdrive, to read and set the clocks, and to commit the new values for posterity. So the card is now running at 700MHz (the highest allowed) for core and 750MHz (the lowest allowed) for memory. I remember the discussions years ago that memory clock frequency should be set as low as possible (to reduce heat) without affecting crunch time. I'd like to set it a lot less than 750MHz.

So once again, many thanks to Shodan7 for the notes on how he achieved it and to captainjack for pointing out the aticonfig options. If anybody is interested in the host details, its hostID is 108757 and if you drill down to dates/times around 26 July 7:00 AM UTC you will see the last task crunched under WinXP and a few hours later, the first task crunched under Linux. An hour or two later you will also see the reduction in crunch time when I overclocked the core frequency. All I had to do (as root) was

# aticonfig --od-enable      (to enable overdrive)
# aticonfig --od-getclocks   (to see the default clocks [625,933] and the allowed limits)
# aticonfig --od-setclocks=700,750 (to set the new frequencies)
# aticonfig --od-commitclocks      (to save for posterity)
# aticonfig --od-gettemperature    (to check the temperature after overclocking)

The value reported was 62deg C, which seems pleasantly low when compared to what was reported in Catalyst control center under Win XP (70-80deg C if I remember correctly). As I write this, crunching under Linux has been in progress for over 12 hours with no tasks reporting as errors. Now, 11 more hosts to go! :-).
5) Message boards : News : Apology for recent bad batches of workunits (Message 55823)
Posted 16 Oct 2012 by Profile Gary Roberts
No, there are lots of crunching errors from tasks that became send today....

Are you crunching two at a time?

Yesterday, all my hosts (which were crunching two at a time) were rapidly erroring out full caches of tasks. I noticed that if I got a new cache of work and immediately suspended all but one task, then that one task would complete without problems. If I then unsuspended a second task, it would also crunch successfully. The minute I tried to launch a second task with one already running, it would try to start and then error out after about 5 seconds.

So I reconfigured all my hosts to only crunch one at a time and the major problem went away. All have been crunching without further issue for about 12 hours now. I've had a look through a couple of lists of completed tasks and there are occasional (and quite different) failures - I would estimate about 5% of tasks are failing like this. With these tasks that fail, they seem to go to normal completion and then fail right at the very end.

A normal and successful completion shows the following right at the end of the stderr.txt output:

Integration time: 26.084413 s. Average time per iteration = 40.756896 ms
Integral 2 time = 26.670313 s
Running likelihood with 66200 stars
Likelihood time = 0.281871 s
<background_integral> 0.000229475606607 </background_integral>
<stream_integral>  29.075788494514907  1751.674920113726300  265.410894993209520 </stream_integral>
<background_likelihood> -3.630395539836397 </background_likelihood>
<stream_only_likelihood>  -50.559887396327156  -4.291799741661306  -3.193754139881464 </stream_only_likelihood>
<search_likelihood> -2.933654500572366 </search_likelihood>
06:07:42 (4016): called boinc_finish


A task that fails has the following output (note the 5th and 6th lines - the *** are mine):

Integration time: 26.096254 s. Average time per iteration = 40.775398 ms
Integral 2 time = 26.683731 s
Running likelihood with 66200 stars
Likelihood time = 0.345676 s
*** Non-finite result
Failed to calculate likelihood ***
<background_integral> 0.000127021322682 </background_integral>
<stream_integral>  0.000000000000000  21.097641379193686  135.128876475406910 </stream_integral>
<background_likelihood> -4.749804775987869 </background_likelihood>
<stream_only_likelihood>  -1.#IND00000000000  -3.735624970800870  -173.837339342064330 </stream_only_likelihood>
<search_likelihood> -241.000000000000000 </search_likelihood>
06:09:32 (1384): called boinc_finish


It would appear that the liklihood calculation is failing perhaps through a 'divide by zero' or something like that.
6) Message boards : News : Separation updated to 1.00 (Message 53062)
Posted 10 Feb 2012 by Profile Gary Roberts
Those having apparent work-fetch problems with 7.X.X .... at present it works in a different way, I suspect - only that, no definitive knowledge - that its being designed to reduce cache sizes at the user end, particularly massive days long ones due to the problems that can give servers.

It certainly works in a different way but it's nothing to do with reducing cache sizes. The two cache settings now have different meanings in 7.x.x. The 'Connect Every ...' (CE) setting should now be regarded as the 'low water' mark. Many people would have this at zero (from past behaviour) and if you do, that is why BOINC wont get work until you actually run out. The 'Extra Days' (ED) setting should now be regarded as an increment to define a 'high water' mark.

So, if you want BOINC to maintain a cache between X and X.01 days, you would set CE=X and ED=0.01.

It's as simple as that. Of course, it's not really documented properly anywhere and if you gripe about that you would probably be told that nothing is guaranteed with 'bleeding edge' versions so why are you using one ... Projects will continue to try to use new features that aren't ready for prime time so you can't really blame BOINC entirely for all the teething problems. It might have been better for volunteers if the new MW 1.0x app had been released in a testing environment similar to the way that the new Einstein CL app is being tested in the Albert@Home test project. I know that requires a lot more resources at the project end but it's a lot more 'friendly' for the long suffering volunteers.

At present, the cache will reduce down without being refreshed. At the point when its almost at dry, it will then download what you set in the cache preferences all the other supply constraints being equal. Often had 200 WU downloads on other projects in one lump after apparent famine. Its feast or famine and will not play unless the cache is near zero.

This is exactly what to expect if you are using 7.x.x with 'old style' cache settings - something like CE=0 and ED=3.0 for example.

7) Message boards : News : had some corruption in the searches (Message 41721)
Posted 24 Aug 2010 by Profile Gary Roberts
I am not sure how BOINC figures out how much work to request.

The client part of BOINC requests work based on the values you set for two particular user controlled preferences - 'connect to the internet every X.XX days' and 'maintain enough work for an additional Y days'. I think the max for each is 10 days so it is theoretically possible to ask for a total of 20 days work. Of course, with a deadline of just 7 days, the theoretical max value is irrelevant.

The crazy part is that the log snippet you posted shows a request for 79.4 days of work spread over 3 GPUs. So there has to be some weird bug(s) in the BOINC client you are using that generates such an impossible work request. BTW, you mention 12 CPUs and 4 GPUs but your hosts are hidden so it's not possible to see what BOINC thinks about your host (or hosts). Also, you support quite a number of projects but how many of these are all fighting for work in competition with MW?

The server part of BOINC is bound to reject a 79.4 day request because it figures that the work could not possibly all be returned within the deadline. A critical part of the 'thinking' at the server end is to do with what you set for the first of the two preferences because if you set a large value there, the scheduler has to allow for the possibility that the client may really not be able to make a further contact with the server (to return any completed results) for that large number of days.

The 79.4 day request seems to suggest that perhaps you have large values for both preferences. If you have, it would be very interesting to experiment with something more reasonable like 0.01/1 or 0.01/2 or even 0.01/3 and see what happens. The first preference should always attempt to reflect reality - use a very low value or even zero if you have an 'always on' internet connection. Don't go overboard with the size of the 'extra days' preference or else you risk triggering BOINC bugs which cause the scheduler to make weird decisions just like you are seeing.

I don't know if any of this is relevant to your situation or not. It shouldn't be too hard to do a few experiments with preferences and see what happens. In the end it may just be that BOINC simply cannot handle the mix you are throwing at it. Since GPU processing was tacked onto BOINC as an afterthought, it's probably going to take quite a while yet for all the issues to get sorted out.

8) Message boards : Number crunching : nbody (Message 41711)
Posted 24 Aug 2010 by Profile Gary Roberts
You shouldn't post the same question in multiple threads. For a possible solution, check the response posted in the News thread. I don't think it has anything to do with nbody tasks as they are CPU only at this stage.
9) Message boards : News : had some corruption in the searches (Message 41709)
Posted 23 Aug 2010 by Profile Gary Roberts
8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] ATI GPU work request: 6863331.47 seconds; 3.00 GPUs

Look more closely. Your request is for nearly 7M secs of work and not 686K secs. Since a full week (the deadline) is just over 600K secs, it's not really surprising that the scheduler doesn't want to send you much work. How do you get your client to actually ask for 6863331.47 seconds of work for 3 GPUs? That is actually over 26 days of work per GPU. How are you able to set your cache that high?
10) Message boards : Number crunching : Waiting for validation... (Message 38741)
Posted 14 Apr 2010 by Profile Gary Roberts
Here is my list waiting validation ...

No, that's not your pending list :-). It's actually the pending list of the person who happens to click it - provided that person is logged in. If not logged in, I imagine it would be a login screen.

Hover your mouse over the link and you will see what I mean :-).
11) Message boards : Number crunching : Marked as Invalid? (Part 2) (Message 38509)
Posted 9 Apr 2010 by Profile Gary Roberts
Does that not mean that the results in this case that are geting the status 'Valid' actually is 'Invalid' ...

I don't know why you would think that.

If <3% of results are marked invalid then >97% are marked valid. All that means is that >97% of results agree with each other. This makes no statement about whether the results are actually right or wrong. All non-5800 series crunching methods gave correct answers anyway. All 5800 series cards that are running V0.23 are giving correct answers now. The chance of two remaining 'bad versions' on 5800 series cards getting to form a quorum is decreasing and is probably quite low now. So I would think that the majority of 'valid' results are also correct results.

... or in other words the validation sytem does not working?

The validation system is working as designed. The validator can't rectify the situation if two hosts each send back the same incorrect answer. The situation will improve further if those still using the 'bad' versions get 'encouraged' into upgrading to V0.23. Send some PMs to offenders rather than suggesting that the validator is broken.
12) Message boards : News : upgrading the ATI 58x0 application (Message 38497)
Posted 9 Apr 2010 by Profile Gary Roberts
I have 5770 - BOINC see it(1360GFLOFs), but MilkyWay says "An ATI GPU supporting double precision math is required".

Take a look at the GPU Requirements sticky in Number crunching.

In particular, this message, gives the full story.
13) Message boards : News : stock ATI 58x0 apps updated (Message 38375)
Posted 8 Apr 2010 by Profile Gary Roberts
What if at this stage, the computer which latest wu was sent to, detaches?

If it was the proper BOINC 'detach' operation, the client would advise the server that the task was not going to be completed and the server would immediately send out a new copy to a 4th machine.

A worse scenario would be if the owner of the 3rd machine happened to turn his machine off and go on holidays for a month, the server would not know and would have to wait for the deadline to expire before it could send out the 4th copy. The server can be infinitely patient and the quorun would eventually get completed. By that time it would be unlikely that such a delayed result would be of any use to the project and it would most likely be discarded. However, the hosts which were deemed to have supplied the agreeing results would still get credit even if the result was discarded.
14) Message boards : Number crunching : Any 2/3 length ATI cards available for double precision? (Message 38345)
Posted 7 Apr 2010 by Profile Gary Roberts
HD4770 thats like standing still while a HD5770 does a moonwalk over it.

I wonder why the HD5770 isn't on the supported GPU list then :-).
15) Message boards : News : testing new validator (Message 38344)
Posted 7 Apr 2010 by Profile Gary Roberts
Two anonymous platforms sporting versions 0.20b and 0.22 out-quorumed an HD5870 running version 0.23.

There are probably a significant number of people running AP and not paying close attention to the boards. Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well. Hopefully they are monitoring their email a bit more closely.
16) Message boards : Number crunching : GPU Requirements [OLD] (Message 38339)
Posted 7 Apr 2010 by Profile Gary Roberts
Just a quick note, the rare HD4830 works great also.

It's already listed in the opening post of this thread.
17) Message boards : News : validator strictness (Message 38245)
Posted 7 Apr 2010 by Profile Gary Roberts
... the format definitions have changed and are not compatible between different CAL versions (i.e. between 1.3 and 1.4). But as the HD5800 GPUs require a driver with CAL 1.4 support, that is not a showstopper. But as a consequence, I will now provide only builds for CAL 1.4 (i.e. Catalyst 9.3 and up). As newer Catalyst versions appear to run quite well also for older GPUs, I think it is about time for this step.

Also any person still using an older card with CAL 1.3 could still continue using their current app (rather than upgrading to CAL 1.4 and the corrected app) since (until Travis pulls the plug on Milkyway2) the old app would continue to give correct answers on their old hardware. People in that category should make preparations to upgrade to CAL 1.4 as they would seem to need that once Milkyway3 replaces Milkyway2.

Just give me a few more minutes to compile the different versions. I will post the links in the number crunching section so Travis/Anthony can upload them as stock apps and everybody with a HD58x0 GPU can use them as soon as possible.

Thank you very much for sorting this out so quickly.
18) Message boards : News : validator strictness (Message 38236)
Posted 6 Apr 2010 by Profile Gary Roberts
I think the application is going to need to be updated before the problem gets fixed. I'll make a news post as soon as we have new applications for the 58x0 series.

Just to clarify things a bit, if CP comes out with a corrected 'current generation' app before you release your new source code, that would provide an immediate solution to the 'invalids' problem if all 5800 series owners were to immediately adopt the new app.

You have mentioned several times about 'releasing the new code' and 'allowing people to compile their own apps' but I don't think you actually spelled out exactly what precompiled apps you would be releasing as well. I might be wrong but I got the impression at one point that you might be building for CPU and CUDA but perhaps not for ATI? In other words, we would need to rely on the continuing services of CP or someone else to port the new code and build the appropriate ATI apps. Is this how things will work when you release the new code?
19) Message boards : News : quorum down to 2 (Message 38235)
Posted 6 Apr 2010 by Profile Gary Roberts
I agree that something along the lines of crunch3r's suggestion should be adopted until CP announces a corrected app.

Then the real fun starts. How do you guarantee that all 5800 series owners actually start using the corrected app? Those using stock apps should be converted automatically. Those running under AP (and not paying attention) will continue to pollute the result stream. If there are enough of those they may still be able to cause incorrect results to be validated, particularly with a quorum of only 2.

Maybe it's possible to discriminate based on both GPU series and app version?
20) Message boards : MilkyWay@home Science : Screen saver? (Message 38214)
Posted 6 Apr 2010 by Profile Gary Roberts
Try reading this thread.

Next 20

©2024 Astroinformatics Group