Posts by mikey
log in
1) Message boards : Number crunching : WUs sticking on AMD RX 560 GPU (Message 67515)
Posted 4 days ago by mikey
When I restarted the computer just now (for another reason), it's now stalling again. Maybe the card's dodgy and needs to warm up?!


What else is the pc doing that could be causing slowdowns is another option.
2) Message boards : Number crunching : WUs sticking on AMD RX 560 GPU (Message 67495)
Posted 5 days ago by mikey
Both machines are running Asteroids on their CPUs. The good machine has an old 4 core i5 3570K. The offending machine has a brand new i5 8600K. So the machine that's playing up has a less powerful GPU with a CPU that has more cores, each core with more power than the good machine, so I can't see that limiting anything. The WUs are also not showing much in the way of CPU usage, as I'd expect with a faster CPU and slower GPU. When they pause, the main timer ticks on, but not the CPU timer, and the % complete sticks for a minute.

I'll have to look at it tomorrow to see if the GPU usage falls off when it sticks, by looking at the MSI Afterburner software. I neglected to setup remote access for it and it's not where I can easily connect a monitor just now.


Stop the Asteroids cpu units for the length of one MW workunit and see if the MW units stop stuttering, if so you found the problem.
3) Message boards : Number crunching : New Linux system trashes all tasks (Message 67494)
Posted 5 days ago by mikey
Thanks for the reply. I guess I will have to delete the project from that computer or just put it to indefinite suspend till it gets sorted out by the project.

I am running a specially compiled version of BOINC made for SETI users. It does not have the current 1000 task restriction on the number of tasks allowed like any BOINC version > 7.02.

I will not update it to anything later as that would defeat my SETI usage which is my primary project.


Load a separate instance of Boinc just for MW then.
4) Message boards : Number crunching : Which GPU? (Message 67485)
Posted 6 days ago by mikey
I notice graphics card manufactures are designing cards to work well with coin mining, so why don't they design them to work well with boinc? We want double precision! And we want it now!


Because we don't buy enough of them to make them worthwhile, we may be many but we don't buy new stuff often enough.
5) Message boards : Number crunching : Which GPU? (Message 67468)
Posted 7 days ago by mikey
I prefer AMD GPUs for price/performance ratio. But I can't find any information on which models are best for double precision. All the benchmarks just quote certain games, or single precision. And AFAIK Milkyway uses double precision. What do other projects use? Does anyone know where I can look at a list of currently available AMD GPUs to compare double precision?

All I have at the moment is an old R9 290, and a brand new RX 560. The 560 should be half the speed of the 290 (going by the single FP speed on reviews), but it's running at about 1/4 of the speed for Milkyway, so I assume they scrimped on the double precision. Anyone running an RX 580 (which I plan to get for gaming)?


Try here too:
http://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/

GeForce GTX 580 1581 197 FP64 = 1/8 FP32
6) Message boards : Number crunching : Huge number of 'Validation inconclusive' WUs (Message 67436)
Posted 15 days ago by mikey
Hello,
at present I have more than 2000 'validation inconclusive' WU (MilkyWay@Home v1.46 (opencl_ati_101)), these are of the 'Unsent' variety, on my three machines:

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=764666&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=765378&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=763440&offset=0&show_names=0&state=3&appid=

Any idea what's going on?

Many thanks,
max


Problem is bug in program. Looking at the set of errors, the first user I looked at had 3 titans but had almost 20,000 errors. If all tasks all error out the number of "pending" will rise to the total number of work units.

I thought only 80 were allowed per day. Even with 3 titans my math suggests it should have taken a week at 80 per 24 hours. All 19,736 was from May7, 5am to may8, 1300


No it's 80 per gpu, but if you buzz right thru them at 2 seconds each you can zip thru thousands of them per day, all errors of course!!! Jake wants us to send him the link to computers like the one you found so he can put it on the 'suspicious' list. Unfortunately when he did it automatically a while back LOTS of people couldn't bring new gpu's on here to crunch so it's now a manual process.
7) Message boards : Number crunching : Account Problems (Message 67426)
Posted 19 days ago by mikey
I used my authenticator string to get into that account and it's using another email address and I can't remember the password. I can't understand why they both have the same creation date and the same active computer which I built back in November.

The email address I'm using now I've had sense 1998, it's an old Netscape account that I use for everything but questionable stuff. I don't know how or why it gave my threadripper rig 2 different host ID numbers, I hope he'll be able to combine both accounts.

Thanks again for your help.


Some Projects will do that for you and some won't, it's got to do with the Science and permissions they give the people we deal with. Permissions in that if they let them do one thing it auto lets them do LOTS of others things too. But only an Admin can help you with combining accounts.
8) Message boards : Number crunching : Validation pending (Message 67417)
Posted 27 days ago by mikey
Hi folks,

One of my tasks has status "Validation pending". The workunit lists 2 tasks, one "validation inconclusive" and the other "validation pending". One of these tasks was reported on April 22, the other on April 23. Why is this task not being validated or is no new task sent to resolve a potential conflict ?

Workunit: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1604568731

Thanks,

Tom


It means they still have to send it to another pc before you get any credits for it, the one unit was obviously run too short and is therefore marked as invalid.
9) Message boards : Number crunching : "Maximum CPU % for graphics" (Message 67403)
Posted 29 days ago by mikey
Ahh ok. I do not use the screensaver either. I will just leave it alone then. Thanks!


No problem.
10) Message boards : Number crunching : Validation inconclusive (Message 67402)
Posted 29 days ago by mikey
Very interesting! Thanks. Makes sense that it would take another confirmation to validate the results. Next up I'm going to look for more info on what differentiates the n-body WU from the regular one.


N-body workunits use multiple cpu cores on one workunit, regular workunits use one cpu core per workunit.
11) Message boards : Number crunching : "Maximum CPU % for graphics" (Message 67390)
Posted 24 Apr 2018 by mikey
Maximum CPU % for graphics

I have never understood this setting in MilkyWay@Home preferences.

What the heck does it do? Should I change it? Is it starving my GPU of cpu-time if I run multiple WU on my gpu?


I believe this is for how much gpu percentage it uses to show the screensaver, I leave mine at the default setting since I don't use the screensaver as it slows down the crunching.
12) Message boards : Number crunching : Validation inconclusive (Message 67389)
Posted 24 Apr 2018 by mikey
Hi folks, I currently have a lot (2000+) of tasks with Status of: Completed, validation inconclusive.

Credit is "pending". Is it normal to have this many inconclusive results? What further validation takes place after this status? What determines if I will/ will not receive credit for these?

Thanks!


Yes it's normal especially as fast as you are going thru the workunits, most will be validated in time, you are just waiting on your wingman in each case. Each workunit here requires at least two different people to crunch it before they can be 'validated'. Sine you are waiting on a wingman yours are listed as 'inconclusive'.

'Pending' credit is different, those are units that don't require a wingman but the project validates them and that depends on when they do it and I guess how many total there are from all the different users.
13) Message boards : Number crunching : n body crawling (Message 67380)
Posted 21 Apr 2018 by mikey
Yeah I retired all my Core 2 Quads. Performance just isn't there for the energy cost. Even first gen Core is getting my stinkeye as I start to consolidate. Moving to Sandy as the bottom cutoff for desktop stuff. Found it's better still to just go multi-socket Xeon. LGA1366 is kinda iffy and is the bottom of what I will even consider. LGA20111 is plentiful and cheap on the used market.


You have mail
14) Message boards : Number crunching : Run Multiple WU's on Your GPU (Message 67379)
Posted 21 Apr 2018 by mikey
You have run into a Boinc software limitation, not a gpu limitation, Boinc itself can't see 12gb of ram on the gpu, it will in time but not now, so running that many workunits that each take that much memory will be a problem.


How could this be a BOINC limitation? Do you have a citation on this? Or a link to the bug in the source code? It seems to me that if I ask BOINC to schedule 8 tasks per GPU, that BOINC will do that without trying to determine if the GPU has enough RAM. Additionally, the errors I am seeing are coming from the Milkyway WUs. The errors are intermittent too. The computer can successfully handle 8 WUs per GPU most of the time, but even a 5% error rate is too high.



So this info is incorrect. There is no issue with BOINC and 12Gb of ram on a graphics card. The issue is the application running the WU doesn't know to throttle back if it runs out of memory. So with 12Gb of GPU RAM and 8 WU going you can go past the 12Gb of available RAM and it will error out and I think kill all running WUs (or at least the one that ran out of memory). This is not a BOINC limitation but a limitation with the application crunching the WU. I recently tested out a Tesla v100 with 16Gb of GPU RAM. I ran 10 WU at a time and I would peak at 14.5Gb of RAM used. It didn't error out...worked fine. This was running Boinc 7.6.31 on Ubuntu 16.04. If I pushed 12 WU, depending on how they ran (RAM usage ramps up as WU processes) they would error out because I ran out of GPU RAM. In general a Milkyway WU will peak at the end around 1800 Mb. Doing the math:


6 WU @ 1800 = 10.8Gb
8 WU @ 1800 = 14.4Gb


That's why you are erroring out at 8 WU....you are randomly running out of GPU RAM. I say randomly because your WU are all starting and ending at random times and its rare for all of them to finish at once (and hit peak memory usage). You could probably get away with 7 but some will still fail randomly. Here is a v100 running 7 WU. Notice the GPU memory usage:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 000094A8:00:00.0 Off | 0 | | N/A 60C P0 199W / 250W | 8915MiB / 16160MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 92457 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1838MiB | | 0 92476 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1480MiB | | 0 92484 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1838MiB | | 0 92500 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1444MiB | | 0 92523 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1480MiB | | 0 92685 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 406MiB | | 0 92693 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 358MiB | +-----------------------------------------------------------------------------+


The ones at the top of the list have been running and are about to finish up. The ones at the bottom (higher PID) have just started.

There is a command line version of BOINC. If I were you I'd openup a DOS prompt and go to c:\Program Files\BOINC or wherever you have it installed. Run " boinccmd --get_project_status". Record the current time and the number of WU's you have and the elapsed time. Thats your baseline. Let it run for a couple hours. Get the stats again. Calculate the difference (new time - old time and new total - old total). Do the math to find out how long you were taking per WU. Now divide that by 6. There is your approximate average per WU when running 6 at a time. Change it to 5 WU, do it again. Change it to 4, do it again. Change it to 7, do it again (and watch for errors).


You now have a number lower then the others. Stick with that many WUs.


Also in case anyone is wondering after lots of playing a Tesla v100 seems optimal at 7 WU using 0.142 for the GPU setting and I used 0.5 for the CPU. It seemed to give the best average WU time with quantity taken into effect...37s per WU with 7 at a time or a average of one WU per 5.3 seconds. I also tested a p100 which despite its price tag being 75% of a v100 its almost half the speed. The best I could get out of it was 54.9s per WU with 6 at a time or a average of one WU per 9.14 seconds. 4 or 5 WU were just about the same (9.32), below or above were slower on average.


Thank you very much, I learned something new today!!
15) Message boards : Number crunching : n body crawling (Message 67371)
Posted 20 Apr 2018 by mikey
n body is using little cpu and run time is hitting 13 days . any ideas ?


Pause the crunching and then resume it, if that doesn't help suspend it and then reboot the pc. If even that doesn't help you may just want to stop doing the n-body wu's until they figure out what the problem SOME people have running it.
16) Message boards : Number crunching : Need to know what other projects use the GPU so I can troubleshoot an issue (Message 67370)
Posted 20 Apr 2018 by mikey
Thanks. I hadn't noticed the Add a Project page on BOINC has a Nvidia icon for GPU projects, my bad!

Anyway, the problem persists on another GPU project, so I made a thread about it: https://boinc.berkeley.edu/dev/forum_thread.php?id=12415#85934

Thanks.


That's not 100% accurate, but it is a start for instance if you click on Seti or Prime Grid it does not show they can use ATI cards but both CAN! If you click on Moowrapper or Collatz you can see they can handle both kinds of gpu cards.
17) Message boards : Number crunching : My number of "invalids" decreased. How is this possible? (Message 67350)
Posted 18 Apr 2018 by mikey
I am looking at a problem with one of my systems that is reporting invalid tasks. The more concurrent tasks I run, for example, 20 at a time, the percentage of invalid tasks increase. When running one task at a time it seems I get no invalid work units generated at all. I have been picking and choosing drivers and varying the % cpu assigned to get best performance on my S9100 which seems to have several drivers available.

Just a few hours ago the number of invalid units, as reported by your database, dropped from over 140 down to 42 on the system I am debugging. An invalid result is one that is different from the "wingman" and the task must be sent to a 3rd system to determine which answer is correct.

Since, on at least 140 of my units, I had failed the validity check, how is it now that things have changed and I no longer have that many invalid checks?

I double checked ono of my failing work units but it is no longer available. I understand raw data and results are deleted on account of disk storage but I would think the number of units processed would always be available.


No the numbers are not cumulative, I have been crunching for MW with my gpu's since 2009 and my stats are:
State: All (1892) · In progress (177) · Validation pending (0) · Validation inconclusive (135) · Valid (1580) · Invalid (0) · Error (0)

There is no way that's all the workunits I have ever done, with ZERO invalid or errors!! You can see my credits and rac under my name, I have had TONS of workunits that came back as invalid or with errors that are no longer showing up.
18) Message boards : Number crunching : AMD FirePro S9150 (Message 67343)
Posted 17 Apr 2018 by mikey
Hello, just this very night I setup my S9150 card. Windows 10 Pro with appropriate driver from AMD. "15.201.2401-whql-firepro-retail"

My trouble is BOINC does not see the GPU. I am not sure why. Windows, GPUZ, and Afterburner all see the card. Attached pic.

Ideas? Thank you!



https://i.imgur.com/hC3P2eU.png

edit: I have deleted and reinstalled latest BOINC app, still does not see the card.


Try restarting the pc and make sure a monitor or "dummy plug" is plugged into it.
19) Message boards : Number crunching : AMD FirePro S9150 (Message 67336)
Posted 15 Apr 2018 by mikey
PROBLEMS PROBLEMS PROBLEMS !!

I stopped processing multiple concurrent units on my S9100 as I as getting validate errors. Up until a few hours ago I had a total of 2 validate errors on over 36000 completed work units. When running multiple concurrent tasks on this (the S9100) system, I was getting about 1 out of 4 tasks generating validate errors. Those invalid tasks, so far about 90, completed with no error but failed to validate. What bothers me is that the majority did get validated (the 3 out of 4 or so)

For example: I am the one in the middle and the one below me is an S9150 system. That system has only a few invalidates. It does have a different driver but is windows 8.1 and the drivers are not listed for windows 10x64 for the s9150 unlike my S9100 which does have support for windows 10x64. I downloaded driver 1800.12 thinking that would help with my 5 at a time work load. It did not, the percent of invalid tasks decreased but was still present.

I then allowed only 1 tasks to run at a time and have not seen a single invalid task since. Looking here the invalid tasks are all 3 digit completion time which indicates the failures are from 2 or more concurrent tasks. My Wingman had only 16 invalid and appears to be running 2 at once which is consistent with what I seem to be having (fewer invalid with less concurrency)

I have run multiple tasks on HD7950 and HD7850 for a long time and had a total of "2" invalid so it appears the driver for my Firepro has a problem. I asked at the AMD forum for help and possibly a diagnostic program but running one at a time seems to have fixed the problem so far.


It could be a memory problem with the card, windows or even the software, every workunit uses memory and maybe something isn't releasing it in time to crunch the next workunit so it's having problems.
20) Message boards : News : New Runs (Message 67320)
Posted 9 Apr 2018 by mikey
Hi Jake,

why is your server down so many times?


He has said in the past it was the normal backup processes etc that put a strain on the system resources and cause the whole thing to crash. He tried putting in a 2nd cpu core in the last week or so but something was damaged and it didn't work and they are now reviewing their options.


Next 20

Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group