Welcome to MilkyWay@home

Posts by bobgoblin

21) Message boards : Number crunching : GPU units hanging (Message 17642)
Posted 5 Apr 2009 by bobgoblin
Post:
Ook, where to start.

I'm running Vista64bit, Boinc version 6.4.7, catalsyt 9.2. Trying to get the GPU version 19e running. I have edited the system 32 folder to what is saying in the read me file. I haven't altered the app_info.xml file just inserted it. Machine runs fine for a few hours but when I go back the units are just hanging. Other projects are running fine but milkyway seems to have just frozen.

Any clues to what I have done wrong please. Thanks for reading.


I have tried going to Catalyst 8.12 as I read that sometimes this is the only one that can work on some boxes.



i'm running 6.4.7 & 9.2 on an i7. I set the <cmdline>n8</cmdline> in the app_info.xml to limit the number that were actually processing. I found that if too many were running concurrently, they would lock up with only a 512M card. then i would have to close boinc and restart to clear out the frozen wu's.

but I get so few wu's for the gpu these days I'm mainly running ABC on this particular machine.
22) Message boards : Number crunching : ATI help please. (Message 13724)
Posted 3 Mar 2009 by bobgoblin
Post:
I'm running CCC 9.1, 6.4.5 on an hd4870. if you were already running milkyway, were you running an Opti app before going with the gpu app? switching back & forth, I found I had to detach from MW then reattach, abort any wu it downloads during the reattach, then close Boinc, copy in the gpu files and restart boinc.
23) Message boards : Application Code Discussion : GPU app teaser (Message 12476)
Posted 23 Feb 2009 by bobgoblin
Post:
i left it at 9.1 but took your advice of upping the fan speed. that's brought it down from ~83c to ~77c.

and it made it through the night without locking up. So it could be that it was overheating. But I had also upgraded to version .19. So, was there a change in there that corrected the problem?

either way, v.19 is working fine on an i7 and hd4870 with 512m

Besides the CPU detection nothing changed between 0.17 and 0.19. So maybe really a temperature problem.

But I have seen your crunch times are slightly on the high side. This could be caused by running too many WUs concurrently on the GPU. At a certain point the RAM on the graphics card is not sufficient for the number of WUs taking space there. Before it errors out (when even more WUs would be crunched), it slows down (probably some swapping over PCI-Express happens). And with 16 WUs it is getting already a bit crowded on a 512MB card.
Another reason for the higher times could be that the card runs downclocked in a power saving mode. Maybe you should check the clock speed of the card.

Furthermore you may think about attaching to a second BOINC project with that i7. This will reduce the number of MW WUs that are running a he same time, but not the throughput. You will still finish the same number of WUs per hour even with less concurrently running WUs. In fact, it could even rise in your case. Furthermore your CPU cores wouldn't be idling that much ;)



I enabled CPDN and ABC but MW always ran 16 apps, then would switch to 8 ABC, never sharing - it was like how the Cylons and Humans can't get along, it always had to be one or the other never both.

But with 16 MW's running it would eventually lock up. It really does look like a ram problem, the 512 really can't handle a constant run of 16 Apps of this length. Though I didn't have trouble earlier with the shorter apps.

So, what I finally did was reset boinc to only run with 6 CPU's max. Now with only 12 wu's running at a time, the GPU seems much happier and doesn't lock up. And the actual crunch times are significantly shorter with only the 12. 2 - 3 mintues compared to the 8 or 9 when running 16. And as you pointed out, I end up processing more wu's.
24) Message boards : Application Code Discussion : GPU app teaser (Message 12106)
Posted 21 Feb 2009 by bobgoblin
Post:
i left it at 9.1 but took your advice of upping the fan speed. that's brought it down from ~83c to ~77c.

and it made it through the night without locking up. So it could be that it was overheating. But I had also upgraded to version .19. So, was there a change in there that corrected the problem?

either way, v.19 is working fine on an i7 and hd4870 with 512m

Besides the CPU detection nothing changed between 0.17 and 0.19. So maybe really a temperature problem.

But I have seen your crunch times are slightly on the high side. This could be caused by running too many WUs concurrently on the GPU. At a certain point the RAM on the graphics card is not sufficient for the number of WUs taking space there. Before it errors out (when even more WUs would be crunched), it slows down (probably some swapping over PCI-Express happens). And with 16 WUs it is getting already a bit crowded on a 512MB card.
Another reason for the higher times could be that the card runs downclocked in a power saving mode. Maybe you should check the clock speed of the card.

Furthermore you may think about attaching to a second BOINC project with that i7. This will reduce the number of MW WUs that are running a he same time, but not the throughput. You will still finish the same number of WUs per hour even with less concurrently running WUs. In fact, it could even rise in your case. Furthermore your CPU cores wouldn't be idling that much ;)


off the top of my head, i remembered reading that you basically just changed the version number - but couldn't remember if that was just the opti app or the gpu.

also, i've noticed the temp has dropped even further overnight, so I may reset the fan speed to 40%. and i have a climate prediction model sitting @ 50% done, so I'll resume that one.
25) Message boards : Number crunching : Very Strange Time To Competion (Message 12072)
Posted 21 Feb 2009 by bobgoblin
Post:
I'm noticing some really long wu's on my older machines but not many. I posted the details of one here:

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=642&nowrap=true#12070

but it seems like it maybe just a random super long wu
26) Message boards : Application Code Discussion : GPU app teaser (Message 12065)
Posted 21 Feb 2009 by bobgoblin
Post:

I suggest you rollback to 8.9 and see how it goes.

EDIT: Oh! Another thing. Monitor your temperatures on your GPU while crunching if you can. I had to increase the fan speed on my 4850 to 50% because I was afraid that 85C was a bit much. Even this slight increase from 30 to 50% cooled the card all the way down to 63C stable.


i left it at 9.1 but took your advice of upping the fan speed. that's brought it down from ~83c to ~77c.

and it made it through the night without locking up. So it could be that it was overheating. But I had also upgraded to version .19. So, was there a change in there that corrected the problem?

either way, v.19 is working fine on an i7 and hd4870 with 512m
27) Message boards : Application Code Discussion : GPU app teaser (Message 11979)
Posted 21 Feb 2009 by bobgoblin
Post:
I upped the limit to 5000, its probably not enough :P Didn't want to jump all the way to 10k and have craziness start happening.

You don't need to raise it until the multiple GPU support is working. 10,000 WUs a day on a dual core are just enough for a HD4870 at stock speed.


and with the longer wu's 5k is probably enough.

though I noticed on the i7:

Maximum daily WU quota per CPU 4999/day

and it was only 4998/day when I looked earlier while all the other machines showed 5000.

and since we've gotten longer work units, i've been having problems with the GPU app freezing after about 4 hours - i7 with ati 4800. so, this morning, I updated to CCC 9.1. Then noticed it froze after 2 1/2 - 3 hours when i checked it from my phone.

when I got home tonight, it didn't look like 9.1 installed correctly. So I completely uninstalled and reinstalled 9.1, then it would freeze up after an hour or so. I just installed the new .19 gpu app, so maybe that will clear it up?

but i wonder if b/c these are longer wu's, that I'm just putting too much strain on the gpu continuously crunching 16 wu @ a time?
28) Message boards : Number crunching : Core i7 and Optimized App (Message 11267)
Posted 17 Feb 2009 by bobgoblin
Post:
You know... I think it's ok now... I think it was just a problem of 8 tasks starting and finishing in unison. Now that they've sort of become staggared, things are running smoothly and my CPU is keeping busy. Times look to be between 6 and 9 minutes per WU now.

Cool. I'll let it run and see how it does...



I think you're just hitting the server when there was little to send. I'm running an i7 and it usually ranges to 1 or 2 that haven't started to a full house of 96.

and if you have an ati 4800 graphics card on your machine, then you might want to give the gpu app a whirl. then you can crunch 16 @ a time in seconds.
29) Message boards : Application Code Discussion : GPU app teaser (Message 11081)
Posted 17 Feb 2009 by bobgoblin
Post:
With the gpu, I can crunch 16 @ a time with the i7, so 10k limit per core, or 80,000 in my case, would be a more realistic than 4000.

But the throughput is still be one WU every 9 seconds or so with a HD4870. It is not getting faster with more concurrent WUs. So with a HD4870 a limit of 10,000 WUs a day would be enough as long there is no multi GPU support implemented (or massive overclocking involved).

I would say 10,000 WUs per host and day are needed now. When multiple cards are working and/or newer GPUs are available, this needs to be raised again.



oh, i agree with that too. the turn around time for the gpu app was about 2 1/2 minutes. i've been running the op app this week and it's crunching 8 wu's in 6 minutes since the .19's came out, so that limit needs to go much higher as well.
30) Message boards : Application Code Discussion : GPU app teaser (Message 11041)
Posted 16 Feb 2009 by bobgoblin
Post:
max_ncpus back to "1".

avg_ncpus set to 0.25. If I keep the default value of 0.50, I have only three WU running at the same time : two optimized MW and one World Community Grid. With avg_ncpus to 0.25 I have four workunits (two for each project)

Travis > thank you very much for your answer about the credits, it's a very good news. For maximum number of units per core, I can say that : with a Core2Duo and a HD4850 512Mo, I crunched 2000 workunits in 6 hours (with no other projet running at the same time, so the CPU wasn't at full use). So 4.000 workunits per core and day will be enough for this configuration (my computer is working "only" 12 hours per day or so, so 4.000 WU per core and day is the very, very maximum). I don't know how many workunits can be crunched with another model of ATI.

Cluster > If I understand correctly, your optimization uses "only" one core at a time, is that right ? Is it possible to use more core, so we can use only Milkyway on one computer with more than one core ?

Do you need something special to help you for your tests ?

PS : during this writing I reached my 2.000 workunits limit - later than before, probably because I tried avg_ncpus to 0.10 ...



With the gpu, I can crunch 16 @ a time with the i7, so 10k limit per core, or 80,000 in my case, would be a more realistic than 4000.
31) Message boards : Number crunching : nm_s82_r7/r8 computation errors (Message 10483)
Posted 13 Feb 2009 by bobgoblin
Post:
... I would agree that the compute errors are also download errors. But after reading both threads again, it seems to be restricted to quad cores & up. My 1Gig up to the core 2 duo all are working fine, just the i7 having issues.

Funny enough I had problems on all my 4 puters... *sigh*

These are:
- C2D lappy (Win XP 64-bit)
- Quad (Vista 32-bit)
- P4 (Win XP 32-bit)
- P4 lappy (Win XP 32-bit)

So it seems to affect not only faster/newer boxes. But it seems to affect only Windows machines, is that right?


yes, from what i've read, only windows. i've been running the gpu app on the i7 and opti apps on all other machines. watching the i7 this morning once the star file download fails, those that download after it fail, but those that successfully downloaded before start failing with compute errors. but then boinc starts scheduling way in the future, so it never sends back the failed wu's and doesn't try to download star files again. once i run an update, it downloads star files successfully and runs about another 3 hours before failing again. I've just downgraded to the opti app on the i7 to see what happens.
32) Message boards : Number crunching : nm_s82_r7/r8 computation errors (Message 10473)
Posted 13 Feb 2009 by bobgoblin
Post:
I had two WUs on my lappy crunching with stock app v0.17 now.
Both seem to have finished without failure but one WU has vanished from my results' list as soon as it was reported so I can't say for sure. :-(
The other one has gone now, too and was here (at least I have seen that result marked as succesful before it was purged).

PS#1: This is no sign at all that the opti app isn't working correctly because:
a) not all WUs are erroring out while using opti app!
b) most of the errors are download errors which have nothing to do with any app!
Even the ones which are declared as "Compute error" are in fact download errors, see example:
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>stars-82-v2.txt</file_name>
<error_code>-200</error_code>
</file_xfer_error>
</message>


The other errors I had yesterday (exceeded disk limit and some Windows Dumps) seem to have gone on all my system today. ;-)

PS#2: Concerning the download errors there must be something wrong on the server side, I guess... Because even the stock app doesn't know where to get the input files from, right? *grin*
I mean the server is giving out these input files and also the info where they are stored...


I would agree that the compute errors are also download errors. But after reading both threads again, it seems to be restricted to quad cores & up. My 1Gig up to the core 2 duo all are working fine, just the i7 having issues.
33) Message boards : Number crunching : MD5 checksum errors (Message 10447)
Posted 13 Feb 2009 by bobgoblin
Post:
Looks like there are two error types, including those discussed in this thread

An example of the MD5 checksum error is here from BOINC messages tab -


12/02/2009 23:44:35|Milkyway@home|[error] MD5 check failed for /stars-79.txt
12/02/2009 23:44:35|Milkyway@home|[error] expected c0053b930967ae032adc2b8f23015a93, got 61a14313a3f0244e7f7be2df1645d5d8
12/02/2009 23:44:35|Milkyway@home|[error] Checksum or signature error for /stars-79.txt

This is much less, as an error, than the ones discussed in the other thrad.





i've looked through that thread earlier. and i'm starting to see the compute errors too. after i submit an update i'm seeing these errors now:

2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755819_1234487764_0; state 9
2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755820_1234487764_0; state 9
2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755826_1234487764_0; state 9
2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755802_1234487763_0; state 9

34) Message boards : Number crunching : MD5 checksum errors (Message 10444)
Posted 13 Feb 2009 by bobgoblin
Post:
th detach didn't work either. it all locked up about 3 hours later. but once i did an update after getting home, everything starts working again. Could it be we're exhausting the star files on our pc's and it doesn't know to go out and get new star files?
35) Message boards : Number crunching : MD5 checksum errors (Message 10387)
Posted 12 Feb 2009 by bobgoblin
Post:


What might have happened is that you didn't complete downloading one of the star files. You could try deleting it and making boinc re-download it. I think a project reset should do that.


I did a reset and have processed 100 wu's without errors now, and everything looks like it's downloading correctly.

thanks, Travis.



it happened again about 8 hours later. all the wu's erred and boinc rescheduled for 24 hours later, so it didn't update.

this time i completely detached and rejoined. i'll let you know how that goes.
36) Message boards : Number crunching : MD5 checksum errors (Message 10325)
Posted 11 Feb 2009 by bobgoblin
Post:


What might have happened is that you didn't complete downloading one of the star files. You could try deleting it and making boinc re-download it. I think a project reset should do that.


I did a reset and have processed 100 wu's without errors now, and everything looks like it's downloading correctly.

thanks, Travis.
37) Message boards : Number crunching : MD5 checksum errors (Message 10302)
Posted 11 Feb 2009 by bobgoblin
Post:
<core_client_version>5.10.28</core_client_version>
WU download error: couldn't get input files:
<file_xfer_error>
  <file_name>/stars-79.txt</file_name>
  <error_code>-119</error_code>
  <error_message>MD5 check failed</error_message>
</file_xfer_error>


Probably a WU configuration problem, the filename starts with a slash


All the sticky datafiles start with a slash recently, not had a prob with Linux or XP with that.

[edit] Hmm, mind you I'm using BOINC 6x [/edit]



I'm seeing a sudden surge of these today:

2/11/2009 11:48:24 AM|Milkyway@home|[error] MD5 check failed for /stars-79.txt
2/11/2009 11:48:24 AM|Milkyway@home|[error] expected c0053b930967ae032adc2b8f23015a93, got 61a14313a3f0244e7f7be2df1645d5d8
2/11/2009 11:48:24 AM|Milkyway@home|[error] Checksum or signature error for /stars-79.txt

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>stars-79.txt</file_name>
<error_code>-200</error_code>
</file_xfer_error>

</message>
]]>


<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>stars-82-v2.txt</file_name>
<error_code>-200</error_code>
</file_xfer_error>

</message>
]]>

but only on my vista 64. when these all err out, the scheduler resets to 24 hours before reconnect.
38) Message boards : Number crunching : We're back (Message 10240)
Posted 9 Feb 2009 by bobgoblin
Post:
Maybe it can power a larger version of the abacus Travis will be sending out? :-P


Do the work of 4 abacus', you know... a quad core. :P



a 4-cylinder TA manually crunching 15 mw wu's an hour. ;)
39) Message boards : Number crunching : We're back (Message 10199)
Posted 9 Feb 2009 by bobgoblin
Post:
lol, yeah, that's what reminded me of being hit by a pontiac sunbird last year.

Oh good lord. Years ago I drove a Pontiac Sunbird. I hated that crappy car.


I don't care for them myself, just muscle cars and classics.



maybe GM will come out with a new TA after the Camaro hits the streets, one that has wireless internet and you can crunch mw wu's in the car?
40) Message boards : Number crunching : We're back (Message 10198)
Posted 9 Feb 2009 by bobgoblin
Post:
lol, yeah, that's what reminded me of being hit by a pontiac sunbird last year.

Oh good lord. Years ago I drove a Pontiac Sunbird. I hated that crappy car.


well, there's a blue one driving around Cincinnati with my backside impression on the hood and a broken windsheild.

i still haven't decided if she was screaming b/c she hated the car or b/c she was about to run down a pedestrian.


Previous 20 · Next 20

©2024 Astroinformatics Group