21)
Message boards :
Number crunching :
GPU units hanging
(Message 17642)
Posted 5 Apr 2009 by bobgoblin Post: Ook, where to start. i'm running 6.4.7 & 9.2 on an i7. I set the <cmdline>n8</cmdline> in the app_info.xml to limit the number that were actually processing. I found that if too many were running concurrently, they would lock up with only a 512M card. then i would have to close boinc and restart to clear out the frozen wu's. but I get so few wu's for the gpu these days I'm mainly running ABC on this particular machine. |
22)
Message boards :
Number crunching :
ATI help please.
(Message 13724)
Posted 3 Mar 2009 by bobgoblin Post: I'm running CCC 9.1, 6.4.5 on an hd4870. if you were already running milkyway, were you running an Opti app before going with the gpu app? switching back & forth, I found I had to detach from MW then reattach, abort any wu it downloads during the reattach, then close Boinc, copy in the gpu files and restart boinc. |
23)
Message boards :
Application Code Discussion :
GPU app teaser
(Message 12476)
Posted 23 Feb 2009 by bobgoblin Post: i left it at 9.1 but took your advice of upping the fan speed. that's brought it down from ~83c to ~77c. I enabled CPDN and ABC but MW always ran 16 apps, then would switch to 8 ABC, never sharing - it was like how the Cylons and Humans can't get along, it always had to be one or the other never both. But with 16 MW's running it would eventually lock up. It really does look like a ram problem, the 512 really can't handle a constant run of 16 Apps of this length. Though I didn't have trouble earlier with the shorter apps. So, what I finally did was reset boinc to only run with 6 CPU's max. Now with only 12 wu's running at a time, the GPU seems much happier and doesn't lock up. And the actual crunch times are significantly shorter with only the 12. 2 - 3 mintues compared to the 8 or 9 when running 16. And as you pointed out, I end up processing more wu's. |
24)
Message boards :
Application Code Discussion :
GPU app teaser
(Message 12106)
Posted 21 Feb 2009 by bobgoblin Post: i left it at 9.1 but took your advice of upping the fan speed. that's brought it down from ~83c to ~77c. off the top of my head, i remembered reading that you basically just changed the version number - but couldn't remember if that was just the opti app or the gpu. also, i've noticed the temp has dropped even further overnight, so I may reset the fan speed to 40%. and i have a climate prediction model sitting @ 50% done, so I'll resume that one. |
25)
Message boards :
Number crunching :
Very Strange Time To Competion
(Message 12072)
Posted 21 Feb 2009 by bobgoblin Post: I'm noticing some really long wu's on my older machines but not many. I posted the details of one here: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=642&nowrap=true#12070 but it seems like it maybe just a random super long wu |
26)
Message boards :
Application Code Discussion :
GPU app teaser
(Message 12065)
Posted 21 Feb 2009 by bobgoblin Post:
i left it at 9.1 but took your advice of upping the fan speed. that's brought it down from ~83c to ~77c. and it made it through the night without locking up. So it could be that it was overheating. But I had also upgraded to version .19. So, was there a change in there that corrected the problem? either way, v.19 is working fine on an i7 and hd4870 with 512m |
27)
Message boards :
Application Code Discussion :
GPU app teaser
(Message 11979)
Posted 21 Feb 2009 by bobgoblin Post: I upped the limit to 5000, its probably not enough :P Didn't want to jump all the way to 10k and have craziness start happening. and with the longer wu's 5k is probably enough. though I noticed on the i7: Maximum daily WU quota per CPU 4999/day and it was only 4998/day when I looked earlier while all the other machines showed 5000. and since we've gotten longer work units, i've been having problems with the GPU app freezing after about 4 hours - i7 with ati 4800. so, this morning, I updated to CCC 9.1. Then noticed it froze after 2 1/2 - 3 hours when i checked it from my phone. when I got home tonight, it didn't look like 9.1 installed correctly. So I completely uninstalled and reinstalled 9.1, then it would freeze up after an hour or so. I just installed the new .19 gpu app, so maybe that will clear it up? but i wonder if b/c these are longer wu's, that I'm just putting too much strain on the gpu continuously crunching 16 wu @ a time? |
28)
Message boards :
Number crunching :
Core i7 and Optimized App
(Message 11267)
Posted 17 Feb 2009 by bobgoblin Post: You know... I think it's ok now... I think it was just a problem of 8 tasks starting and finishing in unison. Now that they've sort of become staggared, things are running smoothly and my CPU is keeping busy. Times look to be between 6 and 9 minutes per WU now. I think you're just hitting the server when there was little to send. I'm running an i7 and it usually ranges to 1 or 2 that haven't started to a full house of 96. and if you have an ati 4800 graphics card on your machine, then you might want to give the gpu app a whirl. then you can crunch 16 @ a time in seconds. |
29)
Message boards :
Application Code Discussion :
GPU app teaser
(Message 11081)
Posted 17 Feb 2009 by bobgoblin Post: With the gpu, I can crunch 16 @ a time with the i7, so 10k limit per core, or 80,000 in my case, would be a more realistic than 4000. oh, i agree with that too. the turn around time for the gpu app was about 2 1/2 minutes. i've been running the op app this week and it's crunching 8 wu's in 6 minutes since the .19's came out, so that limit needs to go much higher as well. |
30)
Message boards :
Application Code Discussion :
GPU app teaser
(Message 11041)
Posted 16 Feb 2009 by bobgoblin Post: max_ncpus back to "1". With the gpu, I can crunch 16 @ a time with the i7, so 10k limit per core, or 80,000 in my case, would be a more realistic than 4000. |
31)
Message boards :
Number crunching :
nm_s82_r7/r8 computation errors
(Message 10483)
Posted 13 Feb 2009 by bobgoblin Post: ... I would agree that the compute errors are also download errors. But after reading both threads again, it seems to be restricted to quad cores & up. My 1Gig up to the core 2 duo all are working fine, just the i7 having issues. yes, from what i've read, only windows. i've been running the gpu app on the i7 and opti apps on all other machines. watching the i7 this morning once the star file download fails, those that download after it fail, but those that successfully downloaded before start failing with compute errors. but then boinc starts scheduling way in the future, so it never sends back the failed wu's and doesn't try to download star files again. once i run an update, it downloads star files successfully and runs about another 3 hours before failing again. I've just downgraded to the opti app on the i7 to see what happens. |
32)
Message boards :
Number crunching :
nm_s82_r7/r8 computation errors
(Message 10473)
Posted 13 Feb 2009 by bobgoblin Post: I had two WUs on my lappy crunching with stock app v0.17 now. I would agree that the compute errors are also download errors. But after reading both threads again, it seems to be restricted to quad cores & up. My 1Gig up to the core 2 duo all are working fine, just the i7 having issues. |
33)
Message boards :
Number crunching :
MD5 checksum errors
(Message 10447)
Posted 13 Feb 2009 by bobgoblin Post: Looks like there are two error types, including those discussed in this thread i've looked through that thread earlier. and i'm starting to see the compute errors too. after i submit an update i'm seeing these errors now: 2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755819_1234487764_0; state 9 2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755820_1234487764_0; state 9 2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755826_1234487764_0; state 9 2/12/2009 8:29:27 PM|Milkyway@home|[error] garbage_collect(); still have active task for acked result nm_s82_r9_755802_1234487763_0; state 9 |
34)
Message boards :
Number crunching :
MD5 checksum errors
(Message 10444)
Posted 13 Feb 2009 by bobgoblin Post: th detach didn't work either. it all locked up about 3 hours later. but once i did an update after getting home, everything starts working again. Could it be we're exhausting the star files on our pc's and it doesn't know to go out and get new star files? |
35)
Message boards :
Number crunching :
MD5 checksum errors
(Message 10387)
Posted 12 Feb 2009 by bobgoblin Post:
it happened again about 8 hours later. all the wu's erred and boinc rescheduled for 24 hours later, so it didn't update. this time i completely detached and rejoined. i'll let you know how that goes. |
36)
Message boards :
Number crunching :
MD5 checksum errors
(Message 10325)
Posted 11 Feb 2009 by bobgoblin Post:
I did a reset and have processed 100 wu's without errors now, and everything looks like it's downloading correctly. thanks, Travis. |
37)
Message boards :
Number crunching :
MD5 checksum errors
(Message 10302)
Posted 11 Feb 2009 by bobgoblin Post: <core_client_version>5.10.28</core_client_version> WU download error: couldn't get input files: <file_xfer_error> <file_name>/stars-79.txt</file_name> <error_code>-119</error_code> <error_message>MD5 check failed</error_message> </file_xfer_error> I'm seeing a sudden surge of these today: 2/11/2009 11:48:24 AM|Milkyway@home|[error] MD5 check failed for /stars-79.txt 2/11/2009 11:48:24 AM|Milkyway@home|[error] expected c0053b930967ae032adc2b8f23015a93, got 61a14313a3f0244e7f7be2df1645d5d8 2/11/2009 11:48:24 AM|Milkyway@home|[error] Checksum or signature error for /stars-79.txt <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>stars-79.txt</file_name> <error_code>-200</error_code> </file_xfer_error> </message> ]]> <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>stars-82-v2.txt</file_name> <error_code>-200</error_code> </file_xfer_error> </message> ]]> but only on my vista 64. when these all err out, the scheduler resets to 24 hours before reconnect. |
38)
Message boards :
Number crunching :
We're back
(Message 10240)
Posted 9 Feb 2009 by bobgoblin Post: Maybe it can power a larger version of the abacus Travis will be sending out? :-P a 4-cylinder TA manually crunching 15 mw wu's an hour. ;) |
39)
Message boards :
Number crunching :
We're back
(Message 10199)
Posted 9 Feb 2009 by bobgoblin Post: lol, yeah, that's what reminded me of being hit by a pontiac sunbird last year. maybe GM will come out with a new TA after the Camaro hits the streets, one that has wireless internet and you can crunch mw wu's in the car? |
40)
Message boards :
Number crunching :
We're back
(Message 10198)
Posted 9 Feb 2009 by bobgoblin Post: lol, yeah, that's what reminded me of being hit by a pontiac sunbird last year. well, there's a blue one driving around Cincinnati with my backside impression on the hood and a broken windsheild. i still haven't decided if she was screaming b/c she hated the car or b/c she was about to run down a pedestrian. |
©2024 Astroinformatics Group