Posts by mikey
log in
1) Message boards : Number crunching : CPU MT N-body de_nbody_1_13 ALL FAIL. (Message 67063)
Posted 6 days ago by mikey
ALL the 1.68 MilkyWay@Home N-Body Simulation (mt) de_nbody_1_13

FAIL at startup. Is there something I am missing? I've upgraded BOINC and all drivers, tried running them with fewer CPU's and nothing helps.

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=2256760910

ALL MT 1.13 tasks seem to fail the same...

8-)


Turn off the n-body tasks in your settings and use your cpu's on something else, they DO have problems for some people but not everyone, for those that do it's easier to crunch something else instead.
2) Message boards : Number crunching : MW@H DBase problems (Message 67052)
Posted 7 days ago by mikey
I guess my utilization is low enough that I have never noticed the problem. I process MW tasks every day but only about 30 a day, so have never noticed an inability to upload because MW is such a small percentage compared to Seti.


I've got 3 pc's here right now but am about to another goal for me at Einstein and PG is not far behind it, I'm 3rd on my Team here at MW but the 2 people ahead of me stopped crunching long ago. I would love to pass them but just can't bring more machines here with the dbase problems. I lost over 500 workunits worth of credits the last time it crashed!! Way too often, for me, the inprogress and inconclusive numbers are the same or there are even more inconclusives than inprogress workunits and that scares me alot!! Too many other projects don't have those problems for MW to be STILL having them, someone needs to figure out how to ask for help.
3) Message boards : News : Reducing Workunits to Unreliable Hosts (Message 67051)
Posted 7 days ago by mikey
Hey everyone,

I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end.

Jake


Thanks I hope that helps!

In the meantime maybe you can see if the pc is affected by it:
http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=763376

It got 128 tasks and trashed them ALL!!
4) Message boards : Number crunching : MW@H DBase problems (Message 67038)
Posted 9 days ago by mikey
Lst night once again there was the unable to open dbase error, so no results could be uploaded.
Can whomever is in charge of feeding the dbase hamsters ensure that they DO get their nosh?
This dbase problem has been ongoing for some time [years in fact] Them poor 'amsters must be starving...


Seems to happen alot doesn't it? I would love to bring more pc's here but am worried about the dbase problems too, I have people to pass that I just can't.
5) Message boards : News : Reducing Workunits to Unreliable Hosts (Message 67035)
Posted 10 days ago by mikey
I will do a little more research into configuring the server better throughout the week and run some more tests...


Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision.


Maybe they could blacklist some then? Not the actual host but the model of gpu that can't do double precision. That should mean the host can't get work. Another project I crunch for bans any gpu with less than 2gb of ram, I realize that may be a bit easier but it's a start.
6) Message boards : Number crunching : Hosts with only invalid results (Message 67015)
Posted 18 days ago by mikey
Hey Everyone,

I just tried turning on some options to fix this problem.

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4227

Let me know if there are any issues there.

Jake


Maybe you can look at this host and see why your fix isn't working on it?

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=763378

798 workunits NOT ONE is valid yet you guys keep sending them!!!

Maybe THIS is why it takes soooo fricking long to get credits here, how many of these hosts are there out there that are just clogging up the process?
7) Message boards : News : Reducing Workunits to Unreliable Hosts (Message 67010)
Posted 20 days ago by mikey
Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks...


And it continues...


What's strange is they have 2 pc's, one works and one doesn't at all, it's got the 2 second error problem.
8) Message boards : Cafe MilkyWay : Where is everyone? (Message 67002)
Posted 22 days ago by mikey
Still not a lot happening here. Surprising given the troubles at Seti.


I'm still nervous about the DB here. I only have one unsent workunit in my inconclusives right now though.
9) Message boards : Number crunching : Hosts with only invalid results (Message 66998)
Posted 25 days ago by mikey
Replying to my own post, I resolved my problem with a lack of WUs by detaching from the project, then attaching. Not very elegant, but effective in this case. Apologies for the noise.


We all try different things, some work and some don't, your way worked and you are back to crunching again!! WOO HOO!!!
10) Message boards : News : Reducing Workunits to Unreliable Hosts (Message 66997)
Posted 25 days ago by mikey
Hey Everyone,

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know.

Thank you all for your continued support.

Jake


Thank you very much, hopefully the credits will flow more quickly now.
11) Message boards : Number crunching : Hosts with only invalid results (Message 66987)
Posted 26 days ago by mikey
80 workunits is the max any one gpu can get at a time, they do restrict all of us that way. As we return one we can get another, so as that pc returns an invalid workunit if it gets another and trashes it it could be going thru at least 80 per day, alot more if it connects again once it's out of trashed workunits.

Right. But, there seems to be a 24-hour backoff imposed. It's returning all 80 crashed tasks at once, getting 80 new ones, and repeating that cycle 24 hours later.

That is the BOINC mechanism in play and is working as designed with the 24 hour backoff.

I have seen many comments in various projects that bad hosts need to be expunged by the admins. But I have not once ever seen that action imposed. Seems that all admins are afraid of the publicity/recrimination of banning someone. Moderators ban users at will and frequently for violating posting policy. Why would banning an incorrectly configured host be any different?


One thing other Projects do is to reduce the number of workunits a bad pc can get though and MW does NOT do that currently. For instance a pc returns 80 bad workunits today it only gets 5 workunits tomorrow, it returns them as invalid and it only gets one workunit per day until it starts returning valid workunits again.
12) Message boards : Number crunching : exactly what url to use in cc_config.xml? (Message 66981)
Posted 28 days ago by mikey
I had to use https, not http

not sure what is going on but it is working now


    <exclude_gpu>
    <url>https://milkyway.cs.rpi.edu/milkyway/</url>
    <device_num>0</device_num>
    </exclude_gpu>
    <exclude_gpu>
    <url>http://einstein.phys.uwm.edu/</url>
    <device_num>1</device_num>
    </exclude_gpu>



What is even stranger it that on one of my systems einstein.phys.uwm.edu does not work. Instead I need to use einsteinathome.org. I detached and re-attached and still got that einsteinathome.org. My other systems have the correct phys.uwm.edu address. I will ask over at einstein about this.



I will make a note of the https...thanks! I'm glad it's working.
13) Message boards : Number crunching : Hosts with only invalid results (Message 66978)
Posted 29 days ago by mikey
Has anyone ever PM'd the owner to inform them their host is producing nothing but invalids?

In my case, I did but got no response.

The host continues to generate nothing but errors. At least it's being limited to 80 tasks/day and is contacting the project only once every 24-hrs. So, it seems like some kind of restriction is being imposed. The user has another host and it's returning valid results.


80 workunits is the max any one gpu can get at a time, they do restrict all of us that way. As we return one we can get another, so as that pc returns an invalid workunit if it gets another and trashes it it could be going thru at least 80 per day, alot more if it connects again once it's out of trashed workunits.
14) Message boards : Number crunching : exactly what url to use in cc_config.xml? (Message 66977)
Posted 29 days ago by mikey
I used the one I as told to use but it didnt work

    cc_config.xml: bad URL in GPU exclusion: http://milkyway.cs.rpi.edu/milkyway/
    This project is using an old URL. When convenient, remove the project, then add http://milkyway.cs.rpi.edu/milkyway/



what I tried is as follows:


    <exclude_gpu>
    <url>milkyway.cs.rpi.edu/milkyway/</url>
    <device_num>1</device_num>
    <type>NVIDIA</type>
    </exclude_gpu>



Thanks for looking!



Like this:

<cc_config>
<options>
<use_all_gpus>1</use_all_gpus>
<exclude_gpu>
<url>http://milkyway.cs.rpi.edu/milkyway/</url>
<device_num>1</device_num>
</exclude_gpu>
</options>
</cc_config>

If you are using <device_num> you do NOT need the <nvidia> part too, you would only use that if you had for example an ATI and an Nvidia gpu and wanted to exclude them that way instead. Then you could use the part of this line that you needed:
[<type>NVIDIA|ATI|intel_gpu</type>]
and skip the <device_num> line.
15) Message boards : Number crunching : Hosts with only invalid results (Message 66973)
Posted 18 Jan 2018 by mikey
I've returned to the project after being away for awhile so I'm not sure how much of a problem this has been recently. Regardless, I thought I'd give this thread a bump...

I've only returned a few results so far, but all of my inconclusives for the MilkyWay@Home app (not the N-body) include a wingman whose host (ID 643627) has been returning nothing but errors for at least the last 4 days.

Can't something be done to halt the sending of new work to unreliable hosts?


The problem seems to be a lack of people power at the project level, there's more than one Admin that has quit crunching, I don't know if that's because they are 'former' Admins or because of some other reason, but when the Admins stop contributing to their own project that's a problem to me. It means they have less and less contact with the daily goings on OUR level and depend more and more on messages that most of us have no clue who or how to contact.

In a thread last week one of the Senior Admins asked for a list of "unsent" workunits on our pc's, if I can see them for my account can't they run a small script and get the list for everyone themselves? Was he taking the easy way out and just trying to be helpful, I don't know but he never replied back in the thread. My "unsent" workunits were sent out to other computers by the next day though.
16) Message boards : Cafe MilkyWay : MilkyWay@Home "standard" and N-Body Simulations (Message 66967)
Posted 15 Jan 2018 by mikey
Hi. I wanted to know what is the difference between standard and N-Body apps. Thanks.


N-body units are for the cpu only and usually use multiple cpu cores for each work unit.
17) Message boards : News : Validation Inconclusive Errors (Message 66966)
Posted 15 Jan 2018 by mikey
I own a measly laptop and have been trying to contribute what i can using the decent computation , but the Nbody tasks have interesting properties where at least half tend to get 'caught' in their own processing, to the state where after a day of running, they claim they will be done in two days, or on occasion get trapped for a few days and report a completion time beyond the reporting time.

Is there a way to allow these processes that literally get stuck at a certain component to automatically abort and report the troublesome step or stage to you guys to permit improvements to the code?
As of right now if after a few hours the time remaining starts ballooning i abort manually as there is no way that a 95% complete process that has been running for 2 hours should still be at 95% completion an hour later, which is something i have witnessed.
I note that Bio-inc doesn't even notice when the predicted time to completion ends up beyond the report time and cancel the process then. which is what was previously happening to Nbody before i started manually monitoring it's progress.
I don't know why Nbody is doing this, but i have offered plenty of memory (which it hasn't used) it steadily reduces the CPU usage as it jams up and bioinc has access to a few gigs of storage which it has barely scratched while doing this computation, so i am puzzled if the issue is a bottleneck or a crashing component.


Some people have said if you suspend then resume them they finish right up.
18) Message boards : MilkyWay@home Science : Science Summary (Message 66962)
Posted 14 Jan 2018 by mikey
Looking forward to some 'official' notice that crunching data for MilkyWay@home still is of value. Once I've hit apprx. 1M I'll be moving on to another project and will only continue with MW@h if real value is confirmed - no point is using compute resource if the project has become a 'zombie'.


Read the News section, the problems aren't that the data isn't valuable, the problem seems to stem from lack of money to keep the databases running like they want them to be.
19) Message boards : MilkyWay@home Science : Science Summary (Message 66958)
Posted 11 Jan 2018 by mikey
I know scientists have a lot of things to do, but
This page is up to date as of January 10, 2015--Siddhartha S.


A little update is welcome.
What we are crunching for?


If you look 'Matthew' the person who started this page is no longer crunching, the problem may be that no one else is willing to take on his duties. Other Project Admins have quit crunching here too, this could be a dying project.
20) Message boards : News : Validation Inconclusive Errors (Message 66957)
Posted 11 Jan 2018 by mikey
For me, the N-Body inconclusives are clearing out nicely. It seems now that it's the MilkyWay@Home app that's lagging...


Looks like progress is being made. All of my inconclusives have wingmen now.


Mine appear to be also but then I have crunched any new ones since yesterday either! This is a royal pain having to babysit this project, I lost over 500 completed workunits during the last database crash and I'm very hesitant to let that happen again. I assume they are trying but something isn't working.


Next 20

Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group