Welcome to MilkyWay@home

Posts by AndreyOR

1) Message boards : News : Admin Updates Discussion (Message 77060)
Posted 12 days ago by AndreyOR
Post:
How does one prevent the single-thread executable from downloading and running? I have preferences set to run both applications with no limit on CPUs and Jobs. I get a mix of single and multi-threaded tasks.
2) Questions and Answers : Windows : Only one GPU ? (Message 74709)
Posted 27 Nov 2022 by AndreyOR
Post:
...Usually, while doing it the hard way, one finds other interesting things ...
Or learns how to go about finding other things ...

I agree and like your way of thinking.
3) Message boards : Number crunching : Validation Pending too many tasks (Message 74515)
Posted 20 Oct 2022 by AndreyOR
Post:
Validation is still happening just at a very slow pace so if one stops crunching a reduction in one's Validation Pending is to be expected. Even without stopping there are occasional temporary reductions. However, users stopping is likely to slow things down even more as there will be even less machines doing the little validation that is happening.

Unfortunately, it seems unlikely that things will get fixed until after the server migration.

I haven't ran N-Body much over the last few months. Is N-Body validation also a problem or is it just Separation?
4) Message boards : Number crunching : Validation Pending too many tasks (Message 74495)
Posted 19 Oct 2022 by AndreyOR
Post:
Skillz, mikey,
There's no huge backlog like a few months ago that will take many weeks to clear. Current queues get cleared within a couple of days or so as the numbers are in thousands and tens of thousands instead of millions or tens of millions. So work generation still occurs just somewhat irregularly. With such a huge and growing validation queue, I'm wondering why the work that's being generated is not almost all wingman/validation work. Yes, as tasks get generated (new or resends) they go to the back of the queue but the queue gets cleared out every couple of days or so, thus I'd expect validation to be occurring regularly and not be piling up.

If you look at your Validation Inconclusive tasks you'll notice that there's a task In Progress or Unsent. In Validation Pending - there's nothing like that so that makes me think that validation hasn't been attempted yet on those tasks. I could be missing something but I suspect that something may be up with the validator.
5) Message boards : Number crunching : Validation Pending too many tasks (Message 74491)
Posted 18 Oct 2022 by AndreyOR
Post:
So it seems like the problems are unlikely to go away until after the migration to new hardware (and fixing the issues that might come with that) as the workunit pool overfill bug seems to be very persistent.

The WU overfill bug still doesn't explain the growing validation queue, I don't think. When the task generator creates new tasks, why does it seem like no wingman/validation tasks are being created, just new, initial ones, given the large and ever-growing Waiting for Validation queue? Overfill or not, I don't see why the Validation queue is so large and growing.
6) Message boards : Number crunching : Validation Pending too many tasks (Message 74443)
Posted 14 Oct 2022 by AndreyOR
Post:
It seems like whatever the problem is it's affecting the Validator. If tasks were getting validation attempts they'd be getting marked as valid, invalid, or inconclusive. Instead they're stuck in pending. Upon quick look it seems like almost none of the pending tasks have "wing-man" tasks generated yet. Otherwise they'd show up as assigned (to a machine) or unsent. So the only thing that the Task Generator can do is generate new, _0, tasks. But that also doesn't seem to be working well, at least for Separation, as work there is hard to get.

In general, tasks get assigned to users in the order they were created so we just need to keep crunching since when the validation starts working again, and second-attempt tasks start getting generated, they'd be going to the back of the queue and so to get to them we need to process everything in front of them. So I'd say that the best thing to do is just to keep crunching. Hopefully things can get resolved soon on the server side of it. Good thing is that there's a plan to replace the server, by the end of the year I believe, which should prevent the recurrence of significant problems the project has experienced this year.
7) Message boards : Number crunching : Validation Pending too many tasks (Message 74432)
Posted 12 Oct 2022 by AndreyOR
Post:
Really? I haven't seen it yet and just had to manually request tasks as the queue emptied out. Even with an empty queue I haven't been getting the max 300 tasks like before, just got 224 and even less before.
8) Message boards : Number crunching : New Benchmark Thread - times wanted for any hardware, CPU or GPU, old or new! (Message 74414)
Posted 11 Oct 2022 by AndreyOR
Post:
I've looked at them but my system isn't very expandable when it comes to PCIe slots. If your system(s) can accommodate them and you can find a good deal, try them. I'm a fan of R9 280X given that it can be had for very cheap if one is patient on eBay. It has at least 4 times the throughput of a 3060Ti that came with my system.

The real power of these 1+TB FP64 cards is being able to run many tasks simultaneously. R9 280X produces best throughput when running 4 to 5 tasks concurrently. I'd definitely be comfortable testing how 8-10 concurrent tasks does on those FirePros.
9) Message boards : Number crunching : Very Long WU's (Message 74401)
Posted 10 Oct 2022 by AndreyOR
Post:
"Remaining time" being off by a lot is unusual for MilkyWay but somewhat common for projects that have a lot of variability in runtimes (LHC) or have very long (days to weeks) runtimes (CPDN). It's new to MilkyWay and Tom explained the reason for it earlier this week.
10) Message boards : Number crunching : Very Long WU's (Message 74400)
Posted 10 Oct 2022 by AndreyOR
Post:
It's best to think of BOINC credit in terms of average rather than absolute credit per task, especially when there's a lot of variability in runtimes. If you compare average credit per runtime or CPU time for a bunch N-Body tasks before the variability of runtimes showed up to a bunch of tasks now, they're probably similar. BOINC doesn't do well with a lot of variability short term but long term things average out. I'd suggest to just keep crunching and let the credit average itself out. Unless someone can find evidence otherwise, I don't think users get shorted on credit long term. I'd say long term is at least a couple of weeks as one probably needs to complete a lot of tasks for BOINC to figure things out.
11) Message boards : Number crunching : Very Long WU's (Message 74393)
Posted 9 Oct 2022 by AndreyOR
Post:
N-Body tasks have a 12 day deadline so even with long run times there shouldn't be any "not started by deadline" errors. The only reasons I can think of that would make one run out of time is a large BOINC queue and not having one's PC run close to 24/7.
12) Message boards : Number crunching : Validation Pending too many tasks (Message 74370)
Posted 6 Oct 2022 by AndreyOR
Post:
I actually think that the whole trusted computer thing doesn't work well on this project. I have plenty of consecutive valid tasks and still have over 5000 waiting for validation. So do you @mrchips: https://milkyway.cs.rpi.edu/milkyway/host_app_versions.php?hostid=792762 and @mikey: https://milkyway.cs.rpi.edu/milkyway/host_app_versions.php?hostid=810880.

The project is experiencing some problems and there's a backlog of workunits to validate. Validation will happen when at least one other computer processes the same workunits to validate them. That will probably not happen until the unsent tasks queue goes down to its normal levels and the tasks for workunits that need validation will start getting generated and assigned to other users. It'll take some time but eventually it'll happen. I'd suggest just to keep crunching as the more we crunch the faster everything will clear out. Short term the validation queue will likely continue to fluctuate and grow.
13) Message boards : Number crunching : Very Long WU's (Message 74369)
Posted 6 Oct 2022 by AndreyOR
Post:
That's weird that the estimated runtime was so different from the actual runtime. I'll keep an eye on that. Has anyone else seen that problem?

I've seen this with another project, LHC, that tends to have highly variable runtimes from task to task (probably also highly variable estimated computation size) so I didn't think of it as unusual when I saw it here. Just figured that something changed with the science of things that made it difficult to estimate accurately. It could be that BOINC doesn't do well with a lot of variability and keeps trying to find consistency.
14) Message boards : Number crunching : Very Long WU's (Message 74353)
Posted 4 Oct 2022 by AndreyOR
Post:
I was commenting more generally about BOINC credit system in that post. As far as discrepancies that you noticed... that's partly what I was referring to in an earlier post by saying that credit inconsistencies are not unusual with BOINC when estimated computation size of tasks changes all of the sudden or is variable. It usually averages itself out though over time. I believe the inconsistencies are due to BOINC trying to average things out but it's not good at doing that and takes a long time. The greater and more frequent the computation size variability (between tasks), the longer it takes BOINC to average things out. I think it's like weeks not days. I don't think BOINC is trying to short users of credit it's just trying to adapt to the changes but it's not good at doing that. I've seen or read about this with projects that rely at least in some part on BOINC default system, like Rosetta, LHC, and MilkyWay. Other projects like Universe, CPDN, and Einstein have a set credit per task (or trickle in case of CPDN) completed, which also varies by sub-project. That's why you haven't seen it in Einstein, for example.

We also just got an explanation from Tom as to why N-body tasks started to take so long. Which is helpful to know that there's a scientific reason and not just glitchy or bad tasks.
15) Message boards : News : Server Maintenance 12:00 PM ET (16:00 UTC) 9/23/2022 (Message 74347)
Posted 4 Oct 2022 by AndreyOR
Post:
Yeah, in general it's usually best to just keep crunching and let admins decide if something needs to be done server side, like turn off task generator for example. Validator and unsent tasks queues can feed each other and make it look like no progress is being made when it is, it's just not yet visible.

I contribute to various projects but sometimes focus on one at a time also. I assume that's what you're doing too. I'd just hope that people wouldn't stop contributing because project is experiencing some difficulties.
16) Message boards : Number crunching : Very Long WU's (Message 74345)
Posted 4 Oct 2022 by AndreyOR
Post:
It's my understanding that BOINC credit calculation system has been criticized for years. The same glitches and oddities apply to everyone within a project though so It's only useful to compare within a project, never between projects. Even total BOINC credit for all projects is pretty useless for any comparisons.

It's human nature to have some kind of metrics and rewards. I think there are simpler and better systems for doing this though. The most important metric I'd say is number of tasks completed, that's what's most important to the projects. A tally could be kept (for each sub-project) and a badge could be awarded for completing every given number of tasks. I'd argue that it's much more useful and meaningful to know that one has completed 1000 tasks than that one got 1 million points, for example.
17) Message boards : Number crunching : Very Long WU's (Message 74339)
Posted 3 Oct 2022 by AndreyOR
Post:
May I suggest that we don't worry too much about the variations that happen with runtimes, credit, etc. and just crunch. We can ask questions, speculate, like here: https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4930&postid=74329 but let's not give up, temporarily or permanently. Credit inconsistencies are not unusual with BOINC when estimated computation size of tasks changes or is variable. It'll average itself out though. If one is concerned, check the average credit per runtime or cpu time of a bunch of tasks before and compare to the same bunch of current tasks, I bet they'll be similar even if variability between tasks seems large right now. Anyone who has contributed to LHC, for example, knows that all of their subprojects have highly variable tasks but the credit tends to be pretty consistent on average.

Disk crash a few months ago was a painful time for the project but everything worked out, nothing was lost and everyone got their credit, it just took time, but this is a long term project so in the end it was just one of the hurdles to overcome. This (high validation and tasks queue) is nothing close to that. The project does seem to have some server issues but there are plans to replace them soon, as mentioned in the forums. Longer runtimes is likely due to scientific reasons as speculated in the post I linked above.

Some may already know this but project people who're most likely to watch and post on the forums are PhD students and are very busy with much higher priorities. We'll eventually get the info and things will eventually get fixed, it'll just take time. I'd encourage everyone to just stay the course and crunch regardless of what's happening. That's what's most helpful to the project and we'll get our credits and badges, even if sometimes it takes a bit of time.
18) Message boards : News : Server Maintenance 12:00 PM ET (16:00 UTC) 9/23/2022 (Message 74338)
Posted 3 Oct 2022 by AndreyOR
Post:
Making changes to the app_config usually doesn't make tasks error out as most of the entries affect BOINC only (not the project app) and incorrect formatting or syntax is usually ignored. However, there's one, optional, entry that affects the project app, cmdline (where --nthreads is usually the argument) and it can cause the tasks to crash if the format or syntax is incorrect. That's what made your tasks crash. Here's from the error log of one of the tasks:
Argument parsing error: --nthreads>2: unknown option
Failed to read arguments

The syntax is --nthreads x, where x is the number of threads you want to use. Having said that, the optional entry cmdline --nthreads ... is unnecessary for MilkyWay N-body as avg_ncpus does the job for both BOINC and the project app.

Here's an app_config that has N-body use 4 threads per task and runs 2 separation GPU tasks simultaneously.
<app_config>
   <app>
      <name>milkyway</name>
      <gpu_versions>
          <gpu_usage>.5</gpu_usage>
          <cpu_usage>.9</cpu_usage>
      </gpu_versions>
   </app>
   <app_version>
      <app_name>milkyway_nbody</app_name>
      <plan_class>mt</plan_class>
      <avg_ncpus>4</avg_ncpus>
   </app_version>
</app_config>

One of the reasons that only one N-body is running could be due to how you allocated resources to BOINC itself and to various projects. BOINC uses that info to determine how many tasks of which project to run and when. max_concurrent is only a max limiter and won't force BOINC to run a certain amount of tasks.

I'd suggest not to worry about runtimes. The credit per unit of runtime is pretty much the same regardless of how long a task takes. It's perfectly fine to run N-body 1 or 2 core. Too many cores will actually make things less productive. I, for example, found that 4 cores per task gives the best tasks/hour rate and anything higher than about 9 cores is no better and even worse than 2 cores.

May i suggest you don't give up on CPU tasks or GPU ones for that matter. Just take the time to figure out the resource allocation and making a valid app_config. The high validation will come down in due time. The current issue is nothing close to the aftermath of the disk crash a few months ago and in the end everything straightened out, no tasks were lost and everyone got their credit. This will be no different. Users leaving is very likely worse for the project as that's less PCs to clear out the validation and the high N-Body queue, which just means that everything will take longer.
19) Message boards : News : Server Maintenance 12:00 PM ET (16:00 UTC) 9/23/2022 (Message 74332)
Posted 1 Oct 2022 by AndreyOR
Post:
N-Body can only use a maximum of 16 cores per task so you'll have to reduce the number of cores per task and run multiple tasks simultaneously. Unless modified, it'll by default use all available cores up to 16. I know some people do it but I've never worried about leaving a core free for GPU. It doesn't seem to make a difference although I've never done a more detailed test.

Additionally, your app_config looks incorrect and is likely ignored. Check here for the correct format and syntax of the file: https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration. If you're just trying to modify N-Body, "milkyway" is not the right name and you'd need to use the app_version section of app_config.
20) Questions and Answers : Unix/Linux : boic sees 2 GPU's but only uses 1 (Message 74251)
Posted 23 Sep 2022 by AndreyOR
Post:
It seems like you have an older version of BOINC, try updating to the latest one, and perhaps also check that GPU drivers are up to date. Try shutting down BOINC and deleting coproc_info.xml and have BOINC recreate it upon restart. It seems to be reporting your GPUs to the website incorrectly. The explanation and format of cc_config.xml and app_config.xml are here: https://boinc.berkeley.edu/wiki/Client_configuration. They're the same for any OS. I'm assuming the files are in the correct places and when you make changes to them you restart BOINC. It seems like to use multiple GPUs you only need the use_all_gpus flag in cc_config. I believe app_config only controls how many tasks to run concurrently of a given app (in general) or app version (more specifically), I don't think it controls the number of GPUs to use. Also, I think that if you have multiple copies of app_version for the same app_name and plan_class (in app_config) the last one will just override the first one. I don't think you can distinguish different GPUs of the same plan_class without going the complicated route of Anonymous Platform setup https://boinc.berkeley.edu/wiki/Anonymous_platform. This would also be the last resort option to try for multiple GPU usage.


Next 20

©2024 Astroinformatics Group