Welcome to MilkyWay@home

Very Long WU's

Message boards : Number crunching : Very Long WU's
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
JIM

Send message
Joined: 21 Jul 09
Posts: 4
Credit: 5,717,780
RAC: 652
Message 74181 - Posted: 14 Sep 2022, 20:37:40 UTC

What’s with the sudden influx of long de_nbody work units? I have an 8 core machine and am used to seen WU’s that complete in 6 or 8 minutes running on all cores art once. Suddenly I am getting large numbers of WU’s with run times of 5 to 8 hours. That’s quite a change. What gives?
ID: 74181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 210
Credit: 105,926,123
RAC: 25,164
Message 74183 - Posted: 15 Sep 2022, 1:34:45 UTC - in response to Message 74181.  

What’s with the sudden influx of long de_nbody work units? I have an 8 core machine and am used to seen WU’s that complete in 6 or 8 minutes running on all cores art once. Suddenly I am getting large numbers of WU’s with run times of 5 to 8 hours. That’s quite a change. What gives?
Someone else raised the same point in the thread "NBody tasks taking much longer ..." just under a week ago. And I've been noticing these longer tasks since the middle of last month...

We'd need the project scientist to give a proper explanation, but the long-running tasks admit to being long-running before they start, and seem to get credited accordingly so it looks like expected behaviour for certain work units!

Cheers - Al.
ID: 74183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,893,161
RAC: 375
Message 74323 - Posted: 30 Sep 2022, 10:59:53 UTC - in response to Message 74183.  

Same here…WU’s that took 4 Mins are now at least 4 hours, whether 4,6,8 Cores allocated. Estimated run times bear no relation to reality. I am running my pile down as the times are unpredictable. I may even abort the last bunch. If it takes as long to validate then no wonder the waiting validation pile is not moving much for anything.
ID: 74323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave Studdert
Avatar

Send message
Joined: 26 Mar 09
Posts: 2
Credit: 20,799,320
RAC: 4,466
Message 74324 - Posted: 30 Sep 2022, 12:35:17 UTC

Yeah some of these CPU units are taking crazy amounts of time. 6 cores and its 9 hours in with only 27% complete on a 4.5Ghz CPU.
ID: 74324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,893,161
RAC: 375
Message 74325 - Posted: 30 Sep 2022, 13:26:10 UTC - in response to Message 74324.  

Same here I have an Intel I7 and 2 are running on 4 CPU’s each and say 7 hours to go after running for 2 hours so far.
ID: 74325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,893,161
RAC: 375
Message 74326 - Posted: 30 Sep 2022, 19:59:19 UTC - in response to Message 74325.  
Last modified: 30 Sep 2022, 20:07:23 UTC

Same here I have an Intel I7 and 2 are running on 4 CPU’s each and say 7 hours to go after running for 2 hours so far.


That’s my Nbody WU’s finished at last, one took over 96,000 CPU seconds, the other over 107,000 CPU sec over 4 CPU’s.

Apart from anything else credits are not consistent.

WU producing 61707 CPU seconds got 3939 credits

WU producing 96940 CPU seconds got 2139 credits

WU producing 52276 CPU seconds got 2228 credits

I will not be processing any more of these. sorry.
ID: 74326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 74330 - Posted: 1 Oct 2022, 6:58:03 UTC - in response to Message 74326.  

I can confirm the inconsistencies.

PC #1:

8,650.77___57,509.30_____2,036.67

7,802.28___51,971.95_____4,544.88

Both ran on same PC, same numver of CPUs, nothing else running.
-------------------------------------------------------------------------------------

PC #2:

10,502.32___37,127.45_____3,211.39

13,434.64___49,270.17_____2,041.10

Both ran on same PC, same numver of CPUs, nothing else running.
-------------------------------------------------------------------------------------

PC #3:

458.05___1,426.72_____54.40

462.13___1,433.34_____51.95

Both ran on same PC, same numver of CPUs, nothing else running.
-------------------------------------------------------------------------------------

Hope there is a sensible explanation for this!

Maybe we should open a new thread for this topic?
ID: 74330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,893,161
RAC: 375
Message 74335 - Posted: 2 Oct 2022, 15:39:21 UTC - in response to Message 74330.  
Last modified: 2 Oct 2022, 15:52:27 UTC

Looking back at old screenshots I did in April the Nbody Simulation average was .12 or .13 of an hour, about 7 Mins. Now the average is over 2 hours with some ridiculous maximums at 130 hours plus. Something has really changed. Also the awaiting validation seems to hovering between 1.69 and 1.80 million and has been like that for last few weeks. The last Nbody’s I did were mostly fails to complete and were timed out, not surprisingly.
ID: 74335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 74336 - Posted: 2 Oct 2022, 18:00:39 UTC - in response to Message 74335.  
Last modified: 2 Oct 2022, 18:04:30 UTC

... The last Nbody’s I did were mostly fails to complete and were timed out, not surprisingly.

Mine are doing fine.
On the "slowest" PC run times are around max 15 hours.
On the others around max 3 hours.

I wonder why Tom doesn't respond to this "problem" of very long runtimes.
Something must have changed - most likely the amount of input data?

I think this will "scare off" a lot of crunchers. We are already down to around 2000 users.
Not to mention the (long) response times of the "homepage" ...
ID: 74336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 13 Oct 21
Posts: 44
Credit: 225,162,808
RAC: 7,665
Message 74339 - Posted: 3 Oct 2022, 4:46:23 UTC
Last modified: 3 Oct 2022, 4:48:08 UTC

May I suggest that we don't worry too much about the variations that happen with runtimes, credit, etc. and just crunch. We can ask questions, speculate, like here: https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4930&postid=74329 but let's not give up, temporarily or permanently. Credit inconsistencies are not unusual with BOINC when estimated computation size of tasks changes or is variable. It'll average itself out though. If one is concerned, check the average credit per runtime or cpu time of a bunch of tasks before and compare to the same bunch of current tasks, I bet they'll be similar even if variability between tasks seems large right now. Anyone who has contributed to LHC, for example, knows that all of their subprojects have highly variable tasks but the credit tends to be pretty consistent on average.

Disk crash a few months ago was a painful time for the project but everything worked out, nothing was lost and everyone got their credit, it just took time, but this is a long term project so in the end it was just one of the hurdles to overcome. This (high validation and tasks queue) is nothing close to that. The project does seem to have some server issues but there are plans to replace them soon, as mentioned in the forums. Longer runtimes is likely due to scientific reasons as speculated in the post I linked above.

Some may already know this but project people who're most likely to watch and post on the forums are PhD students and are very busy with much higher priorities. We'll eventually get the info and things will eventually get fixed, it'll just take time. I'd encourage everyone to just stay the course and crunch regardless of what's happening. That's what's most helpful to the project and we'll get our credits and badges, even if sometimes it takes a bit of time.
ID: 74339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 74341 - Posted: 3 Oct 2022, 5:22:19 UTC - in response to Message 74339.  

+1
--------------------------------------------------------------------------------------------------

I still don't understand the interesting situations, where tasks that "run"
longer receive less credits.
--------------------------------------------------------------------------------------------------

I personally am not keen about credits - OK, they're nice, but I always thought
one does (should do) crunching for the pure sense of doing something
good for science.
Sort of like contributing to the understanding of our world/universe - etc..
--------------------------------------------------------------------------------------------------

Cheers
ID: 74341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 13 Oct 21
Posts: 44
Credit: 225,162,808
RAC: 7,665
Message 74345 - Posted: 4 Oct 2022, 14:15:45 UTC - in response to Message 74341.  

It's my understanding that BOINC credit calculation system has been criticized for years. The same glitches and oddities apply to everyone within a project though so It's only useful to compare within a project, never between projects. Even total BOINC credit for all projects is pretty useless for any comparisons.

It's human nature to have some kind of metrics and rewards. I think there are simpler and better systems for doing this though. The most important metric I'd say is number of tasks completed, that's what's most important to the projects. A tally could be kept (for each sub-project) and a badge could be awarded for completing every given number of tasks. I'd argue that it's much more useful and meaningful to know that one has completed 1000 tasks than that one got 1 million points, for example.
ID: 74345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 74348 - Posted: 4 Oct 2022, 14:49:33 UTC - in response to Message 74345.  

Well said.

But, I am/was not comparing between different PCs running Milkyway.
Also not comparing between projects - neither on same PC or a different one.

The three examples I showed earlier, were run on the same PC using Milkyway.
I was trying to show that one task has a lesser run time, but is awarded
more credits than the other task, which ran longer.

I am not aware of any such discrepancies in credit calculations, for example, on Einstein.

Of course, a comparison of credits or run times between projects, on same PC or
different ones, is of no value.
ID: 74348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 74351 - Posted: 4 Oct 2022, 15:00:37 UTC

I talk about this a little bit in this thread https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4930#74308, but I think I know what the problem is.

There are combinations of parameters (such as very dense dwarf galaxies) that cause the simulation to run for a long time. This is usually because the timestep resolution that you need to accurately simulate those systems is very small, so the simulation may choose to run 10,000 timesteps for very dense systems, but only 1,000 timesteps for a less dense system. Timesteps all take roughly the same amount of time to run, so in this example that would be a 10x increase in the time it would take to crunch that simulation.

Eric had implemented a system that avoided parameters that would cause very long runtimes. Specifically, if your client calculated that you needed a very large number of timesteps for a simulation, that workunit would abort and move on to the next task. These very dense dwarf galaxies are not realistic, so we don't lose any scientific value by not running them.

This was working for a while, but I am seeing N-body workunits that have very dense cores in the results pool. Something must have gotten changed when Eric made changes recently.... I have made the team aware of this problem and we'll work on fixing it.
ID: 74351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,893,161
RAC: 375
Message 74352 - Posted: 4 Oct 2022, 16:53:30 UTC - in response to Message 74351.  

Thanks for the update Tom😁
ID: 74352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 13 Oct 21
Posts: 44
Credit: 225,162,808
RAC: 7,665
Message 74353 - Posted: 4 Oct 2022, 19:58:59 UTC - in response to Message 74348.  

I was commenting more generally about BOINC credit system in that post. As far as discrepancies that you noticed... that's partly what I was referring to in an earlier post by saying that credit inconsistencies are not unusual with BOINC when estimated computation size of tasks changes all of the sudden or is variable. It usually averages itself out though over time. I believe the inconsistencies are due to BOINC trying to average things out but it's not good at doing that and takes a long time. The greater and more frequent the computation size variability (between tasks), the longer it takes BOINC to average things out. I think it's like weeks not days. I don't think BOINC is trying to short users of credit it's just trying to adapt to the changes but it's not good at doing that. I've seen or read about this with projects that rely at least in some part on BOINC default system, like Rosetta, LHC, and MilkyWay. Other projects like Universe, CPDN, and Einstein have a set credit per task (or trickle in case of CPDN) completed, which also varies by sub-project. That's why you haven't seen it in Einstein, for example.

We also just got an explanation from Tom as to why N-body tasks started to take so long. Which is helpful to know that there's a scientific reason and not just glitchy or bad tasks.
ID: 74353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3319
Credit: 520,303,301
RAC: 20,458
Message 74355 - Posted: 5 Oct 2022, 10:20:59 UTC - in response to Message 74345.  

It's my understanding that BOINC credit calculation system has been criticized for years. The same glitches and oddities apply to everyone within a project though so It's only useful to compare within a project, never between projects. Even total BOINC credit for all projects is pretty useless for any comparisons.

It's human nature to have some kind of metrics and rewards. I think there are simpler and better systems for doing this though. The most important metric I'd say is number of tasks completed, that's what's most important to the projects. A tally could be kept (for each sub-project) and a badge could be awarded for completing every given number of tasks. I'd argue that it's much more useful and meaningful to know that one has completed 1000 tasks than that one got 1 million points, for example.


Try running the project wuprop then, it counts the hours your pc's put in, both cpu cores and the gpu, which is more like your last sentence

https://wuprop.boinc-af.org

it runs as an NCI task meaning, Non Computationally Intense or about 0.25 of a cpu core, and you should run it on everything that crunches Boinc tasks so you get both your hours counted and get Badges for reaching different milestones. The Project also has a forum section letting you know when new apps are starting or restarting after being off for awhile.

https://wuprop.boinc-af.org/forum_thread.php?id=351
ID: 74355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3319
Credit: 520,303,301
RAC: 20,458
Message 74356 - Posted: 5 Oct 2022, 10:25:54 UTC - in response to Message 74353.  

I was commenting more generally about BOINC credit system in that post. As far as discrepancies that you noticed... that's partly what I was referring to in an earlier post by saying that credit inconsistencies are not unusual with BOINC when estimated computation size of tasks changes all of the sudden or is variable.


Credit inconsistencies come in when a project doesn't assign a fixed credit for each task, yes that has it's own problems like now when some tasks run much longer than other tasks, the variable credit metric used is complicated and is ALOT better though than the 'credit new' system that comes built into the Server side of Boinc
ID: 74356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 74357 - Posted: 5 Oct 2022, 13:45:11 UTC

Went through and looked at the N-body tasks yesterday. We have a system that throws out tasks if they have more than 150k timesteps (the number of timesteps is determined by how long the simulation needs to evolve, as well as how dense the dwarf galaxy is). It turns out that the current N-body runs have optimized to a point where the number of timesteps is very close to 150k - we calculated the number of timesteps for an arbitrary WU, and it was 147,500 timesteps.

Luckily, that means that the length of N-body tasks at the moment isn't because of a glitch. Everything is working as intended. The bad news is that there isn't any way to shorten the N-body simulations, unless we wanted to release a new client with a different timestep limit and put up new runs.

You may also see many of your N-body tasks recently have only taken a few seconds to run - that happens when the simulation calculates that it needs more than 150k timesteps.
ID: 74357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 17 Feb 09
Posts: 24
Credit: 3,432,392
RAC: 68
Message 74358 - Posted: 5 Oct 2022, 22:33:22 UTC - in response to Message 74357.  

You may also see many of your N-body tasks recently have only taken a few seconds to run - that happens when the simulation calculates that it needs more than 150k timesteps.

Yup, just had one of those. Lasted all of 1 second.
The other w/u had a claimed runtime of about 1hr 40 mins before starting but took 19hrs 43 mins to complete. I am hoping some of the other w/u will be shorter, otherwise some w/u will not meet the deadline.
ID: 74358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Very Long WU's

©2024 Astroinformatics Group