Welcome to MilkyWay@home

News General

Message boards : News : News General
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,033,605
RAC: 35,365
Message 73090 - Posted: 24 Apr 2022, 3:07:20 UTC - in response to Message 73086.  

I am sure you all remember the w/u =7 tasks that caused invalidations. Well, this morning I encountered the son of w/u=7. Its signature is w/u=5 and it causes validation errors. Que paso?

Yes, I vaguely remember some vinegar associated with that. Can you please refresh us on how to find these and exterminate them? Actually, do we need to exterminate them? Or will the 'system' do that for us?

I'm only a user, not a BOINC SysAdmin, but I'll have a go at this...

To address the "extermination" question first - I believe there is some code in the existing validator that spots results that don't have the expected number of parameters and invalidates them (it is potential bad science, after all!) I would hope that it also flags the entire BOINC workunit as bad so no further retries should be sent, but without looking at the code I can't be sure!

Regarding Frank's post: there's more than one reason a result can be declared invalid, so without being able to see the output from the Invalid task(s) it isn't possible to say whether there is a recurrence of that "too many WUs in a shipped task" problem. Current tasks have number_WUs = 5 and number_params_per_WU = 20, so that 5 isn't an issue!

In order to satisfy my curiosity on this, I went via Frank's profile to find his computers to see if I could spot the offending task(s). I found a relevant(*) Invalid task on two of the systems. In each case it was declared a Validation error because it had failed to calculate the likelihoods for the first of the five jobs in the task. Judging by the content of the result report, it appeared to try to start doing the first job twice and got confused about the state of the checkpoint file -- perhaps something interrupted BOINC whilst a checkpoint was being taken? The validator would spit this out at once because it couldn't find all the likelihood data!

I think it would be interesting to know whether the retry/retries also fail to validate - if they fail, then there's possibly a problem in the parameters, but if they don't fail the issue was specific to the computer(s) in question, not a work unit error.

Hope this helps clear things up a bit...

Cheers - Al.

* There were also some "completed, couldn't validate" tasks, and some orphaned invalid tasks from the big renumbering crash of late January 2021.


Not sure I got it, but to paraphrase Curly of the 3 stooges, "I'm tryin' to think, but nothin's happening!"
ID: 73090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,458,206
RAC: 36,287
Message 73091 - Posted: 24 Apr 2022, 6:40:22 UTC - in response to Message 73086.  

@Frank

Further to the post a couple of places above...

One of the two Invalid results mentioned now has three wingmen, all of whom validated successfully! Fortunately, one of them was another Windows CPU task, which allowed an easy comparison of the results (result structure being the same).

Ignoring the bit about Lua script errors (which everyone gets!), the successful task had this for the first "job"...
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 20 </number_params_per_WU>
Using SSE4.1 path
Integral 0 time = 818.259082 s
Running likelihood with 47431 stars
Likelihood time = 1.152013 s
<background_integral> 0.000395229940402 </background_integral>
<stream_integral>  3.068682680690464  107.536654276206780  90.712861191606876 </stream_integral>
<background_likelihood> -6.591808580671233 </background_likelihood>
<stream_only_likelihood>  -72.128164401557285  -2.892107495485999  -13.431452465675566 </stream_only_likelihood>
<search_likelihood> -2.774016998251114 </search_likelihood>

whilst the invalid one has this...
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 20 </number_params_per_WU>
Using SSE4.1 path
Failed to find header in checkpoint file
Failed to read state
Failed to calculate likelihood

Both seemed to produce numerically identical results for the remaining four items, so whatever caused the issue for the first job didn't spill over into the others... As I don't run Windows, and don't do Separation CPU tasks, I can't really offer any explanation :-(

At least it doesn't appear to be the return of the ill-formed tasks!

Cheers - Al.
ID: 73091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 73094 - Posted: 24 Apr 2022, 15:09:56 UTC

5 w/u the son of 7 w/u. Today, I encountered two that invalidated the tasks. I think the 5 w/u corruption is a sickness (a variant of the 7 w/u decease) that can be present in untold numbers of tasks yet to be run. Yes, it's a pandemic. It has to be corrected.
And, no there were no computer error, except those errors committed by the computer that built the tasks.
ID: 73094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,458,206
RAC: 36,287
Message 73098 - Posted: 25 Apr 2022, 0:42:53 UTC - in response to Message 73094.  

5 w/u the son of 7 w/u. Today, I encountered two that invalidated the tasks. I think the 5 w/u corruption is a sickness (a variant of the 7 w/u decease) that can be present in untold numbers of tasks yet to be run. Yes, it's a pandemic. It has to be corrected.
And, no there were no computer error, except those errors committed by the computer that built the tasks.

I notice that the two newly invalidated tasks appear to have the same issue I remarked upon above. I also notice that there is a retry in progress for both of them at the time I composed this message...

Until enough wingmen have responded and also failed for one reason or another, there's no proof that there's something wrong with the BOINC work unit from which the tasks are built. And, if the reason for failure is to do with the wingman's system, that says nothing about the actual tasks as the limit on errors is set at 2 results so any other results may get tagged as "Completed, can't validate"...

On the "can't validate" point - I notice that you also got some of those on one of your systems, and in each case there were two wingmen that ended up in an Error state; one of the systems had a GPU that can't do double precision (so no surprise there!) and the other was a Linux system using an OpenCL version observed to cause problems in the past. Your results (and those of the other wingman who couldn't validate on those units) would almost certainly have validated were it not for those two non-task-related errors getting in the way -- processing of those tasks was not helped by the long delay caused by the server problems :-(

And I repeat that these tasks are supposed to be 5 w/u, but I suspect you know that, and it does make for a catchy tag-line! That said, declare a work unit pandemic if and when your wingmen start to show identical symptoms - until then, watch and wait... And if you are genuinely interested in trying to find out why some of your tasks don't handle the first sub-task properly, a thread in the "Number crunching" forum is probably a good idea, rather than pursuing it in a News thread! :-)

Cheers - Al.
ID: 73098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73107 - Posted: 25 Apr 2022, 23:45:40 UTC - in response to Message 73082.  

I am sure you all remember the w/u =7 tasks that caused invalidations. Well, this morning I encountered the son of w/u=7. Its signature is w/u=5 and it causes validation errors. Que paso?


WU = 5 is normal, there is currently one test run up that has a bundle size of 5. All of Jake's WUs had bundle size 5, the only reason that mine have 4 is because we are trying to fit more streams per stripe than Jake was.
ID: 73107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 73631 - Posted: 21 May 2022, 17:03:41 UTC

I have a simple question. Are we done here or has MW cratered? Well, maybe MW is just over-subscribed and don't need all the computing power they have available.
MW seems hopelessly befuddled. The supply of runnable tasks is sporadic, at best. Validation is a bad joke, all the validation required copies of tasks don't get sent. Tasks error out for no response after. 2 minutes in the client. What kind of a project plan could tolerate 8 million unsent N-Body tasks? Why are there 0 Separation tasks, especially on a weekend? Talk about erratic, what about the servers? I could continue this list but it isn't necessary.
There are a ton of problems. There is no money to generate fixes. So, are we done here?
ID: 73631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,033,605
RAC: 35,365
Message 73632 - Posted: 21 May 2022, 18:24:31 UTC - in response to Message 73631.  
Last modified: 21 May 2022, 18:29:04 UTC

Are we done here or has MW cratered? I hope not. I think there is good work to be done here. I recall in Prof. Heidi's video, that she would like to run a bunch of simultaneous star streams, similar to the orphan-cheneb stream in the video.

Well, maybe MW is just over-subscribed and don't need all the computing power they have available. Good question. Been wondering that myself.

MW seems hopelessly befuddled. That, I think, goes to leadership, and available resources. I don't think RPI management is willing to devote what it takes, in manpower, money, and compute resources to kick it up a notch or two. Money is very tight right now.

The supply of runnable tasks is sporadic, at best. Yep.

Validation is a bad joke, all the validation required copies of tasks don't get sent. Yep, something funky is going on with the servers. For now, would a periodic reboot of the whole enchilada be appropriate? Say once every 2 days? Once a week?

Tasks error out for no response after. 2 minutes in the client. Yep, although I don't think I have seen that in several weeks. Shouldn't happen at all.

What kind of a project plan could tolerate 8 million unsent N-Body tasks? A week or so ago it was at 17 million. I suspect a configuration error in the server side. Plus there was a hard drive failure that was very difficult to recover from.

Why are there 0 Separation tasks, especially on a weekend? Good question again.

Talk about erratic, what about the servers? Funny you should mention that. I was just looking into Dell severs to see what it would cost to R&R the whole lot of 'em for faster performance. (Assuming server performance is the issue, don't know at this point, but something is going on)

I could continue this list but it isn't necessary. Maybe you should. Perhaps the active posters here could formulate a list of things we would like to see addressed?

There are a ton of problems. There is no money to generate fixes. So, are we done here?

Good post overall. Glad to see it. +1

P.S. Also, could be the 25 year old boinc software as well. It might not be up to the large number of volunteers offering compute cycles, on ever faster computers, and now with screaming hot GPU processors. .
ID: 73632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,950,525
RAC: 21,840
Message 73641 - Posted: 22 May 2022, 2:46:07 UTC - in response to Message 73632.  

Are we done here or has MW cratered? I hope not. I think there is good work to be done here. I recall in Prof. Heidi's video, that she would like to run a bunch of simultaneous star streams, similar to the orphan-cheneb stream in the video.


I sure hope not too!!

Why are there 0 Separation tasks, especially on a weekend? Good question again.


Simply because people go home for the weekend and the staff that do stay aren't there to 'manage' the Servers.

P.S. Also, could be the 25 year old boinc software as well. It might not be up to the large number of volunteers offering compute cycles, on ever faster computers, and now with screaming hot GPU processors. .


Universe was just overwhelmed saying "The problem is now in very large number of concurrent HTTP connections.
Our server handles about 300 simultaneous connections and it is maximum what it can do"

Back in the days of dialup Seti did a test of how long it took for each pc to 'request a connection from the Server, the Server to acknowledge that request and open a port, the pc to start sending data and the Server to receive it and for the Server o then say goodbye to the pc and then close the port, it was in the 5 second range, obviously todays connection and Servers are significantly faster but time and number of users is not on the Projects side in all this. Also as you said above todays computers can get thru data significantly faster than just 5 years ago and gpu's are even faster than that!!

I think they are going to have to separate the cpu and gpu tasks into different Servers, or maybe even the NBody and Separation tasks, and that way each Server can take some of the load that the current Server is obviously being overwhelmed with. As you also said though that requires money or a donation of hardware that is sufficiently upto date for the Project to accept it.

Years ago I got a Server from my son who got it from his school who got it for the US State Dept as a surplus thing for the IT class my son was in, they played with it for the whole school year then the teacher wanted it just gone so my sone loaded it into his car and brought it to me. It was over 2 feet square and tall and VERY VERY heavy and loaded with scsi drives. After playing with it for over 6 months, including buying new memory and drives for it etc etc I offered it to Seti when they said they were in desperate need of a Server. i sent them pictures and they said 'thanks but no thanks' and said if I wanted to ship it to them they would figure out something to do with it. I lived on the East Coast and said thanks anyway so I found a guy a work who wanted to learn how to make the Server software work so he could get promoted, I gave it to him and he loaded it into his trunk and it made the car MUCH lower in the back. He did get the promotion!!
ID: 73641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 73644 - Posted: 22 May 2022, 15:43:54 UTC - in response to Message 73632.  

I don't think overload of the servers is a problem. Just look at the numbers of users active on a given day. It has been hanging around 6,000 for weeks and the number of active contributors is around 15,000.
However, I have seen the response time of the servers stretch out substantially. I believe the slowness of server response may lie on the Internet. Could be electronic interference or overload of an intermediate server (Internet speed will always be controlled by the slowest server and collisions simply stop transmission until there is a channel open).
On the Errors by "Timed out - no response" yesterday I encountered 27 of them. They did not error in 2 minutes but rather in 2 hours.
If reboot can stabilize server operation I would encourage it. Probably in the dead of night,
I am encouraged by your attitude and intelligence. Maybe the crater can wait. How ever the befuddlement fix can not.
ID: 73644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 May 11
Posts: 66
Credit: 5,635,044
RAC: 46
Message 73645 - Posted: 22 May 2022, 15:47:40 UTC - in response to Message 73644.  

The problem is with the fact that night is relative.
On the other side of the Earth it will be day.
ID: 73645 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GolfSierra

Send message
Joined: 11 Mar 22
Posts: 42
Credit: 21,902,543
RAC: 0
Message 73651 - Posted: 22 May 2022, 20:54:48 UTC - in response to Message 73644.  

befuddlement fix


I love that word - hits the nail right on the head!
ID: 73651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,033,605
RAC: 35,365
Message 73656 - Posted: 23 May 2022, 2:44:05 UTC - in response to Message 73641.  



Universe was just overwhelmed saying "The problem is now in very large number of concurrent HTTP connections.
Our server handles about 300 simultaneous connections and it is maximum what it can do"

I think they are going to have to separate the cpu and gpu tasks into different Servers, or maybe even the NBody and Separation tasks, and that way each Server can take some of the load that the current Server is obviously being overwhelmed with. As you also said though that requires money or a donation of hardware that is sufficiently upto date for the Project to accept it.
Wow. I had not even considered that. Let's say 10,000 participating computers( a little bit high for now), and they ping every 2 minutes, as Keith does. So 5,000/minute, or 83 pings per second. I have no idea, but it seems like a lot to me. Especially when you dog-pile on type of tasks, quantity to send of each type, ensure no duplicates, receive and sort returned tasks, data base updates, yada yada yada, and pretty soon you are at 100% CPU utilization.

I'm in violent agreement with ya on splitting up the server load to CPU separation, GPU separation(and that might even be subdivided into AMD, NVIDIA, and Intel types, just because they run so dang fast.) And also CPU n body.

All this depends on how many simultaneous connections are expected if and when Prof Heidi kick off the multi stream project.

@ Tom, how many simultaneous connections are you seeing these days?
ID: 73656 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,033,605
RAC: 35,365
Message 73657 - Posted: 23 May 2022, 2:50:46 UTC - in response to Message 73641.  

Years ago I got a Server from my son who got it from his school who got it for the US State Dept as a surplus thing for the IT class my son was in, they played with it for the whole school year then the teacher wanted it just gone so my sone loaded it into his car and brought it to me. It was over 2 feet square and tall and VERY VERY heavy and loaded with scsi drives. After playing with it for over 6 months, including buying new memory and drives for it etc etc I offered it to Seti when they said they were in desperate need of a Server. i sent them pictures and they said 'thanks but no thanks' and said if I wanted to ship it to them they would figure out something to do with it. I lived on the East Coast and said thanks anyway so I found a guy a work who wanted to learn how to make the Server software work so he could get promoted, I gave it to him and he loaded it into his trunk and it made the car MUCH lower in the back. He did get the promotion!!
Very cool. I tried to give away my old IMSAI 8080. No joy. But it was a beast in its day.
ID: 73657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,033,605
RAC: 35,365
Message 73658 - Posted: 23 May 2022, 3:01:34 UTC - in response to Message 73645.  

The problem is with the fact that night is relative.
On the other side of the Earth it will be day.
Yeah. Good point. Perhaps reboot at the lowest usage point, whatever that is...?
ID: 73658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,033,605
RAC: 35,365
Message 73659 - Posted: 23 May 2022, 3:24:05 UTC - in response to Message 73644.  

I don't think overload of the servers is a problem. Just look at the numbers of users active on a given day. It has been hanging around 6,000 for weeks and the number of active contributors is around 15,000.
However, I have seen the response time of the servers stretch out substantially. I believe the slowness of server response may lie on the Internet. Could be electronic interference or overload of an intermediate server (Internet speed will always be controlled by the slowest server and collisions simply stop transmission until there is a channel open).
On the Errors by "Timed out - no response" yesterday I encountered 27 of them. They did not error in 2 minutes but rather in 2 hours.
If reboot can stabilize server operation I would encourage it. Probably in the dead of night,
I am encouraged by your attitude and intelligence. Maybe the crater can wait. How ever the befuddlement fix can not.
Truth be told, I am soooo captured by this project. I am slowly, very slowly, gathering info on the project characteristics to see if if it can be tweaked in some way to support a more robust though-put at the client level. If tweaking won't get us there, then perhaps a campaign to rebuild the whole MW@H infrastructure from the ground up might be in order. That's very extreme. Prolly gonna take a sugar daddy or 2 to fund it. With or without boinc. (I was in the Find-a-drug cancer project lo, these many years ago, and it did not use boinc. I screened 525,000,000 molecules, and found 17 that had anti cancer properties.) Obviously strong support from RPI will be required too. This is all Blue Sky Smack Talk right now, but could bear fruit down the line somewhere....
ID: 73659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GolfSierra

Send message
Joined: 11 Mar 22
Posts: 42
Credit: 21,902,543
RAC: 0
Message 73660 - Posted: 23 May 2022, 8:46:50 UTC - in response to Message 73659.  
Last modified: 23 May 2022, 8:47:21 UTC

If tweaking won't get us there, then perhaps a campaign to rebuild the whole MW@H infrastructure from the ground up might be in order. That's very extreme. Prolly gonna take a sugar daddy or 2 to fund it.


Crowd funding could help with raising money for more capabable hardware, However, one enthusiastic manager like Tom Donlon is just not enough to handle the rebuilt of the MW software. We would need a team of 10 Donlons to do that.
ID: 73660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,950,525
RAC: 21,840
Message 73661 - Posted: 23 May 2022, 9:27:04 UTC - in response to Message 73658.  

The problem is with the fact that night is relative.
On the other side of the Earth it will be day.


Yeah. Good point. Perhaps reboot at the lowest usage point, whatever that is...?


The problem is they go home at night and weekends , their time, so unless they can get someone to come in and shut it down and then restart it and make sure everything is running that may be a no go for them. Now they could do it at a fixed time every week for example to see if things run more smoothly doing that. remember Seti used to shut down every Tuesday for 'maintenance'.
ID: 73661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,950,525
RAC: 21,840
Message 73662 - Posted: 23 May 2022, 9:36:19 UTC - in response to Message 73659.  

I don't think overload of the servers is a problem. Just look at the numbers of users active on a given day. It has been hanging around 6,000 for weeks and the number of active contributors is around 15,000.
However, I have seen the response time of the servers stretch out substantially. I believe the slowness of server response may lie on the Internet. Could be electronic interference or overload of an intermediate server (Internet speed will always be controlled by the slowest server and collisions simply stop transmission until there is a channel open).
On the Errors by "Timed out - no response" yesterday I encountered 27 of them. They did not error in 2 minutes but rather in 2 hours.
If reboot can stabilize server operation I would encourage it. Probably in the dead of night,
I am encouraged by your attitude and intelligence. Maybe the crater can wait. How ever the befuddlement fix can not.


Truth be told, I am soooo captured by this project. I am slowly, very slowly, gathering info on the project characteristics to see if if it can be tweaked in some way to support a more robust though-put at the client level. If tweaking won't get us there, then perhaps a campaign to rebuild the whole MW@H infrastructure from the ground up might be in order. That's very extreme. Prolly gonna take a sugar daddy or 2 to fund it. With or without boinc. (I was in the Find-a-drug cancer project lo, these many years ago, and it did not use boinc. I screened 525,000,000 molecules, and found 17 that had anti cancer properties.) Obviously strong support from RPI will be required too. This is all Blue Sky Smack Talk right now, but could bear fruit down the line somewhere....


I think it would be a bad thing if MilkyWay went private and/or non Boinc, we need projects like MW to give those who still have eyes on the sky but can't or don't have the time to do anything about figuring out what's up there and how our little blue World fits into it. It also helps prove that Science is not just for those in white coats sitting in a lab or with their eyes glued to a screen someplace, it's also for the little guy sitting at home who just wants to feel they can do something about it. Kinda like all the Covid Boinc projects that popped up, people felt like they were a part of the solution not just going along with the flow wherever it leads.
ID: 73662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DGPickett

Send message
Joined: 14 Oct 19
Posts: 6
Credit: 3,181,839
RAC: 2,388
Message 74056 - Posted: 14 Aug 2022, 21:06:28 UTC

MilkyWay is violating the BOINC Computing Preferences, running when my Linux Ubuntu 22.04 LTS system is in use.
ID: 74056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 May 11
Posts: 66
Credit: 5,635,044
RAC: 46
Message 74057 - Posted: 14 Aug 2022, 21:08:39 UTC - in response to Message 74056.  

Boinc will do so if you set it to run always instead of run based on preferences.
ID: 74057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : News : News General

©2024 Astroinformatics Group