Welcome to MilkyWay@home

Posts by Peter Dragon

1) Message boards : News : Nbody WU Flush (Message 73929)
Posted 26 Jun 2022 by Peter Dragon
Post:
Is this e3 cpu or e5?

24x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
Why is Boinc reporting it as E3?

Because that's reporting the Virtual CPU, its a VM, the above is the physical CPUs being used on the host (Answering the other members question).
2) Message boards : News : Nbody WU Flush (Message 73927)
Posted 26 Jun 2022 by Peter Dragon
Post:
Is this e3 cpu or e5?

24x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
3) Message boards : News : Nbody WU Flush (Message 73925)
Posted 26 Jun 2022 by Peter Dragon
Post:
Its a single machine, IBM x3550 M4 1u server running RHEL8/Virt. 6 VMs (2CPUs Per) running ubuntu/docker in a SWARM.

I needed an excuse to learn docker and docker SWARM in February, so this was my excuse to learn it **SHRUGS**
4) Message boards : News : Nbody WU Flush (Message 73906)
Posted 25 Jun 2022 by Peter Dragon
Post:
I ended up doing
docker run --rm --network boinc boinc/client boinccmd_swarm --get_tasks --force
and that looks to have brought down new tasks. Thanks for the replies.

I have also unhid my nodes, feel free to take a peak.
5) Message boards : News : Nbody WU Flush (Message 73901)
Posted 24 Jun 2022 by Peter Dragon
Post:
All 6 of my nodes are quiet, any new bit's going out, or are we in a lul?
6) Message boards : Number crunching : Validation inconclusive (Message 73870)
Posted 19 Jun 2022 by Peter Dragon
Post:
Looks like the end of the tunnel in sight, I still have 29 to go but a long way from where it was.
yep. n body well has run dry.


Agreed, all my n body work units have hit the floor :(

7) Message boards : Number crunching : Validation inconclusive (Message 73800)
Posted 5 Jun 2022 by Peter Dragon
Post:
Looks like the N-Body jobs kicked in around 11pm June 3rd. The image is from June 4th.

8) Message boards : Number crunching : Validation inconclusive (Message 73797)
Posted 5 Jun 2022 by Peter Dragon
Post:
N-Body's, so far seeing completions and new tasks.
5 Jun 2022, 1:59:02 UTC 5 Jun 2022, 4:30:15 UTC Completed and validated Runtime(s): 1,499.61 Milkyway@home N-Body Simulation v1.82 (mt)
9) Message boards : News : Server Trouble (Message 73732)
Posted 28 May 2022 by Peter Dragon
Post:
Yes Thanks to all who have or are serving. Just 7% of the population have stepped forward and raised our hands and written that blank check.
I shall write one word. Authoritarian.


Of course its Authoritarian, its the military? A combat soldier can be ordered to do some really counterintuitive things. Without an authoritarian system, those things wouldn't get accomplished. The notion of decision by committee when you got bullets flying over head will get you killed.

*shrugs*
10) Message boards : Number crunching : Validation inconclusive (Message 73682)
Posted 24 May 2022 by Peter Dragon
Post:
It's almost 2 days since I've got any task validated. There's over 1.5m workunits waiting for validation, don't think it's going good.


My task validations have been completing as of 24 hours ago.

23 May 2022, 4:07:46 UTC 23 May 2022, 15:43:21 UTC Completed and validated
11) Message boards : Cafe MilkyWay : WCG Friends (Message 73496)
Posted 14 May 2022 by Peter Dragon
Post:

I would guess they would have limited tasks available as they slowly ramp up to full capacity to ensure everything is okay going forward.


Thats fine, I could spin up a 4 node cluster and task it to 8 of the xeon cores, if it ramps up the power will be there, if not no loss.
12) Message boards : Cafe MilkyWay : WCG Friends (Message 73464)
Posted 11 May 2022 by Peter Dragon
Post:
Hmm I might have a go at WCG, I have some extra CPU cycles I could throw at it.

:)
13) Questions and Answers : Unix/Linux : BOINC in Docker Swarm (Message 73169)
Posted 29 Apr 2022 by Peter Dragon
Post:
So I run a 6 node (6 VMs Running Ubuntu 18.04.6 LTS) Docker Swarm, which hosts 6 docker containers running BOINC on a mature IBM 3550 M4. Today I lost quorum in the cluster which forced all containers to consolidate on to 1 node. However I noticed that despite the containers being active and the service for BOINC still running in each container, no processing was happening.

First thing I tried was to force a rebalance across the cluster after fixing the issue with the other 5 nodes.

:~# docker service update --force boinc
boinc
overall progress: 6 out of 6 tasks
1/6: running   [==================================================>]
2/6: running   [==================================================>]
3/6: running   [==================================================>]
4/6: running   [==================================================>]
5/6: running   [==================================================>]
6/6: running   [==================================================>]
verify: Service converged


Once the containers were spread evenly across the swarm and rebalanced, I waited but still no activity. So I hopped on BOINC Manager, connected to each container, and verified the service and project. Forced and update, but the containers just wouldn't check in.

This is where I learned how picky BOINC running in docker can be. I checked the logs and come to find out, the consolidation (Not the rebalance) actually changed the container names, which didn't match up to what was in the MilkyWay project under "View Computers".

So while annoying not really a big deal, since its docker, just trash all 6 containers and redeploy...

:~# docker service rm boinc
boinc

:~# docker service ls
ID        NAME      MODE      REPLICAS   IMAGE     PORTS

:~# docker service create --replicas 6 --name boinc --network=boinc -p xxxx -e BOINC_GUI_RPC_PASSWORD="xxxxxxxxxxxxx" -e BOINC_CMD_LINE_OPTIONS="--allow_remote_gui_rpc" boinc/xxxxxxxxxxxxxx
overall progress: 6 out of 6 tasks
1/6: running   [==================================================>]
2/6: running   [==================================================>]
3/6: running   [==================================================>]
4/6: running   [==================================================>]
5/6: running   [==================================================>]
6/6: running   [==================================================>]
verify: Service converged

Then once the new images were deployed, just had to register the new docker containers with the new names with MilkyWay@home.

:~# docker run --rm --network boinc boinc/client boinccmd_swarm --passwd xxxxxxxxxxxxx --project_attach http://milkyway.cs.rpi.edu/milkyway/ <insert keys here, no you can't see mine>
===== Client at M.Y.I.P2 =====
===== Client at M.Y.I.P2 =====
===== Client at M.Y.I.P3 =====
===== Client at M.Y.I.P4 =====
===== Client at M.Y.I.P5 =====
===== Client at M.Y.I.P6 =====


Boom, activity see after a few minutes, and they showed up under "View Computers". After that, it was just a matter of clean up, by merging the old computer ids with the new computer ids.

And yeah I know its a complicated and odd way to run BOINC, but idle hands and all... Figured I would share incase anyone else was running BOINC in docker.

-PD

PS: Please excuse the redacted parts, etc...etc..
14) Message boards : Number crunching : 1 of my 6 nodes, keeps timing out. (Message 73103)
Posted 25 Apr 2022 by Peter Dragon
Post:
So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now.

In the past I would see no CPU time spent as the behavior, but now after the clean up I see CPU churning away. So it hit me, I've been getting hit but the compute preference for "disk" that reads "Leave at least 1 GB free"

So lesson learned, that's on me lol!
15) Message boards : Number crunching : 1 of my 6 nodes, keeps timing out. (Message 73097)
Posted 24 Apr 2022 by Peter Dragon
Post:
Thanks for the responses, and the explanations!
16) Message boards : Number crunching : 1 of my 6 nodes, keeps timing out. (Message 73076)
Posted 22 Apr 2022 by Peter Dragon
Post:
Hey guys,

A greenhorn here, and my first post. I run a 6 node Docker Swarm dedicated to just Milkyway@home. All 6 nodes used to run tasks with out issue, until around sometimes last month. I have seen some of the challenges described in other threads, so Im uncertain if my issue is related.

But around that time, 1 out of the 6 nodes stopped completing tasks, and started throwing timeout errors like seen in the results page below.

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=924870

If I destroy the docker service and rebuild, once tasks are assigned I appear to get the same issue, only the on a different node each time. 1 out of the 6 throws the time out errors. As far as I can tell, it doesn't "appear" to be a issue on my side. But I wanted to get some feedback from the community and your thoughts.

This is not a pressing matter, just looking to better understand the behavior.

CPU Info Per Node:
Intel Xeon E312xx (Sandy Bridge, IBRS update) [Family 6 Model 42 Stepping 1]
(2 processors)

Thanks again!
-PD




©2022 Astroinformatics Group