Posts by Pascal

1) Message boards : News : New Nbody version 1.48 (Message 63269) Posted 26 Mar 2015 by Pascal Post: I'd not heard the reverse problem, MT preempting before checkpoint, before - but it's worth investigating, to find what is triggering the switch. Something in 'even higher priority', perhaps? And as Jacob says, if the MT task is kept in memory when suspended, it shouldn't restart from square zero. Not worth investigating. I was wrong. BOINC appears to be behaving correctly. The problem, however, was actually worse than I thought. All my ssh sessions are automatically logged. I just found this reviewing yesterday's logs: 8) ----------- name: de_nbody_2_13_orphan_sim_2_1422013803_804445_1 WU name: de_nbody_2_13_orphan_sim_2_1422013803_804445 project URL: http://milkyway.cs.rpi.edu/milkyway/ report deadline: Thu Mar 26 19:38:57 2015 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 148 checkpoint CPU time: 0.000000 current CPU time: 734050.600000 fraction done: 1.000000 swap size: 22847488.000000 working set size: 3469312.000000 estimated CPU time remaining: -735547.873428 9) ----------- That's just under 8.5 days straight Nbody spent on that task without checkpointing, apparently uninterrupted by BOINC. The CPU is an Intel Xeon E3-1271 v3 @ 3.60GHz. Not the latest & greatest, but certainly no slouch. Other tasks on the box use about half a core. I first started looking at the box yesterday after I noticed BOINC was only using 1 of the 8 cores. It looks like my troubleshooting was what caused that task to start over several times yesterday, not BOINC. Eventually I aborted all the Nbody tasks and set nomorework for Milkyway.
2) Message boards : News : New Nbody version 1.48 (Message 63261) Posted 25 Mar 2015 by Pascal Post: We had multithreaded the assignments of radii and velocities to bodies. Both of these were done through rejection sampling, using random numbers. However, when that code ran with multiple threads the assignment of radii and velocity were different between runs even with the same random number seed and parameters. This was because which thread ran in which order was indeterminate, meaning which body was assign what radii and velocity was unpredictable. This was a very nasty bug, made nastier because it did not present itself. Runs would complete normally. However, because of the indeterminate nature of the algorithm, a poorer likelihood was reported than would be expected with a set of parameters, even if they were close. Therefore, overall, it led to poor convergence. It sounds like you were trying to use a single PRNG for all threads. One solution is to use multiple independent PRNGs. If body distribution to threads is deterministic then the original thread can pass each new thread a seed from its PRNG they can use to seed their own PRNG. If not, then every time a thread picks up a new body it will need to reseed its PRNG using something specific to the body. erand48(), nrand48(), or jrand48() may work for you. So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem? I ran into the problem on BOINC 7.2.42 on Linux version 3.10.0-123.13.2.el7.x86_64. That is the current "Recommended version" for that OS. No, leave_apps_in_memory was unset. It may be possible to easily reproduce by setting cpu_scheduling_period_minutes to 1. I'll think about it.
3) Message boards : News : New Nbody version 1.48 (Message 63258) Posted 25 Mar 2015 by Pascal Post: The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48.