Welcome to MilkyWay@home

Posts by dobrichev

1) Message boards : Number crunching : Where's CUDA? (Message 28364)
Posted 25 Jul 2009 by dobrichev
Post:
BTW, which is the "single precision CPU version"?

I posted some results against double precision here
2) Message boards : MilkyWay@home Science : Single vs. Double Precision (Message 28363)
Posted 25 Jul 2009 by dobrichev
Post:
I experimented with integral steps for "mu" and "nu" which resulted to significant differences in the final fitness.

Example: increasing Mu and Nu steps in the test file astronomy_parameters-20.txt from 800 to 900 and from 80 to 90 respectively resulted in following fitness in double precission:

original astronomy_parameters-20.txt:
...
mu[min,max,steps]: 133, 249, 800
nu[min,max,steps]: -1.25, 1.25, 80
...
fitness = -2.985312797571516

modified:
...
mu[min,max,steps]: 133, 249, 900
nu[min,max,steps]: -1.25, 1.25, 90
...
fitness = -2.985312877665851

Natan said the optimization algorithm sometimes hits local maximum, so there always must be careful manual checking for the congruence of the crunched numbers. In other words the algorithm is not noice-free and much accurate calculations probably are not necessary.
Something suggests me that the scientific focus will move (if not already) to using BOINC for not fully automatic solving of the problem, but more for brute force experimenting, followed by more or less manual data processing. From this point of view I think the scientists soon or later will prefer quick response from BOINC community instead of slow but more accurate processing results.

Results for the fitness in above example in single precision exponent in the most inner loop only are respectively
original: -2.985312797574330
modified: -2.985312877668430

Single precision exponent, including the stupid conversion from double to single, and back to double immediately before and after the exp() call, increases the whole performance about 4 times.

The production tasks currently use larger integral steps and probably the deviations there would be less, but I am using CPU and had not time to experiment with 1 hour tasks.

For me, the requirement to use double precision sounds like requirement not to make love 3 months before measuring the blood preasure because this may affect the results ;-)

I know the project is still in alpha stage so please treat my comment as hint only.
3) Message boards : Cafe MilkyWay : Word Link IV (Message 28316)
Posted 24 Jul 2009 by dobrichev
Post:
Norway
4) Message boards : Cafe MilkyWay : Word Link IV (Message 28308)
Posted 24 Jul 2009 by dobrichev
Post:
rum
5) Message boards : Cafe MilkyWay : Word Link IV (Message 27843)
Posted 16 Jul 2009 by dobrichev
Post:
evolution
6) Message boards : Application Code Discussion : found a small memory error in the code (Message 27040)
Posted 4 Jul 2009 by dobrichev
Post:
Do you mean the xyz[convolve][3] array?


Exactly.

I compiled v0.18 source on Windows. Adding some intrinsics in the innermost loop and drawing out of the loops 1/(q*q), sinb, sinl, cosb and cosl along with some other minor changes resulted in performance improvement of ~11% over the latest Gipsel's SSE3 build for Windows. The half of CPU time is consumed for the exp() function in the inner loop. Using vectorized exp() from intel's compiler didn't help, but storing the intermediate results in array and executing the exp() in a separate loop helped.
The rearangement I made is manually to process in parallel the two subsequent loop iterations, using the XMM registers. When x[j] and x[j+1] are located sequentially in the memory they can be loaded in one instruction directly in the register w/o wasting cycles and dirtying other registers for shuffling. This may be applicable for some other arrays too.

BTW, any suggestions where and whether to publish my observations (source modifications, code and comments) are welcome. The message boards are messed up (just as I am doing with this post). I am not sure if anybody is still interested in CPU (not GPU) versions since the task is highly prone to parallelization and therefore GPU builds probably contribute almost 100% of the scientific results.
7) Message boards : Application Code Discussion : found a small memory error in the code (Message 26987)
Posted 3 Jul 2009 by dobrichev
Post:
All memory allocations are executed within seconds, so this will NOT affect the MW application performance. Compiler runtime usually is requesting larger memory blocks from OS, and later malloc calls are managed by the runtime internally within these blocks. When the amount of necessary space is known (which is our case) it is better to allocate memory at once instead of giving the runtime a chance to fragment the memory or to allocate unusual space from the OS. At least this is the way I like to code.
The really slow process is the OS memory request in multiprocessor system due to the locks. Frequent OS memory allocation requests will affect the rest of the running processes and cause delays if some really memory intensive applications are running in parallel. This is not the case for the desktop machines.
Actually I am not sure how BOINC environment is running the application. Probably on process termination all the memory leaks gone along with the process itself so there might be not memory allocation problems at all.

A real benefit would be if the arrays of type x0, y0, z0, x1, y1, z1... are replaced with x0, x1... y0, y1... z0, z1 which allows effective loop vectorization. But this is another matter.
8) Message boards : Application Code Discussion : found a small memory error in the code (Message 26947)
Posted 2 Jul 2009 by dobrichev
Post:
Here is the list of the memory leaks I found in application source v0.18

file boinc_astronomy.C

void worker()
...
free(sp);
free_search_parameters(s); // <-- missing
...


file evaluation_optimized.c

void free_constants(ASTRONOMY_PARAMETERS *ap)
...
//for (i = 0; i < ap->number_streams; i++) { // <-- wrong
for (i = 0; i < ap->convolve; i++) { // correct
...


file evaluation_state.c

void free_state(EVALUATION_STATE* es) {
int i;
free(es->stream_integrals);
for (i = 0; i < es->number_integrals; i++) {
free_integral_area(es->integral[i]);
free(es->integral[i]); // <-- missing
}
free(es->integral);
}

int read_checkpoint(EVALUATION_STATE* es) {
...
if (1 > fscanf(file, "background_integral: %lf\n", &(es->background_integral))) return 1;
free(es->stream_integrals); // <-missing, read_double_array below allocates memory
es->number_streams = read_double_array(file, "stream_integrals", &(es->stream_integrals));
...


file search_parameters.c

void free_search_parameters(SEARCH_PARAMETERS *parameters) {
free(parameters->search_name);
free(parameters->parameters);
free(parameters->metadata);
free(parameters); // <-- missing
}

=====

There is also a 4-byte memory leak somewhere in the boinc library, probably in diagnostics_win.cpp
diagnostics_threads.push_back(pThreadEntry);

=====

Additionaly, when checkpoint is loaded, the memory is previously allocated on the basis of the workunit, but the loops for data reading are based on the checkpoint content. This causes heap corruption when the checkpoint file was generated for another workunit due to the difference in the data member count. Same could occur when for some reason the checkpoint file is corrupt.

=====

In general, since the amount of required heap memory is known at the beginning, all the memory allocation calls may be reduced to one allocation per data type. This should affect the performance, especially on multiprocessor mashines.
9) Message boards : MilkyWay@home Science : What are we doing and for whom? (Message 26926)
Posted 2 Jul 2009 by dobrichev
Post:
Is it normal recently going workunits to contain three streams, two of which have exactly the same definition?
An example workunit is 91405739 where the first and the second stream in parameter_212F5_3s.txt are equal.




©2022 Astroinformatics Group