Welcome to MilkyWay@home

problem with checkpoints?

Message boards : Application Code Discussion : problem with checkpoints?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile mindc

Send message
Joined: 9 Jul 08
Posts: 7
Credit: 11,070,991
RAC: 0
Message 8813 - Posted: 21 Jan 2009, 14:56:09 UTC

Since new assimilator/validator arrive, I've got problem with WUs, which has been restarted. This kind of WUs, are all marked as Invalid. Any clues?
ID: 8813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Lazarus-uk

Send message
Joined: 22 Mar 08
Posts: 7
Credit: 9,175,991
RAC: 0
Message 8816 - Posted: 21 Jan 2009, 17:01:34 UTC - in response to Message 8813.  



I also have a couple tasks that have just failed to validate after restarting from checkpoints

63306531
63306530

I'm using Ubuntu 8.10 and speedimic's SSE4.1 Linux 64-bit App. Other WUs, that run straight through without checkpoints, are validating.


ID: 8816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 8821 - Posted: 21 Jan 2009, 20:00:53 UTC - in response to Message 8816.  



I also have a couple tasks that have just failed to validate after restarting from checkpoints

63306531
63306530

I'm using Ubuntu 8.10 and speedimic's SSE4.1 Linux 64-bit App. Other WUs, that run straight through without checkpoints, are validating.


I'll take a look into this.
ID: 8821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 8899 - Posted: 23 Jan 2009, 13:36:00 UTC - in response to Message 8821.  
Last modified: 23 Jan 2009, 14:13:34 UTC

I'll take a look into this.


I've just done that, too, as I've read the news about the problems only now.
I changed nothing to that checkpointing stuff (besides the increased precision of the values stored in the checkpoints, but that is unrelated to this bug) in calculate_integral, so I would assume the problem applies also to the stock app (and probably all other apps out there). This is also evident by the report of the same problem for the linux applications.

A quick glance revealed, that when resuming from a checkpoint, the app starts from the last values for mu, nu and r. So the values for the last combination are actually added twice to the integrals, when resuming from a checkpoint. This of course screws the result (sometimes more, sometimes less, depending on the mu, nu, r combination).
I will add a fix to my application.

Edit:
I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it.
ID: 8899 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Lazarus-uk

Send message
Joined: 22 Mar 08
Posts: 7
Credit: 9,175,991
RAC: 0
Message 8901 - Posted: 23 Jan 2009, 14:31:50 UTC - in response to Message 8899.  
Last modified: 23 Jan 2009, 14:32:40 UTC

I'll take a look into this.


I've just done that, too, as I've read the news about the problems only now.
I changed nothing to that checkpointing stuff (besides the increased precision of the values stored in the checkpoints, but that is unrelated to this bug) in calculate_integral, so I would assume the problem applies also to the stock app (and probably all other apps out there). This is also evident by the report of the same problem for the linux applications.

A quick glance revealed, that when resuming from a checkpoint, the app starts from the last values for mu, nu and r. So the values for the last combination are actually added twice to the integrals, when resuming from a checkpoint. This of course screws the result (sometimes more, sometimes less, depending on the mu, nu, r combination).
I will add a fix to my application.

Edit:
I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it.



Those 2 WUs I reported as failures, earlier in this thread, were started from checkpoints after a reboot of the pc. Since then I've had no more failures from checkpoints, whether after rebooting or not. The problem seems to have disappeared, for me at least. Or, maybe I've just been lucky in restarting from a place where the values don't affect the result too much.
ID: 8901 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 8902 - Posted: 23 Jan 2009, 15:03:50 UTC - in response to Message 8899.  
Last modified: 23 Jan 2009, 15:12:36 UTC

Edit:
I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it.

Couldn't edit anymore, so this is just an Edit2:

I let now the checkpointing at the end of the loop, but will remove the "ia->r_step_current++" from the for loop and put it directly in front of the checkpointing. That way the already updated value will be stored in the checkpoint and the double calculation of some values should be gone. Furthermore it appears it is the smallest change to correct the problem.
Can somebody verify that this change will produce correct results in all cases?

And should I update the application_name or the version to distinguish between the old and the updated one? I would prefer the version number (0.14 is a good one, isn't it?).

PS: The innermost loop in calculate_integrals looks now like that:
for (; ia->r_step_current < ia->r_steps;) {
  [some code here]
  ia->r_step_current++;
  #ifdef GMLE_BOINC
    int retval;
    if (boinc_time_to_checkpoint()) {
    retval = write_checkpoint(es);
    if (retval) {
      fprintf(stderr,"APP: astronomy checkpoint failed %d\n",retval);
      return;
    }
    boinc_checkpoint_completed();
  }
  boinc_fraction_done(calculate_progress(es));
  #endif
}
ID: 8902 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 8911 - Posted: 23 Jan 2009, 21:30:51 UTC - in response to Message 8901.  

I'll take a look into this.


I've just done that, too, as I've read the news about the problems only now.
I changed nothing to that checkpointing stuff (besides the increased precision of the values stored in the checkpoints, but that is unrelated to this bug) in calculate_integral, so I would assume the problem applies also to the stock app (and probably all other apps out there). This is also evident by the report of the same problem for the linux applications.

A quick glance revealed, that when resuming from a checkpoint, the app starts from the last values for mu, nu and r. So the values for the last combination are actually added twice to the integrals, when resuming from a checkpoint. This of course screws the result (sometimes more, sometimes less, depending on the mu, nu, r combination).
I will add a fix to my application.

Edit:
I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it.



Those 2 WUs I reported as failures, earlier in this thread, were started from checkpoints after a reboot of the pc. Since then I've had no more failures from checkpoints, whether after rebooting or not. The problem seems to have disappeared, for me at least. Or, maybe I've just been lucky in restarting from a place where the values don't affect the result too much.


I turned off the setting that was not awarding credit to workunits with these types of results (since it's a problem on our end), so everyone should be getting credit for these until we update the code and get the problem fixed.

ID: 8911 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 8920 - Posted: 23 Jan 2009, 23:49:13 UTC - in response to Message 8911.  

(since it's a problem on our end), so everyone should be getting credit for these until we update the code and get the problem fixed.

Actually the wrong r value in the checkpoint is the smaller problem. I've tested my solution and found it is not sufficient. Obviously the stream_integral and background_integral values are read back as zero from the checkpoints, even if other values were written to the file.

So I traced it to the functions fwrite_integral_area and fread_integral_area. The background_integral is written to the checkpoint with

fprintf(file, "background integral: %.10lf\n", ia->background_integral);

but read with

fscanf(file, "background_integral: %lf\n", &(ia->background_integral));

There is a missing underscore when writing to the file! When resuming from a checkpoint, it will read all values after that as zero.

I have put the underscore to the fprintf and changed the fscanf line to

fscanf(file, "background%*cintegral: %lf\n", &(ia->background_integral));

That construct will accept any sign between "background" and "integral" ;)
ID: 8920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 8925 - Posted: 24 Jan 2009, 1:30:43 UTC - in response to Message 8920.  
Last modified: 24 Jan 2009, 2:18:53 UTC

So hopefully now the last bug for this checkpointing stuff. All the changes above won't help if the following lines
ia->background_integral = 0;
for (i = 0; i < ap->number_streams; i++) {
	ia->stream_integrals[i] = 0;
}

are not deleted from the beginning of calculate_integral. These statements simply overwrite the values just read from the checkpoints.
I will report back, if all 3 changes together correct the problem.

Edit:
I can confirm now, that together the changes enable correct checkpointing, i.e. the output is the same when running uninterrupted or when resuming from a checkpoint.
Hope this helps you to release an updated version.

@Travis: Have you read my PM about the GPU stuff? What do you think of it?
ID: 8925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 8931 - Posted: 24 Jan 2009, 17:36:24 UTC - in response to Message 8902.  
Last modified: 24 Jan 2009, 17:38:02 UTC

Edit:
I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it.

Couldn't edit anymore, so this is just an Edit2:

I let now the checkpointing at the end of the loop, but will remove the "ia->r_step_current++" from the for loop and put it directly in front of the checkpointing. That way the already updated value will be stored in the checkpoint and the double calculation of some values should be gone. Furthermore it appears it is the smallest change to correct the problem.
Can somebody verify that this change will produce correct results in all cases?

And should I update the application_name or the version to distinguish between the old and the updated one? I would prefer the version number (0.14 is a good one, isn't it?).

PS: The innermost loop in calculate_integrals looks now like that:
for (; ia->r_step_current < ia->r_steps;) {
  [some code here]
  ia->r_step_current++;
  #ifdef GMLE_BOINC
    int retval;
    if (boinc_time_to_checkpoint()) {
    retval = write_checkpoint(es);
    if (retval) {
      fprintf(stderr,"APP: astronomy checkpoint failed %d\n",retval);
      return;
    }
    boinc_checkpoint_completed();
  }
  boinc_fraction_done(calculate_progress(es));
  #endif
}


Yeah, this looks like it was causing a problem. I just moved the checkpointing code to the top of the loop (as opposed to the bottom).

I also moved the checkpoint code in the calculate_integrals function outside the loop to prevent it from not calculating an integral:

        for (; es->current_cut < ap->number_cuts; es->current_cut++) {
                if (es->current_cut == -1) {
                        current_area = es->main_integral;
                } else {
                        current_area = es->cuts[es->current_cut];
                }
                calculate_integral(ap, current_area, es);

                if (es->current_cut == -1) {
                        es->background_integral = current_area->background_integral;
                        for (i = 0; i < ap->number_streams; i++) es->stream_integrals[i] = current_area->stream_integrals[i];
//                      printf("[main] background: %.10lf, stream[0]: %.10lf\n", current_area->background_integral, current_area->stream_integrals[0]);
                } else {
                        es->background_integral -= current_area->background_integral;
                        for (i = 0; i < ap->number_streams; i++) es->stream_integrals[i] -= current_area->stream_integrals[i];
//                      printf("[cut %d] background: %.10lf, stream[0]: %.10lf\n", es->current_cut, current_area->background_integral, current_area->stream_integrals[0]);
                }
        }
        #ifdef GMLE_BOINC
                int retval = write_checkpoint(es);
                if (retval) {
                        fprintf(stderr,"APP: astronomy checkpoint failed %d\n",retval);
                        return retval;
                }
        #endif


As to:

ia->background_integral = 0;
for (i = 0; i < ap->number_streams; i++) {
	ia->stream_integrals[i] = 0;
}


I think this needs to be moved to before the checkpoint is calculated (to initialize the values in case there isn't a checkpoint). I'm testing these changes right now and once i know they're working I'll update the stock app to 0.14
ID: 8931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 8932 - Posted: 24 Jan 2009, 17:39:52 UTC - in response to Message 8925.  

@Travis: Have you read my PM about the GPU stuff? What do you think of it?


I got it last night after getting back from snowboarding, so I was a bit too tired to formulate any decent response :D It seems pretty cool though and your values should be within our level of tolerance for what we're doing right now. It'd be great if you could post the GPU code in here so other people can look at it as well. I'd be happy to sticky and put a stamp of approval on it after we look at it.
ID: 8932 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 8935 - Posted: 24 Jan 2009, 17:51:59 UTC - in response to Message 8931.  


As to:
ia->background_integral = 0;
for (i = 0; i < ap->number_streams; i++) {
	ia->stream_integrals[i] = 0;
}


I think this needs to be moved to before the checkpoint is calculated (to initialize the values in case there isn't a checkpoint). I'm testing these changes right now and once i know they're working I'll update the stock app to 0.14


Update to this: Yeah these can just be removed because they're initialized in initialize_state in evaluation_state.c
ID: 8935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 8936 - Posted: 24 Jan 2009, 17:56:44 UTC - in response to Message 8902.  

Edit:
I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it.

Couldn't edit anymore, so this is just an Edit2:

I let now the checkpointing at the end of the loop, but will remove the "ia->r_step_current++" from the for loop and put it directly in front of the checkpointing. That way the already updated value will be stored in the checkpoint and the double calculation of some values should be gone. Furthermore it appears it is the smallest change to correct the problem.
Can somebody verify that this change will produce correct results in all cases?

And should I update the application_name or the version to distinguish between the old and the updated one? I would prefer the version number (0.14 is a good one, isn't it?).

PS: The innermost loop in calculate_integrals looks now like that:
for (; ia->r_step_current < ia->r_steps;) {
  [some code here]
  ia->r_step_current++;
  #ifdef GMLE_BOINC
    int retval;
    if (boinc_time_to_checkpoint()) {
    retval = write_checkpoint(es);
    if (retval) {
      fprintf(stderr,"APP: astronomy checkpoint failed %d\n",retval);
      return;
    }
    boinc_checkpoint_completed();
  }
  boinc_fraction_done(calculate_progress(es));
  #endif
}


This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++).
ID: 8936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 8971 - Posted: 24 Jan 2009, 22:39:14 UTC - in response to Message 8932.  

@Travis: Have you read my PM about the GPU stuff? What do you think of it?

[..]
It seems pretty cool though and your values should be within our level of tolerance for what we're doing right now. It'd be great if you could post the GPU code in here so other people can look at it as well. I'd be happy to sticky and put a stamp of approval on it after we look at it.

Nice to hear!

And I have not even told you how fast it really is. You will be surprised, I promise!

But there is still the problem that there is no support from the BOINC client for a proper scheduling of GPU tasks on ATI cards (nvidia only in the moment). But maybe one could ask the BOINC devs for that?
But as you are interested, I will go ahead and try to figure out, how it can be run also on current clients.

If there is a real chance it can be distributed as stock GPU app here, I'm definitely willing to share the GPU code. But I have to warn you, it will be a little harder to maintain than a typical C code. Unfortunately, the Stream SDK of ATI isn't that mature and some functionality is basically missing in the moment. So I had to go the hard way and actually implemented it in some kind of GPU assembler language. But as the graphics card only calculates the r loop in calculate_integral (but it is 99+% of the whole computation), one can still change the stuff around it, without the need to do anything with the GPU code.
One may ask why one should take this effort, when CUDA has matured much more already. But the simple answer is performance. ATI beats the crap out of nvidia with double precision calculations. A GTX280 is maybe on the heels of a HD3850, but a HD4870 has about three times the power.

And just as a teaser for the crowd, do you remember the days of the old 1.22 application? A single HD4870 would generate more than two times the throughput of the whole MW project at that times ;)
ID: 8971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 8972 - Posted: 24 Jan 2009, 22:43:36 UTC - in response to Message 8936.  
Last modified: 24 Jan 2009, 22:44:35 UTC

This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++).

In which version was it put inside of that loop? In runs through this loop quite fast (just some seconds on any recent CPU). I would say it makes no sense to put it inside.
ID: 8972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 9008 - Posted: 25 Jan 2009, 1:56:30 UTC - in response to Message 8972.  

This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++).

In which version was it put inside of that loop? In runs through this loop quite fast (just some seconds on any recent CPU). I would say it makes no sense to put it inside.


I think in the swap from astronomyX to milkywayX it was moved in there. We could probably move it to the middle loop and would probably see some speedup.
ID: 9008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 9047 - Posted: 25 Jan 2009, 3:53:37 UTC - in response to Message 9008.  

This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++).

In which version was it put inside of that loop? In runs through this loop quite fast (just some seconds on any recent CPU). I would say it makes no sense to put it inside.


I think in the swap from astronomyX to milkywayX it was moved in there. We could probably move it to the middle loop and would probably see some speedup.


I actually just tried this and the performance difference wasn't noticeable.
ID: 9047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Ben

Send message
Joined: 8 Mar 08
Posts: 17
Credit: 4,411,459
RAC: 0
Message 9236 - Posted: 26 Jan 2009, 22:54:51 UTC - in response to Message 8971.  

@Travis: Have you read my PM about the GPU stuff? What do you think of it?

[..]
It seems pretty cool though and your values should be within our level of tolerance for what we're doing right now. It'd be great if you could post the GPU code in here so other people can look at it as well. I'd be happy to sticky and put a stamp of approval on it after we look at it.

Nice to hear!

And I have not even told you how fast it really is. You will be surprised, I promise!

But there is still the problem that there is no support from the BOINC client for a proper scheduling of GPU tasks on ATI cards (nvidia only in the moment). But maybe one could ask the BOINC devs for that?
But as you are interested, I will go ahead and try to figure out, how it can be run also on current clients.

If there is a real chance it can be distributed as stock GPU app here, I'm definitely willing to share the GPU code. But I have to warn you, it will be a little harder to maintain than a typical C code. Unfortunately, the Stream SDK of ATI isn't that mature and some functionality is basically missing in the moment. So I had to go the hard way and actually implemented it in some kind of GPU assembler language. But as the graphics card only calculates the r loop in calculate_integral (but it is 99+% of the whole computation), one can still change the stuff around it, without the need to do anything with the GPU code.
One may ask why one should take this effort, when CUDA has matured much more already. But the simple answer is performance. ATI beats the crap out of nvidia with double precision calculations. A GTX280 is maybe on the heels of a HD3850, but a HD4870 has about three times the power.

And just as a teaser for the crowd, do you remember the days of the old 1.22 application? A single HD4870 would generate more than two times the throughput of the whole MW project at that times ;)



I'd love to see this kind of program being implemented! The only concern i have is, because a lot of people here would only be too happy to run it how would the server fair up?

With the GPU's running the WU's would either have to become much longer or the server would have to be upgraded as i don't think the network would hold up on the server end if that was the case :P
ID: 9236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Application Code Discussion : problem with checkpoints?

©2024 Astroinformatics Group