Message boards :
Application Code Discussion :
problem with checkpoints?
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Jul 08 Posts: 7 Credit: 11,070,991 RAC: 0 |
Since new assimilator/validator arrive, I've got problem with WUs, which has been restarted. This kind of WUs, are all marked as Invalid. Any clues? |
Send message Joined: 22 Mar 08 Posts: 7 Credit: 9,175,991 RAC: 0 |
|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
|
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
I'll take a look into this. I've just done that, too, as I've read the news about the problems only now. I changed nothing to that checkpointing stuff (besides the increased precision of the values stored in the checkpoints, but that is unrelated to this bug) in calculate_integral, so I would assume the problem applies also to the stock app (and probably all other apps out there). This is also evident by the report of the same problem for the linux applications. A quick glance revealed, that when resuming from a checkpoint, the app starts from the last values for mu, nu and r. So the values for the last combination are actually added twice to the integrals, when resuming from a checkpoint. This of course screws the result (sometimes more, sometimes less, depending on the mu, nu, r combination). I will add a fix to my application. Edit: I guess I will just move the writing of the checkpoint from the end of the loop to the beginning. It should produce correct results then. I think I had this done already at some point, but reverted it to have again the same scheme as the stock app without thinking too much about it. |
Send message Joined: 22 Mar 08 Posts: 7 Credit: 9,175,991 RAC: 0 |
I'll take a look into this. Those 2 WUs I reported as failures, earlier in this thread, were started from checkpoints after a reboot of the pc. Since then I've had no more failures from checkpoints, whether after rebooting or not. The problem seems to have disappeared, for me at least. Or, maybe I've just been lucky in restarting from a place where the values don't affect the result too much. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Edit: Couldn't edit anymore, so this is just an Edit2: I let now the checkpointing at the end of the loop, but will remove the "ia->r_step_current++" from the for loop and put it directly in front of the checkpointing. That way the already updated value will be stored in the checkpoint and the double calculation of some values should be gone. Furthermore it appears it is the smallest change to correct the problem. Can somebody verify that this change will produce correct results in all cases? And should I update the application_name or the version to distinguish between the old and the updated one? I would prefer the version number (0.14 is a good one, isn't it?). PS: The innermost loop in calculate_integrals looks now like that: for (; ia->r_step_current < ia->r_steps;) { [some code here] ia->r_step_current++; #ifdef GMLE_BOINC int retval; if (boinc_time_to_checkpoint()) { retval = write_checkpoint(es); if (retval) { fprintf(stderr,"APP: astronomy checkpoint failed %d\n",retval); return; } boinc_checkpoint_completed(); } boinc_fraction_done(calculate_progress(es)); #endif } |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
I'll take a look into this. I turned off the setting that was not awarding credit to workunits with these types of results (since it's a problem on our end), so everyone should be getting credit for these until we update the code and get the problem fixed. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
(since it's a problem on our end), so everyone should be getting credit for these until we update the code and get the problem fixed. Actually the wrong r value in the checkpoint is the smaller problem. I've tested my solution and found it is not sufficient. Obviously the stream_integral and background_integral values are read back as zero from the checkpoints, even if other values were written to the file. So I traced it to the functions fwrite_integral_area and fread_integral_area. The background_integral is written to the checkpoint with fprintf(file, "background integral: %.10lf\n", ia->background_integral); but read with fscanf(file, "background_integral: %lf\n", &(ia->background_integral)); There is a missing underscore when writing to the file! When resuming from a checkpoint, it will read all values after that as zero. I have put the underscore to the fprintf and changed the fscanf line to fscanf(file, "background%*cintegral: %lf\n", &(ia->background_integral)); That construct will accept any sign between "background" and "integral" ;) |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
So hopefully now the last bug for this checkpointing stuff. All the changes above won't help if the following lines ia->background_integral = 0; for (i = 0; i < ap->number_streams; i++) { ia->stream_integrals[i] = 0; } are not deleted from the beginning of calculate_integral. These statements simply overwrite the values just read from the checkpoints. I will report back, if all 3 changes together correct the problem. Edit: I can confirm now, that together the changes enable correct checkpointing, i.e. the output is the same when running uninterrupted or when resuming from a checkpoint. Hope this helps you to release an updated version. @Travis: Have you read my PM about the GPU stuff? What do you think of it? |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Edit: Yeah, this looks like it was causing a problem. I just moved the checkpointing code to the top of the loop (as opposed to the bottom). I also moved the checkpoint code in the calculate_integrals function outside the loop to prevent it from not calculating an integral: for (; es->current_cut < ap->number_cuts; es->current_cut++) { if (es->current_cut == -1) { current_area = es->main_integral; } else { current_area = es->cuts[es->current_cut]; } calculate_integral(ap, current_area, es); if (es->current_cut == -1) { es->background_integral = current_area->background_integral; for (i = 0; i < ap->number_streams; i++) es->stream_integrals[i] = current_area->stream_integrals[i]; // printf("[main] background: %.10lf, stream[0]: %.10lf\n", current_area->background_integral, current_area->stream_integrals[0]); } else { es->background_integral -= current_area->background_integral; for (i = 0; i < ap->number_streams; i++) es->stream_integrals[i] -= current_area->stream_integrals[i]; // printf("[cut %d] background: %.10lf, stream[0]: %.10lf\n", es->current_cut, current_area->background_integral, current_area->stream_integrals[0]); } } #ifdef GMLE_BOINC int retval = write_checkpoint(es); if (retval) { fprintf(stderr,"APP: astronomy checkpoint failed %d\n",retval); return retval; } #endif As to: ia->background_integral = 0; for (i = 0; i < ap->number_streams; i++) { ia->stream_integrals[i] = 0; } I think this needs to be moved to before the checkpoint is calculated (to initialize the values in case there isn't a checkpoint). I'm testing these changes right now and once i know they're working I'll update the stock app to 0.14 |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
@Travis: Have you read my PM about the GPU stuff? What do you think of it? I got it last night after getting back from snowboarding, so I was a bit too tired to formulate any decent response :D It seems pretty cool though and your values should be within our level of tolerance for what we're doing right now. It'd be great if you could post the GPU code in here so other people can look at it as well. I'd be happy to sticky and put a stamp of approval on it after we look at it. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Update to this: Yeah these can just be removed because they're initialized in initialize_state in evaluation_state.c |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Edit: This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++). |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
@Travis: Have you read my PM about the GPU stuff? What do you think of it? Nice to hear! And I have not even told you how fast it really is. You will be surprised, I promise! But there is still the problem that there is no support from the BOINC client for a proper scheduling of GPU tasks on ATI cards (nvidia only in the moment). But maybe one could ask the BOINC devs for that? But as you are interested, I will go ahead and try to figure out, how it can be run also on current clients. If there is a real chance it can be distributed as stock GPU app here, I'm definitely willing to share the GPU code. But I have to warn you, it will be a little harder to maintain than a typical C code. Unfortunately, the Stream SDK of ATI isn't that mature and some functionality is basically missing in the moment. So I had to go the hard way and actually implemented it in some kind of GPU assembler language. But as the graphics card only calculates the r loop in calculate_integral (but it is 99+% of the whole computation), one can still change the stuff around it, without the need to do anything with the GPU code. One may ask why one should take this effort, when CUDA has matured much more already. But the simple answer is performance. ATI beats the crap out of nvidia with double precision calculations. A GTX280 is maybe on the heels of a HD3850, but a HD4870 has about three times the power. And just as a teaser for the crowd, do you remember the days of the old 1.22 application? A single HD4870 would generate more than two times the throughput of the whole MW project at that times ;) |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++). In which version was it put inside of that loop? In runs through this loop quite fast (just some seconds on any recent CPU). I would say it makes no sense to put it inside. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++). I think in the swap from astronomyX to milkywayX it was moved in there. We could probably move it to the middle loop and would probably see some speedup. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
This change also has to be done in the innermost loop of calculate_likelihood (for the same reason). I just put the checkpoint code at the beginning of the loop, instead of moving ia->r_step_current++ into the loop, (and in the case of calculate likelihood, es->current_star_point++). I actually just tried this and the performance difference wasn't noticeable. |
Send message Joined: 8 Mar 08 Posts: 17 Credit: 4,411,459 RAC: 0 |
@Travis: Have you read my PM about the GPU stuff? What do you think of it? I'd love to see this kind of program being implemented! The only concern i have is, because a lot of people here would only be too happy to run it how would the server fair up? With the GPU's running the WU's would either have to become much longer or the server would have to be upgraded as i don't think the network would hold up on the server end if that was the case :P |
©2024 Astroinformatics Group