Happy Tuesday everyone!
So warning to everyone this is going to be a very long post. You can read it, skim it, just read the TL;DR at the bottom or whatever you feel. Since this is kind of a technical post and I know the audience is diverse, I will try to help you decide how to parse this post: general crunchers (TL;DR), more science savvy crunchers interested in how MW@home works behind the scenes(Read/Skim), and future MW@home scientists who will inevitably be lost in the massive project that is MW@home(please read thoroughly and reference all of the papers).
Throughout the post you will see links to help define terms that you may not have heard of before. Obviously not all of the words will have this nice feature, so if you have any questions or want better clarification, feel free to ask me questions! I will answer as soon as possible.
So here we go:
Good news! I found the bug which prevented us from generating test data for modfit. Before I get to the bug lets first talk about the basic steps required to create test data. This will hopefully be a simple enough explanation for everyone to understand, but also give enough insight into why this bug took so long to find.
Building a Wedge:
Step zero involves some pre-calculations needed to generate the data. Mainly, we calculate how many stars we expect to find in the background and in each stream.
Step one in generating test data for MilkyWay@home is to generate a background distribution of stars consistent with that expected by our application. This is pretty easy to do since we already know the distribution we expect. Just normalize it by dividing by the functions maximum in the wedge to get the probability that a star will be present at any given point in our data wedge. Then we use a method called rejection sampling to generate stars according to this distribution. This method is not the fastest/most efficient, but it is the easiest to code and debug and since it is not in a time sensitive application we chose it.
Step two is to generate the each stream. To do this, we first calculate the length each stream runs through the wedge. Each stream is modeled as a cylinder in our data and the density of the stream drops with radius from the center line of the cylinder. Calculating the length involves making the stream incrementally longer until each end cap is completely out of the wedge. We need to ensure the entire end cap is out of the wedge because it is possible for a stream to still have stars within the wedge even if the center of the end cap is outside of the wedge. (I drew a few crude pictures to prove this to myself, but I currently don't have any worthy of putting on here. Once we know the length of the stream we can generate a cylindrical distribution of stars based on a uniform random number along the z-axis, a normal random number along the x-axis and a different normal random number along the y-axis.
Once all of the streams and background are generated, we have what I call an ideal distribution of stars. This is what the distribution would look like if all of the stars has a single brightness and if there were random error or observational effects. Since MilkyWay@home expects and accounts for observational effects and intrinsic characteristics of our generated data we must put this into our simulated data.
The final step to generating the test data is to simulate observational effects and intrinsic star properties on our stars. We start by "fuzzing" out star brightnesses to match the distribution we expect to find in our data. Basically we use its current point as the center of the distribution of star points and then either make it brighter or dimmer according to a specific distribution. Next we add in the detection efficiency of the telescope used to take our actual data, and simulate the number of stars which will no longer meet out selection criteria due to random error in brightness measurements (especially for distant stars). This results in our test data having fewer stars father away from us than closer. This makes sense because dimmer stars are less likely to be picked up by our telescope than bright ones due to random error. Also brightness measurements become less accurate the dimmer the object (due to random error) so we again expect fewer stars to fall within our selection criteria farther away (dimmer).
So our bug was in the way we calculate the length of the stream in step two. Instead of increasing the length of the cylinder along the z-axis or essentially the direction along the cylinder, we increased the length along the x-axis. The result was essentially a strange shape the resembled a stream, but was not really a cylinder and it was also 90 degrees off of the orientation is was supposed to have. This bug is now fixed and our test data simulator seems to be working.
For some proof that it is working now lets look at the same plot I linked earlier, but now with our new simulated data.
In this plot, the blue dot represents the expected location of the likelihood peak and the green dot represents the actual likelihood peak in this parameter sweep. They are almost perfectly on top of each other! The small difference is easily accounted for by the step sizes used in our sweep and is well within our expected error tolerance for our program.
I put new runs up for Separation Modfit over the weekend:
These runs are going to show how easy it is for Modfit to optimize to the expected values. I am also running nonmodfit runs which will show what differences we expect to see. This is important because it will give us a sanity check when we run modfit on real data and want to compare it to Matt Newby's results run before modfit. All of these results will be included in a paper I am writing for publication explaining the Modfit algorithm and how it changes our fits.
Hope you guys appreciate the transparency that this blog is bringing to our development process. If you guys have any suggestions for things you want me to include in my blog posts, or questions about what I just posted please let me know.
TL;DR: I found the bug in the test data simulation program and we have new runs up with the correct data. If you like pretty pictures and want proof look at the parameter sweep and see the two dots are almost the same.[/b]