Bootstrapping
By Elijah Bernstein-Cooper, August 19, 2015, 0 comments.

## Likelihoods vs. Bootstraps

Motivated by Leroy et al. (2009) I decided to attempt a simpler more data-driven approach to estimating the uncertainties of the DGR and intercept. I am able to derive meaningful uncertainties of the DGR and intercept now using a bootstrap monte carlo, with very few assumptions about the data.

Bootstrapping allows the data to speak for itself. Maximum likelihood analysis requires a much better constraint on the data uncertainty scales and distribution.

Bootstrapping means that we do not have to bin the data, or mask the data. Binning and masking in the MLE analysis led to constantly varying results based on fine-tuning either step.

### Relevance to Median fitting

The bootstrapping technique is similar to fitting to the median bins of the distribution. Bootstrapping with replacement creates a dataset consisting of the most common data points. For the clouds in consideration, the most prevalent pixels are the diffuse pixels.

The advantage of bootstrapping over a median fit, is that it gives us a probability density function for each of the parameters, i.e. uncertainties on each parameter. We can sample this PDF in a monte carlo later on to estimate the uncertainties of parameters in the Krumholz model.

### The Monte Carlo

For each bootstrap iteration I first simulate the $A_V$ noise, by assuming normally distributed errors, and add this simulated noise to the $A_V$ data. I then resample this simulated noisy $A_V$ data with replacement. I added small amount of additional noise corresponding to the scatter between the Schlegel et al. (1998) IR $A_V$ image and Planck data.

We can now perform weighted least squares on the resampled, simulated data. I am minimizing the error, $s$ as

where $w_i$ is the weight, $y_i$ is the data, and $\hat{y}_i$ is the model, each of the $i$th element. The weights are defined as the inverse of the data variance.

## Application

### Selecting HI Range

I decided to use the method adopted Martin et al. (2015) and Planck (2011) to select the HI range. These authors examine the standard deviation of the HI at each velocity. Valleys in the standard deviation spectrum mark the separation between independent structures. I outlined the steps I would use in a previous post. I used the HI range of -5 to 15 km/s for each cloud. California’s standard deviation is somewhat peculiar, in that it has two components. However California also has two CO components.

### Lee+12 Comparison

We compare the bootstrap fits to the Lee+12 IRIS $A_V$ data with the MLE fits. Below shows the results for including an intercept in the fit, and holding the intercept at 0 mag. Including an intercept makes the uncertainty of the fit huge for Perseus. Leroy et al. (2009) held the intercept at 0 mag in their fit. We show an alternative to using an intercept later in the post.

#### Figure 1

Lee+12 $A_V$ vs. N(HI) with, top left: bootstrap fit, right: MLE fit, bottom-left: bootstrap fit without an intercept. The errors reported on the bootstrap fit are at 68% confidence. The bootstrap fit corresponds well to the polynomial fit to the entire dataset on the right. However the DGR is almost twice as large as Lee et al. (2012)’s DGR of 0.11. Without an intercept, the bootstrap DGR is much closer to Lee et al. (2012).

### Planck

The rest of the results use the Planck $A_V$ data. Only 100 bootstraps were run. At least 10,000 are needed to realistically sample the variation in the data, the number of pixels in the data. I show results with and without an intercept, as well as considering a background dust population. Fits without an intercept are shown to be compared with Leroy et al. (2009), who did not fit an intercept, but rather removed a background by hand.

#### With intercept

Perseus

Taurus

California

##### Figure 2

Left: Planck $A_V$ vs. N(HI) bootstrap fit. Right: Distribution of intercept and DGR from all bootstraps. The fits to California tend to trace a much more linear regime of the $A_V$ vs. N(HI) distribution compared to a simple polynomial fit which found DGRs on order of 0.4.

#### Without intercept

Perseus

Taurus

California

##### Figure 3

Left: Planck $A_V$ vs. N(HI) bootstrap fit. Right: Distribution of the DGR from all bootstraps.

#### Background Cloud

We could also fit for multiple clouds along the line of sight as done in Planck (2011). This would allow us to associate the excess of dust emission with HI not associated with the cloud, especially in California.

Our model $A_V$ would be represented as

where a B subscript represents the background, and the C subscript represents the cloud, and $A_{V,I}$ is the intercept. Each of the model $A_V$ would be represented as

We get the cloud N(HI), N(HI)$_C$, as usual, by selecting the HI range from the standard deviation spectrum. The background N(HI), N(HI)$_B$, is the integrated HI along the line-of-sight excluding the cloud HI range. For the three clouds the background HI range consists of -100 to -5 km/s, and 15 to 100 km/s. Only in California and Taurus are the backgrounds significant.

We fit for DGR$_B$, DGR$_C$, and $A_{V,I}$ as three separate parameters.

Perseus

Taurus

California

##### Figure 4

Top-left: Planck $A_V$ vs. cloud N(HI), bottom-left: $A_V$ vs. background N(HI). Top-right: Bootstrap distribution of background vs. cloud DGR, bottom-right: Bootstrap distribution of intercept vs. cloud DGR. The inclusion of a background cloud seems to show reasonable results. Especially for California which has so much HI along the line-of-sight. Without the background cloud fit California’s DGR is on order of $0.2 \times 10^{-20}$ cm$^2$ mag.

## Next Steps

1. Run simulation with many more bootstraps.

2. Choose whether or not to use the background cloud fit.

3. Incorporate more uncertainties in the monte carlo bootstrapping. Examples are:

• Uncertainty in calibration, between different data sets, 2MASS and Schlegel et al. (1998).

• Varying dust opacity