[PDF] Benchmark Tests for Markov Chain Monte Carlo Fitting of Exoplanet Eclipse Observations

Abstract

Ground-based observations of exoplanet eclipses provide important clues to the planets' atmospheric physics, yet systematics in light curve analyses are not fully understood. It is unknown if measurements suggesting near-infrared flux densities brighter than models predict are real, or artifacts of the analysis processes. We created a large suite of model light curves, using both synthetic and real noise, and tested the common process of light curve modeling and parameter optimization with a Markov Chain Monte Carlo (MCMC) algorithm. With synthetic white-noise models, we find that input eclipse signals are generally recovered within 10% accuracy for eclipse depths greater than the noise amplitude, and to smaller depths for higher sampling rates and longer baselines. Red-noise models see greater discrepancies between input and measured eclipse signals, often biased in one direction. Finally, we find that in real data, systematic biases result even with a complex model to account for trends, and significant false eclipse signals may appear in a non-Gaussian distribution. To quantify the bias and validate an eclipse measurement, we compare both the planet-hosting star and several of its neighbors to a separately-chosen control sample of field stars. Re-examining the Rogers et al. (2009) Ks-band measurement of CoRoT-1b finds an eclipse 3190 +370 −440 ppm deep centered at ϕ me = 0.50418 +0.00197 −0.00203 . Finally, we provide and recommend the use of selected datasets we generated as a benchmark test for eclipse modeling and analysis routines, and propose criteria to verify eclipse detections.

Full PDF

aa r X i v : . [ a s t r o - ph . E P ] F e b Benchmark Tests for Markov Chain Monte Carlo Fitting ofExoplanet Eclipse Observations

Justin Rogers , , Mercedes L´opez-Morales , , D´aniel Apai , Elisabeth Adams e-mail: [email protected] Received October 24, 2018

ABSTRACT

Ground-based observations of exoplanet eclipses provide important clues tothe planets’ atmospheric physics, yet systematics in light curve analyses are notfully understood. It is unknown if measurements suggesting near-infrared ﬂuxdensities brighter than models predict are real, or artifacts of the analysis pro-cesses. We created a large suite of model light curves, using both synthetic andreal noise, and tested the common process of light curve modeling and parame-ter optimization with a Markov Chain Monte Carlo (MCMC) algorithm. Withsynthetic white-noise models, we ﬁnd that input eclipse signals are generally re-covered within 10% accuracy for eclipse depths greater than the noise amplitude,and to smaller depths for higher sampling rates and longer baselines. Red-noisemodels see greater discrepancies between input and measured eclipse signals, of-ten biased in one direction. Finally, we ﬁnd that in real data, systematic biasesresult even with a complex model to account for trends, and signiﬁcant falseeclipse signals may appear in a non-Gaussian distribution. To quantify the biasand validate an eclipse measurement, we compare both the planet-hosting starand several of its neighbors to a separately-chosen control sample of ﬁeld stars.Re-examining the Rogers et al. (2009) Ks-band measurement of CoRoT-1b ﬁndsan eclipse 3190 +370 − ppm deep centered at φ me =0 . +0 . − . . Finally, we pro-vide and recommend the use of selected datasets we generated as a benchmark Johns Hopkins University, Department of Physics and Astronomy, 366 Bloomberg Center, 3400 N.Charles Street, Baltimore, MD 21218, USA Carnegie Institution of Washington, Department of Terrestrial Magnetism, 5241 Broad Branch Rd. NW,Washington D.C., 20015-1305, USA Harvard-Smithsonian Center for Astrophysics, 60 Garden St., Cambridge, MA 02138, USA Department of Astronomy, The University of Arizona, and Department of Planetary Sciences, TheUniversity of Arizona, Tucson, AZ, USA

Subject headings: binaries:eclipsing – planetary systems – stars:individual (CoRoT–1) – techniques: photometric – methods: analytical – methods: statistical

1. Introduction

Eclipses of transiting exoplanets, detected as a dip in ﬂux while the planet passes be-hind the star, provide a wealth of information about the planets’ atmospheric properties.The depth of an eclipse in a given wavelength passband provides information about theplanet’s thermal or reﬂected emission, or both, relative to the ﬂux of the host star. There-fore, an eclipse detection, even in a single passband, can constrain the day-side tempera-ture and atmospheric energy circulation of the planet at a given atmospheric height. Ameasurement at optical wavelengths can also place an upper limit on the planet’s albedo(see e.g. L´opez-Morales & Seager 2007). Multiple photometric detections at diﬀerent wave-lengths, or equivalently a spectrum, can further identify the albedo and energy trans-fer, and provide information about the chemical composition of the planets’ atmospheres,their vertical temperature proﬁles, and the presence or absence of thermal inversion layers(e.g. Burrows et al. 2008; Fortney et al. 2008).To date, more than a dozen exoplanets have published eclipse detections. Most of thoseare from the

Spitzer Space Telescope , including eclipses in the four mid-infrared bands of theIRAC instrument at 3.6, 4.5, 5.6 and 8 µ m (e.g. Charbonneau et al. 2005; Machalek et al.2008; Knutson et al. 2009), at 16 µ m with IRS (e.g. Deming et al. 2006; Grillmair et al.2008), and at 24 µ m with MIPS (e.g. Deming et al. 2005; Charbonneau et al. 2008). Re-cently, optical and near-IR eclipse detections have also been made with the Hubble Space Tele-scope (Swain et al. 2009), the

CoRoT and

Kepler missions (Alonso et al. 2009; Snellen et al.2009; Borucki et al. 2009; Coughlin & L´opez-Morales 2012), and also with ground-based tele-scopes (Sing & L´opez-Morales 2009; de Mooij & Snellen 2009; Gillon et al. 2009; Rogers et al.2009; Anderson et al. 2010; Gibson et al. 2010; L´opez-Morales et al. 2010; Croll et al. 2010a,b;Smith et al. 2011; Croll et al. 2011; C´aceres et al. 2011; de Mooij et al. 2011; Zhao et al.2012b,a; Crossﬁeld et al. 2012).The optical and near-IR detections are of particular interest, since they can be madefrom the ground, can provide information about the planetary albedos (optical observations),and because they are closest to the blackbody emission peaks of hot jupiter exoplanets. Thislast factor is important because it allows an estimation of the bulk of the energy emitted by 3 –those planets. One feature observed in the ground-based detections to date is that severaleclipses appear too deep (i.e. the planet is too bright) in the near-IR (around 2.0 µ m),which suggests that the planets are hotter than predicted by standard models. This hasbeen seen for CoRoT-1b in the eclipse observations by two independent teams at similarwavelengths, i.e. in Ks-band (2.2 µ m) (Rogers et al. 2009) and in NB2090 band (2.09 µ m)(Gillon et al. 2009), for WASP-19b in NB2090 band (Gibson et al. 2010), for WASP-12bin Ks-band (Croll et al. 2011), for HAT-P-1b in Ks-band (de Mooij et al. 2011), and mostrecently for WASP-3b in Ks-band (Zhao et al. 2012a). These results may be attributed toa real physical eﬀect, suggesting alternate atmospheric condition scenarios – for example,unexpected chemical abundances or non-LTE conditions. However, because the signal-to-noise ratios are low, any bias or systematic eﬀects in the analytical methods used to detectand model the eclipses could dramatically change the eclipse measurements and conclusions.The most common methods used to measure physical characteristics from transit andeclipse light curves involve constructing a parameterized model of the light curve and its sys-tematic eﬀects, and then using a routine – typically a Markov Chain Monte Carlo (MCMC)algorithm – to derive the best-ﬁtting set of parameters according to a simple criterion such asthe χ or Bayesian Information Criterion value (e.g. Harrington et al. 2007). Some alterna-tives exist; for example, SysRem (Tamuz et al. 2005), wavelet-based algorithms (Carter & Winn2009), non-parametric algorithms (Waldmann 2011), and Gaussian processes (Gibson et al.2012). These are discussed in Section 2.2.In this work, however, we focus on what remain the most common methods for measuringthe signals and potential systematics associated with exoplanet eclipse observations from theground, i.e. MCMC algorithms, and present benchmark data sets to test the eﬀectivenessof these methods. In Section 2 we describe the characteristics of a typical ground-basedeclipse observation and the data modeling and signal extraction process. In Section 3 wedescribe our model implementation to test potential systematic eﬀects associated with thesteps described in Section 2. Section 4 presents the results of the tests for synthetic data,both with and without red noise, and in Section 5 we show the results of those same testsusing real datasets. In Section 6 we discuss the measurement of real eclipses and our re-evaluation of the depth and uncertainty of the published Ks-band eclipse of CoRoT–1b. Ourconclusions and the implications for future exoplanet eclipse searches are summarized inSection 7. Appendix A provides a collection of benchmark data sets that can be shared fortesting and verifying eclipse modeling and analysis processes. 4 –

2. Secondary Eclipse Observations and Detection Methods

The methods described in this section focus on ground-based observations of secondaryeclipses, which are becoming the primary means of exploring the optical and near-IR portionsof exoplanetary emission spectra, but are currently made at only modest conﬁdence levels(typically less than 10 σ ). However, this same description may also apply to primary transitobservations and to observations from space, if the signiﬁcance levels of the signals are low.Detections of eclipses in the optical and near-IR are presently limited to “very hotJupiters” (VHJs), i.e. Jupiter-size exoplanets which reside very close to their host stars (lessthan 3-day orbits) and are expected to be tidally locked, with typical atmospheric day-sidetemperatures around 1500–2500 K. The large radii and hot day-sides make these VHJs themost favorable exoplanets for detection. However, even in these cases, the eclipse signals ofVHJs are typically very small ( < < “red” noise, caused mainly by systematic trends related to changing atmosphericand instrumental parameters (Pont et al. 2006). Although noise levels should be improvedby collecting a large number of data points during one or multiple eclipse epochs, this is thecase if the only noise present is random (Poisson or white noise). In practice, the red noiselimits the noise reduction, which in turn limits the conﬁdence of the eclipse measurement.This explains all of the published ground-based detections being at the modest 3-10 σ con-ﬁdence level, except for the K S -band detection of WASP-12b by Croll et al. (2011) — byobserving the target for 6.2 hours with no dithering and 22 stable reference stars, this eclipsewas measured to a remarkable 25- σ level. New techniques for detrending and removal ofsystematics have signiﬁcantly lowered the amplitude of residual red noise with respect to theunbinned dispersion due to Poisson noise, but some level of residuals are still typical, as seenin most of the published detection papers. Even if the instrumental and other systematicsare well-accounted for, intrinsic stellar noise itself can contribute above the Poisson level(Gilliland et al. 2011). A typical exoplanet eclipse observation consists of a series of short exposures – a “lightcurve” – in a single night over several hours (enough to capture the entire eclipse andsuitable baseline). The most critical attributes determining whether the eclipse signal can 5 –be extracted are the depth of the signal (i.e. the planet-to-star ﬂux ratio) and the noise levels(both red and white components). The chief goals are to minimize the point-to-point ﬂuxdispersion of the light curve and to obtain as many data points as possible, in order to bestdetect a diﬀerence between the in- and out-of-eclipse ﬂux levels. These goals are complicated,however, as each characteristic of the observation set – comparison stars, airmass, ditheringstrategy, phase coverage, sampling rate – contributes to the level and structure of the noise.There are a number of observational observational techniques employed to optimize the lightcurve for detection, whether from a space-borne dedicated instrument (see e.g. Gilliland et al.2011 for the

Kepler process), or using existing ground-based facilities.Most VHJ eclipses last between 1.5 and 2.5 hours, the time it takes for the planet to passbehind the star from the observer’s viewpoint. Table 1 lists the parameters that determinethis duration for CoRoT-1b (Barge et al. 2008; Bean 2009), a VHJ that we have targeted forsecondary eclipse detection, and a synthetic planet that we discuss in Section 4. T is thetotal time from the beginning of the eclipse’s ingress (ﬁrst contact) to the end of its egress(fourth contact), i.e. the “Full Duration” of the eclipse; T is the “Flat-Bottom Duration”– the time from second to third contact during which that the planet is entirely eclipsed bythe star (i.e. when the light curve is ﬂat and at minimum ﬂux).While it is possible to measure the depth of an eclipse with only partial coverage (e.g. theWASP-12b J-band measurement by Croll et al. 2011), the optimal scenario is to observe acontinuous window including the entire eclipse and some out-of-eclipse baseline both beforeand after. If the total number of baseline observations is less than the number of observationsduring full eclipse, the conﬁdence level of the eclipse detection will be limited by the baseline’snoise statistics. Obtaining a baseline at least as long as the eclipse duration will provide abetter opportunity to make a detection, with the noise statistics only limited by the numberof in-eclipse points.Obtaining many data points requires a sampling rate as fast as possible. There isa competing interest, however, in obtaining enough counts per frame to limit the photonnoise in the target and comparison stars. An established strategy is to set the exposuresto be just long enough that the dispersion is dominated by correlated noise rather thanphoton statistics. For ground-based observations, this typically means obtaining at least10 integrated photons per frame (in both the target and comparisons) to limit the photonnoise to 10 − , approximately equal to the contribution level of the red noise, readout errors,ﬂat-ﬁelding errors, etc. Longer exposures, limited by the same systematic noise factors, willnot substantially improve the photometric precision. For near-IR observations, particularlyK-band, there are also sky background concerns to consider – long exposures will saturate,and the faster sky variability is another motivation to have a fast sampling rate. To avoid 6 –saturation but still obtain enough photon counts, it is sometimes necessary to defocus thestar images and spread the signal over more pixels on the chip. This allows the collectionof more photons and can also reduce the sensitivity to small shifts in the stars’ locationson the chip, but the greater number of pixels used will also contribute more noise. Moderninstruments can employ other methods of improving the observing cadence, such as the useof faster-readout subframes or routines designed to optimize the duty cycle. For example, theeclipse of OGLE-TR-56b (Sing & L´opez-Morales 2009) was detected with an E2V camerafeaturing frame-transfer technology allowing almost no readout time and near 100% dutycycle, and the WASP-4b detection (Caceres et al. 2011) used a datacube mode that canstore 250 images at a time with no readout overhead until the end. Taking all of this intoconsideration, the ideal sampling rate is set anywhere from one point every couple of minutesto as many as 100 points per minute, depending on the telescope’s aperture and the apparentbrightness of the host and comparison stars.One of the most important systematic noise minimization strategies is to have at leastone nearby star suitable to be used as a comparison for diﬀerential photometry. An idealcomparison star should be non-variable and similar to the target in brightness (to achievegood photon noise in the same exposure time as the target, without saturating) and color(to limit diﬀerential atmospheric extinction). Better still is to have many comparison stars,although this is limited by the instrument’s ﬁeld of view and the brightness of the target.Measuring the target system’s ﬂux relative to the comparison stars, rather than the absoluteﬂux count of the target, substantially reduces the atmospheric eﬀects, including airmass andthin clouds. To minimize chromatic eﬀects, it is also desirable to limit the observations toan airmass less than ∼ Even with all the noise-minimization techniques described in the previous section, asigniﬁcant amount of red noise systematics will likely remain in the observed light curve. 7 –The next (and most diﬃcult) step is to model and remove these systematic trends; otherwisethey can confound the extraction of the eclipse signal.The amount of correlated noise in a data series can be estimated by calculating thedispersion σ N when binning the residuals (i.e. the diﬀerence between the observed data andthe best-ﬁt model) into N points. For an entirely random noise distribution (white noise), σ N will decrease as N − / , but in the presence of red noise it instead follows: σ N = σ w N + σ r (1)where σ w and σ r are the amplitudes of white and red noise, respectively. Thus, the contri-bution of the white noise can be reduced by binning over larger samples, but the red noisecannot. Over an eclipse light curve, the red noise often contributes 20 to 50% as much noiseas the unbinned white noise level before detrending (Pont et al. 2006). The white and rednoise amplitudes can be estimated for any data series by ﬁnding the σ w and σ r values thatbest suit this equation.The red noise is not completely unexplained, however. In practice, the detected ﬂuxis often found to correlate with various eﬀects of the atmosphere and instrument (seee.g. Harrington et al. 2007). The observed light curve can be thought of as the productof three components: the ﬂux reduction by the actual eclipse, the intrinsic random whitenoise (photon noise, read noise, etc.), and the red noise resulting from systematic correla-tions. The goal is to remove the systematic trends and recover the basic parameters thatdescribe the shape of the eclipse signal: the eclipse depth D e , which represents the emis-sion of the planet in the observed passband, and the mid-eclipse phase φ me , which providesinformation about the eccentricity.There are several approaches to model and remove these systematics, and several so-phisticated algorithms have been proposed and tested. For example, for ﬁelds with manystars, algorithms like SysRem (Tamuz et al. 2005) can automatically measure trends andremove them from the light curve to reduce the red noise (e.g. Sing & L´opez-Morales 2009).Carter & Winn (2009) introduced a wavelet-based method to estimate the parameters. In it,the noise is treated as the sum of time-correlated and uncorrelated components, and the like-lihood function is calculated with a fast wavelet transform. Waldmann (2011) has proposeda non-parametric algorithm to blindly de-convolve all non-Gaussian signals from the eclipselight curve. This method is best suited for simultaneous observations at diﬀerent wave-lengths (e.g. binned spectroscopic observations); most photometric detections are based on asingle light curve, or curves of eclipses on diﬀerent nights. Both of these alternative methodspreferentially remove the correlated noise without making use of the measured atmosphericand instrumental quantities that are often used to seek correlations and systematic eﬀects. 8 –In another method, Gibson et al. (2012) have used Gaussian processes for regression, whichdeﬁne a distribution over function space, and maintain some advantages of both the linearparameter model and the Waldmann algorithm. Their technique makes use of additionalmeasured parameters, but then ﬁnds the distributions of systematics in a non-parametricway. Most common, however, is to simply look for correlations between observed ﬂux andmeasured atmospheric or instrumental parameters, and remove them to reveal the signal.The most basic method is to use only the out-of-eclipse baseline points of the light curves,i.e. the portions that are expected to be constant in time, to manually look for and removetrends one at a time, beginning with the most strongly correlated. For example, if thebaseline ﬂux appears to correlate strongly with airmass, one can ﬁt a linear trend betweenairmass and the diﬀerential ﬂux in the baseline, and then remove this trend from the entirelight curve. This process can be repeated iteratively with other parameters until the trendsremaining in the baseline are minimized. Since these ﬁts are done using only the out-of-eclipseportions of the light curve, systematics in the in-eclipse data are removed by interpolation ofthe out-of-eclipse ﬁts. Many of the ground-based eclipse detections (e.g. Rogers et al. 2009;Croll et al. 2011) have been made in this manner.Some weaknesses of looking for trends only in the out-of-eclipse portions of the lightcurve are that it requires a prior guess of which points are not in the eclipse, and it does notmake use of the large amount of in-eclipse data, in which other trends could be present. Themanual removal of one correlation at a time can also be problematic in cases where thereis not a dominant trend, as the removal of a second trend can reintroduce correlations withthe ﬁrst parameter.Arguably a better approach is to consider the eﬀects of the correlation trends andthe shape of the eclipse signal simultaneously together in a model. This can be done byconstructing a correlation function C q ( q ) with each independent variable q one might pickout from the observations (e.g. airmass, position on the detector, sky brightness, exposuretime, shape of the images, etc.), and then combining those functions with the eclipse lightcurve F ecl to form a more detailed model for the observed ﬂux: F obs = F ecl Y q C q ( q ) (2)At ﬁrst-order one may use linear C q functions, although in fact the best correlations maybe non-linear. This has been demonstrated to be a useful technique, simultaneously ac-counting for the physical eﬀect of the system and the atmospheric and instrumental eﬀects(Harrington et al. 2007).A challenging aspect of this method, however, is the need to test a large number of free 9 –parameters, without initially knowing which parameters are the most essential. Testing acomprehensive grid of possible combinations of these parameters can be time-prohibitive;instead, a method is needed to eﬃciently ﬁnd the set of parameters, with uncertainties, thatbest models the observed light curve. A few diﬀerent algorithms are used to achieve this,but the most common is the MCMC, which we discuss in Section 3.2.We consider this last approach to be the most robust; therefore it is the one we havefocused on in the rest of this work. Our implementation is discussed in detail in Section 3.

3. Model Implementation and Systematic Eﬀects Tests

To test the analysis process which extracts eclipse detections from photometric lightcurves, we set out to ﬁrst generate a variety of test datasets (discussed in Sections 4 and5), and then evaluate them using our own adaptation of a multi-parameter model of theeclipse plus correlation trends, and MCMC routines to ﬁnd the best-ﬁt set of parameters.The goal is to determine how well this type of model and the MCMC process measure aninput eclipse of known depth and phase placement. If the process is working eﬀectively,the best-ﬁt solution should recover the same correlation and eclipse signal that was input.Under a range of noise levels, systematics, and other conditions, how close does the MCMCcome to recovering a signal of a given depth? At what noise limits do we no longer detectthe input signal? Are there any systematic errors in the measurement? In order to answerthese key questions, we required our own implementation of an observed light curve modeland a process for ﬁtting the model with the best parameters.

We began constructing our model of the light curve of a secondary eclipse of a typicalVHJ with a simple eclipse curve - the light curve that the system’s ﬂux would produce in theabsence of any noise. This shape is determined by a number of star-planet system parameters,generally constrained by transit and radial velocity observations to better precision thanachievable from eclipse observations: stellar radius R ∗ , planet radius R P , orbital period P ,semi-major axis a , inclination i , and eccentricity e . We selected for these quantities valuescomparable to those of CoRoT-1b and other representative VHJs, as given in Table 1.Similarly to the transit models in Mandel & Agol (2002), the eclipse curve was generatedby simulating the motion of the planet relative to the star, and calculating from this thefraction of the planet’s surface area eclipsed P ecl as a function of orbital phase φ . The 10 –eclipse window for a transiting VHJ is typically a small enough portion of the period thatorbital eccentricity aﬀects only φ me , the phase at which the middle of the eclipse is centered,which is directly related to e cos ω , where ω is the argument of the periastron. We beganby calculating the function r ( φ ), the separation between star and planet disk centers in theplane of the sky, in units of R ∗ : r ( φ ) = aR ∗ [1 − sin i cos ( φ − φ me )] / , (3)and from there we found the portion of the planet that is eclipsed as a function of r : P ecl ( r ) =  r ≥ p πp ( θ − sin θ cos θ ) + π ( θ − sin θ cos θ ) 1 − p < r < p r ≤ − p (4)where p is the unitless ratio R P /R ∗ , and the angles θ and θ are deﬁned by:cos θ = 1 + r − p r ; cos θ = r + p − rp . (5)Because it is the planet and not the star that is eclipsed, stellar limb darkening does notaﬀect the eclipse shape, and the portion of the light curve when the planet is fully obscuredwill be ﬂat. The ﬂux decrement, or depth D e , of the eclipse depends on the brightness(thermal and reﬂected emission) of the planet. For our purposes, we considered the planetto be seen as a uniformly bright disk, allowing us to model the eclipse shape as F ecl ( φ ) = F b ∗ { D e ∗ [1 − P ecl ( φ, φ me )] } (6)where F b is a normalizing baseline ﬂux parameter. P ecl ( φ, φ me ) was found using Equations3, 4, and 5, the mid-eclipse phase φ me , and the other ﬁxed parameters, and then used with D e to model the expected ﬂux from the star and planet throughout the eclipse (ﬂux of 1 infull eclipse, 1+ D e when the planet is fully visible, normalized to the brightness of the staralone). Thus, with all the other star and planet parameters held constant, the model for theeclipse shape depends only on F b , D e , and φ me .Next, in order to model the systematic eﬀects, we allowed possible relationships betweenthe detected ﬂux and various independent variables measured in the dataset, constructingfor each variable a simple linear correlation function: C q ( q ) = 1 + m q q (7)where m q is the correlation slope factor between a given independent variable q and the ﬂux.Each linear function could in principle also have an intercept b q , but these can easily becombined into the overall baseline F b , so they are not needed in each correction curve. 11 –Note that this linear function is merely the simplest possible correlation one could devise.For measurements such as pointing/pixel eﬀects, it is an approximation for eﬀects that couldbe nonlinear. Thus, this model can be expanded to include polynomial or other correlationfunctions, e.g. a quadratic function C q ( q ) = 1 + m q q + m q q . Later, we explored this in our“real data” tests (Section 5.3).We considered several variables that might be available and relevant: the airmass ( z ),the full-width half-maximum size of the star images ( w ), their x- and y- positions on the chip( x , y ), sky brightness ( s ), and time (or, equivalently, orbital phase φ ). Correlations withother characteristics can of course also be modeled as needed, but the model parameters weselected for this work are these, also listed in Table 2.Some of these ( φ , z ) are universal in each dataset, but the rest ( m w , m x , m y , m s )are measured separately for each star in the ﬁeld. As long as a variable exhibits the samebehavior over time for each star (e.g. if one star moves three pixels in the x-direction and itsFWHM increases by 10%, the same happens to all other stars in the ﬁeld), a single set ofvalues can be used to determine that trends with that variable. Even though each star maybe aﬀected diﬀerently by changes in one parameter, the diﬀerential ﬂux can be modeled witha single correlation. However, if the behavior of the same variable diﬀers greatly from starto star, it may be better to use a diﬀerent set of measurements, and model parameters, foreach star (e.g. if there is a diﬀerent sky brightness behavior in diﬀerent parts of the ﬁeld, use s , s , etc. rather than just one s – and increase the number of corresponding slope factors m s ).In addition, as discussed in Section 2.1, a single dataset may comprise observations frommultiple oﬀset / dither positions, each of which can exhibit diﬀerent trends. For example, x , y , and w each aﬀect the distribution of photons across diﬀerent pixels on the detector, sotheir eﬀect on the measured ﬂux will not be the same for diﬀerent oﬀset positions. The twoprimary parameters ( D e , φ me ) in Table 2 are global to the entire dataset, deﬁning the shapeand placement of the eclipse in the light curve, but the rest can have diﬀerent behavior atdiﬀerent oﬀset positions; thus, we use a separate set of these parameters to apply to thepoints at each oﬀset position.All the correlation functions for the independent variables are then multiplied by thenoiseless secondary eclipse light curve (Equation 6) to form a more accurate model for theobserved ﬂux of the target planet-star system as a function of orbital phase φ , and theadjustable model parameters: F mod = F ecl ( φ, F b , D e , φ me ) Y q C q ( q ) . (8) 12 –By ﬁnding the parameters that produce a single best-ﬁt match of the model in Equa-tion 8 to the observed light curve, we derive the key values we are seeking: the eclipse’sdepth and central phase. With diﬀerently-behaving parameters, multiple stars and oﬀset positions, the numberof parameters required to produce the full light curve model can begin to grow rapidly,and testing even a sparse grid of possible combinations can become prohibitive. To moreeﬃciently sample the parameter space and ﬁnd the best solution with uncertainties, weimplemented two separate MCMC algorithms, each developed independently to ensure thatwe were not seeing the eﬀects of a problematic procedure.The ﬁrst MCMC process, which we called “MCMC-A”, begins by generating a set ofinitial guesses for all the parameters, and in each step randomly selects one parameter toperturb by a small amount, or “jump.” Each step generates a model light curve for thisset of parameters including the jump, and calculates an information criterion, or “ﬁtness”,for example, the χ value between that model and the observed points (see Section 3.3).The new ﬁtness is compared to the ﬁtness at the previous step, and then the jump is eitheraccepted or rejected, following the Metropolis-Hastings algorithm: If the new ﬁtness is lower(i.e. better) than the previous one, the jump is accepted, and the chain proceeds to thenext step with a set of parameter values including that new jumped-to value; if the newﬁtness is worse than the previous one, the jump may still be accepted, with a probability P jump = e − ∆ χ , where ∆ χ is the diﬀerence in ﬁtness caused by this jump. The secondoption prevents the solution from remaining in a local minimum in the parameter space. Ifthe jump is rejected, the pre-jump parameters are carried over to the next step. This processof random jumps being accepted or rejected is repeated up to four million times, resultingin a distribution of values for each parameter from all the steps of the MCMC. To removebiases from the starting position, we discard the ﬁrst 20% of the steps, and then examine theremaining distribution for each parameter. These values are then binned into a histogram,and the midpoint of the most heavily-populated bin (i.e. the mode average) is taken to bethe best-ﬁt value of that parameter, with the errors set such that the central 68% of thepoints fall within the 1- σ limits.For every eclipse we tested, MCMC-A ﬁrst ran two short (10,000-step) MCMC trials toadjust the jump size for each parameter, i.e. the amplitude of the random-normal distributionfrom which each jump is selected. If a jump size were set too small, the distribution wouldbe too diﬀuse; if the jumps were too large, the chain would not jump enough. After setting 13 –jump sizes for each parameter that would result in 30 - 40% of jumps being accepted, we ranﬁfteen separate MCMC’s – three starting from each of ﬁve diﬀerent selected initial parameterguesses, in order to test the eﬀects that the starting position would have on the chain. Werepeated each test with two diﬀerent information criteria used for the Metropolis-Hastingstest: χ and the Bayesian Information Criterion (described in Section 3.3). All of the datadescribed in the following sections (synthetic and real datasets) were analyzed with thisMCMC-A implementation.We also tested a representative sample of light curves with a second, independent MCMCanalysis process by a diﬀerent team member, using the model described in Carter et al.(2011). This analysis, which we called MCMC-B, used a Gibbs sampler to vary a randomparameter for each step, before also using the Metropolis-Hastings jump acceptance criterion.Before each long chain, a sequence of shorter sequences was run to tune the step sizes foreach parameter to between 20-40%. For each test, one chain of 1,000,000 values was run,with the ﬁrst 5% discarded, and the median and standard deviation of the truncated chainfor each parameter are reported as the best-ﬁt values and errors.When run independently on the same datasets, the two MCMC processes producedsimilar results within their reported uncertainties. Table 3 compares the results between thetwo processes on a sample of eight selected synthetic light curves. The ﬁrst three columnsdescribe the characteristics of each dataset: the amplitude σ r and pattern of the red noiseused in the light curve (see Section 4.4 and Figure 1), the length of the observing window,and the sampling rate of the observations. Columns 4 through 6 list the eclipse depth thatis put into the light curve initially, and recovered by each MCMC process, where “Fit (A)”and “Fit (B)” correspond to MCMC-A and MCMC-B respectively. Columns 7 through 9show the inputs and results for the mid-eclipse phase. All eight tests used the χ informationcriterion, and had an input white noise amplitude σ w set to 1000 ppm. The best-ﬁt solutionsfor eclipse depth D e and mid-eclipse phase are also shown in Figure 2. From these testswe concluded that both independently-developed MCMCs produce consistent results. Thusconﬁrming the reliability of each, we proceeded to use MCMC-A in the remainder of thiswork. When including parameters in the model, we do not initially know which are relevantand which will simply complicate the process without helping to model the systematics; forexample, one light curve may have a strong dependence on the position of the stars on thechip, but little to no eﬀect from the changing airmass. In such a case, allowing a dependence 14 –on airmass only increases the uncertainty on the other parameters without improving theﬁt. Thus, a quantitative measure is needed to describe both the goodness of ﬁt and theusefulness of the free parameters. For that purpose, in addition to the basic χ value, wealso tested an alternative, the Bayesian Information Criterion (BIC), that takes into accountthe number of parameters used. The BIC is deﬁned as: BIC = χ + k ln n (9)where n is the number of observed points and k is the number of nonzero parameters. Thepurpose of the BIC is to introduce a preference for simpler models over more complex modelsthat may be “overﬁtting” the data with more free parameters than needed. A complex model(i.e. one with a higher number of nonzero parameters k ) will still be preferred in some cases,but only if it signiﬁcantly improves the ﬁt to overcome the greater k ln n term. For example,removing one parameter from the model (i.e. setting it to zero) reduces the BIC value by ln ( n ), so if doing so does not raise the χ value by more than that amount, the solution withone fewer parameter is preferred.The MCMC will not naturally jump a parameter to exactly zero, so in order to im-plement the BIC, we set a threshold range for each parameter such that jumps to valuesnear zero would be set to zero to minimize the BIC. We found that using a threshold largerthan a few times a parameter’s adjusted jump size would raise rather than lower the BIC.Thus, we set the threshold for each parameter equal to the jump size (previously adjustedas described in Section 3.2).

4. Synthetic Datasets

As discussed in Section 2.1, the recovery of a secondary eclipse signal by the modelingand analysis process depends on the depth of the eclipse and the noise levels, as well asother aspects of the data set such as phase coverage and sampling rate. Before beginning onreal data, which has unknown noise patterns, we wanted to test our method on completelysynthetic datasets with known and controlled noise sources. To do this, and investigate howeach characteristic might aﬀect the process’s results, we created a wide variety of syntheticeclipse light curves, and then ran them through the MCMC-A process to see what depthsand phases would be measured. Each test dataset consisted of a secondary eclipse curve witha controlled eclipse depth D e , and mid-eclipse phase φ me , combined with a noise proﬁle. 15 – We began by creating a model “Exoplanet X” similar to many detected VHJs. Theplanet and star radii, orbital period, semimajor axis, and inclination, which constrain theduration and shape of the eclipses, are listed in Table 1. The period, 25 hours, was selectedfor an easy conversion between time and phase – every hour of observation corresponds toa span of 0.04 in phase. This conﬁguration results in a ﬂat-bottom portion of the eclipse( T ) spanning 94 minutes, with ingress and egress lasting 18 minutes each, for a full eclipseduration ( T ) of 130 minutes.For most of the trials, we used an observation window beginning two hours before andending two hours after the mid-eclipse time (240 minutes total, extending from phase 0.42 to0.58). This provided 55 minutes of baseline on each side of the eclipse, giving a total baselineof 110 minutes, which is slightly longer than the time spent in the ﬂat-bottom portion of theeclipse shape. We also tested runs with shorter and longer baselines (15, 94, and 188 minuteson each side; total durations of 160, 318, and 506 minutes), and a 240-minute series withthe mid-eclipse phase oﬀset to 0.475, to provide a baseline that was short (17.5 minutes) onone side and long on the other.We then set up an array of observation points across the selected window with a givensampling rate, and a small random error (normally distributed; σ =0.1 s) added to the timingof each point to simulate imperfect data spacing. We tested sampling rates of 25, 75, and250 points per hour (100, 300, and 1000 data points, respectively, in the four-hour observingwindow). Table 4 lists all the values we adjusted and tested in this section.Next, using these phase points and the planet model, we built the (noiseless) secondaryeclipse shape with the same function (Equation 6) used for that part of the light curve model.Given all the ﬁxed star and planet values, and a set baseline ﬂux F b = 1, the only adjustableparameters that would adjust the noiseless eclipse shape were D e and φ me .Finally, we generated a noise proﬁle to add to the eclipse curve, with either white noiseonly (Section 4.3) or a combination of white and red noise (Section 4.4). An example noiselesseclipse shape is shown in panel (a) of Figure 3, along with diﬀerent noise models, constructedlight curves, and their best-ﬁt model solutions. As described in Section 3.2, the MCMC-A process ran ﬁfteen Markov chains for eachlight curve evaluated, starting from ﬁve distinct sets of model parameter values. Only the 16 –ﬁrst four parameters ( D e , φ me , F b , and m φ ) from Table 2 were used in ﬁtting the syntheticeclipses, as values for the other measured atmospheric and detector eﬀects did not exist. Forreal datasets (Section 5), we would include all of the parameters.As an example, Table 5 shows the results for a complete set of ﬁfteen trials on one lightcurve: 75 data points per hour in a four-hour observing window, with a synthetic ExoplanetX eclipse of depth D e =4000 parts per million (ppm), centered at phase φ me =0.500, and rednoise (Pattern 3; see Section 4.4 and Figure 1) equal in magnitude to the white noise (1000ppm), and run to optimize the χ information criterion. All of these aspects are discussedin Section 4.4, and this is also the light curve shown in panel (d) of Figure 3.The trials are divided into ﬁve subsets of three trials, each one set starting from diﬀerentinitial guesses for D e and φ me . The trials are identiﬁed in the ﬁrst column by their subset’sstarting point and then the trial number within that subset, e.g. 1-2 for the second trial atthe ﬁrst starting position. The second and third columns provide the initial guesses for eachsubset. Columns four through six show the best-ﬁt results for the eclipse depth, mid-eclipsephase, and baseline ﬂux (phase slope m φ is not shown). The ﬁnal two columns show twomeasures of the model’s ﬁtness – the χ value and the root mean square of the residuals.The results among each set of trials were generally in good agreement and independent ofthe point from which the chain was started, though occasionally the series from one startingpoint would ﬁnd a less optimal solution, which was reﬂected by a higher dispersion of theresiduals between the observed and model points. For example, in the set of trials shown inTable 5, the recovered values from four of the ﬁve starting points were in agreement witheach other within their errors, and had very similar χ and residual RMS values, suggestingthat the space they populated represented the global best ﬁt. Meanwhile, the dispersionsof the residuals from starting point 4 were much worse, indicating that the MCMC foundonly a local minimum. Of the ﬁfteen trials, we adopted the solution with the lowest residualdispersion to provide the model light curve and best-ﬁt parameters. The last row of thetable lists the key results from this example’s best ﬁt (trial 5-1): eclipse depth (adjusted forthe baseline), and mid-eclipse phase. This solution diﬀers from the input depth by 360 ppm(7.3 σ , but only 9% of the input depth), and the input mid-eclipse phase by 0.00084 (3.2 σ ),equivalent to 76 s for the orbit of Exoplanet X. Both of these values suggest that the errorsmay be underestimated. 17 – In the simplest scenario, we generated normally-distributed white noise to add to thenoiseless eclipse curves. Because the detectability would be dependent on the depth-to-noiseratio, we chose to keep the white noise amplitude σ w (i.e. the RMS of the normal distribution)ﬁxed at 0.1%, or 1000 ppm, and test diﬀerent eclipse depths, beginning at 5000 ppm (ﬁvetimes the white noise level) and continuing down to 50 ppm, as well as diﬀerent values forthe other attributes of the light curves (see Table 2). Panel (b) in Figure 3 shows a whitenoise pattern and the combined light curve (eclipse plus noise) with its best-ﬁt model.Figure 4 shows the diﬀerence between the eclipse depths that were input into the testsand those that were measured by the analysis, as a percentage of the input depths. We testedthe range of eclipse depths for four diﬀerent observing window lengths, and using both the χ and Bayesian information criteria. Each of these sets is shown in a separate plot, withthe three diﬀerent sampling rates in diﬀerent colors. At the highest input depths, both theeclipse depth and mid-eclipse phase that were recovered were in close agreement with thevalues from the input models, as expected; as the model depths were decreased, these valuesbegan to disagree by greater amounts. By looking at each series starting with the deepestinput and moving to the shallower ones, we were able to examine under what conditions thesignals continued to be recovered successfully.To quantify the results, we calculated the percent diﬀerence between the recovered andinput values at each eclipse depth, beginning with the deepest and looked for a point belowwhich we could no longer recover the eclipse signal reliably (for our purposes, we referred toa recovered eclipse depth within 10% of the input depth as “successful”). We then deﬁneda “critical depth” D e,crit as the input depth of the last successful recovery before the second time the diﬀerence was greater than 10% (to avoid being confounded by one outlier point).We found that the mid-eclipse phases were also recovered accurately for tests above thisthreshold – 63% contained the input φ me within 1 σ of the calculated values and errors; 91%within 2 σ – but diverged as the input depths became shallower. Above the critical depth, wecalculated a “dispersion” as the average of the absolute values of the errors after discardingthe highest and lowest values, and considered whether the deviations were systematic orrandom in nature. Figure 4 illustrates the accuracy of the recovered eclipse depths for allthe synthetic, white noise only tests, and the critical depths and average errors for everycombination of observing window, sampling rate, and information criterion are listed in thecolumns in Table 6 labeled “ σ r =0”.Typically, the signals were recovered in agreement with the inputs as long as the depthwas at or above the white noise level (1000 ppm). The only instances where this was not thecase were at the lowest sampling rate (25 points per hour), where there were fewer points in 18 –the light curve; however, even in these instances the results were not oﬀ by more than about15% until we passed 1000 ppm. At depths below that level, the process was much less likelyto recover the input depth, although the observing window and sampling rate did aﬀect this.For the standard 240-minute observing block, D e,crit was at or near the white noise level(800 – 1000 ppm) when using both the χ and BIC criteria. We also found that, while thesampling rate had little eﬀect on the critical depth reached, it did aﬀect the dispersion of theresults for the points above D e,crit . As we increased the sampling rate, the recovered depthsclustered closer to the input values. This showed, as expected, that a faster sampling rateat the same noise level will return a more accurate measurement.In the case of the shorter (160-minute) observing windows, eclipse depths were recov-ered less accurately, and the critical depth was reached sooner. For the longer windows(particularly the longest, 506 minutes) and at the faster sampling rates, accurate depthswere recovered from eclipses with signals as low as 250 to 300 ppm, i.e. a factor of threeto four times shallower than the white noise level. This demonstrated that baselines thatsigniﬁcantly exceed the duration of the eclipse are preferable.In all cases tested, there were no systematic biases toward measuring the eclipses to bedeeper or shallower eclipses than the inputs; rather, the errors appeared random. Next, we created an additional component of the noise model to simulate the eﬀects ofred noise. Carter & Winn (2009) took a multi-step approach to synthesizing red noise: ﬁrst,they established the covariance conditions for wavelet and scaling coeﬃcients for the caseof combined white and red noise. Then they drew a sequence of 1024 independent randomvariables, set such that their wavelet and scaling coeﬃcients would obey these conditions,and ﬁnally, took the inverse Fast Wavelet Transform of the sequence to generate a red noisesignal.We constructed our initial models in a simpler fashion, by adding together four sinewaves with periods equal to 2, 1, 0.5, and 0.25 times the four-hour observation window(i.e. P n = 480, 240, 120, or 60 minutes): R ( φ ) = X n =1 A n sin ( φ/P n + ω n ) (10)This model has eight adjustable parameters (amplitude A n and phase angle ω n for eachcomponent). We chose sets of parameter values to create three distinct red noise models, 19 –shown as Patterns 1, 2, and 3 in Figure 1, and measured the RMS of the points in eachpattern. Then to use a given red-noise pattern, we scaled the amplitude to set the RMS toa desired red noise level σ r . We began with a red noise amplitude of σ r = 200 ppm (20% of σ w , a typical level for the red noise found in diﬀerential light curves), but also tested higherred noise levels of 500 ppm and 1000 ppm.While we anticipated that our models using a few frequencies would produce similareﬀects as typical observed noise, real systematics can include power at all frequencies, andincreasing for longer wavelengths. Thus, in addition to the initial patterns, we generatedBrownian noise (1 /f ) models by integrating series of Gaussian white noise. Two suchpatterns are shown in Figure 1 as Patterns 4 and 5. These were scaled and added into thelight curves the same way as the sinusoidal patterns.For each test, we then combined the red-noise pattern with the randomized white noisepattern (always kept at σ w = 1000 ppm) and eclipse shape, and ran the resulting light curvethrough the MCMC and modeling process as before in the white noise tests. An examplelight curve (noise and eclipse signal) for Pattern 3, σ r =1000 ppm, and a synthetic eclipse of D e = 4000 ppm is shown in panel (c) in Figure 3.Figure 5 shows the recovery results for the light curves with a red noise component(Pattern 3) of amplitude 500 ppm, i.e. 50% of the white noise level, added to the raw eclipseshape and white noise. These are subdivided by observing window length, informationcriterion, and sampling rate. In all of these tests, we saw little diﬀerence between the χ and BIC results, due to the fact that these models had fewer free parameters than the realdata tests with instrumental and atmospheric measurements. In Section 5.2, we will notethat this is no longer the case.We calculated critical depths and error dispersions in the same manner as for the whitenoise tests (Section 4.3); these results for every observing window / sampling rate / informa-tion criterion combination are also given in Table 6 in the columns marked “ σ r =500 ppm”.As in the white-noise case, the accuracy of both the depth and mid-eclipse phase measure-ments diverged as the input depths became smaller, but this began to happen at depthsgreater than the white noise level, or even the combined point-to-point uncertainty from thewhite and red noise together. The dispersions above the critical depths were greater as well,and we saw stronger systematic errors than in the white noise tests.In the series of 240-minute tests for all three red noise levels, the recovered eclipse depthswere within ∼

10% of the input depths down to near the white noise level, yet the best-ﬁtmodels consistently measured eclipse signals slightly deeper than the inputs, as seen in thesecond row of Figure 5. The critical depths for this baseline ranged from 600 ppm (60% of 20 –the white noise level) to as large as 3300 ppm (over three times the white noise amplitude)for σ r =200 ppm and 500 ppm. For inputs above the critical depths, both information criteriamethods resulted in eclipse measurements 3 to 6% deeper than the input depths. For the σ r =1000 ppm series, the critical depths were all 2300 ppm or higher, and the dispersion wasgreater (5 to 9%). With the shorter 160-minute baseline, there was a stronger bias towardsdeeper detections than the inputs, but the critical depths were much higher as well (2000 to4000 ppm).Baselines of 318 minutes showed improvement in D e,crit and dispersion, but when thebaseline was extended to 506 minutes, the accuracy became systematically worse again, witheach series tending to measure the eclipse signals around 5% shallower than the inputs for σ r =200 ppm, and even more so for the higher levels of red noise. Above the critical depths,these deviations were greater than the error bars on the depths measured by the MCMC,particularly for the cases with faster sampling rates.The strongest biases were seen in the tests with the most red noise ( σ r =1000, equal to σ w ). Especially in the 160- and 240-minute baselines, the recovered eclipses were oﬀset fromthe input values by more than 10% even for the highest input depths (see again Table 6).These trends illustrate the red noise ﬂoor explained by Equation 1 – increasing the numberof points reduces the error from the white noise, but not from the red noise. As expected,the critical depths and dispersions increased with higher red noise amplitudes.Examining the shape of red noise Pattern 3 in Figure 1 indicated why the diﬀerentbaselines showed diﬀerent systematic biases. For the longest baseline (506 minutes, extendingfrom phase 0.33 to 0.67), the full red noise pattern was lower on both ends and high in themiddle, which counteracted the inserted eclipse shape. When the ends were clipped to ashorter phase coverage, that inﬂuence was removed, and when clipped even further (to 240or 160 minutes), the dip at phase 0.51-0.52 became a signal that was added to the eclipsedepth. With Pattern 1, the longest-baseline set showed a bias recovering deeper eclipsesthan input; with Pattern 2 there was little bias either way. These also suggest that with along observing window it may be prudent to attempt ﬁts with only the necessary portion ofit, since extended baselines can have trends that confound the eclipse measurement.The eﬀects of the Brownian noise patterns (4 and 5) on one series (four-hour observingwindow, sampling rate of 250 points per hour) are shown in Figure 6: for both χ andBIC, Pattern 5 behaves similarly to the simpler Pattern 3, though with a slightly higherpositive bias at most depths. Pattern 4 acts in the opposite direction with similar amplitudes,resulting in slightly lesser measured eclipse depths. They did not show qualitatively diﬀerentresults than the sinusoidal noise patterns. 21 –A key ﬁnding was that the resulting uncertainties from the MCMC process far under-predicted the actual oﬀsets between input and measured eclipse depths. For example, in thePattern 4, σ r =500 ppm, χ series, the measured depths diﬀered from the input depths by2% to 10% above D e,crit , but the estimated errors from the MCMC were between 0.7% and3%, a factor of 2 - 4 smaller.These results showed that the unknown structure of the red noise, even in the baselineportions of the light curve, can introduce in addition to the random error an additionalsystematic bias aﬀecting all the measurements on a series of tests with a ﬁxed noise patternin a certain way. However, in practice on a real dataset, this bias is only a once-observedeﬀect, which amounts to a greater uncertainty on the actual depth of the eclipse event. Thiswas reﬂected by the disagreements between the input and calculated eclipse depths beinggreater than the error bars found by the MCMC process.

5. Real Data Systematics

After a thorough exploration of synthetic light curves, we began to test the process onreal data, which would have measurable white and red noise levels, but unknown systematicsand trends. We took some photometry from a previously observed exoplanet ﬁeld to betreated as the noise and, as in the previous section, added a range of simulated eclipsesignals to see which could be recovered.

The dataset we used for these tests was the photometry from the CoRoT–1b ﬁeld, takenin Ks-band with APO/NICFPS, on UT 15 January, 2009, and published in Rogers et al.(2009) (hereafter R09). This set included 758 images alternating between two oﬀset po-sitions 15 arcseconds apart. However, ﬁnding the diﬀerential photometric precision to besubstantially worse at high airmass, we removed the points at z > σ from their neighbors within 20 minutes, leaving 691data points. In addition to the target, there were four stars of relatively similar magnitude(between 0.95 and 1.8 times the target ﬂux) in the same area of the chip, and many morestars in the ﬁeld. We assigned each star a number from 0 through 13, with 0 being the target,CoRoT-1, and 1 through 4 the bright nearby reference stars used in R09. The remainingimages spanned 293 minutes, with a small gap (8.8 minutes) in the coverage near phase 0.55,resulting in a cadence of approximately 145 points per hour. The observing window and 22 –sampling rate for our real dataset tests were ﬁxed by those characteristics of the datasetitself, unlike in the case of the synthetic tests.To produce light curves for use with our synthetic eclipse tests, we selected bright, stableﬁeld stars one at a time to treat as though they were the target, and performed diﬀerentialphotometry with respect to other nearby stars in the ﬁeld, using the comparison stars whichproduced the lowest ﬂux RMS in the diﬀerential light curves. To avoid issues with the actualeclipse in the data, we excluded the main target star CoRoT-1 from this process. For eachlight curve, we then measured the amount of red and white noise as the values for σ w and σ r that best ﬁt the binned dispersions to Equation 1, ﬁnding that both were greater than in thesynthetic noise tests. For example, the noise levels for star 1’s diﬀerential light curve werefound to be σ w = 4580 ppm and σ r = 1540 ppm. We then also made the same measurementsusing just the out-of-eclipse baseline points, since for a measurement on the target light curvewe would want to use only these points to judge the noise structure. In most cases the levelswere very similar to those in the full light curve. We repeated this for several stars in theﬁeld, numbered 1, 2, 3, 4, 8, and 13, in order to test the systematics of each. For stars 1 and4, the best comparisons and the measured red and white noise for each (in ppm) light curveare given in the third and fourth columns of Table 7 (identiﬁed as “1-a” and “4-a”).The diﬀerential ﬂux for light curve 1-a is shown at the top of Figure 7 in blue and red,with each color representing one of the two oﬀset positions. Below it are the other measuredcharacteristics that we used for de-correlation: the airmass ( z ), full width at half-maximumof the star images ( w ), displacement of the stars on the chip in the x- and y-directions ( x , y ),and the sky brightness ( s ). In some datasets we have seen these attributes behave diﬀerentlyfor each star; In this particular dataset, however, all of these comparison stars were localizedto the same area, and we tested that the values behaved similarly for each star. Therefore,we were able to use single arrays for each variable φ , z , w , x , y , and s , although each variablestill required two diﬀerent slope coeﬃcients – one for the behavior at each oﬀset position.In total, the light curve model required 16 parameters (see Table 2: F b , m φ , m z , m w , m x , m y , and m s for each of the two oﬀset positions, along with the global parameters D e and φ me . When using the χ value as the ﬁtness criterion, the models would make use of all16 parameters, allowing correlations with each variable / oﬀset position combination. In theBIC runs, however, we expected that most of the best-ﬁt models would use only the mostinﬂuential parameters, with the others ﬁxed to zero for a better BIC value. 23 – The resulting light curves from each test star’s diﬀerential photometry were then usedas the noise components of new eclipse light curves to be tested. As before in Section 4,we constructed a noiseless eclipse shape for each given depth and mid-eclipse phase, withthe same star and planet parameters of the hypothetical Exoplanet X (Table 1). We thencombined this signal with the real noise proﬁles to form light curves to be analyzed by ourmodeling and parameter optimization processes, and then ran the tests for a range of eclipsedepths, two diﬀerent mid-eclipse phases (0.5 and 0.475), and both the χ and BIC informationcriteria. While the models with χ optimization used all sixteen model parameters, the BICruns typically chose only nine to twelve, omitting some of the correlation slope parameters(see Table 7).The behavior of the recovered eclipse signals can be seen in Figure 8. In most cases,as with the synthetic tests, the recovered eclipse parameters were close to the input valuesfor large input eclipse depths, but began to deviate more at depths lower than the whitenoise level. Unlike in the synthetic tests, which had a newly-generated random white noisecomponent for each trial, the noise structure was the same in every run for a given targetstar. This generally resulted in calculated eclipse depths consistently higher or lower thanthe input depth, with the systematic bias getting worse as the input model depths decreased.For each test star’s series, both the uncertainties on the depth measurements and thediﬀerences between the measured and input depths were substantially higher than in thesynthetic-noise tests. This led to higher values for the critical depth D e,crit , ranging from2000 to 9500 ppm, and in some cases (e.g. for star 4) the recovered eclipse depths werefar outside the 10% range for all depths tested. This was partially due to the point-to-point uncertainty being greater, as evidenced by the higher measured white and red noiseamplitudes, shown in each window of Figure 8 as vertical gray and red bars. Notably, therewas far more diﬀerence between the χ and BIC results for these series than for the syntheticmodel tests, because there were far more parameters that could be set to zero for diﬀerentpreferred solutions.For the best-behaved test stars (2 and 3), the critical depths occurred at or below thewhite noise amplitude. In these well-behaved series we saw better behavior from the BICthan from the χ models. However, the others (1, 4, 8, and 13) had strong biases from theirinputs at greater depths, and had mixed results comparing χ and BIC. Star 8, for example,behaved well with the BIC to below the level of σ w before getting abruptly worse, while for χ it diﬀered by more than 20% even from input signals greater than 10 ppm.The disagreements between the reported errors and the actual oﬀsets between input 24 –and measured eclipse depths varied substantially as well. For example, for Star 1 and Star4 using the BIC, the oﬀsets were 3 - 4 times greater than the error bars, whereas for Star 1using χ and Star 2 using either criterion, the error bars appeared to be appropriately-sized,containing many of the oﬀsets within 1 σ .Recalling from the tests in Section 4.4 that diﬀerent baselines can aﬀect the depthmeasurement, we also tried clipping the baseline on two of the light curves: one well-behavedseries (star 2), and one poorly-behaved one (star 4). In addition to running a series with allthe available points, we ran tests with only the points between phases 0.444 and 0.556 (572points over 241 minutes). The series for star 2 was made signiﬁcantly worse by removingthe extra baseline, but interestingly, the tests for star 4 resulted in values closer to the inputdepths when using χ (though still mostly outside the 10% error threshold) while the BICresults remained about the same. The results for these two clipped light curves can be seenin the bottom row of Figure 8.Even if we account for the diﬀerent noise sources and consider the eclipse depths relativeto the noise levels, it is apparent that systematic biases exist, and are diﬀerent for each choiceof target star and the amount of baseline. In practice, it appears that each star’s light curvehas its own systematics relative to the others, which can lead to biases in a measured eclipsesignal. As discussed in Section 3.1, some of the parameters may have non-linear eﬀects; e.g. al-though we modeled them as such, there is no reason that the relationship between ﬂuxsystematics and position on the chip, or the size and shape of the star images, should behavelinearly. To explore this, we added the use of a quadratic ﬁt to the x- and y-displacementparameters, since those are values that may most likely have non-linear shapes. This in-creased the number of model parameters by two per oﬀset position, with two coeﬃcients inthe x and y models rather than one.Figure 9 shows the results in the same format as Figures 4, 5, and 8, with the calculateddepth oﬀsets versus the input depths for synthetic eclipses put into the light curve using Star2 from the CoRoT-1 Ks-band ﬁeld. The green and pink points with solid lines represent theoﬀsets with linear correlations only (same as the second plot in Figure 8, but using the eclipsemodel of CoRoT-1b instead of Exoplanet X), while the blue and orange points with dashedlines show the results using quadratic correlation models for the positional displacement.In these tests we found that the quadratic models underestimated the eclipse depths for 25 –all input depths, meaning that they were both more complex and less accurate. However,that may be the case only for this speciﬁc data set, and knowing that using a quadratic ormore complicated ﬁt model in place of a linear correlation might aﬀect the depth estimation,it is recommended to at least perform this test on some of the ﬁeld stars.

6. Identifying and Measuring Real Eclipses

Having tested the MCMC analysis described in the previous sections on a wide rangeof test eclipse light curves with both real and synthetic noise patterns, we then applied theprocess to re-evaluate the CoRoT-1b Ks-band eclipse depth measurement in R09. As we sawin Section 5, each light curve had its own systematics that caused a diﬀerent bias for eachstar when treated as the target. We measured the eclipse signals in the target light curveas well as other ﬁeld stars in multiple ways, in order to establish a corrected, more robustvalue for the true eclipse depth.

The main analysis of R09 optimized the diﬀerential photometry between the target starCoRoT-1 and four nearby reference stars (those numbered 1, 2, 3, and 4 in this analysis), andthen tested the ﬁtting and removal of trends manually, one at a time. That analysis foundthat correcting for a correlation between the out-of-eclipse diﬀerential ﬂux and the FWHMof the star images was suﬃcient to reach a detection of an eclipse of depth D e = 3360 ± φ me = 0 . +0 . − . .For consistency in re-measuring the eclipse, we took the weighted diﬀerential light curvewith respect to the same four comparison stars (without adding any eclipse signal). Theresulting light curve had a ﬂux RMS of 5390 ppm, and measured white and red noise levels(ﬁtting to Equation 1) of σ w =7320 ppm, σ r =1230 ppm. Looking only at the out-of-eclipsebaseline points, however, the noise levels drop, with the red noise component nearly vanish-ing: σ w =6610 ppm, σ r =170 ppm.We then ran the MCMC-A analysis process to ﬁnd the best-ﬁt model incorporating asecondary eclipse signal – now using the previously-measured characteristics for the planetCoRoT-1b by its host star (see Table 1) rather than “Exoplanet X” – and allowing corre-lations with any of the other variables in Table 2. The best-ﬁt eclipse model using the χ criterion included an eclipse 3240 +330 − ppm deep, centered at phase 0 . +0 . − . ; with theBIC, the best-ﬁt model had D e = 3150 +380 − ppm, φ me = 0 . +0 . − . . These results, all 26 –agreeing with the previously published values within the reported errors of R09, are listedin Table 7 in the column marked “0-a”.The small uncertainties on φ me appear to suggest an eccentricity that is nonzero at the4 to 5- σ level; however, the phase values for the real datasets were calculated based on theephemeris from Bean (2009), derived from multiple transit timing observations: T (RJD)=54159 . ± . p = 1 . ± . φ me to 0.50410 +0 . − . ( χ ) and0.50426 +0 . − . (BIC). Both of these now show a nonzero eccentricity at only a 2.1- σ level,and place the R09 value of φ me =0.5022 at 1 σ away from the new measurements.Because the two information criteria methods gave nearly identical results, we deter-mined it best to adopt average values of the two solutions, with the error bars adjusted toinclude the spread of both. Thus, we state as our re-measurement of the R09 eclipse thevalues D e =3190 +370 − and φ me =0.50418 +0 . − . .We also tried constructing other light curves using diﬀerent combinations of stars 1through 4 as the comparison ﬂux for the target CoRoT-1. The measurements resulting fromone comparison star were inconsistent, sometimes detecting a negative depth or stronglyoﬀset phases, and high levels of red noise in the residuals. However, there was better consis-tency in the tests that used two or more reference stars, with a most of the best-ﬁt eclipsemodels having a depth between 2800 and 3500 ppm and a central phase of 0.502 to 0.504.Although these measurements produced eclipse attributes consistent with the results ofR09, we saw in the analysis for Section 5.2 that it was common to ﬁnd eclipse solutions ofsigniﬁcantly non-zero depth in the light curves of other ﬁeld stars with no synthetic eclipseadded. Two examples of this are seen in the best-ﬁt results for light curves 1-a and 4-a,run through the same process as the target (the third and fourth columns of Table 7), andexamining the results from test target stars 1, 2, 3, 4, 8, and 13, we noted two importantﬁndings. First, the detected depths appeared to be spread in a non-Gaussian distribution;rather than ﬁnding many of the depths close to zero with few outliers, they were clusteredclose to the red noise amplitude in both the positive and negative direction. Second, mostof the spurious detections were not centered at phases close to 0.5 – only one of the positiveeclipse values (4-a with BIC) had a φ me within even 5 σ of 0.5. These spurious detectionsmake it diﬃcult to determine how much a detected eclipse signal in the target light curve isthe result of systematic eﬀects versus an actual measurement of the planet ﬂux. In particular,since the reliability of the measurements is poorest for eclipse signals below the noise levels,eclipses of small depth are very diﬃcult to measure. 27 – In all of the real noise tests, the systematics were produced by a combination of thetarget star’s light curve and the composite ﬂux of the diﬀerent reference stars, so a bias couldresult from either the target or the reference star ﬂux. To create a more controlled datasetto compare the detection from CoRoT-1 with the spurious detections from the other teststars, we prepared an external “Group B” of stars (numbers 5, 6, 7, 8, 9, 11, 12, and 13 inthe ﬁeld), and combined their ﬂuxes into a single composite light curve that we could use totest each star from “Group A” (stars 1, 2, 3, 4, and CoRoT-1) in a consistent manner. Withthese we created ﬁve new diﬀerential light curves, “0-b” through “4-b”. Since we were notusing the optimal set of reference stars, their noise levels were approximately 1.5 to 2 timeshigher than for the “A” set of light curves.We then ran these light curves through the MCMC-A process as before; the right-handside of Table 7 lists three of them – 0-b, 1-b, and 4-b – with their calculated white andred noise amplitudes (in ppm) and the results fo their best-ﬁt solutions. We saw similarbehavior to what appeared in the Group A light curve results: eclipse depths at large values,and mid-eclipse phases often far from 0.5. The depths of these spurious eclipse detectionswere more consistent in these tests than in the Group A tests, clustered mainly at 4100 to4800 ppm, and with only one negative depth detection among them. The measurement ofthe eclipse on the target light curve 0-b did not plainly stand out from the rest; its depthusing χ was only slightly greater than that for 4-b, and while its φ me values were the closestto 0.5, they were not much closer than the values found for 1-b.The eclipse depths found at large phase oﬀsets, however, tell us little about the biasthat the comparison set may contribute when combined with a real eclipse. When a planet’sorbital eccentricity is known from other measurements, as is the case with CoRoT-1b, thatinformation can be used to better rule out false eclipse detections. We expected the trueeclipse to be centered close to φ =0.5 (recall R09 measured 0 . +0 . − . , and the detectionsin other bands either had similar results or assumed a circular orbit), and thus tried runningadditional MCMCs with that parameter ﬁxed. Table 8 lists these results for ﬁxing φ me toboth 0.5 and 0.5022, and in each of these cases, we saw new trends. Since the routine couldno longer ﬁt deep eclipses at phases far from 0.5, the eclipses measured in the light curvesfrom stars 1 through 4 were shallower than before. There also were no negative depthsmeasured, and the eclipse signal for light curve 3-b was nearly zero.The greatest depth measurements were now by a signiﬁcant margin those for the targetlight curve 0-b, ranging from 4310 +700 − to 5780 +380 − ppm. These values are signiﬁcantly higherthan in R09; however, there was a greater amount of noise in the comparison ﬂux (compositeof the eight “Group B” stars), than in the optimal reference composite from stars 1 through 28 –4 (“Group A”), and the uniformly positive spurious detections in 1-b through 4-b suggestthat some of this depth is a bias in the comparison light curve. Conveniently, we can lookat these results for the other light curves with the same comparisons to determine what biasthey may be introducing.For each ﬁxed φ me and information criterion, we estimated the Group B bias contributionwith a weighted average of the eclipse depths found for target stars 1 through 4. Theweighting factors were the same that were used for the composite comparison ﬂux fromGroup A: 40% for 2-b, 22% for 3-b, and 19% each for 1-b and 4-b. Then to determine acorrected depth measurement for the real eclipse, we simply subtracted the bias measurementfrom the best-ﬁt 0-b light curve result. The ﬁnal, corrected D e values for each ﬁxed-phase testare given (in ppm) in the last row of Table 8. These values end up in fairly close agreementwith one another as well as with the R09 value. In particular, the best-ﬁt depths using χ arenearly identical (3210 +560 − ppm at φ me =0.5, 3240 +530 − ppm at φ me =0.5022), suggesting thatthe measured ﬂux decrement is not strongly dependent on the exact placement of the mid-eclipse phase in these trials. The BIC results are less consistent with one another (2650 +760 − ppm at φ me =0.5, 3750 ±

620 ppm at φ me =0.5022), but remain in agreement within theirerrors of the χ results, the values in Section 6.1, and the original measurement in R09.These additional test results served to reinforce that the measurements of CoRoT-1b arethe most consistent in our photometric dataset; the eclipse was found at close to the samedepth and phase placement by a number of diﬀerent means. The red noise was eliminatedor nearly eliminated in many of the residuals, indicating that the process did a good job ofremoving trends.

7. Summary & Discussion

Much of our understanding of exoplanet atmospheres is based on a small number ofemission detections, making it crucial to know how accurate those observations are. Espe-cially when using ground-based instruments, correlated “red” noise complicates obtaininga reliable measurement. Measurements subject to red noise require an optimization of theobservations to limit the systematic eﬀects, and then the identiﬁcation, modeling, and re-moval of the trends from the resulting light curves. Simultaneously considering the eclipseshape and correlation trends together in a single model often requires a large number ofparameters to be optimized, making use of statistical tools like the Bayesian InformationCriterion (BIC) and a Markov Chain Monte Carlo (MCMC) algorithm.To test the reliability of commonly used analysis methods, we created a sample of 29 –hundreds of test eclipse light curves with both synthetic and real noise and ran them throughthe processes we developed for modeling (the combination of the eclipse signal and trends)and parameter optimization (MCMC). To simulate red noise, we used both simple sinusoidalpatterns and stochastic Brownian noise, observing similar results with each. Another concernregarding the synthetic models was that the noise amplitude was time-invariant, whereasin actual data sets the changes in airmass or sky brightness, or instrumental changes likedefocusing will change the noise level over the course of the observation. This assumptionof stationarity is used in all of the current modeling approaches, but nevertheless may leadto an underestimation of the eﬀects of real noise. A selection of these data sets are madeavailable in Appendix A for testing other analysis methods. From our tests and results, wedetermined some lessons for both the observation and analysis process.In the case where only white noise is present in the light curves, our tests ﬁnd for eachsampling rate, baseline, and information criterion a critical input depth dividing the results.Eclipse signals with input depths greater than this value are generally recovered within 10%without any systematic biases, while below it the eclipse depth measurements diﬀer from theinput depths by more than their measured uncertainties. This critical depth can be foundby beginning with synthetic eclipses of a great depth relative to the white noise level, andtesting lesser depths until they are no longer measured within the desired accuracy. We ﬁndthat the sampling rate and baseline are both important considerations: a longer baselineand a faster cadence improve the limits to which one can make a successful measurementand reduce the dispersion of the errors.When red noise is present in the light curve, the dispersion of results is greater and thecritical depth is reached at a higher input value; more importantly, we ﬁnd that several ofthe series exhibit systematic biases. Whereas in the white noise tests, the discrepancy be-tween the input and measured eclipse depths and mid-eclipse phases was randomly positiveor negative, structure in the red noise often results in measured eclipses that are consistentlydeeper or shallower than the inputs, or centered at a particular phase oﬀset. The uncer-tainties reported by the MCMC also may underestimate the depth errors by a factor of twoto four. We ﬁnd that baselines that signiﬁcantly exceed the duration of the eclipse are stillusually preferable; however, in certain cases the structure of the noise can negatively aﬀectthe best-ﬁt solutions. Including all available baseline may not be optimal for real datasetseither, particularly if some of it is noisier (e.g. taken at high airmass or diﬀerent weatherconditions). In cases when this is possible, additional tests should be done with a clippedbaseline.Examining datasets with real noise, we ﬁnd unexplained or under-modeled systematicsto be more prevalent still, which can lead to large biases or even completely spurious de- 30 –tections. Even with a detailed model to account for a wide variety of trends in the lightcurves, we ﬁnd that discrepant results between input and output signals appear when usingdiﬀerent ﬁeld stars for photometric comparison. For this reason, conﬁrming a measurementusing diﬀerent sets of comparison stars is advised. Much larger biases can accumulate onshallow eclipses than on deeper ones, and each star shows its own systematics and biasesthat sometimes cannot be removed by de-correlation techniques. The actual depth biasescan be anywhere from about the same magnitude as the measured uncertainties to three tofour times as large, reinforcing the common attitude of preferring substantially more thana 3- σ detection. A signiﬁcant ﬁnding is that spurious eclipse detections measured on ﬁeldstars do not follow a Gaussian distribution, but rather cluster around the level of the lightcurve’s red noise, with depths in both the positive and negative direction.This uncertainty of what is a real eclipse signal and what is a false detection fromred noise poses a serious problem to the measurement of eclipses, and further tests shouldbe used to determine the nature of the biases and verify the detections. One way to doso is to design a control set of comparison stars and determine a bias level by combiningthe spurious detections of other stars in place of the target. If the expected mid-eclipsetime (i.e. eccentricity) is well-known, this value can be ﬁxed to better determine the noisestructure’s contribution to the measurement of the actual eclipse depth. The establishedbias level can then be corrected for in the eclipse measurement of the actual target to verifyits depth.These additional tests and veriﬁcation processes are particularly advisable in the caseof low-conﬁdence ( < σ ) detections, i.e. most of the ground-based measurements that havebeen published. In order to claim a conﬁdent detection, we recommend the following criteriabe met: • A suﬃcient level of signiﬁcance on the measured signal (commonly adopted as ∼ σ ),and agreement using diﬀerent comparison stars and/or methods • Mid-eclipse phase consistent with other studies of the eccentricity • A relatively small amount of red noise in the residuals • Spurious eclipse signals with similar conﬁdence levels and phase placements as thetarget not seen in the light curves of other nearby starsThis set of criteria is stricter than adopted in the literature, but ensures more reliabledetections and depth measurements. We provide as an example the tests performed on theCoRoT-1 data from R09. Whereas the tests on other stars in the ﬁeld produce a distribution 31 –of eclipse detections with positive and negative depths and a range of mid-eclipse phasessuggesting strongly nonzero eccentricities, the measurements of the eclipse in the CoRoT-1light curve agree closely in depth and with a mid-eclipse phase close to 0.5. Testing diﬀerentlight curves for the target using diﬀerent reference stars, the results are not strongly aﬀectedas long as at least two of the bright, nearby comparison stars are used. When using a separateset of less-ideal reference stars, a deeper eclipse is measured, but this bias is estimated andremoved eﬀectively by ﬁnding the depth measured in other stars with the same referenceset, ﬁxed to the same mid-eclipse phase. Through this variety of tests, we establish the R09detection to be robust and consistent.Given the diﬃculties and complications of measuring secondary eclipse signals in typicalground-based data, we caution that both published and working detections be checked toensure that they represent an actual eclipse measurement and are not the products of unre-solved biases. If the published eclipse depths (and, subsequently, planetary thermal emissionlevels) are adjusted, the best-ﬁtting models and conclusions about atmospheric composi-tion and structure can be dramatically altered. We recommend that these analysis routinesand tests be applied to current and future eclipse light curves, as a means to evaluate andminimize the presence of systematics in the reported eclipse signals.This research has been supported by the National Science Foundation through grantAST-0908278. Critical support was provided by the Space Telescope Science Institute Di-rector’s Discretionary Research Fund D0101.90131. We thank the scientiﬁc editor and theanonymous referee for insightful comments and suggestions, and Veselin Kostov for helpfuldiscussion.

REFERENCES

Alonso, R., Guillot, T., Mazeh, T., Aigrain, S., Alapini, A., Barge, P., Hatzes, A., & Pont,F. 2009, A&A, 501, L23Anderson, D. R., Gillon, M., Maxted, P. F. L., Barman, T. S., Collier Cameron, A., Hellier,C., Queloz, D., Smalley, B., & Triaud, A. H. M. J. 2010, A&A, 513, L3+Barge, P., Baglin, A., Auvergne, M., Rauer, H., L´eger, A., Schneider, J., Pont, F., Aigrain,S., Almenara, J.-M., Alonso, R., Barbieri, M., Bord´e, P., Bouchy, F., Deeg, H. J., LaReza, D., Deleuil, M., Dvorak, R., Erikson, A., Fridlund, M., Gillon, M., Gondoin, P.,Guillot, T., Hatzes, A., Hebrard, G., Jorda, L., Kabath, P., Lammer, H., Llebaria, A., 32 –Loeillet, B., Magain, P., Mazeh, T., Moutou, C., Ollivier, M., P¨atzold, M., Queloz,D., Rouan, D., Shporer, A., & Wuchterl, G. 2008, A&A, 482, L17Bean, J. L. 2009, A&A, 506, 369Borucki, W. J., Koch, D., Jenkins, J., Sasselov, D., Gilliland, R., Batalha, N., Latham,D. W., Caldwell, D., Basri, G., Brown, T., Christensen-Dalsgaard, J., Cochran, W. D.,DeVore, E., Dunham, E., Dupree, A. K., Gautier, T., Geary, J., Gould, A., Howell,S., Kjeldsen, H., Lissauer, J., Marcy, G., Meibom, S., Morrison, D., & Tarter, J. 2009,Science, 325, 709Burrows, A., Budaj, J., & Hubeny, I. 2008, ApJ, 678, 1436C´aceres, C., Ivanov, V. D., Minniti, D., Burrows, A., Selman, F., Melo, C., Naef, D., Mason,E., & Pietrzynski, G. 2011, A&A, 530, A5Caceres, C., Ivanov, V. D., Minniti, D., Burrows, A., Selman, F., Melo, C., Naef, D., Mason,E., & Pietrzynski, G. 2011, ArXiv e-printsCarter, J. A., Rappaport, S., & Fabrycky, D. 2011, ApJ, 728, 139Carter, J. A. & Winn, J. N. 2009, ApJ, 704, 51Charbonneau, D., Allen, L. E., Megeath, S. T., Torres, G., Alonso, R., Brown, T. M.,Gilliland, R. L., Latham, D. W., Mandushev, G., O’Donovan, F. T., & Sozzetti, A.2005, ApJ, 626, 523Charbonneau, D., Knutson, H. A., Barman, T., Allen, L. E., Mayor, M., Megeath, S. T.,Queloz, D., & Udry, S. 2008, ApJ, 686, 1341Coughlin, J. L. & L´opez-Morales, M. 2012, AJ, 143, 39Croll, B., Albert, L., Lafreniere, D., Jayawardhana, R., & Fortney, J. J. 2010a, ApJ, 717,1084Croll, B., Jayawardhana, R., Fortney, J. J., Lafreni`ere, D., & Albert, L. 2010b, ApJ, 718,920Croll, B., Lafreniere, D., Albert, L., Jayawardhana, R., Fortney, J. J., & Murray, N. 2011,AJ, 141, 30Crossﬁeld, I. J. M., Barman, T., Hansen, B. M. S., Tanaka, I., & Kodama, T. 2012, ApJ,760, 140 33 –de Mooij, E. J. W., de Kok, R. J., Nefs, S. V., & Snellen, I. A. G. 2011, A&A, 528, A49+de Mooij, E. J. W. & Snellen, I. A. G. 2009, A&A, 493, L35Deming, D., Harrington, J., Seager, S., & Richardson, L. J. 2006, ApJ, 644, 560Deming, D., Seager, S., Richardson, L. J., & Harrington, J. 2005, Nature, 434, 740Fortney, J. J., Lodders, K., Marley, M. S., & Freedman, R. S. 2008, ApJ, 678, 1419Gibson, N. P., Aigrain, S., Pollacco, D. L., Barros, S. C. C., Hebb, L., Hrudkov´a, M.,Simpson, E. K., Skillen, I., & West, R. 2010, MNRAS, 404, L114Gibson, N. P., Aigrain, S., Roberts, S., Evans, T. M., Osborne, M., & Pont, F. 2012, MNRAS,419, 2683Gilliland, R. L., Chaplin, W. J., Dunham, E. W., Argabright, V. S., Borucki, W. J., Basri,G., Bryson, S. T., Buzasi, D. L., Caldwell, D. A., Elsworth, Y. P., Jenkins, J. M.,Koch, D. G., Kolodziejczak, J., Miglio, A., van Cleve, J., Walkowicz, L. M., & Welsh,W. F. 2011, ApJS, 197, 6Gillon, M., Demory, B., Triaud, A. H. M. J., Barman, T., Hebb, L., Montalb´an, J., Maxted,P. F. L., Queloz, D., Deleuil, M., & Magain, P. 2009, A&A, 506, 359Grillmair, C. J., Burrows, A., Charbonneau, D., Armus, L., Stauﬀer, J., Meadows, V., vanCleve, J., von Braun, K., & Levine, D. 2008, Nature, 456, 767Harrington, J., Luszcz, S., Seager, S., Deming, D., & Richardson, L. J. 2007, Nature, 447,691Knutson, H. A., Charbonneau, D., Cowan, N. B., Fortney, J. J., Showman, A. P., Agol, E.,Henry, G. W., Everett, M. E., & Allen, L. E. 2009, ApJ, 690, 822L´opez-Morales, M., Coughlin, J. L., Sing, D. K., Burrows, A., Apai, D., Rogers, J. C.,Spiegel, D. S., & Adams, E. R. 2010, ApJ, 716, L36L´opez-Morales, M. & Seager, S. 2007, ApJ, 667, L191Machalek, P., McCullough, P. R., Burke, C. J., Valenti, J. A., Burrows, A., & Hora, J. L.2008, ApJ, 684, 1427Mandel, K. & Agol, E. 2002, ApJ, 580, L171Pont, F., Zucker, S., & Queloz, D. 2006, MNRAS, 373, 231 34 –Rogers, J. C., Apai, D., L´opez-Morales, M., Sing, D. K., & Burrows, A. 2009, ApJ, 707,1707Sing, D. K. & L´opez-Morales, M. 2009, A&A, 493, L31Smith, A. M. S., Anderson, D. R., Skillen, I., Collier Cameron, A., & Smalley, B. 2011,MNRAS, 416, 2096Snellen, I. A. G., de Mooij, E. J. W., & Albrecht, S. 2009, Nature, 459, 543Swain, M. R., Vasisht, G., Tinetti, G., Bouwman, J., Chen, P., Yung, Y., Deming, D., &Deroo, P. 2009, ApJ, 690, L114Tamuz, O., Mazeh, T., & Zucker, S. 2005, MNRAS, 356, 1466Waldmann, I. P. 2011, ArXiv e-printsZhao, M., Milburn, J., Barman, T., Hinkley, S., Swain, M. R., Wright, J., & Monnier, J. D.2012a, ApJ, 748, L8Zhao, M., Monnier, J. D., Swain, M. R., Barman, T., & Hinkley, S. 2012b, ApJ, 744, 122

This preprint was prepared with the AAS L A TEX macros v5.2.

35 –Table 1: Characteristics of Exoplanet X and CoRoT-1b.

References. (1) See Section 4.1(2) Barge et al. 2008 (3) Bean 2009Parameter Exoplanet X CoRoT–1bStar Radius R ∗ [ R Sun ] 1.0 0.95Planet Radius R P [ R Jup ] 1.5 1.49Period P [days] 1.041667 1.508956Semimajor Axis a [AU] 0.01978 0.02540Inclination i [degrees] 88.0 85.1Eccentricity e T [minutes] 130.15 150.32 T [minutes] 94.13 106.56Reference 1 2, 3Table 2: Parameters Used for Light Curve ModelsModel Parameter SymbolEclipse Depth D e Mid-Eclipse Phase φ me Baseline Flux Level F b Phase Slope m φ Airmass Slope m z FWHM Slope m w X-displacement Slope m x Y-displacement Slope m y Sky Brightness Slope m s Table 3: Results from Two Independent MCMC Methods σ r (ppm) Obs. Win Samp. Rate Eclipse Depth D e (ppm) Mid-eclipse phase φ me (pattern) (minutes) (points/hour) Input Fit (A) Fit (B) Input Fit (A) Fit (B)0 240 25 1300 1318 +141 − ±

210 0.500 0.50093 +0 . − . ± +79 − ±

99 0.500 0.50123 +0 . − . ± +31 − ±

65 0.500 0.49914 +0 . − . ± +130 − ±

119 0.500 0.50176 +0 . − . ± +88 − ±

120 0.500 0.49940 +0 . − . ± +72 − ±

119 0.500 0.50207 +0 . − . ± +100 − ±

120 0.500 0.49929 +0 . − . ± +51 − ±

66 0.500 0.49994 +0 . − . ± σ w (ppm) 1000Red Noise Pattern (See Fig. 1) 0, 1, 2, 3Red Noise Level σ r (ppm) 200, 500, 1000Observing Window (minutes) 160, 240, 318, 506Input Mid-Eclipse Phase φ me .475, .500Sampling Rate (points / hour) 25, 75, 250Input Eclipse Depth D e (ppm) 50, 100, 150, 200, 250, 300, 400, 500,600, 800, 1000, 1300, 1600, 2000, 2300,2600, 3000, 3300, 3600, 4000, 4500, 5000Information Criterion χ , BIC Table 5: Example Results of MCMC TrialsMCMC Starting Point Recovered Values ResidualsTrial D e (ppm) φ me D e (ppm) φ me F b χ RMS (ppm)1-1 -744 0.50311 4390 +24 − . +0 . − . . +0 . − . +31 − . +0 . − . . +0 . − . +47 − . +0 . − . . +0 . − . +65 − . +0 . − . . +0 . − . +58 − . +0 . − . . +0 . − . +24 − . +0 . − . . +0 . − . +53 − . +0 . − . . +0 . − . +35 − . +0 . − . . +0 . − . +30 − . +0 . − . . +0 . − . − +98 − . +0 . − . . +0 . − . − +154 − . +0 . − . . +0 . − . − +113 − . +0 . − . . +0 . − . +51 − . +0 . − . . +0 . − . +22 − . +0 . − . . +0 . − . +25 − . +0 . − . . +0 . − . +51 − . +0 . − . Table 6: Synthetic Series Critical Depths and DispersionsObs. Sampling σ r =0 σ r =200 ppm σ r =500 ppm σ r =1000 ppmWindow Rate Info. D e,crit Disp. D e,crit Disp. D e,crit Disp. D e,crit Disp.(minutes) (pts./hr.) Crit. (ppm) (%) (ppm) (%) (ppm) (%) (ppm) (%)160 25 χ > χ > χ > > > χ

800 3.84 2000 5.59 2300 5.56 3000 5.94240 75 χ χ

800 2.41 600 4.01 2000 4.14 2300 8.59240 25 BIC 800 5.98 3300 4.78 1300 5.90 > χ χ

500 3.41 1300 1.96 500 3.38 1300 3.65318 250 χ

500 2.91 400 1.59 250 2.86 800 2.86318 25 BIC 1600 3.20 600 4.37 2600 5.78 1600 4.75318 75 BIC 800 4.62 500 3.75 800 4.37 600 3.44318 250 BIC 250 3.59 800 2.99 800 2.65 1000 2.92506 25 χ

800 4.82 800 4.72 2300 6.29 > χ

300 3.96 1300 4.83 3600 8.98 > χ

300 2.80 1600 4.75 3000 7.66 > > > > Table 7: Real Noise Data Sets and ResultsLight Curve 0-a 1-a 4-a 0-b 1-b 4-bTarget CoRoT-1 Star 1 Star 4 CoRoT-1 Star 1 Star 4Comp. Stars 1,2,3,4 2,3,4,13 1,2,3,8,13 Group B Group B Group B σ w (all) 7318 4582 4550 8265 8982 9283 σ r (all) 1225 1535 1691 2257 1451 5437 σ w (out) 6608 4462 4378 7446 8451 9329 σ r (out) 171 1609 1903 1527 1878 3688 Best-ﬁt Model Results ( χ ) D e (ppm) 3237 +327 − − +513 − +495 − +375 − +436 − +542 − φ me . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . Nonzero params 15 16 16 16 16 16 σ w (residuals) 6544 5660 5505 7177 7951 7773 σ r (residuals) 125 0 606 0 0 646 Best-ﬁt Model Results (BIC) D e (ppm) 3150 +383 − − +829 − +354 − +434 − +311 − +457 − φ me . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . . +0 . − . Nonzero params 15 10 9 12 8 8 σ w (residuals) 6411 5617 5498 7280 8056 7944 σ r (residuals) 0 0 781 0 335 522 Table 8: Independent Reference Set (Group B) Tests and Corrections φ me ﬁxed to 0.5000 φ me ﬁxed to 0.5022 Target Best-Fit D e ( χ ) Best-Fit D e (BIC) Best-Fit D e ( χ ) Best-Fit D e (BIC)CoRoT-1 4992 +430 − +699 − +426 − +375 − Star 1 3861 +332 − +280 − +452 − +306 − Star 2 1618 +268 − +340 − +290 − +696 − Star 3 101 +462 − +36 − +427 − +24 − Star 4 2003 +427 − +322 − +460 − +527 − Avg. 1-4 1784 +362 − +285 − +391 − +514 − Corrected 3208 +562 − +755 − +528 − +622 −

42 –Table 9: Benchmark Data Set AttributesName Type σ w σ r Obs. Win Samp. Rate D e φ me S1 Synth 1000 0 240 75 4000 0.500S2 Synth 1000 0 240 75 3300 0.475S3 Synth 1000 0 160 75 1000 0.500S4 Synth 1000 0 506 75 1600 0.500S5 Synth 1000 200 240 75 2000 0.500S6 Synth 1000 1000 240 250 4000 0.500S7 Synth 1000 200 318 25 2600 0.500R1 Star 2 6188 2239 293 141 4000 0.500R2 Star 0 7318 1225 293 141 1500 0.500R3 Star 0 7318 1225 293 141 0 0.500Table 10: Data Set S1: Synthetic Noise (Full Table Online)Phase O.Flux Error E.Shape N.Base W.Noise R.Noise0.42000 1.00452 0.00099 1.0040 1.00052 0.00052 0.000000.42053 1.00393 0.00099 1.0040 0.99993 -0.00007 0.000000.42107 1.00404 0.00099 1.0040 1.00004 0.00004 0.000000.42161 1.00460 0.00099 1.0040 1.00060 0.00060 0.000000.42214 1.00330 0.00099 1.0040 0.99930 -0.00070 0.00000... ... ... ... ... ... ...

Table 11: Data Set R1: Real Noise (Full Table Online)Phase O.Flux Error E.Shape N.Base O.Pos Airm. FWHM x Disp y Disp Sky0.43242 1.00245 0.00418 1.0040 0.99846 0 1.4050 10.6442 4.9667 -1.8051 -38.1740.43279 1.00067 0.00405 1.0040 0.99668 1 1.4023 10.6731 5.8475 -2.4504 35.4850.43294 1.00390 0.00411 1.0040 0.99990 1 1.4010 10.6792 5.8099 -2.4756 14.2240.43315 0.99956 0.00415 1.0040 0.99557 0 1.3995 10.6743 5.3403 -2.8960 -7.6870.43331 0.99974 0.00415 1.0040 0.99576 0 1.3983 10.6720 4.6221 -2.7726 -2.658... ... ... ... ... ... ... ... ... ... ... 44 –

A. Benchmark Data Sets

Given the biases and complications in detecting exoplanet eclipses, and the diversity ofroutines developed to make these measurements, there is a strong need for common referencedatasets containing realistic noise levels and patterns against which diﬀerent teams can testthe performance of their routines. Such datasets are not commonly available; thus, weprovide here a set of models to serve this purpose for the exoplanet communityWe selected a number of representative light curves from our studies that can be usedas benchmark testing for any routines designed to ﬁt and measure eclipses: ten datasets thatspan the full breadth of our tests, both the synthetic noise models (white and red noise) andthe real data systematics (from the CoRoT-1 Ks-band photometry). Various baselines from160 to 506 minutes and sampling rates from 25 to 250 points per hour are featured. Thedepth and central phase of the eclipses that are input are provided, so that any team cancheck the results from their analysis routine against them.Table 9 shows the characteristics of each dataset: the type of noise (synthetic or fromone of the real stars’ diﬀerential light curves), white and red noise amplitude σ w and σ r (inppm), observing window (in minutes), sampling rate (in points per hour), and the inputeclipse signal’s depth (in ppm) and mid-eclipse phase. For the synthetic datasets, σ w and σ r are the input values that we started with; for the datasets with real noise, these values weremeasured by ﬁtting to Equation 1. These datasets are made available in machine-readableASCII format on the Vizier Data Service. The ﬁrst seven sets, labeled S1 through S7, are synthetic light curves constructed inthe manner described in Section 4.1. Sets S1 through S4 contain white noise only, while S5through S7 also include red noise components (S5 and S6 use Pattern 3, S7 uses Pattern 4;see Figure 1). Sets S1 and S6 are also the data shown in panels (b) and (c), respectively, ofFigure 3. The ﬁrst several lines of set S1 are shown in Table 10 as an example to illustratethe format and data types. The ﬁrst three columns show the orbital phase at which eachobservation occurs, the observed ﬂux, and the ﬂux uncertainty. Only these three columnsare necessary for testing purposes; the remaining four columns simply separate the diﬀerentcomponents of the construction: the noiseless eclipse shape, the combined noise base, andthe white and red noise components. A greater number of signiﬁcant digits are given in theonline tables.The three remaining datasets R1, R2, and R3 use the noise and systematics from the For the time being, they can be found at ;this is to be replaced later with Vizier url and citation, when the paper is accepted.

45 –Fig. 1.— Five red noise models we used for the synthetic light curve tests. Patterns 1,2, and 3 were constructed by adding together four sinusoidal waves of diﬀerent periods,amplitudes, and phase angles. Patterns 4 and 5 were constructed by integrating a randomnormal distribution to form Brownian noise. Each selected noise model was normalized tothe desired red noise amplitude, and then combined with white noise and a simulated eclipsecurve to form a synthetic light curve. 46 –Fig. 2.— A comparison of the results from each of the two independently developed MCMCanalysis processes we tested on eight selected synthetic light curves. Identical sets of lightcurves, covering a range of eclipse depth, baseline length, sampling rate, and red noiseamplitude and pattern, were run through each analysis, in order to compare the results toeach other and to the input values. The x-positions of each result are oﬀset for purposes ofclarity. The results from both MCMCs were consistent with each other, and are also shownin Table 3 along with a description of the light curves. 47 –Fig. 3.— Examples of construction of eclipse light curves: a 4000-ppm deep eclipse with nonoise at all (a), and with each of the three types of noise models (shown in red): White noiseonly (b), white noise plus red noise (c), and real noise systematics from NICFPS Ks-bandphotometry (d). For each case, we combined the eclipse shape and noise into a light curveto model with the MCMC and recover the best-ﬁt depth and mid-eclipse phase. 48 –Fig. 4.— The behavior of the best-ﬁt recovered eclipse signals as a function of input depth,for the tests with white noise only. The gray vertical bar shows the white noise level of σ w =1000 ppm. The accuracy is expressed as a percent diﬀerence between the input andrecovered depth. Each row of plots represents a diﬀerent baseline length, and the colorsshow the diﬀerent sampling rates. The plots in the left column use the χ informationcriterion, while those in the right column use the BIC. The x-positions of the symbols areslightly oﬀset in order to distinguish the data points and error bars. Following each series ofpoints from left to right, the critical depth is deﬁned as the point at which the depth percenterrors leave the ±

10% range for the second time. 49 –Fig. 5.— The behavior of the best-ﬁt recovered eclipse signals as a function of input depth,for the tests with white noise and red noise (Pattern 3) added. The amplitudes of these noisepatterns ( σ w =1000 ppm, σ r =500 ppm) are shown by the vertical gray and red bars. Theaccuracy is expressed as a percent diﬀerence between the input and recovered depth. Eachrow represents a diﬀerent baseline length, and the colors red, green, and blue represent thediﬀerent sampling rates. The plots in the left column use the χ information criterion, whilethose in the right column use the BIC. The x-positions of the symbols are slightly oﬀset inorder to distinguish the data points and error bars. 50 –Fig. 6.— Comparison of the two diﬀerent types of red noise models we used in these analyses:a sum of sinusoids (Pattern 3), and two Brownian random walks (Patterns 4 and 5, seeFigure 1), with amplitudes of σ w =1000 ppm, σ r =500 ppm, shown by the vertical gray andred bars. These tests all used the 240-minute observing window with a cadence of 250 pointsper hour. The accuracy is expressed as a percent diﬀerence between the input and recovereddepth. 51 –Fig. 7.— The light curve and trends from the real dataset (CoRoT-1b ﬁeld in Ks). Thetop set of points (red and blue; each represents one of the two oﬀset positions) shows thediﬀerential ﬂux between Comparison Star 1 and the others in the ﬁeld, to be used as realnoise for the tests in Sections 5 and 6. The points are in 15-minute bins for clarity. Below arecurves (normalized, not to scale with one another) showing the other parameters measuredto look for correlation. 52 –Fig. 8.— The depth recovery results from the combination of real noise and synthetic eclipsesof varying input depths. The dark green squares represent the results using χ , while thepink diamond points represent the results using the BIC. The measured white and red noiseamplitudes are indicated by the vertical gray and red bars, respectively. Each of the top sixpanels uses a diﬀerent star as the target, while the The bottom row shows results from twolight curves – one from a well-behaved series (Star 2), and one from a poorly-behaved series(Star 4) – clipped to only 50 minutes before and after the eclipse. 53 –Fig. 9.— Comparison of the results from using the linear and quadratic models for depen-dence on x- and y-displacement, for both χ (top panel) and BIC (bottom panel). Each ofthese are performed using the Star 2 light curve, with synthetic eclipses of varying depthsinput. The measured white and red noise amplitudes are indicated by the vertical gray andred bars, respectively. 54 –photometry of the CoRoT-1 ﬁeld in Ks. Set R1 treats comparison star 2 as the target andadds a 4000-ppm eclipse, modeled using the characteristics of “Exoplanet X” from Table 1.Refer to that table and Section 4.1 for all the other parameters that are needed to reconstructthe eclipse signal in light curve. This is the light curve shown in panel (d) of Figure 3; itis also equivalent to the eclipseless light curve 2-a described in Section 5.1 with a 4000-ppmeclipse shape added. Sets R2 and R3 use the light curve from the actual target star, with setR2 having an extra 1500-ppm eclipse added on top of the true signal. If a routine is workingproperly it should ﬁnd an eclipse 1500 ppm deeper for R2 than for R3. Table 11 shows theﬁrst several lines of set R1. The data to ﬁt is again in the ﬁrst three columns, but the realnoise sets include the other variables – oﬀset position, z , w , x , y , and ss