Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian...

40
1 Part VII: Part VII: Bayesian Statistics Bayesian Statistics [R. Trotta 0803.4089] [R. Trotta 0803.4089] [ [ R. Trotta 1701.01467] R. Trotta 1701.01467] [Amendola & Tsujikawa cap. 13] [Amendola & Tsujikawa cap. 13] Miguel Quartin Miguel Quartin Instituto de Física, UFRJ Instituto de Física, UFRJ Astrofísica, Relativ. e Cosmologia (ARCOS) Astrofísica, Relativ. e Cosmologia (ARCOS) Curso de Cosmologia Pós – 2019/1 Curso de Cosmologia Pós – 2019/1

Transcript of Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian...

Page 1: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

11

Part VII: Part VII: Bayesian StatisticsBayesian Statistics[R. Trotta 0803.4089] [R. Trotta 0803.4089] [[R. Trotta 1701.01467] R. Trotta 1701.01467]

[Amendola & Tsujikawa cap. 13][Amendola & Tsujikawa cap. 13]

Miguel QuartinMiguel QuartinInstituto de Física, UFRJInstituto de Física, UFRJ

Astrofísica, Relativ. e Cosmologia (ARCOS)Astrofísica, Relativ. e Cosmologia (ARCOS)

Curso de Cosmologia Pós – 2019/1Curso de Cosmologia Pós – 2019/1

Page 2: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

22

TopicsTopics

Short review of probability theoryShort review of probability theory

Bayes' TheoremBayes' Theorem

The Likelihood methodThe Likelihood method

Model SelectionModel Selection

[optional] Fisher Matrix[optional] Fisher Matrix

Page 3: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

33

ProbabilitiesProbabilities

Classical interpretation of probability: infinite realization Classical interpretation of probability: infinite realization limit of relative frequencieslimit of relative frequencies Probability: “the number of times the event occurs over the Probability: “the number of times the event occurs over the

total number of trials, in the limit of an infinite series of total number of trials, in the limit of an infinite series of equiprobable repetitions.”equiprobable repetitions.”

Let's define 2 random (stochastic) variables x and y (e.g. Let's define 2 random (stochastic) variables x and y (e.g. numbers on a die roll).numbers on a die roll). pp(X) is the probability of getting the result x = X(X) is the probability of getting the result x = X pp(X, Y) or (X, Y) or pp(X (X ∩∩ Y) prob of getting results x = X AND y = Y→ Y) prob of getting results x = X AND y = Y→ pp(X | Y) or (X | Y) or pp(X ; Y) prob of x = X given the fact that y = Y→(X ; Y) prob of x = X given the fact that y = Y→ pp(X (X ∪∪ Y) prob of getting results x = X OR y = Y→ Y) prob of getting results x = X OR y = Y→

Page 4: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

44

Probabilities (2)Probabilities (2)

Some properties:Some properties: Joint probabilities are symmetricJoint probabilities are symmetric

Joint prob of Joint prob of independentindependent events events

Joint prob of Joint prob of dependentdependent events events

Disjoint prob of Disjoint prob of mutually exclusivemutually exclusive events events

In particularIn particular

Page 5: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

55

Probabilities (3)Probabilities (3)

Let's discuss the conditional probability property:Let's discuss the conditional probability property:

Suppose A refers to “person that studies physics” and B to Suppose A refers to “person that studies physics” and B to “person that plays piano”“person that plays piano”

Suppose also that we know that: Suppose also that we know that: p(B)p(B) = 1/100 and = 1/100 and p(A, B)p(A, B) = 1/ 1000= 1/ 1000 In other words, out of 1000 random people, 10 will play In other words, out of 1000 random people, 10 will play

the piano and 1 will play the piano AND be a physicistthe piano and 1 will play the piano AND be a physicist So, if someone plays piano, he has 1/10 chance of being So, if someone plays piano, he has 1/10 chance of being

also a physicistalso a physicist

Page 6: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

66

Bayes' TheoremBayes' Theorem

Note thatNote that The probability of A given B is The probability of A given B is notnot the prob of B given A. the prob of B given A. E.g.: The probability of winning the lottery given that you E.g.: The probability of winning the lottery given that you

played twice in your life is played twice in your life is notnot the same as the probability the same as the probability that you played twice in your life given that you won the that you played twice in your life given that you won the lottery!lottery!

From the symmetry of the joint probabilities we getFrom the symmetry of the joint probabilities we get

This is the This is the Bayes TheoremBayes Theorem of conditional probabilities of conditional probabilities

Page 7: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

77

InterpretationInterpretation

Classical “Classical “frequentistfrequentist” interpretation of probability ” interpretation of probability (infinite realization limit of relative frequencies) has (infinite realization limit of relative frequencies) has limitationslimitations It is It is circularcircular assumes that the repeated trials have same → assumes that the repeated trials have same →

probability of outcomesprobability of outcomes Cannot deal with unrepeatable situations [e.g. (i) probability Cannot deal with unrepeatable situations [e.g. (i) probability

I will die in a car accident; (ii) prob the Big-Bang happened I will die in a car accident; (ii) prob the Big-Bang happened the way it did]the way it did] ““what is the probability that it rained in Manaus during what is the probability that it rained in Manaus during

D. Pedro II 43D. Pedro II 43rdrd birthday?” birthday?” How to correct for finite realizations? How many How to correct for finite realizations? How many

realizations are needed for the frequencies to be approx. the realizations are needed for the frequencies to be approx. the probabilities? This approximation is to which % accuracy?probabilities? This approximation is to which % accuracy?

Page 8: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

88

Interpretation (2)Interpretation (2)

The The BayesianBayesian interpretation is based on Bayes' Theorem interpretation is based on Bayes' Theorem Re-interpret the theorem not in terms of regular random Re-interpret the theorem not in terms of regular random

variables but in terms of data (variables but in terms of data (DD) and theory () and theory (TT)) Inverse statistical problem: what is the probability that Inverse statistical problem: what is the probability that

theory theory TT is correct given we measured the data is correct given we measured the data DD??

The “theory” might be a model (such as The “theory” might be a model (such as ΛΛCDM or DGP) of CDM or DGP) of just the parameter values of an assumed model (such as just the parameter values of an assumed model (such as ΩΩm0m0 and and ΩΩΛ0Λ0 , assuming , assuming ΛΛCDM).CDM).

Page 9: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

99

Interpretation (3)Interpretation (3)

Bayesian analysis has some philosophical implicationsBayesian analysis has some philosophical implications The The best theorybest theory will be the will be the most probablemost probable theory theory Bayesian analysis carry a mathematically precise Bayesian analysis carry a mathematically precise

formulation of Occam's Razor: “if 2 hypotheses are equally formulation of Occam's Razor: “if 2 hypotheses are equally likely, the hypothesis with the fewest assumptions should likely, the hypothesis with the fewest assumptions should be selected”.be selected”. Only strong reason before the 18Only strong reason before the 18thth century to choose century to choose

Copernicus' model over Ptolomy'sCopernicus' model over Ptolomy's Karl Popper: we prefer simpler theories to more Karl Popper: we prefer simpler theories to more

complex ones “because their empirical content is complex ones “because their empirical content is greater; and because they are better testable” simple →greater; and because they are better testable” simple →theories are theories are more easily falsifiablemore easily falsifiable

Page 10: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1010

The Likelihood MethodThe Likelihood Method

Let's define the data as a vector Let's define the data as a vector xx and the parameters as and the parameters as a vector a vector θθ

We write Bayes' theorem asWe write Bayes' theorem as PP → → posteriorposterior probability probability pp((θθ) prior probability→) prior probability→ LL → → likelihoodlikelihood function function g(g(xx)) as we will see usually just a normalization factor→ → as we will see usually just a normalization factor→ →

We are interested in the posteriorWe are interested in the posterior In the literature people sometimes refer to the posterior In the literature people sometimes refer to the posterior

also as the “likelihood”also as the “likelihood”

Page 11: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1111

The Likelihood Method (2)The Likelihood Method (2)

The posterior is a probability, so it has to be normalized The posterior is a probability, so it has to be normalized to unityto unity

This integral is called the evidenceThis integral is called the evidence gg((xx) does not depend on the parameters, so useless for ) does not depend on the parameters, so useless for

parameter determinationparameter determination But very useful to choose between modelsBut very useful to choose between models

Page 12: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1212

Prior and PrejudicePrior and Prejudice

Priors are inevitable in the likelihood (posterior) methodPriors are inevitable in the likelihood (posterior) method Frequentist don't like it subjective prior knowledge→Frequentist don't like it subjective prior knowledge→ Bayesianists have to learn to love it after all, we →Bayesianists have to learn to love it after all, we → alwaysalways

know something before the analysisknow something before the analysis

E.g.: we can use E.g.: we can use pp((ΩΩm0 m0 < 0) = 0 as it does not make < 0) = 0 as it does not make sense to have negative matter densitysense to have negative matter density

We can add information from previous experiment. We can add information from previous experiment. E.g. experiment A measured E.g. experiment A measured h h = 0.72 = 0.72 ±± 0.08, so we can 0.08, so we can use, say, the Gaussian prioruse, say, the Gaussian prior

Page 13: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1313

Prior and PrejudicePrior and Prejudice

You are free to choose your prior but choice must be →You are free to choose your prior but choice must be →explicitexplicit

You HAVE to choose a prior →You HAVE to choose a prior → pp((θθ) = 1 *is* a particular ) = 1 *is* a particular prior, which under a parameter change will no longer be prior, which under a parameter change will no longer be constant. constant. E.g.: E.g.: pp((tt) = 1 ) = 1 ≠≠ pp((zz) = 1 ) = 1 ≠ ≠ pp(log (log tt) = 1 …) = 1 …

E.g. 2: a measurement of ΩE.g. 2: a measurement of ΩΛΛ00 assumes the strong prior that assumes the strong prior that the model is the model is ΛΛCDMCDM

Priors may be subjective, but analysis is objectivePriors may be subjective, but analysis is objective Priors are an Priors are an advantageadvantage of Bayes no inference can be made → of Bayes no inference can be made →

without assumptionswithout assumptions Data can show that the priors were “wrong”Data can show that the priors were “wrong”

Page 14: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1414

The Likelihood Method (3)The Likelihood Method (3)

If we are not interested in model selection we can neglect If we are not interested in model selection we can neglect the function the function gg((xx)) The posterior The posterior PP must then be normalized must then be normalized The best-fit parameter values are the ones that maximize The best-fit parameter values are the ones that maximize PP

The The nn%% confidence region confidence region RR of the parameters are the of the parameters are the region around the best fit for whichregion around the best fit for which

The confidence region in general is not symmetricThe confidence region in general is not symmetric

Page 15: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1515

The Likelihood Method (4)The Likelihood Method (4)

If the likelihood (i.e. the posterior) has many parameters, If the likelihood (i.e. the posterior) has many parameters, it is interesting to know what information it has in each it is interesting to know what information it has in each parameter (or each pair) independently of the othersparameter (or each pair) independently of the others We must do a weighted sum on the other parametersWe must do a weighted sum on the other parameters

This is referred to as marginalization over a parameterThis is referred to as marginalization over a parameter

Page 16: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1616

The Likelihood Method (5)The Likelihood Method (5)

It is customary to use the following confidence regions: It is customary to use the following confidence regions: 68.3%, 95.4% and 99.73%. The reason is that for gaussian 68.3%, 95.4% and 99.73%. The reason is that for gaussian posteriors, these are the 1, 2 and 3 standard deviations.posteriors, these are the 1, 2 and 3 standard deviations. We therefore often refer to these regions, for simplicity, as We therefore often refer to these regions, for simplicity, as

just the 1just the 1σσ, 2, 2σσ and 3 and 3σσ regions regions

Say, if for Say, if for ΩΩm0 m0 the best fit is 0.3 the best fit is 0.3 and and

the 68% confidence region (1the 68% confidence region (1σσ) is [0.1, 0.4] ) is [0.1, 0.4] we writewe write

Note that here the 2Note that here the 2σ region will σ region will notnot be be

Page 17: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1717

The Likelihood Method (6)The Likelihood Method (6)

best fit

Page 18: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1818

How to Build the Likelihood?How to Build the Likelihood?

The likelihood is a function of the data functional →The likelihood is a function of the data functional →form depends on the instrument used to collect the dataform depends on the instrument used to collect the data Usually instruments have (approximately) either Gaussian Usually instruments have (approximately) either Gaussian

or Poisson noise. or Poisson noise. It is common to assume by default a Gaussian noiseIt is common to assume by default a Gaussian noise

If the likelihood is gaussian in the data and the data are If the likelihood is gaussian in the data and the data are independent (uncorrelated errors) we haveindependent (uncorrelated errors) we have

Page 19: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

1919

Model SelectionModel Selection

We now want to address the more general problem: how We now want to address the more general problem: how to tell which of 2 competitive theories are statistically to tell which of 2 competitive theories are statistically better given some data?better given some data?

Frequentist approach: compare the reduced Frequentist approach: compare the reduced χχ22 (i.e. the (i.e. the χχ22 per degree of freedom – d.o.f.) of the data in the 2 per degree of freedom – d.o.f.) of the data in the 2 theoriestheories The The χχ22–distribution with –distribution with kk degrees of freedom is the degrees of freedom is the

distribution of a sum of the squares of distribution of a sum of the squares of kk independent independent standard normal (i.e. gaussian) random variables.standard normal (i.e. gaussian) random variables.

The p.d.f. is given by (although we do not use it explicitly)The p.d.f. is given by (although we do not use it explicitly)

Page 20: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2020

Model Selection (2)Model Selection (2)

This is the distribution if the likelihood of the data was This is the distribution if the likelihood of the data was exactly given by exactly given by

In a nutshell, it is the sum of squares of the “distance, in In a nutshell, it is the sum of squares of the “distance, in units of standard deviations, between data points and units of standard deviations, between data points and theoretical curve”theoretical curve” We refer to the We refer to the total total χχ22 as as the sum the sum

Frequentist mantra: good models haveFrequentist mantra: good models have

Page 21: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2121

Model Selection (3)Model Selection (3) The bayesian equivalent to The bayesian equivalent to χχ22 comparison is the comparison is the Bayes Bayes

ratioratio ratio of → ratio of → evidencesevidences of models “1” and “2” of models “1” and “2” For a model “M” the evidence isFor a model “M” the evidence is

BB1212 > 1 model 1 is favored by the data (and vice-versa)→ > 1 model 1 is favored by the data (and vice-versa)→

If you have an a priori reason to favor a model generalize →If you have an a priori reason to favor a model generalize →the above to include model priorsthe above to include model priors

The Bayes factor between 2 models is justThe Bayes factor between 2 models is just

Page 22: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2222

Model Selection (4)Model Selection (4) The Bayes factor has several advantages over simple The Bayes factor has several advantages over simple χχ22

If the If the data is poordata is poor and a particular parameter of one model and a particular parameter of one model is unconstrained by it, the model is is unconstrained by it, the model is notnot penalized penalized E.g.: a given dark energy model has a parameter E.g.: a given dark energy model has a parameter

related to, say, cluster abundance at z = 2, for which related to, say, cluster abundance at z = 2, for which data is poor. This is good, because poor data data is poor. This is good, because poor data ≠≠ poor poor model! model!

Mathematically the posterior is approx. flat on this →Mathematically the posterior is approx. flat on this →parameter assuming (as usual) that the priors are →parameter assuming (as usual) that the priors are →independent we have that:independent we have that:

Page 23: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2323

Model Selection (5)Model Selection (5) To get a better intuition, we can study the simple case of To get a better intuition, we can study the simple case of

Gaussian likelihoods + gaussian priors analytical →Gaussian likelihoods + gaussian priors analytical → EE(x)(x) Assuming uncorrelated parameters, the posterior is then Assuming uncorrelated parameters, the posterior is then

(integrating over the data): [note f ↔(integrating over the data): [note f ↔ LL]]

P(θ)

Page 24: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2424

Model Selection (6)Model Selection (6) The evidence is then given byThe evidence is then given by

Let's analyze the 3 distinct terms aboveLet's analyze the 3 distinct terms above fmax is the max likelihood how well the model fits the →fmax is the max likelihood how well the model fits the →

datadata is always < 1 penalizes extra parameters →is always < 1 penalizes extra parameters →

constrained by the data Ockham's Razor factor→constrained by the data Ockham's Razor factor→ exp[ … ] penalizes cases where prior best fit is very →exp[ … ] penalizes cases where prior best fit is very →

different than posterior best fitdifferent than posterior best fit

Page 25: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2525

Jeffrey's ScaleJeffrey's Scale

As we have seen: As we have seen: BB1212 > 1 model 1 is favored by the → > 1 model 1 is favored by the →data (and vice-versa)data (and vice-versa)

There is no absolute rule of how big must BThere is no absolute rule of how big must B1212 be to be to conclude whether one model must be replaced by anotherconclude whether one model must be replaced by another

A simple rule-of-thumb though is just to use a simple scale A simple rule-of-thumb though is just to use a simple scale to guide the discussion. to guide the discussion. Jeffrey's scaleJeffrey's scale is often used: is often used:

Page 26: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2626

Page 27: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2727

Fisher MatrixFisher Matrix

In a nutshell the Fisher Matrix method is an →In a nutshell the Fisher Matrix method is an →approximationapproximation for the computation of the posterior under for the computation of the posterior under the assumption that it is gaussian the assumption that it is gaussian in the parametersin the parameters Advantages: Advantages:

very fast to compute (either analytically or numerically)very fast to compute (either analytically or numerically) gives directly the (elliptical) confidence-level contoursgives directly the (elliptical) confidence-level contours

Disadvantages: Disadvantages: gives wrong results when non-Gaussianity is stronggives wrong results when non-Gaussianity is strong no intrinsic flags to warn you when non-Gaussianity is no intrinsic flags to warn you when non-Gaussianity is

strongstrong numerical derivatives can be noisynumerical derivatives can be noisy

For a 4-page quick-start guide, see: arXiv:0906.4123For a 4-page quick-start guide, see: arXiv:0906.4123 For more detail, see Amendola & Tsujikawa, Sect. 13.3For more detail, see Amendola & Tsujikawa, Sect. 13.3

Page 28: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2828

Fisher Matrix (2)Fisher Matrix (2) We write the posterior as a multivariate gaussianWe write the posterior as a multivariate gaussian

The matrix The matrix FF is called the Fisher (or information) matrix is called the Fisher (or information) matrix To compute To compute FF, we Taylor expand the posterior near its , we Taylor expand the posterior near its

peak – the maximum likelihood (ML) pointpeak – the maximum likelihood (ML) point We need to compute first this point but this is simple:We need to compute first this point but this is simple:

When doing forecasts for future experiments, we know When doing forecasts for future experiments, we know the ML beforehand (it is our fiducial model)the ML beforehand (it is our fiducial model)

For real data multi-dim. minimization algorithms are →For real data multi-dim. minimization algorithms are →fastfast

Page 29: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

2929

Properties of the Fisher MatrixProperties of the Fisher Matrix

Once we have Once we have FF, the , the covariance matrixcovariance matrix is simply its is simply its inverseinverse For 2 parameters:For 2 parameters:

The ellipses axis lengths (The ellipses axis lengths (α α aa and and α bα b) and rotation angle ) and rotation angle are given by the eigenvalues and eigenvectors of are given by the eigenvalues and eigenvectors of CC::

Page 30: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3030

Properties of the Fisher Matrix (2)Properties of the Fisher Matrix (2)

MarginalizationMarginalization over a parameter simply remove the → over a parameter simply remove the →line & column of that parameter from line & column of that parameter from CC = = FF–1–1 and invert and invert the new, reduced the new, reduced CC

FixingFixing a parameter to its best fit simply remove the → a parameter to its best fit simply remove the →line & column of that parameter from line & column of that parameter from FF

Adding datasets simply add →Adding datasets simply add → FFtottot = = FF11 + + FF22

Changing variables simple jacobian transformation→Changing variables simple jacobian transformation→

Page 31: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3131

Covariance MatrixCovariance Matrix

When data is correlated, we need to compute the When data is correlated, we need to compute the covariance matrix covariance matrix ΣΣ

Page 32: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3232

Covariance Matrix (2)Covariance Matrix (2)

The cov matrix is related to the The cov matrix is related to the correlation matrixcorrelation matrix::

It corresponds to the cov matrix of the standardized random It corresponds to the cov matrix of the standardized random variable setvariable set

We can also define the We can also define the cross-covariancecross-covariance between 2 vectors between 2 vectors

Some properties of Σ:Some properties of Σ: It It is positive-semidefinite and symmetricis positive-semidefinite and symmetric

Page 33: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3333

Covariance Matrix (3)Covariance Matrix (3)

To compute the cov matrix, we need to compute expected To compute the cov matrix, we need to compute expected values (means) over many realizationsvalues (means) over many realizations

Sometimes it can be computed analyticallySometimes it can be computed analytically But more often it cannot, and one has to rely on But more often it cannot, and one has to rely on

simulations of simulations of mock datamock data Mock data, or mock catalogs, are collections of random Mock data, or mock catalogs, are collections of random

realizations of data according to some distributionrealizations of data according to some distribution Many mock (Monte Carlo) catalogs have to be generated in Many mock (Monte Carlo) catalogs have to be generated in

order to estimate the cov matrix with good precisionorder to estimate the cov matrix with good precision Never forget the golden rule: statistical errors decrease Never forget the golden rule: statistical errors decrease

as sqrt(N)as sqrt(N)

Page 34: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3434

Example: 2-pt correlation functionExample: 2-pt correlation function

Let's study one particular example involving the 2-point Let's study one particular example involving the 2-point correlation function in astronomycorrelation function in astronomy

We want to study how a given class of objects are We want to study how a given class of objects are distributed in the skydistributed in the sky Let's focus on galaxies, for instanceLet's focus on galaxies, for instance Given a random galaxy in a location, the 2-point correlation Given a random galaxy in a location, the 2-point correlation

functionfunction ξξ(r)(r) describes the describes the excess probabilityexcess probability that another that another galaxy will be found within a given (scalar) distance r, galaxy will be found within a given (scalar) distance r, compared to an compared to an uniform distributionuniform distribution

Because gravity attracts objects they tend to cluster together, Because gravity attracts objects they tend to cluster together, so we expectso we expect ξξ(r)(r) to decrease as r increasesto decrease as r increases

Page 35: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3535

Example: 2-pt correlation functionExample: 2-pt correlation function

In principle, In principle, ξξ depends on the vector depends on the vector rr, but if our data is , but if our data is assumed to be statistically homogeneous (same statistics assumed to be statistically homogeneous (same statistics everywhere), then x only depends on r = |everywhere), then x only depends on r = |rr|.|. Let's assume thisLet's assume this

If we define the number of galaxies dN at a small volume If we define the number of galaxies dN at a small volume dV, located at distance r from a given galaxy, and the dV, located at distance r from a given galaxy, and the average density as average density as ρρ00,we have,we have

It's an excess probability we have an integral constraint→It's an excess probability we have an integral constraint→

Page 36: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3636

Example: 2-pt correlation functionExample: 2-pt correlation function

If the correlation If the correlation ξξ is positive ( is positive (negativenegative), there are more ), there are more ((lessless) particles than an uniform distribution) particles than an uniform distribution

For a given catalog, unless the volume has a very simple For a given catalog, unless the volume has a very simple geometry (say, a perfect sphere), it is impossible to geometry (say, a perfect sphere), it is impossible to compute the correlation function or its cov matrix compute the correlation function or its cov matrix analyticallyanalytically

We can estimate We can estimate ξ in a given catalog with the following ξ in a given catalog with the following estimator, where DD means number of galaxies at a estimator, where DD means number of galaxies at a distance (r, r + Δr) in the data, and RR the same in a distance (r, r + Δr) in the data, and RR the same in a random, uniform catalog with the same volume:random, uniform catalog with the same volume:

Page 37: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3737

Example: 2-pt correlation functionExample: 2-pt correlation function

We compute the numbers DD(r) and RR(r) for all pairs of We compute the numbers DD(r) and RR(r) for all pairs of objects (here, galaxies)objects (here, galaxies)

We do this for a number of distance binsWe do this for a number of distance bins In each bin, we need to estimate the error bars In each bin, we need to estimate the error bars Since the same objects enter different bins, the data is highly Since the same objects enter different bins, the data is highly

correlated we need to compute the cov matrix→correlated we need to compute the cov matrix→ We need to generate many random data catalogs!We need to generate many random data catalogs!

Page 38: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3838

Example: 2-pt correlation functionExample: 2-pt correlation function

Page 39: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

3939

Comparison of Different MethodsComparison of Different Methods

When we have a posterior with non-linear dependence in When we have a posterior with non-linear dependence in the parameters (i.e., it is not Gaussian in the parameters, the parameters (i.e., it is not Gaussian in the parameters, even if it is Gaussian in the data), the Fisher Matrix even if it is Gaussian in the data), the Fisher Matrix approach might yield incorrect resultsapproach might yield incorrect results

We have then several options to compute it. Let's assume We have then several options to compute it. Let's assume we have N parameters. The most common are:we have N parameters. The most common are: Grid analysis compute numerically the posterior for a N-→Grid analysis compute numerically the posterior for a N-→

dimensional tensor, which is the exterior product of the dimensional tensor, which is the exterior product of the different vectors of values for each parameterdifferent vectors of values for each parameter Must guess the parameter ranges, or apply trial & errorMust guess the parameter ranges, or apply trial & error

Run first a very coarse-grained tensor, the refineRun first a very coarse-grained tensor, the refine Very simple to code and implementVery simple to code and implement

Page 40: Curso de Cosmologia Pós – 2019/1 Part VII: Bayesian Statisticsdarnassus.if.ufrj.br/~mquartin/disciplinas/cosmology/... · 2019. 6. 10. · 1 Part VII: Bayesian Statistics [R. Trotta

4040

Comparison of Different MethodsComparison of Different Methods

MCMC analysis usually based on the Metropolis-→MCMC analysis usually based on the Metropolis-→Hastings algorithmHastings algorithm Goal: probe the N-dimensional space in a ”non-Goal: probe the N-dimensional space in a ”non-

retangular” way concentrate in the high-posterior →retangular” way concentrate in the high-posterior →space more efficient search→space more efficient search→

We will study it later We will study it later Fisher Matrix analysis anyway and hope for the bestFisher Matrix analysis anyway and hope for the best Nested Sampling analysis see 1306.2144 & 1506.00171→Nested Sampling analysis see 1306.2144 & 1506.00171→

Comparison of techniques: Comparison of techniques: Fisher Matrix fast and simple approx., maybe very wrong→Fisher Matrix fast and simple approx., maybe very wrong→ MCMC complexity is N log(N), but code requires tuning→MCMC complexity is N log(N), but code requires tuning→ Grid algorithm complexity grows as Exp(N) [only OK for →Grid algorithm complexity grows as Exp(N) [only OK for →

up to ~6 params]up to ~6 params]