Sta 2030 Notes

103
7/23/2019 Sta 2030 Notes http://slidepdf.com/reader/full/sta-2030-notes 1/103 Introduction to distribution theory and regression analysis (STA2030S) Christien Thiart and the STA2030 team Department of Statistical Sciences University of Cape Town c 20 July 2009

Transcript of Sta 2030 Notes

Page 1: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 1/103

Introduction to distribution theory and regression analysis

(STA2030S)

Christien Thiart and the STA2030 teamDepartment of Statistical Sciences

University of Cape Town

c20 July 2009

Page 2: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 2/103

Contents

1 Random variables, univariate distributions 1-1

1.1 Assumed statistical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

1.3 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4

1.4 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7

1.4.1 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10

1.4.2 The Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11

1.5 Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11

1.6 Functions of random variables - cumulative distribution technique . . . . . . . . . 1-15

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17

2 Bivariate Distributions 2-1

2.1 Joint random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1

2.2 Independence and conditional distributions . . . . . . . . . . . . . . . . . . . . . . 2-6

2.3 The bivariate Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9

2.4 Functions of bivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . 2-10

2.4.1 General principles of the transformation technique . . . . . . . . . . . . . . 2-10

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12

3 Moments of univariate distributions and moment generating function 3-1

3.1 Assumed statistical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

3.2 Moments of univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

3.2.1 Moments - examples A − F . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6

3.3 The moment generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10

3.4 Moment generating functions for functions of random Variables . . . . . . . . . . . 3-14

3.5 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17

1

Page 3: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 3/103

4 Moments of bivariate distributions 4-1

4.1 Assumed statistical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1

4.2 Moments of bivariate distributions: covariance and correlation . . . . . . . . . . . . 4-1

4.3 Conditional moments and regression of the mean . . . . . . . . . . . . . . . . . . . 4-5

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6

5 Distributions of Sample Statistics 5-1

5.1 Random samples and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1

5.2 Distributions of sample mean and variance for Gaussian distributed populations . 5-3

5.3 Application to χ2 goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . 5-6

5.4 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7

5.5 Applications of the t distribution to two-sample tests . . . . . . . . . . . . . . . . . 5-9

5.6 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14

6 Regression analysis 6-1

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1

6.2 Simple (linear) regression - model, assumptions . . . . . . . . . . . . . . . . . . . . 6-2

6.3 Matrix notation for simple regression . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6

6.4 Multivariate regression - model, assumptions . . . . . . . . . . . . . . . . . . . . . 6-8

6.5 Graphical residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10

6.6 Variable diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13

6.6.1 Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . 6-13

6.7 Subset selection of regressor variables - building the regression model . . . . . . . . 6-15

6.7.1 All possible regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.7.2 Stepwise regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.8 Further residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.9 Inference about regression parameters and predicting . . . . . . . . . . . . . . . . . 6-17

6.9.1 Inference on regression parameters . . . . . . . . . . . . . . . . . . . . . . . 6-17

6.9.2 Drawing inferences about E[Y |xh

] . . . . . . . . . . . . . . . . . . . . . . . 6-18

6.9.3 Drawing inferences about future observations . . . . . . . . . . . . . . . . . 6-18

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A Attitude A-1

2

Page 4: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 4/103

B Maths Toolbox B-1

B.1 Differentiation (e.g. ComMath, chapter 3) . . . . . . . . . . . . . . . . . . . . . . . B-1

B.2 Integration (ComMath, chapter 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

B.3 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1

B.4 Double integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2

B.5 Matrices: e.g. ComMath, Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

3

Page 5: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 5/103

Chapter 1

Random variables, univariatedistributions

These notes have been prepared for the course STA2030S, for which the pre-requisite at UCT is apass in STA1000, STA2020 and a course in mathematics. It is thus assumed that the material inthe textbook for STA1000 (“INTROSTAT”, by L G Underhill and D J Bradfield) is known.

This course is based on the following three principles:

• First year statistical background (Introstat).

• Sharp mathematical tools (Appendix B) and

• Attitude (Appendix A).

1.1 Assumed statistical background

• Concept of a random variable (Introstat, chapter 4)

• Probability mass functions: binomial, poisson, (Introstat chapter 5 and 6)

• Probability density functions: normal, uniform, exponential (Introstat, chapter 5 and 6)

• Concept of a cumulative density function (Introstat, chapter 6)

• Quantiles (Introstat, chapter 1, chapter 6)

1.2 Random variables

[Random variable: a numerical value whose value is determined by the outcome of a randomexperiment (or repeatable process). Each repetition of the random experiment is called a trial.]

It is useful to start this course with the question: “what do we mean by probability?” The basisin STA1000 was the concept of a sample space , subsets of which are defined as events . Very oftenthe events are closely linked to values of a random variable , i.e. a real-valued number whose “true”value is at present unknown, either through ignorance or because it is still undetermined. Thus if T is a random variable, then typical events may be T ≤ 20, 0 < T < 1, T is a prime number.

1-1

Page 6: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 6/103

In general terms, we can view an event as a statement, which is in principle verifiable as true orfalse, but the truth of which will only be revealed for sure at a later time (if ever). The concept of arandom experiment was introduced in STA1000, as being the observation of nature (the real world)which will resolve the issue of the truth of the statement (i.e. whether the event has occurred or

not). The probability of an event is then a measure of the degree of likelihood that the statementis true (that the event has occurred or will occur), i.e. the degree of credibility in the statement.This measure is conventionally standardized so that if the statement is known to be false (theevent is impossible) then the probability is 0; while if it is known to be true (the event is certain)then the probability is 1. The axioms of Kolmogorov (“INTROSTAT”, chapter 3) give minimalrequirements for a probability measure to be rationally consistent.

Kolmogorov (definition of Probability measure)

A probability measure on a sample space Ω is a function P from subsets (events, say A) of Ω tothe interval [0, 1] which satisfies the following three axioms:

(1) P [Ω] = 1

(2) If A ⊂ Ω then P [A] ≥ 0.

(3) If A1, and A2 are disjoint then P [A1∪A2] = P [A1] + P [A2]. In general if A1, A2, · · ·, Ar, · · · isa sequence of mutually exclusive events (that is the intersection between Ai and Aj is empty(Ai ∩ Aj = φ) for i = j, j = 1, 2, · · ·) then

P [A1 ∪ A2 · ··] = P [∪∞i=1Ai]

=

∞i=1

P [Ai]

Within this broad background, there are two rather divergent ways in which a probability measuremay be assessed and/or interpreted:

Frequency Intuition: Suppose the identical random experiment can be conducted N times (e.g.rolling dice, spinning coins), and let M be the number of times that the event E is observedto occur. We can define a probability of E by:

Pr[E ] = limN →∞

M

N

i.e. probability is interpreted as the relative proportion of times that E occurs in many trials.This (interpretation) is still a leap in faith! Why should the future behave like the past?Why should the occurrence of E at the next occasion obey the average laws of the past? In

any case, what do we mean by “identical” experiments? Nevertheless, this approach doesgive a sense of objectivity to the interpretation of probability.

Very often, the “experiments” are hypothetical mental experiments! These tend to be basedon the concept of equally likely elementary events, justified by symmetry arguments (e.g. thefaces of a dice).

Subjective Belief: The problem is that we cannot even conceptually view all forms of uncer-tainty in frequency of occurrence terms. Consider for example:

• The petrol price next December

• The grade of ore in a specific undeveloped block of ore in a mine

• Your marks for this course

1-2

Page 7: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 7/103

None of these can be repeatedly observed, and yet we may well have a strong subjective senseof the probability of events defined in terms of these random variables. The subjective viewaccepts that most, in fact very nearly all, sources of uncertainty include at least some degreeof subjectivity, and that we should not avoid recognizing probability as a measure of our

subjective lack of knowledge. Of course, where well-defined frequencies are available, or canbe derived from elementary arguments, we should not lightly dismiss these. But ultimately,the only logical rational constraint on probabilities is that they do satisfy Kolmogorov’saxioms (for without this, the implied beliefs are incoherent , in the sense that actions ordecisions consistent with stated probabilities violating the axioms can be shown to lead tocertain loss).

The aim of statistics is to argue from specific observations of data to more general conclusions(“inductive reasoning”). Probability is only a tool for coping with the uncertainties inherent inthis process. But it is inevitable that the above two views of probability extend to two distinctphilosophies of statistical inference, i.e. the manner in which sample data should extrapolated togeneral conclusions about populations. These two philosophies (paradigms) can be summarized asfollows:

Frequentist, or Sampling Theory: Here probability statements are used only to describe theresults of (at least conceptually) repeatable experiments, but not for any other uncertain-ties (for example regarding the value of an unknown parameter, or the truth of a “null”hypothesis, which are assumed to remain constant no matter how often the experiment isrepeated). The hope is that different statisticians should be able to agree on these probabili-ties, which thus have a claim to objectivity. The emergence of statistical inference during thelate 19th and early 20th centuries as a central tool of the scientific method occurred withinthis paradigm (at a time when “objectivity” was prized above all else in science). Certainly,concepts of repeatable experiments and hypothesis testing remain fundamental cornerstonesof the scientific method.

The emphasis in this approach is on sampling variability: what might have happened if the underlying experiments could have been repeated many times over? This approach wasadopted throughout first year .

Bayesian: Here probability statements are used to represent all uncertainty, whether a resultof sampling variability or of lack of information. The term “Bayesian” arises because of thecentral role of Bayes’ theorem in making probability statements about unknown parametersconditional on observed data, i.e. “inferring likely causes” from observed consequences, whichwas very much the context in which Bayes worked.

The emphasis is on characterizing how degrees of uncertainty change from the position prior to any observations, to the position after (or posterior to) these observations

One cannot say that one philosophy of inference is “better” than another (although some have

tried to argue in this way). Some contexts lend themselves to one philosophy rather than theother, while some statisticians feel more comfortable with one set of assumptions rather than theother.

Nevertheless, for the purposes of this course and for much of third year (STA3030), we will largelylimit ourselves to discussing the frequentist approach: this approach is perhaps simpler, will avoidconfusion of concepts at this relatively early stage in your training, and is the “classical” approachused in reporting experimental results in many fields. In fact, the fundamental purpose of thiscourse is to demonstrate precisely how the various tests, confidence intervals, etc.. in first year are derived from basic distributional theory applied to sampling variability viewed in a frequentistsense.

Notation: We shall use upper case letters (X , Y , . . .) to signify random variables (e.g. time to

failure of a machine, size of an insurable loss, blood pressure of a patient), and lower case letters

1-3

Page 8: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 8/103

(x , y , . . .) to represent algebraic quantities. The expression X ≤ x thus represents the event,or the assertion, that the random variable denoted by X takes on a value not exceeding the realnumber x. (We can quite legitimately define events such as X ≤ 3.81.)

1.3 Probability mass functions

Now suppose that X can take on discrete values only. Let Ω be the set of all such values e.g.Ω = 0, 1, 2, · · · .

Discrete random variable: A random variable X is defined discrete if the range of possiblevalues X is countable; e.g. number of cars parked in front of Fuller Hall; number of babies thatare born at Tygerberg hospital; number of passengers on the Jammie Shuttle at 10 am.

We define the probability mass function (pmf) (sometimes simply termed the probability function)by p

X(•) = Pr[X = •] (the • is any scalar value). The following properties are then evident:

i pX (xi) ≥ 0 for xi = Ω (countable set)ii p

X(xi) = 0 for all other values of x

iii∞

i=1 pX

(xi) = 1

Example A: Consider the experiment of tossing a coin 4 times. Define the random variable Y asthe number of heads when a coin is tossed 4 times.

(1) Write down the sample space.

(2) Derive the probability mass function (pmf) (assume Pr(head) = p).

(3) Check that the 3 properties of a pmf are satisfied.

First list the sample space: We need to fill 4 positions, the rv Y (the number of heads) can takeon values 1,2,3 or 4. The sample space is obtained by listing all the possibilities:

Y Sample space probability pmf 0 TTTT (1-p) (1-p) (1-p) (1-p) (1 − p)4

1 HTTT p (1-p) (1-p) (1-p)THTT (1-p) p (1-p) (1-p)TTHT (1-p) (1-p) p (1-p)TTTH (1-p) (1-p) (1-p) p 4p(1 − p)3

2 HHTT p p (1-p) (1-p)

HTHT p (1-p)p (1-p)HTTH p (1-p) (1-p)pTTHH p p (1-p) (1-p)THHT p (1-p) (1-p)pTHTH (1-p)p (1-p)p 6p2(1 − p)2

3 HHHT p p p (1-p)HTHH p (1-p) p pHHTH p p (1-p)pTHHH (1-p) p p p 4p3(1 − p)

4 HHHH p p p p p4

1-4

Page 9: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 9/103

Thus, the pmf is given by

pY (y) =

4y

py(1 − p)n−y , y = 0, 1, 2, 3, 4

Note that pY (y) ≥ 0 for all y = 0, 1, 2, 3, 4 (a countable set), and is zero for all other values of y .Furthermore the pmf sums to one:

4i=0

4

i

pi( p − 1)4−i = ( p + (1 − p))4 (binomial theorem)

= 1

The random variable Y in Example A is an example of a binomial random variable. In generalthe probabiltiy mass function of a binomial random variable, with parameters n and p where n is

the number of trails and p is the probability of observing a ’success’ (be careful how you define theconcept of ’success’), is given as:

pX

(x) =

nx

px(1 − p)n−x, x = 0, 1, 2, ·, ·, ·, n

In shorthand notation we write this pmf as X ∼ B(n, p). For more exercises on the Binomial seeIntrostat, chapter 5.

Example B: Consider the following probability mass function:

pX

(x) = x

k, x = 1, 2, 3, 4; zero elsewhere

(1) Find the constant k so that pX

(x) satisfies the conditions of being a pmf of a random variableX .

(2) Find Pr[X = 2 or 3].

(3) Find Pr[32 < X < 92 ].

Solution:

(1) To find k we need to evaluate the sum of the pmf and set it equal to one (third property of a pmf):

4i=1

pX

(xi) = 1

k +

2

k +

3

k +

4

k

= 1 + 2 + 3 + 4

k= 1 (given it is a pmf)

Thus, 10

k = 1 =⇒ k = 10.

(2) Pr[X = 2 or 3] = Pr[X = 2] + Pr[X = 3] = 2

10 +

3

10 =

5

10.

1-5

Page 10: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 10/103

(3) Pr[32

< X < 92

] = Pr[2 ≤ X ≤ 4] = 1−Pr[X = 1] = 1 − 1

10 =

9

10.

Example C: The pmf of the random variable S is given in the following table:

Value of S 0 1 2 3 4 pS

(s) 0.15 0.25 0.25 0.15 c

(1) Find the constant c so that pS

(s) satisfies the conditions of being a pmf of the randomvariable S .

(2) Find Pr[S = 6 or 2.5].

(3) Find Pr[S > 3].

Solution:

(1) To find c we need to evaluate the sum of the pmf and set it equal to one (third property of

a pmf):c = 1 − (0.15 + 0.25 + 0.25 + 0.15) = 0.20

(2) Pr[S = 6 or 2.5] = 0

(3) Pr[S > 3] =Pr[S = 4] = 0.20

In first year you covered other discrete probability mass functions you need to revise them yourself),but here are some important notes on some of them:

Binomial (Example A and Introstat Chapter 5)

The binomial sampling situation refers to the number of “successes” (which may, in some cases,be the undesirable outcomes !) in n independent “trials”, in which the probability of success in

a single trial is p for all trials. This situation is often described initially in terms of “samplingwith replacement”, where the replacement ensures a constant value of p (as otherwise we havethe hypergeometric distribution), but applies whenever sequential trials have constant successprobability (e.g.sampling production from a continuous process).

Poisson (Introstat, Chapter 5)

(1) The Poisson distribution with parameter λ = np is introduced as an approximation to thebinomial distribution when n is large and p is small.

(2) The Poisson distribution also arises in conjunction with the exponential distribution in theimportant context of a “memoryless (Poisson) process”. Such a process describes discreteoccurrences (e.g. failures of a piece of equipment, claims on an insurance policy), when the

probabilities of future events do not depend on what has happened in the past. For example,if the probability that a machine will break down in the next hour is independent of howlong it is since the last breakdown, then this is a memoryless process. For such a processin which the rate of occurrence is λ (number of occurrences per unit time), it was shownin INTROSTAT that (a) the time between successive occurrences is a continuous randomvariable having the exponential distribution with parameter λ (i.e. a mean of 1/λ); and (b)the number of occurrences in a fixed interval of time t is a discrete random variable havingthe Poisson distribution with parameter (i.e. mean) λt.

These distributions are assumed known, revise Chapter 5 and 7 (Introstat), also check and seeif you can prove that these 4 probability mass functions satisfy the conditions of a pmf. For eachdiscrete distribution do at least 4 exercises of Introstat.

You need to realize one important rule:

1-6

Page 11: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 11/103

• Although some pmf ’s can have special names (e.g. Binomial, Poisson, Geometric, Negativebinomial), each will still follow the rules of a pmf.

1.4 Probability density functions

A discrete random variable can only give rise to a countable possible number of values, in contrasta continuous random variable is a random variable whose set of possible values is uncountable(e.g. you can measure your weight to 2 decimals, 4 decimals - you can measure it to any degree of accuracy!) Examples of continuous random variables:

• Lifetime of a energy saver globe

• Petrol consumption of your car

• The length of time you have to wait for the Jammie shuttle

A continuous random variable (e.g. say X ) is described by its probability density function (pdf).The function f

X(x) is the probability density function (pdf). Once again, the subscript X identifies

the random variable under consideration, while the argument x is an arbitrary algebraic quantity.The pdf satisfies the following properties:

f X

(x) ≥ 0 ... but can be greater than 1! ∞x=−∞

f X

(x)dx = 1

Pr[a < X ≤ b] = b

x=a

f X

(x)dx

Example D: Let T be a random variable of the continuous type with pdf given by

f T

(t) = ct2 for 0 ≤ t ≤ 2

(1) Show that for f T

(t) to be a pdf, c = 3

8.

(2) Find P[X > 1].

(3) Draw the graph of f T

(t).

Solution:

(1) Since f T

(t) is a pdf, we must have that 2

x=0f T

(t)dt = 1

2x=0

f T

(t)dt = 2

x=0ct2dt = c

t3

3

20

= c

3(8 − 0) = c

8

3

thus

c8

3 = 1 (given it is a pdf)

⇒ c = 3

8

1-7

Page 12: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 12/103

(2)

P [X > 1] = 1 − P [X ≤ 1]

= 1−

1

x=0

f T

(t)dt

= 1 − 1

x=0

3

8t2dt

= 1 −

3

8

t3

3

10

= 1 − 3

8

1

3

= 7

8

Example E: Let W be a random variable of the continuous type with pdf given by

f W

(w) = 2e−2w for 0 < w < ∞

(1) Show that f W

(w) is a pdf.

(2) Find P[2 < w < 10].

(3) Draw the graph of f W

(w).

Solution:

(1) It is clear that f W (w) ≥ 0, we need to show that ∞x=0 f W (w)dw = 1:

∞x=0

f W

(w)dw =

∞x=0

2e−2wdw

= −1

∞x=0

−2e−2wdw

= −1

e−2w∞0

= −1(e−∞ − e0)

= −1(0 − 1)

= 1

(2)

P [2 < w < 10] =

10x=2

2e−2wdw

= −1

10x=2

−2e−2wdw

= −1e−2w

102

= −1(e−20 − e−4)

= −1(−0.01832)

= 0.01832

1-8

Page 13: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 13/103

Comment: The random variable W follows an exponential distribution with parameter λ = 2(see below, and Introstat, chapter 5).

Example F: Let X be a random variable of the continuous type with pdf given by

f X

(x) = 1

10 for 2 < x < c

(1) Find c so that f X

(x) is a pdf.

(2) Find P[x > 14].

(3) Find P[x ≤ 5].

(4) Draw the graph of f X

(x).

Solution:

(1) c

x=2

f X

(x)dx =

c

x=2

1

10dx

=

1

10x

c

2

= 1

10(c − 2)

For f X

(x) to be a pdf 1

10(c − 2) = 1, thus c = 12

(2) P[x > 14] = 0 (f X(x) = 0, outside the bounds)

(3)

P [x ≤ 5] =

5x=2

1

10dx

=

1

10x

52

= 1

10(5 − 2)

= 3

10

Comment: The random variable X follows a uniform distribution (X ∼ U (2, 10)).

If the continues random variable X is equally likely to take on any value in the interval (a, b) thenX has the uniform distribution, X ∼ U (a, b) with probability density function

f X

(x) = 1

b − a for a < x < b

and 0 otherwise. (Introstat, chapter 5).

Other continous probability density functions that you come across in first year include the normal(Introstat, Chapter 5), t, F and Chi-squared distributions (Introstat, Chapter 9, 10 and in laterchapters of these notes).

You need to realize two important rules:

1-9

Page 14: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 14/103

• Although some pdf’s can have special names (e.g. Weibull, log-normal, laplace), each willstill follow the rules of a pdf.

• A pdf given without its bounds is not a pdf! A pdf will always have a lower bound and anupper bound. Always specify these bounds, even when they are

−∞ or

∞.

1.4.1 The Gamma distribution

For your maths toolbox:

(1) We first define the gamma function as follows:

Γ(n) =

∞0

xn−1e−xdx.

Note that this equation gives a function of n, and not a function of x, which is an arbitrarilychosen variable of integration! The argument n may be an integer or any real number.

(2) The easiest case is:

Γ(1) =

∞0

e−xdx = 1.

(3) It can also be shown that Γ(12) =√

π.

(This follows by change of variable in the integration from x to z =√

2x, and recognizing theform of the density of the normal distribution, whose integral is known. (only the studentswith strong mathematical skills should try showing this result)).

(4) An important property of the gamma function is given by the following, in which the second

line follows by integration by parts:

Γ(n + 1) =

∞0

xne−xdx

=

xn(−e−x)∞0 −

∞0

n xn−1(−e−x)dx

= 0 + n

∞0

xn−1e−xdx

= n Γ(n)

Use of this result together with Γ(1) = 1 and Γ( 12

) = √

π allows us to evaluate Γ(n) for allinteger and half-integer arguments. In particular, it is easily confirmed that for integer valuesof n, Γ(n) = (n

−1)!.

(5) Evaluate

(a) Γ(5) = (5 − 1)! = 24

(b) Γ(412) = Γ ( 3 1

2 + 1) = 3 12Γ(31

2 + 1) = · · · = 3 12 21

2 112

12 Γ( 12) = 31

2 212 11

212

√ π

The Gamma distribution is defined as:

f X

(x) = λαxα−1e−λx

Γ(α) for 0 < x < ∞, λ > 0 and α > 0.

(Note that in general we will assume that the value of a density function is 0 outside of the range

for which it is specifically defined.)

1-10

Page 15: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 15/103

Exercise: Show that the pdf given above for the gamma distribution is indeed a proper pdf.(HINT: Transform the integration to a new variable u = λx.)

Alternative definition: In some texts, the gamma distribution is defined in terms of α and aparameter β defined by β = 1/λ. In this case the density is written as:

f X

(x) = xα−1e−x/β

Γ(α)β α for 0 < x < ∞.

Of course, the mathematical fact that f X

(x) satisfies the properties of a pdf does not imply orshow that any random variable will have this distribution in practical situations. We shall, a littlelater in this course, demonstrate two important situations in which the gamma distribution arisesnaturally. These situations are:

(1) The gamma special case in which α = n/2 (for integer n) and λ = 12

is the χ2 (chi-squared)distribution with n degrees of freedom (which you met frequently in the first year course).

(2) When α = 1 the gamma distribution is called the exponential distribution.

1.4.2 The Beta distribution

We start by defining the beta function , as follows:

B(m, n) =

10

xm−1(1 − x)n−1dx

Note that this equation gives a function of m and n (and not of x). Note the symmetry of thearguments: B(m, n) = B(n, m). It can be shown that the beta and gamma functions are relatedby the following expression, which we shall not prove:

B(m, n) = Γ(m)Γ(n)

Γ(m + n)

Clearly, the function defined by:

f X

(x) = xm−1(1 − x)n−1

B(m, n) for 0 < x < 1

satisfies all the properties of a probability density function. A probability distribution with pdf given by f

X(x) is called the beta distribution , or more correctly the beta distribution of the first

kind. (We shall meet the second kind shortly.) We will later see particular situations in which thisdistribution arises naturally, e.g. in comparing variances or sums of squares.

1.5 Distribution function

The distribution function (sometimes denoted as the cumulative distribution function (cdf)) of therandom variable X , is defined by:

F X

(b) = Pr[X ≤ b]

The subscript X refers to the specific random variable under consideration. The argument b is anarbitrary algebraic symbol: we can just as easily talk of F

X(y), or F

X(t), or just F

X(5.17).

Since for any pair of real numbers a < b, the events a < X ≤ b and X ≤ a are mutuallyexclusive, while their union is the event X ≤ b, it follows that Pr[X ≤ b] = Pr[X ≤ a] + Pr[a <X ≤ b], or (equivalently):

Pr[a < X ≤ b] = F X

(b) − F X

(a)

Some properties of the cdf F are

1-11

Page 16: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 16/103

(1) F is a monotone, non-decreasing function; that is, if a < b, then F X

(a) ≤ F X

(b).

(2) lim F X

(a) = 1 if a → ∞. (in practice it means that if a → upper bound then F X

(a) = 1

(3) lim F X

(a) = 0 if a

→ −∞. (in practice it means that if a

→ lower bound then F

X(a) = 0

Some examples: Find the cdf for Examples A - F:

Example A (cont): The cdf is given by:

F Y (•) =

•y=0

4y

py(1 − p)4−y

e.g. for • = 1

F Y (1) = (1 − p)4 + 4 p1(1 − p)3

Example B (cont): The cdf is given by:

F X

(•) =•

x=1

x

10

e.g. for • = 3

F X

(3) =3

x=1

x

10

= 110

+ 210

+ 310

= 6

10

Example C (cont): The cdf is given by:

F S

(•) =

•s=0

pS

(s)

e.g. for • = 5

F S

(5) = F S

(4) = 1

Example D (cont): The cdf is given by:

F T

(♣) =

♣0

3

8t2dt

= 3

8

t3

3

♣0

= ♣3

8

1-12

Page 17: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 17/103

Thus the cdf of T is:

F T

(t) = 0 t < 0

= t3

8 0 ≤ t ≤ 2

= 1 t > 2

Example E (cont): The cdf is given by:

F W

(♦) =

♦0

2e−2wdw

= −1

0

−2e−2wdw

= −1

e−2w♦0

= −1[e−2♦ − e0]

= 1 − e−2♦

Thus the cdf of W is given by

F W

(w) = 0 w ≤ 0

= 1 − e−2w, w > 0

Example F (cont): The cdf is given by:

F X

(♥) =

♥2

1

10dx

=

1

10x

♥2

= ♥ − 2

10

Thus the cdf of X is given by

F X

(x) = 0, x ≤ 2

= x − 2

10 , 2 < x < 12

= 1, x ≥ 12

The cumulative distribution function can also be used to find the quantiles (e.g. lower quartile,median and upper quartile) of distributions. (Reminder: from first year (Introstat, chapter 1)lower quartile: 1

4 of the sample data is below the lower quartile; 1

2 (or 50%) is below the median

and 34

(75%) is below the upper quartile).

Example D (quantiles): Use the cdf of T and find the median:

1-13

Page 18: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 18/103

The cdf of T is given by:

F T

(t) = 0 t < 0

= t3

8 0 ≤ t ≤ 2

= 1 t > 2

The median can be denoted by ǫ 12

or by t(m).

F T

(t(m)) =t3(m)

8 =

1

2

thus t3(m) = 8

2

and the median is t(m) = 413

Example E (quantiles): Use the cdf of W and find the lower quartile:

The cdf of W is given by

F W

(w) = 0 w ≤ 0

= 1 − e−2w, w > 0

The lower quartile can be denoted by ǫ 14

or by w(l).

F W

(w(l)) = 1 − e−2w(l) = 1

4

thus

e−2w(l) = 3

4

−2w(l) = ln(3

4)

w(l) = 0.143841

Example F (quantiles): Use the cdf of X and find the upper quartile:

The cdf of X is given by:

F X

(x) = 0, x ≤ 2

= x − 2

10 , 2 < x < 12

= 1, x ≥ 12

The upper quartile can be denoted by ǫ 34

or by x(u).

F X

(x(u)) = x(u) − 2

10 =

3

4

and the upper quartile is x(u) = 912

.

1-14

Page 19: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 19/103

1.6 Functions of random variables - cumulative distributiontechnique

Before proceeding to consideration of specific distributions, let us briefly review some results fromintegration theory. Suppose we have an integral of the form: b

a

f (x)dx

but that we would rather work in terms of a “new” variable defined by u = g(x). There are atleast two possible reasons for this change:

(i) the variable u may have more physical or statistical meaning (i.e. a statistical reason); or

(ii) it may be easier to solve the integral in terms of u than in terms of x (i.e. a mathematicalreason).

We shall suppose that the function g(x) is monotone (increasing or decreasing) and continuouslydifferentiable. We can then define a continuously differentiable inverse function , say g−1(u), whichis nothing more than the solution for x in terms of u from the equation g (x) = u. For example, if g(x) = e−x, then g−1(u) = − ln(u). We then define the Jacobian of the transformation from x tou by:

|J | =

dx

du

=

dg−1(u)

du

Note that J is a function of u.

Example: Continuing with the example of g (x) = e−x, and g−1(u) = − ln(u), we have that:

|J | =

d[− ln(u)]

du

=

− 1

u

= 1

u

since u > 0.

Important Note: Since dx/du = [du/dx]−1, it follows that we can also define the Jacobian by|J | = [dg(x)/dx]−1, but the result will still be written in terms of x, requiring a furthersubstitution to get it in terms of u as required. Note also that some texts define the Jacobianas the inverse of our definition, and care has thus to be exercised in interpreting resultsinvolving Jacobian s.

Example (continued): In the previous example, dg(x)/dx = −e−x, and thus the Jacobian couldbe written as ex; substituting x =

−ln(u) then gives

|J |

= e− ln(u) = u−1, as before.

Theorem 1.1 For any monotone function g (x), defining a transformation of the variable of inte-gration: b

a

f (x)dx =

d

c

f [g−1(u)] |J | du

where c is the smaller of g(a) and g(b), and d the larger.

This theorem then defines a procedure for changing the variable of integration from x to u = g(x)(where g (x) is monotone):

(1) Solve for x in terms of u to get the inverse function g−1(u).

1-15

Page 20: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 20/103

(2) Differentiate g−1(u) and take absolute value to obtain the Jacobian |J |.(3) Calculate the minimum and maximum values for u (i.e. c and d in the theorem).

(4) Write down the new integral, as given by the theorem.

Example: Evaluate: ∞0

xe−12x2dx

(1) Solve for x in terms of u to get the inverse function g−1(u).

Substitute u = 12x2, which is monotone over the range given, which gives x =

√ 2u, and

(2) Differentiate g−1(u) to obtain the Jacobian |J |.|J | =

d√ 2u

du

= 1√ 2u

.

(3) Calculate the minimum and maximum values for u (i.e. c and d in the theorem).

Clearly, u also runs from 0 to ∞.

(4) Write down the new integral, as given by the theorem. ∞0

xe−12x2dx =

∞0

√ 2u e−u 1√

2udu =

∞0

e−udu = 1

We consider example’s D, E and F. In what follow we are going to use the cdf (found in ExampleD, E and F) to evaluate the cdf and pdf of a function of the random variables.

Example D (cont): Define Y = T + 2

2 , find the pdf of Y by using the cdf of T .

Solution:

F Y

(y) = Pr[Y

≤y]

= Pr[T + 2

2 ≤ y]

= Pr[T ≤ 2y − 2]

= F T

(2y − 2)

= (2y − 2)3

8

thus the pdf of Y is:

f Y

(y) = dF

Y (y)

dy =

3(2y − 2)22

8 = 3(y − 1)2. for 1 ≤ y ≤ 2

Homework: check that your answer is right.

Example E (cont): Define H = 2W , find the pdf of H by using the cdf of W .

Solution:

F H

(h) = Pr[H ≤ h]

= Pr[2W ≤ h]

= Pr[W ≤ h

2]

= F W

(h

2)

= 1 − exp[−2h

2]

= 1 − exp[−h]

1-16

Page 21: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 21/103

thus the pdf of H is:

f H

(h) = dF

H(h)

dh = 0 + exp[−h] for h > 0.

Example F(cont): Define Y = X + 8, find the pdf of Y by using the cdf of X .

Solution:

F Y

(y) = Pr[Y ≤ y]

= Pr[X + 8 ≤ y]

= Pr[X ≤ y − 8]

= F X

(y − 8)

= y − 8 − 2

10

= y − 10

10

thus the pdf of Y is:

f Y

(y) = dF

Y (y)

dy =

1

10 for 10 < y < 20.

Homework: check that these answers are correct in D,E and F.

Tutorial Exercises

1. Which of the following functions are valid probability mass functions or probability densityfunctions:

(a)

f X

(x) = 1

6(x2 − 1) 0 ≤ x ≤ 3

= 0 otherwise

(b)

f X

(x) = 7

20(x2 − 1) 1 ≤ x ≤ 3

= 0 otherwise

(c)

f X

(x) = 1

5 1 ≤ x ≤ 6

= 0 otherwise

(d)

pX

(x) = e−1

x! x = 1, 2, ·, ·, ·

(e)

pS

(s) = s s = 1

12;

3

12; 1

3;

5

12

= 0 otherwise

1-17

Page 22: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 22/103

(f)

pT

(t) =

4

t

1

2

4t = 0; , 1; 2; 3;

= 0 otherwise

2. In question one, state the conditions that will make the pdf or pmf that are invalid, valid.

3. For what values of D can pY

(y) be a probability mass function?

pY

(y) = 1 − D

4 y = 0

= 1 + D

2 y = 2

= 1 − D

4 y = 4

= 0 otherwise

4. A random variable X has probability density function given by

f X

(x) = kx3 0 ≤ x ≤ 4

= 0 otherwise

(a) For f X

(x) to be a valid pdf show that k = 164 .

(b) Find the cumulative distribution function (cdf) of X .

(c) Find the Pr[−2 < x ≤ 3].

(d) Find the lower quartile.

(e) Say Y = 16X , find the pdf of Y .

5. A random variable Y has probability density function given by

f Y

(y) = ce−y Y ≥ 0

= 0 otherwise

(a) For f Y

(y) to be a valid pdf show that c = 1.

(b) Find the cumulative distribution function (cdf) of Y .

(c) Find the Pr[y > 4].

(d) Find the median.

(e) Say S = 3Y , find the pdf of S .

6. A random variable X has probability density function given by

f X

(x) = 2x 0 ≤ x ≤ b

= 0 otherwise

(a) For f X

(x) to be a valid pdf show that b = 1.

(b) Find the cumulative distribution function (cdf) of X .

(c) Find the Pr[1

2 < x ≤ 5].

1-18

Page 23: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 23/103

(d) Find the upper quartile.

(e) Say Y = 4X , find the pdf of Y .

7. A random variable X has probability mass function given by

pX

(x) = e−4cx

x! x = 0, 1, 2, · · ·

= 0 otherwise

(a) For pX

(x) to be a valid pmf show that c = 4.

(b) Find the cumulative distribution function (cdf) of X (hint: do it term by tern no patternemerges).

(c) Find the Pr[x > 3].

8. A random variable T has probability mass function given by

9.

pT

(t) =

5

t

42 − t

c

t = 0, 1, 2,

= 0 otherwise

(a) For pT

(t) to be a valid pmf show that c = 36.

(b) Find the cumulative distribution function (cdf) of T .

(c) Find the Pr[t > 3].

(d) Find the Pr[t = 23].

10. A random variable X has probability density function given by

f X

(x) = 10

x x > c

= 0 x ≤ c

(a) For f X

(x) to be a valid pdf show that c = 10.

(b) Find the cumulative distribution function (cdf) of X .

(c) Find the Pr[−2 < x ≤ 3].

(d) Find the lower quartile.

(e) Find the Pr[x > 7].

11. Two fair dice are rolled. Let S equal the sum of the two dice.

(a) List all the possible values S can take.

(b) List the sample space.

(c) Find the pmf of S .

12. Let D represent the difference between the number of heads and the number of tails obtainedwhen a (unbiased) coin is tossed 5 times.

(a) List all the possible values D can take.

(b) List the sample space.

(c) Find the pmf of D.

1-19

Page 24: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 24/103

13. The following question is from Ross (1998), suppose X has the following cumulative distrib-tuion function:

F X (∗) =

0 ∗ < 0

∗4

0 ≤ ∗ < 1

1

2 +

∗ − 1

4 1 ≤ ∗ < 2

11

12 2 ≤ ∗ < 3

1 3 ≤ ∗

(a) Find Pr[X = i], i = 1,2,3.

1-20

Page 25: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 25/103

Chapter 2

Bivariate Distributions

Assumed statistical background

• Chapters one of these notes

Maths Toolbox

• Please work through the section on double integration - Maths Toolbox A.4.

2.1 Joint random variables

Up to now, we have assumed that our “random experiment” results in the observation of a value of

a single “random variable”. Quite typically, however, a single experiment (observation of the realworld) will result in a (possibly quite large) collection of measurements, which can be expressed asa vector of observations describing the outcome of the experiment. For example:

• A medical researcher may record and report various characteristics of each subject beingtested, such as blood pressure, height, weight, smoking habits, etc., as well as the directmedical observations;

• An investment analyst may wish to examine a large number of financial indicators specificfor each share under consideration;

• The environmentalist would be interested in recording many different pollutants in the air.

Each such vector of observations is termed a multivariate observation or random variable. A verylarge proportion of statistical analysis is concerned with the analysis of such multivariate data.For this course (except in the chapter on regression), however, we will be limiting ourselves to thesimplest case of two variables only, i.e. bivariate random variables.

For any bivariate pair of random variables, say (X, Y ), we can define events such as X ≤ x, Y ≤ y,i.e. the joint occurrence of the events X ≤ x and Y ≤ y . Once again, upper case letters (X , Y )will denote random variables, while lower case letters (x, y) denote observable real numbers. Aswith univariate random variables, the discrete case is easily handled by means of a probabilitymass function. We shall once again, with no loss of generality, assume that the discrete randomvariables are defined on the non-negative integers. For a discrete pair of random variables, the joint probability mass function is defined by:

pXY (x, y) = Pr[X = x, Y = y]

2-1

Page 26: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 26/103

i.e. by the probability of the joint occurrence of X = x and Y = y. Just as in the case of theunivariate, the joint pmf will have the following three properties:

(i) pXY (x, y)

≥0

(ii) pXY (x, y) > 0 on a countable sets and

(iii)

x

y pXY (x, y) = 1

The joint cumulative distribution function is given as:

F XY

(x, y) =x

i=0

yj=0

pXY

(i, j)

.

Note that the event X = x is equivalent to the union of all (mutually disjoint) events of the

form X = x, Y = y as y ranges over all non-negative integers. It thus follows that the marginal probability mass function of X , i.e. the probability that X = x, must be given by:

pX

(x) =

∞y=0

pXY

(x, y)

and similarly the marginal for Y is:

pY

(y) =

∞x=0

pXY

(x, y).

Example G: Assume the joint probability mass function of X and Y is given in the following joint probability table:

y = x =0 1 2 3

0 0.03125 0.0625 0.03125 01 0.06250 0.15625 0.1250 0.031252 0.03125 0.1250 0.15625 0.062503 0 0.03125 0.0625 0.03125

Note the following:

From the table it is clear that X can take on the values x = 0, 1, 2, 3 and y = 0, 1, 2, 3. The

pmf is only defined over this range of 16 points in two-dimensional space.

pXY

(0, 1) = Pr[X = 0, Y = 1] = 0.06250

pXY

(5, 1) = 0 (X is outside the valid range)3i=0

3j=0 p

XY (i, j) = 1 (the joint pmf sums to 1)

Summing down each column then gives the marginal pmf for X ; so, for example:

pX

(0) = 0.03125 + 0.0625 + 0.03125 + 0 = 0.125

and similarly pX

(1) = pX

(2) = 0.375 and pX

(3) = 0.125. It is easily seen that this is also themarginal distribution of Y (obtained by adding across the rows of the table).

2-2

Page 27: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 27/103

For continuous random variables, we need to introduce the concept of the joint probability density function f

XY (x, y). In principle, the joint pdf is defined to be the function for which:

Pr[a < X

≤b,c < Y

≤d] =

b

x=a d

y=c

f XY

(x, y) dy dx

for all a < b and c < d. Just as in the case of the univariate, the joint pdf will have the followingthree properties:

(i) f XY (x, y) ≥ 0

(ii) f XY (x, y) > 0 on a measurable sets and

(iii) The total volume (the surface over the x − y plane) is one: ∞−∞

∞−∞ f XY (x, y)dxdy = 1

Note that in particular this condition requires that:

F XY

(x, y) =

x

u=−∞

y

v=−∞f XY

(u, v) dvdu

and it is easy to move from the joint cdf to the joint pdf:

f XY

(x, y) = ∂

∂y

∂F

XY (x, y)

∂x

=

∂ 2F XY

(x, y)

∂x∂y

As for discrete random variables, we can also define marginal pdf’s for X and Y :

f X

(x) = ∞y=−∞

f XY

(x, y) dy

and similarly for the marginal pdf of Y :

f Y

(y) =

∞x=−∞

f XY

(x, y) dx.

The marginal pdf’s describe the overall variation in one variable, irrespective of what happens withthe other. For example, if X and Y represent height and weight of a randomly chosen individualfrom a population, then the marginal distribution of X describes the distribution of heights in thepopulation.

Example H: Suppose the joint pdf of X, Y is given by:

f XY

(x, y) = 3

28(x + y2) for 0 ≤ x ≤ 2; 0 ≤ y ≤ 2

Is this function a joint pdf? Yes, because

• f XY

(x, y) ≥ 0 and

2-3

Page 28: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 28/103

• ∞−∞

∞−∞

f XY

(x, y) dydx =

2x=0

2y=0

3

28(x + y2) dy dx

= 328 2

x=0

2y=0

xdydx + 2

x=0

2y=0

y2 dydx

= 3

28

2x=0

x

2y=0

dydx +

2x=0

2y=0

y2 dydx

= 3

28

2x=0

x [y]2y=0 dx +

2x=0

y3

3

2y=0

dx

= 3

28

2

2x=0

x dx +

2x=0

8

3dx

= 3

28

2

x2

2 2

x=0

+ 8

3 [x]2x=0

= 3

28

4 +

16

3

=

3

28

28

3 = 1

The marginal pdf of X is:

f X

(x) =

2y=0

3

28(x + y2) dy

= 3

28 xy + y3

3 2

y=0

= 3

28

2x +

8

3

for 0 ≤ x ≤ 2

The marginal pdf of Y is:

f Y

(y) =

2x=0

3

28(x + y2) dx

= 3

28

x2

2 + y2x

2x=0

= 3282 + 2y2 for 0 ≤ y ≤ 2

Note that marginal pdf’s are still pdf’s!

Find the probability that both X and Y are less than one:

2-4

Page 29: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 29/103

P r [0 ≤ X ≤ 1, 0 ≤ Y ≤ 1] =

1x=0

1y=0

3

28(x + y2) dy dx

= 328 1

x=0

1y=0

xdydx + 1

x=0

1y=0

y2 dydx

= 3

28

1x=0

x

1y=0

dydx +

1x=0

1y=0

y2 dydx

= 3

28

1x=0

x [y]1y=0 dx +

1x=0

y3

3

1y=0

dx

= 3

28

1x=0

x dx +

1x=0

1

3dx

= 3

28

x2

2 1

x=0

+ 1

3 [x]1x=0

= 3

28

1

2 +

1

3

=

3

28

3 + 2

6 =

5

56

Example I: Suppose that X and Y are continuous random variables, with joint pdf given by:

f XY

(x, y) = ce−xe−2y for x,y > 0

1. Find c.

We need to integrate out the joint pdf and set the definite integral equal to 1:

∞−∞

∞−∞

f XY

(x, y) =

∞x=0

∞y=0

ce−xe−2y dy dx

= c

∞x=0

e−x

∞y=0

e−2y dydx

= c

∞x=0

e−x 1

−2

∞y=0

(−2)e−2y dy dx

= c

−2

∞x=0

e−x

e−2y∞

y=0 dx

= c

2 ∞

x=0

e−x dx

= c2

(−1)

e−x∞y=0

= c

2

thus setting the integral equal to 1 we obtain c = 2.

2. Find the marginal pdf for X :

2-5

Page 30: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 30/103

f X

(x) =

∞y=0

2e−xe−2y dy

= e−x(−1) ∞y=0

−2e−2y dy

= −e−x

e−2y∞

y=0

= −e−x[0 − e0]

= e−x for x > 0

3. Find the marginal pdf of Y :

f Y

(y) =

y

x=0

2e−xe−2y dx

= 2e−2y

(−1) e−x∞x=0

= 2e−2y for y > 0

Note that marginal pdf’s are still pdf’s!

4. Find Pr[Y < X ].

We need to evaluate the double integral over the region y < x, within the domain0 < x < ∞ and 0 < y < ∞ (see figure 2.1).

P r[Y < X ] =

y<x

2e−xe−2y dy dx

= ∞x=0

x

y=0 2e−x

e−2y

dydx

=

∞x=0

e−x(−1)

e−2yx

y=0dx

=

∞x=0

e−x[1 − e−2x]dx

=

∞x=0

e−xdx − ∞

x=0

e−3xdx

= (−1)

e−x∞

x=0− −1

3

e−3x

∞x=0

= 1 − 1

3

= 23

2.2 Independence and conditional distributions

Recall from the first year notes, the concepts of conditional probabilities and of independence of events. If A and B are two events then the probability of A conditional on the occurrence of B isgiven by:

Pr[A|B] = Pr[A ∩ B]

Pr[B]

2-6

Page 31: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 31/103

X

Y

Figure 2.1: Dotted area indicates region of integration

provided that Pr[B] > 0. The concept of the intersection of two events (A ∩ B) is that of the jointoccurrence of both A and B. The events A and B are independent if Pr[A ∩ B] = Pr[A]. Pr[B],

which implies that Pr[A|B] = Pr[A] and Pr[B|A] = Pr[B] whenever the conditional probabilitiesare defined.

The same ideas carry over to the consideration of bivariate (or multivariate) random variables. Fordiscrete random variables, the linkage is direct: we have immediately that:

Pr[X = x|Y = y] = Pr[X = x ; Y = y]

Pr[Y = y ] =

pXY

(x, y)

pY

(y)

provided that pY

(y) > 0. This relationship applies for any x and y such that pY

(y) > 0, and definesthe conditional probability mass function for X , given that Y = y . We write the conditional pmf as:

pX|Y

(x|y) = p

XY (x, y)

pY

(y) .

In similar manner, we define the conditional probability mass function for Y , given that X = x,i.e. p

Y |X(y|x).

By definition of independent events, the events X = x and Y = y are independent if and onlyif p

XY (x, y) = p

X(x).p

Y (y).

If this equation holds true for all x and y, then we say that the random variables X and Y areindependent. In this case, it is easily seen that all events of the form a < X ≤ b and c < Y ≤ d areindependent of each other, which (inter alia ) also implies that F

XY (x, y) = F

X(x).F

Y (y) for all x

and y .

Example G (cont.) Refer back to example G in the previous section (p2-2), and calculate the

conditional probability mass function for X , given Y = 2. We have noted that pY (2) = 0.375,

2-7

Page 32: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 32/103

and thus:

pX|Y

(0|2) = pXY (0, 2)

pY (2) =

0.03125

0.375 = 0.0833

pX|Y (1|2) =

pXY (1, 2)

pY (2) =

0.125

0.375 = 0.3333

pX|Y

(2|2) = pXY (2, 2)

pY (2) =

0.15625

0.375 = 0.4167

pX|Y

(3|2) = pXY (3, 2)

pY (2) =

0.0625

0.375 = 0.1667

Note that the conditional probabilities again add to 1, as required.

The random variables are not independent, since (for example) pX

(2) = pY

(2) = 0.375, andthus p

X(2).p

Y (2) = 0.140625, while p

XY (2, 2) = 0.125.

Once again there is a slight technical problem when it comes to continuous distributions, since

all events of the form X = x have zero probability. Nevertheless, we continue to define theconditional probability density function for X given Y = y as:

f X|Y

(x|y) = f

XY (x, y)

f Y

(y)

provided that f Y

(y) > 0, and similarly for Y given X = x. This corresponds to the formaldefinition of conditional probabilities in the sense that for “small enough” values of h > 0:

Pr[a < X ≤ b | y < Y ≤ y + h] =

b

x=a

f X|Y

(x|y) dx

The continuous random variables X and Y are independent if and only if f XY

(x, y) = f X

(x)f Y

(y)

for all x and y, which is clearly equivalent to the statement that the marginal and conditional pdf’sare identical.

Example H (cont.) In this example, we have from the previous results that:

f X

(x)f Y

(y) = 3

28

2x +

8

3

3

28

2 + 2y2

= f XY

(x, y)

thus X and Y are not independent and the conditional pdf of Y |X is

f Y |X (y|x) =

f XY

(x, y)

f X (x)

=328

(x + y2)328

2x + 8

3

=

x + y2

2x + 83

for 0 ≤ y ≤ 2, whengiven any x ∈ [0, 2]

Homework: 1. Find the conditional pdf of X |Y and

2. Show that the two conditional pdf’s are indeed pdf’s.

Find the P r[(Y < 1)|(X = 0)]

2-8

Page 33: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 33/103

P r[(Y < 1)|(X = 0)] =

1y=0

f Y |X

(y|0)dy

= 1y=0

3y2

8 dy

= 3

8

y3

3

10

= 1

8

Example I (cont.) In this example, we have from the previous results that:

f X

(x)f Y

(y) = e−x2e−2y = f XY

(x, y)

thus X and Y are independent and the conditional pdf of Y

|X is just the marginal of Y

f Y |X

(y|x) = f

XY (x, y)

f X

(x) =

f X

(x)f Y

(y)

f X

(x) = f

Y (y)

2.3 The bivariate Gaussian distribution

In the same way that the Gaussian (normal) distribution played such an important role in theunivariate statistics of the first year syllabus, its generalization is equally important for bivariate(and in general, multivariate) random variables. The bivariate Gaussian (normal) distribution isdefined by the following joint probability density function for (X, Y ):

f XY

(x, y) = 12πσX σY

1 − ρ2

exp− Q(x, y)

2(1 − ρ2)

(2.1)

where Q(x, y) is a quadratic function in x and y defined by:

Q(x, y) =

x − µX

σX

2

− 2ρ

x − µX

σX

y − µY

σY

+

y − µY

σY

2

In matrix notation this pdf can be expressed in the form:

f XY

(x, y) = 1

2π|Σ|1/2 exp

−1

2(z − µ)′Σ−1(z − µ)

where z and µ are column vectors of (x, y) and (µX , µY ) respectively, and the matrix Σ is givenby:

Σ =

σ2

X ρσX σY

ρσX σY σ2Y

The form (2.1) applies also to the general multivariate Gaussian distribution, except that for amultivariate random variable of dimension p, the 2π term is raised to the power of p/2.

We now briefly introduce a few key properties of the bivariate Gaussian distribution:

PROPERTY 1: MARGINAL DISTRIBUTIONS ARE GAUSSIAN.

• In other words, f Y

(y) is the pdf of a gaussian distribution with mean µY and variance

σ2

Y .

2-9

Page 34: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 34/103

• Similarly, the marginal distribution of X is gaussian with mean µX and variance µ2X .

PROPERTY 2: CONDITIONAL DISTRIBUTIONS ARE GAUSSIAN.

• The conditional pdf for X given Y = y is a Gaussian distribution, with mean (i.e. the

conditional mean for X given Y = y)

µX + [ρσX /σY ](y − µY ) and conditional variance σ2X (1 − ρ2).

• Similarly the conditional distribution for Y given X = x is Gaussian with mean

µY + [ρσY /σX ](x − µX ) and conditional variance σ2Y (1 − ρ2).

Note that both regressions are linear, but that the slopes are not reciprocals of each otherunless |ρ| = 1.

Note also that the condtional distributions have reduced variances, by the fractional factor(1 − ρ2

XY )

PROPERTY 3: ρXY

= 0 IMPLIES THAT X AND Y ARE INDEPENDENT.

2.4 Functions of bivariate random variables

When faced with bivariate, or in general multivariate, random variables (observations), there isoften a need to derive the distributions of particular functions of the individual variables. Thereare at least two reasons why this may be necessary:

• It may be useful to summarize the data in some way, by taking the means, sums, differencesor ratios of similar or related variables. For example, height and weight together give somemeasure of obesity, but weight divided by (height)2 may be a more succinct measure.

• Some function, such as the ratio of two random variables, may be more physically meaningfulthan the individual variables on their own. For example, the ratio of prices of two commoditiesis more easily compared across countries and across time, than are the individual variables.

2.4.1 General principles of the transformation technique

The principles will be stated for the bivariate case only, but does in fact apply to multivariatedistributions in general.

Suppose that we know the bivariate probability distribution of a pair (X, Y ) of random variables,and that two new random variables U and V are defined by the following transformations:

U = g(X, Y ) V = h(X, Y )

which are assumed to be jointly one-to-one (i.e. each pair X, Y is transformed to a unique pairU, V ). In other words, each event of the form X = x, Y = y corresponds uniquely to the eventU = u, V = v, where u = g(x, y) and v = h(x, y). The uniqueness of this correspondence allowsus in principle to solve for x and y in terms of u and v, whenever only u and v are known. Thissolution defines an inverse function, which we shall express in the form:

x = φ(u, v) y = ψ(u, v).

Let us further suppose that all the above functions are continuously differentiable. We can thendefine the Jacobian of the transformation (precisely as we did for change of variables in multipleintegrals) as follows:

2-10

Page 35: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 35/103

|J | =

∂φ(u, v)/∂u ∂φ(u, v)/∂v∂ψ(u, v)/∂ u ∂ψ(u, v)/∂v

=∂φ(u, v)

∂u∂ψ(u, v)

∂v − ∂ψ(u, v)

∂u∂φ(u, v)

∂v

We then have the following theorem.

Theorem 2.1 Suppose that the joint pdf of X and Y is given by f XY

(x, y), and that the continu-ously differentiable functions g (x, y) and h(x, y) define a one-to-one transformation of the random variables X and Y to U = g(X, Y ) and V = h(X, Y ), with inverse transformation given by X = φ(U, V ) and Y = ψ(U, V ). The joint pdf of U and V is then given by:

f UV

(u, v) = f XY

(φ(u, v), ψ(u, v)) |J |

Note 1: We have to have as many new variables (i.e. U and V ) as we had original variables (inthis case 2). The method only works in this case. Even if we are only interested in a singletransformation (e.g. U = X +Y ) we need to “invent” a second variable, quite often somethingtrivial such as V = Y . We will then have the joint distribution of U and V , and we will needto extract the marginal pdf of U by integration.

Note 2: Some texts define the Jacobian in terms of the derivatives of g(x, y) and h(x, y) w.r.t.x and y. With this definition, one must use the inverse of |J | in the theorem, as it can beshown that with our definition of |J |:

|J | =

∂g (x, y)/∂x ∂g(x, y)/∂y∂h(x, y)/∂ x ∂ h(x, y)/∂y

−1

This is sometimes the easier way to compute the Jacobian in any case.

Example Suppose that X and Y are independent random variables, each having the gammadistribution with λ = 1 in each case, and with α = a and α = b respectively, i.e.:

f XY

(x, y) = xa−1yb−1e−(x+y)

Γ(a)Γ(b)

for non-negative x and y . We define the following transformations:

U = X

X + Y V = X + Y.

Clearly 0 < U < 1 and 0 < V < ∞. The inverse transformations are easily seen to be definedby the functions:

x = uv y = v(1 − u)

and thus

|J | =

∂x/∂u ∂x/∂v

∂y/∂u ∂y/∂v

=

v u−v (1 − u)

= |v(1 − u) + uv| = v

2-11

Page 36: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 36/103

The joint pdf is thus:

f UV

(u, v) = ua−1va−1vb−1(1 − u)b−1e−v

Γ(a)Γ(b)

v

= ua−1(1 − u)b−1va+b−1e−v

Γ(a)Γ(b)

over the region defined by 0 < u < 1 and 0 < v < ∞. As an exercise, show that U hasthe Beta distribution (of the first kind) with parameters a and b, while V has the gammadistribution with parameters 1 and a + b, and that U and V are independent.

Tutorial Exercises

1. The joint mass function of two discrete random variable, W and V , is given by

pW V (1, 1) = 18

pW V (1, 2) = 14

pW V (2, 1) = 18

pW V (2, 2) = 12

(a) Find the marginal mass functions of W and V .

(b) Find the conditional mass function of V given W = 2.

(c) Are W and V independent?

(d) Compute [i] P (W V ≤ 3). [ii] P [W + V > 2] [iii] P [W/V > 1].

2. If the joint pmf of C and D is given by

pCD (c, d) = cd

36 , for c = 1, 2, 3 and d = 1, 2, 30 elsewhere

(a) Find the marginal pmf of C and D.

(b) Find the conditional mass function of C given D = 2.

(c) Are C and D independent?

(d) Find the probability mass function of X = C D.

3. The joint pdf of X and Y is given by

f XY (x, y) = k(x2 + xy

2 ), 0 < x < 1, 0 < y < 1

0 elsewhere

(a) Use the definition of a joint pdf to find k (a constant).

(b) Find the marginal pdf of Y

(c) Find f X|Y (x|y).

(d) Find [i] P (1 < Y < 2|x = 3) (ii) P (0 < X < 12|y = 1)

4. The joint mass function of two discrete random variable, W and V , is given in the followingtable:

2-12

Page 37: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 37/103

wv 2 3 4-1 0.20 0.05 0.040 0.05 0.10 0.07

1 0.04 0.04 0.162 0.15 0.10 k

(a) Use the definition of a joint pmf to find k.

(b) Find the marginal mass functions of W and V .

(c) Find the conditional mass function of V given W = 2.

(d) Find [i] P r(W ≤ 2, V > 0). [ii] P r[0 < V < 2|W = 2]

(e) Find the pmf of X = V + W − 2.

5. The joint pdf of X and Y is given by

f XY (x, y) = x + y, 0 < x < 1, 0 < y < 1

0 elsewhere

(a) Find the marginal density functions of X and Y .

(b) Are X + Y independent?

(c) Find the conditional density function of X given Y = 12

.

(d) Find [i] P r[X ≤ 12

] [ii] P r[(0 < X < 0.5)|(Y = 12

)]

(e) Find the joint pdf of W = X Y and V = Y .

(f) Find the marginal pdf of W .

(g) Find the pdf of S = X + Y .

6. The joint pdf of S and T is given by

f ST (s, t) =

c(2s + 3t), 0 ≤ s ≤ 1, 0 ≤ t ≤ 1

0 elsewhere

(a) Use the definition of a joint pdf to show that c = 2

5.

(b) Find the marginal density functions of S and T .

(c) Are S and T independent?

(d) Find the conditional density function of S given T .

(e) Find [i] P r[S < 12 , T ≥ 0] [ii] P r[(0 ≤ S ≤ 1)|(T = 1

2)]

(f) Find the joint pdf of W = S + T and V = T .

2-13

Page 38: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 38/103

Chapter 3

Moments of univariatedistributions and moment

generating function

3.1 Assumed statistical background

• More about random variables (Introstat, chapter 6).

• Probability mass functions: binomial, poisson, geometric (Introstat, chapter 5 and 6).

• Probability density functions: gaussian, uniform, exponential (Introstat, chapter 5 and 6).

• Chapter one of these notes.

3.2 Moments of univariate distributions

Let g(x) be any real-valued function defined on the real line. Then, by g (X ) we mean the randomvariable which takes on the value g (x), whenever X takes on the value x (e.g. g(x) = 2x + 3). Inprevious chapters, we have seen how in particular circumstances we can derive the distribution of g(X ), once we know the distribution of X . The expectation of g(X ), in the case of a continuousrandom variable defined by:

E[g(X)] = ∞−∞ g(x)f X(x) dx

and in the case of a discrete random variable defined as:

E[g(X)] =

∞−∞

g(x)pX(x)

Note that while g (X ) is a variable, E [g(X )] is an number value.

If we interpret the probabilities in a frequentist sense, then this expectation can be seen as a “long-run average” of the random variable g(X ). More generally, the value represents in a sense the“centre of gravity” of the distribution of g(X ). An important special case arises when g(x) = xr for

3-1

Page 39: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 39/103

some positive integer value of r; the expectation is then called the r-th moment of the distribution of X , written in the case of a continuous random variable as:

µ′r = E[Xr] = ∞

−∞

xrf X(x)dx

and in the case of a discrete random variable as:

µ′r = E[Xr] =

∞−∞

xrpX(x)

The case r = 1, or E[X ], is well-known: it is simply the expectation, or the mean, of X itself,which we shall often write as µX .

For r > 1 it is more convenient to work with the expectations of (X − µX )r. These values are thecalled central moments of X , where the r-th central moment for a continuous random variable isdefined by:

µr = E[(X − µX)r] = ∞−∞

(x − µX)rf X

(x) dx

and in the case of a discreet random variable as:

µr = E[(X − µX)r] =

∞−∞

(x − µX)rpX(x) dx

Each central moment µr measures in its own unique way, some part of the manner in which thedistribution (or the observed values) of X are spread out around the mean µX .

You should be familiar with the case of r = 2, which gives the variance of X , also written as σ2X ,

or Var[X ]. In the case of a continuous random variable we write:

µ2 = E[(X − µX)2]

=

∞−∞

(x − µX )2 f X

(x) dx

=

∞−∞

(x2 − 2µX x + µ2X ) f

X(x) dx

=

∞−∞

x2 f X

(x) dx − ∞−∞

2 µX x f X

(x) dx +

∞−∞

µ2X f

X(x) dx

= E[X2

] − 2µX ∞

−∞ x f X(x)dx + µ2

X

= µ′2 − 2µX µX + µ2X

= µ′2 − µ2X

Homework: - In the case of a discrete random variable, show that µ2 = µ′2 − µ2X .

The variance is always non-negative (in fact, strictly positive, unless X takes on one specific valuewith probability 1), and measures the magnitude of the spread of the distribution. This interpre-tation should be well-known from first year. We now introduce two further central moments, the3rd and the 4th.

Consider a probability density function which has a “skewed” shape similar to that shown in

Figure 3.1. The mean of this distribution is at x = 2, but the distribution is far from symmetric

3-2

Page 40: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 40/103

f (x)

0 1 2 3 4 5 6

x

mean

Figure 3.1: Example of a skew distribution

around this mean. Now consider what happens when we examine the third central moment. ForX < µX , we have (X − µX )3 < 0, while (X − µX )3 > 0 for X > µX . For a perfectly symmetricdistribution, the positives and negatives will cancel out in taking the expectation, so that µ3 = 0.But for a distribution such as that shown in Figure 3.1, very large positive values of (X − µX )will occur, but no very large negative values. The nett result is that µ3 > 0, and we term such adistribution positively skewed . Negatively skewed distributions (with the long tail to the left) canalso occur, but are perhaps less common in practice and in applications.

Of course, the magnitude of µ3 will also depend on the amount of spread. In order to obtain a feelfor the “skewness” of the distribution, it is useful to eliminate the effect of the spread itself (whichis measured already by the variance). This elimination is effected by defining a coefficient of skew by:

µ3

σ3X

= µ3

( Var(X ))3

which, incidentally, does not depend on the units of measurement used for X . For the distributionillustrated in Figure 3.1, the coefficient of skew turns out to be 1.4. (This distribution is, in fact,the gamma distribution with α = 2.)

In defining and interpreting the fourth central moment, we may find it useful to examine Figure 3.2.The two distributions do in fact have the same mean (0) and variance (1). This may be surprisingat first sight, as the more sharply peaked distribution appears to be more tightly concentratedaround the mean. What has happened, however, is that this distribution has much longer tails.The flatter distribution (actually a Gaussian distribution) has a density very close to zero outsideof the range −3 < x < 3; but for the more sharply peaked distribution, the density falls away muchmore slowly, and is still quite detectable at ±5. In evaluating the variance, the occasionally very

large values for (X − µX )

2

inflate the variance sufficiently to produce equal variances for the two

3-3

Page 41: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 41/103

-4 -2 0 2 4

Figure 3.2: Example of differences in kurtosis

distributions. But consider what happens when we calculate µ4: the occasional large discrepancies

create an even greater effect when raised to the power 4, and thus µ4 is larger for the sharply-peaked-and-long-tailed distribution than for the flatter, short-tailed distribution. The single wordto describe this contrast is kurtosis , and the sharp-peaked, long-tailed distribution is said to havegreater kurtosis than the other.

Thus the fourth central moment, µ4, is a measure of kurtosis, in the sense that for two distributionshaving the same variance, the one with the higher µ4 has the greater kurtosis (is more sharplypeaked and long-tailed). But as with the third moment, µ4 is also affected by the spread, and thusonce again it is useful to have a measure of kurtosis only (describing the shape, not the spread, of the distribution). This elimination of spread is achieved by the coefficient of kurtosis defined as:

µ4

σ4X

= µ4

(Var(X ))2

which again does not depend on the units of measurement used for X .

For the Gaussian distribution (the flatter of the two densities illustrated in Figure 3.2), the coeffi-cient of kurtosis is always 3 (irrespective of mean and variance). The more sharply peaked densityin Figure 3.2 is that of a random variable which follows a so-called “mixture distribution”, i.e.its value derives from the Gaussian distribution with mean 0 and variance 4 with probability 0.2,and from the Gaussian distribution with mean 0 and variance 0.25 otherwise. The coefficient of kurtosis in this case turns out to be 9.75.

There is, however, an alternative definition for the coefficient of kurtosis obtained by subtracting3 from the above, i.e.:

µ4

σ4X

− 3

3-4

Page 42: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 42/103

so that the second definition partly measures in effect departure from the Gaussian distribution(negative values corresponding to flatter and shorter-tailed than Gaussian distributions, and posi-tive values to distributions which are more sharply peaked and heavy tailed than the Gaussian.

One could in principle continue further with higher order moments still, but there seems to be

little practical value in doing so: the first four moments do give considerable insight into theshape of the distribution. Working with moments has great practical value, since from any setof observations of a random variable, we can obtain the corresponding sample moments based onn

i=1(xi − x)r. These sample moments can be used to match the sample data to a particularfamily of distributions.

Some useful formulae: Apart from the first moment, it is the central moments µr which bestdescribe the shape of the distribution, but more often than not it is easier to calculate theraw or uncentred moments µ′r. Fortunately, there are close algebraic relationships betweenthe two types of moments, which are stated below for r = 2, 3, 4. The derivation of theserelationships is left as an exercise. The relationship for the variance is so frequently usedthat it is worth remembering (although all three formulae are easily recollected once their

derivation is understood):µ2 = σ2

X = µ′2 − (µX )2

µ3 = µ′3 − 3µ′2µX + 2(µX )3

µ4 = µ′4 − 4µ′3µX + 6µ′2(µX )2 − 3(µX )4

Example: Suppose that X has the exponential distribution with parameter λ. The mean isgiven by:

µX = E [X ] =

∞0

x f X (x) dx =

∞0

xλe−λx dx

which can be integrated by parts as follows:

µX = x(−e−λx)∞0 + ∞0

e−λx dx

= 0 + 1

λ

= 1

λ

For r > 1 we have in general that:

µ′r = E [X r] =

∞0

xr f X (x) dx =

∞0

xr λe−λx dx

which may again be integrated by parts as follows:

µ′r =

xr(−e−λx)∞0

+

∞0

rxr−1e−λx dx

= 0 + r1

λ

∞0

xr−1λe−λx dx

= r

λµ′r−1

From this recursion relationship it follows that µ′2 = 2/λ2, µ′3 = 6/λ3, and µ′4 = 24/λ4. It isleft as an exercise to convert these raw moments (using the above formulae) to the second,third and fourth central moments, and hence also to determine the coefficients of skew (2)and kurtosis (94) for the exponential distribution. Note that these coefficients do not dependon λ.

3-5

Page 43: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 43/103

3.2.1 Moments - examples A − F

We consider Example A − F of chapter one again, to obtain some of the moments:

Example A:

pY (y) =

4y

py(1 − p)4−y , y = 0, 1, 2, 3, 4

note that pY (y) ≥ 0 for all y = 0, 1, 2, 3, 4 (a countable set) and zero for all other values of y .

Find the mean of Y

E [Y ] =4

y=0

y pY (y)

=4

y=0

y 4

y py(1

− p)4−y

= 0 +4

y=1

p4 (4 − 1)!

(y − 1)!(4 − y)! py−1(1 − p)4−y

= 4 p3

x=0

3!

(x)!((3 + 1) − (x + 1))! px(1 − p)3+1−(x+1) by setting x = y − 1 and m = 4 − 1

= 4 p3

x=0

3x

px(1 − p)3−x

= 4 p(1) because3

x=0 3x px(1 − p)3−x = 1 (pmf of B(3, p))

= 4 p

Example B: Recall that the pmf of X is given by

pX

(x) = x

10, x = 1, 2, 3, 4; zero elsewhere

Find the mean and variance of X :

E [X ] =4

X=1

x pX (x)

=4

X=1

x x

10

= 1 1

10 + 2

2

10 + 3

3

10 + 4

4

10

= 1 + 4 + 9 + 16

10

= 30

10

3-6

Page 44: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 44/103

E [X 2] =4

X=1

x2 x

10

= 1 110

+ 4 210

+ 9 310

+ 16 410

= 1 + 8 + 27 + 64

10

= 100

10

V ar[X ] = E [X 2] − [E [X ]]2

= 100

10 − 9

= 1

Example C: Recall that the pmf of S is given by the following table:

Value of S 0 1 2 3 4 pS

(s) 0.15 0.25 0.25 0.15 0.20

The mean and variance of S is:

E [S ] =4

S =0

spS

(s)

= 0 × 0.15 + 1 × 0.25+ 2 × 0.25 + 3 × 0.15 + 4 × 0.20= 2

E [S 2] =4

S =0

s2 pS

(s)

= 0 × 0.15 + 1 × 0.25 + 4 × 0.25 + 9 × 0.15 + 16 × 0.20

= 5.8

V ar[S ] = E [S 2] − [E [S ]]2

= 1.8

Example D: Recall that the pdf of T is given by

f T

(t) = 3

8t2 for 0 ≤ t ≤ 2

Find µ′r

3-7

Page 45: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 45/103

µ′r = E[Tr]

= 2

0

trf T

(t) dt

=

20

tr 3

8t2 dt

=

20

3

8tr+2 dt

= 3

8

tr+2+1

r + 2 + 1

2t=0

= 3

8

2r+2+1

r + 2 + 1

thus the mean is

E[T] = µ′1

= 3

8

21+2+1

1 + 2 + 1

= 3

2

and

E[T2] = µ′2

=

3

8

22+2+1

2 + 2 + 1

= 12

5

thus the variance is

V ar[T ] = E [T 2] − [E [T ]]2

= 12

5 − 9

4

= 12 × 4 − 9 × 5

20

= 320

Example E: Recall that the pdf of W is given by

f W

(w) = 2e−2w for 0 < w < ∞

Find µ′r

3-8

Page 46: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 46/103

µ′r = E[Wr]

= ∞

0

wrf W

(w) dw

=

∞0

wr2e−2w dw

=

∞0

2x

2

r

e−x 1

2 dx [by setting 2w = x, ⇒ dw =

1

2dx]

=

1

2

r ∞0

x(r+1)−1e−x dx

=

1

2

r

Γ[r + 1]

thus the mean is

E[W] = µ′1

=

1

2

1Γ[1 + 1]

= 1

2

E[W2] = µ′2

= 1

22

Γ[2 + 1]

= 1

4 × 2

= 1

2

Var[W] = µ′2 − [µ′1]2

= 1

2 − 1

4

= 1

4Example F: Recall that the pdf of X is given by

f X

(x) = 1

10 for 2 < x < 12

Find µ′r

3-9

Page 47: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 47/103

µ′r = E[Xr]

= 12

2

xrf X

(x) dx

=

122

xr 1

10 dx

= 1

10

xr+1

r + 1

12x=2

= 1

10

12r+1 − 2r+1

r + 1

thus the mean is

E[X] = µ′1

= 1

10

122 − 22

2

= 1

10

122 − 22

2

= 1

10

1

2(144 − 4)

= 7

E[X2] = µ′2

= 110

123 − 233

= 1

30(123 − 23)

= 1720

30

and the variance is

Var[X] = µ′2 − [µ′1]2

= 172

3 − 49

= 8.3333

3.3 The moment generating function

The moment generating function (mgf) of a random variable X is defined by:

M X (t) = E[etX ]

for real-valued arguments t, i.e. by:

M X (t) = ∞

−∞

etxf X

(x) dx

3-10

Page 48: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 48/103

for continuous random variables, or by:

M X (t) =∞

x=0

etx pX

(x)

for discrete random variables. Note that the moment generating function is a function of t, and notof realizations of X . The random variable X (or more correctly, perhaps, its distribution) definesthe function of t given by M X (t).

Clearly, M X(0) = 1, since for t = 0, etx = 1 (a constant). For non-zero values of t, we cannotbe sure that M X (t) exists at all. For the purposes of this course, we shall assume that M X (t)exists for all t in some neighbourhood of t = 0, i.e. that there exists an ǫ > 0 such that M X (t)exists for all −ǫ < t < +ǫ. This condition is a restrictive assumption, in that particular or specificotherwise well-behaved distributions are excluded. We are, however, invoking this assumption forconvenience only in this course. If we used a purely imaginary argument, e.g. it, where i =

√ −1,then the corresponding function E [eitX ] (called the characteristic function of X , when viewed asa function of t) does exist for all proper distributions. Everything which we shall be doing with

the mgf’s carry through to characteristic functions as well, that extension involves us in issues of complex analysis. For ease of presentation in this course, therefore, we restrict ourselves to themgf.

Recalling the power series expansion of ex, we see that:

M X (t)

= E[1 + tX + t2X 2

2! +

t3X 3

3! +

t4X 4

4! + · · · ]

= 1 + tE[X ] + t2

2!E[X 2] +

t3

3!E[X 3] +

t4

4!E[x4] + · · ·

= 1 + tµ′1 + t2

2!µ′2 +

t3

3!µ′3 +

t4

4!µ′4 + · · ·

Now consider what happens when we repeatedly differentiate M X (t). Writing:

M (r)X =

drM X(t)

dtr

we obtain:

M (1)X = µ′1 +

2t

2!µ′2 +

3t2

3! µ′3 +

4t3

4! µ′4 + · · ·

= µ′1 + tµ′2 + t2

2!µ′3 +

t3

3!µ′4 + · · · .

Similarly:

M (2)X = µ′2 + tµ′3 + t2

2!µ′4 + t3

3!µ′5 + · · · ,

and continuing in this way, we have in general:

M (r)

X = µ′r + tµ′r+1 + t2

2!µ′r+2 +

t3

3!µ′r+3 + · · · .

If we now set t = 0 in the above expressions, we obtain µ′1 = µX = M (1)X (0), µ′2 = M (2)X (0), and ingeneral:

µ′r = M (r)

X (0).

We thus have a procedure for determining moments by performing just one integration or sum-

mation (to get M X (t)), and the required number of differentiations. This is often considerably

3-11

Page 49: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 49/103

simpler than attempting to compute the moments directly by repeated integrations or summa-tions. This only gives the uncentred moments, but the centred moments can be derived from theseraw moments, using the formulae in the previous chapter.

The expansion for M X (t) used above indicates that the mgf is fully determined by the moments

of the distribution, and vice versa . Since the distribution is in fact fully characterized by itsmoments, this correspondence suggests that there is a one-to-one correspondence between mgf’sand probability distributions. This argument is not a proof at this stage, but the above assertioncan in fact be proved for all probability distributions whose mgf’s exist in a neighbourhood of t = 0. The importance of this result is that if we can derive the mgf of a random variable, thenwe have in principle also found its distribution. In practice, what we do is to calculate and recordthe mgf’s for a variety of distributional forms. Then when we find a new mgf, we can check backto see what the distribution it matches. This idea will be illustrated later. We now derive mgf’sfor some important distributional classes.

Example (Geometric distribution): pX

(x) = pq x for x = 0, 1, 2, . . ., where q = 1 − p. Thus:

M X (t) = E[eXt ]

=

∞x=0

etx pX (x)

=

∞x=0

etx pq x

= p∞

x=0

(qet)x.

Using the standard sum of a geometric series, we obtain M X (t) = p/(1

−qet), provided that

qet < 1. The mgf thus exists for all t < − ln(q ), where − ln(q ) is a positive upper boundsince q < 1 by definition.

Exercise Obtain the mean and variance of the geometric distribution from the mgf

Example (Poisson distribution):

pX

(x) = λxe−λ

x!

The mgf is thus:

M X (t) =

∞x=0

etx λxe−λ

x!

= ∞x=0

(λet

)x

e−λ

x!

= e−λeλet∞

x=0

(λet)xe−λet

x!

Now the term in the summation expression is the pmf of a Poisson distribution with parameterλet, and the summation thus evaluates to 1. The mgf is thus:

M X (t) = eλ(et−1).

This recognition of a term which is equivalent to the pdf or pmf of the original distribution,but with modified parameters, is often the key to evaluating mgf’s.

3-12

Page 50: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 50/103

The first two derivatives are:M

(1)X (t) = λeteλ(et−1)

and:M

(1)X (t) = λeteλ(et−1) + (λet)2eλ(et−1).

Setting t = 0 in these expressions gives µX = λ and µ′2 = λ + λ2. Thus:

σ2X = µ′2 − (µX )2 = λ.

Exercise (Binomial distribution): Show that the mgf of the binomial distribution is given by:

M X (t) = (q + pet)n

where q = 1 − p.

Hint: Combine terms with x as exponent.

Example (Gamma distribution):

M X (t) = ∞

0

etx λα

Γ(α)xα−1e−λxdx

=

∞0

λα

Γ(α)xα−1e−(λ−t)xdx

= λα

(λ − t)α

∞0

(λ − t)α

Γ(α) xα−1e−(λ−t)xdx

The integrand in the last line above is the pdf of the gamma distribution with parameters αand λ − t, provided that t < λ. Thus, for t < λ, the integral above evaluates to 1. Note howonce again we have recognized another form of the distribution with which we began. Wehave thus demonstrated that for the gamma distribution:

M X (t) = λα

(λ − t)α =

1 − t

λ

−α

We leave it as an exercise to verify that µX = α/λ, and that σ2X = α/λ2, from the mgf.

Recall that the χ2 distribution with n degrees of freedom is the gamma distribution withα = n/2 and λ = 1

2. Thus the mgf of the χ2 distribution with n degrees of freedom is given

by:M X (t) = (1 − 2t)−n/2.

We shall make use of this result later.

We consider Example A − F of chapter one again, to obtain some moment generating func-tions:

Example A: Y B(4, p) - already done for homework (see above)

Example B, C and D mgf not useful.

Example E: Let W be a random variable of the continuous type with pdf given by

f W

(w) = 2e−2w for 0 < w < ∞.

The mgf of W is given by

M W (w) = E[ewt]

=

∞0

ewtf W

(w) dw

=

∞0

ewt2e−2w dw

= ∞

0

2e−w(2−t) dw

3-13

Page 51: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 51/103

To solve this integral we make the following transformation

w(2 − t) = x thus w = x2−t

and dw = 12−t

dx

thus

M W (w) = ∞0

2e−x 1

2 − t dx

= −2

2 − t[0 − 1]

= 2

2 − t

Example F: Let X be a random variable of the continuous type with pdf given by

f X

(x) = 1

10 for 2 < x < 12.

The mgf of X is given by

M X (x) = E[ext]

=

122

extf X

(x) dx

=

122

ext 1

10 dx

= 1

10

1

t

122

text dx

= 1

10t ext

12

x=2

= 110t

[e12t − e2t]

= e12t − e2t

10t

3.4 Moment generating functions for functions of randomVariables

Apart from its use in facilitating the finding of moments, the mgf is a particularly useful toolin deriving the distributions of functions of one or more random variables. We shall be seeing anumber of examples of this application later. For now, we examine the general principles.

Suppose that we know the mgf of a random variable X , and that we are now interested in the mgf of Y = aX + b, for some constants a and b. Clearly:

E[etY ] = E[eatX+bt] = ebtE[eatX ]

since b, and thus ebt, is a constant for any given t. We have thus demonstrated that M Y (t) =ebtM X (at). Note that by M X(at), we mean the mgf of X evaluated at an argument of at (Hint:Think about M X (♣) imply the mgf of the random variable X evaluated at an argument of ♣).

For example, suppose that X is Gaussian distributed with mean µ and variance σ2. We know thatZ = (X − µ)/σ has the standard Gaussian distribution, i.e. M Z (t) = et2/2. But then X = σZ + µ,and by the above result:

M X (t) = eµt

M Z (σt) = eµt+σ2t2/2

.

3-14

Page 52: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 52/103

A second important property relates to the mgf’s of sums of independent random variables. Sup-pose that X and Y are independent random variables with known mgf’s. Let U = X + Y ; then:

M U (t) = E[etU ] = E[etX+tY ] = E[etXetY ] = E[etX ]E[etY ]

where the last equality follows from the independence of X and Y . In other words:

M U (t) = M X (t)M Y (t).

This result can be extended: for example if Z is independent of X and Y (and thus also of U = X + Y ), and we define V = X + Y + Z , then V = U + Z , and thus:

M V (t) = M U (t)M Z (t) = M X (t)M Y (t)M Z (t).

Taking this argument further in an inductive sense, we have the following theorem:

Theorem 3.1 Suppose that X 1, X 2, X 3, . . . , X n are independent random variables, and that M i(t)is (for ease of notation) the mgf of X i. Then the mgf of S = n

i=1 X i is given by:

M S (t) =

ni=1

M i(t).

An interesting special case of the theorem is that in which the X i are identically distributed, whichimplies that the moment generating functions are identical: M X (t), say. In this case:

M S (t) = [M X (t)]n

.

Since the sample mean X = S/n, we can combine our previous results to get:

M X (t) = M X t

nn

.

This equation is an important result which we will use again later.

We can use theorem 3.1 to prove an interesting property of the Poisson, Gamma and Gaussiandistributions (which does not carry over to all distributions in general, however). (This propertycan be described as the closure of each of the families under addition of independent variableswithin the family).

Poisson Distr.: Suppose that X 1, X 2, X 3, . . . , X n are independent random variables, such thatX i has the Poisson distribution with parameter λi. Then:

M i(t) = eλi(et−1)

and the mgf of S = n

i=1 X i is:n

i=1

eλi(et−1) = exp

ni=1

λi(et − 1)

which is the mgf of the Poisson distribution with parametern

i=1 λi. Thus S has this Poissondistribution.

Gamma Distr.: Suppose that X 1, X 2, X 3, . . . , X n are independent random variables, such thateach X i has a Gamma distribution with a common value for the λ parameter, but withpossibly different values for the α parameter, say αi for X i. Then:

M i(t) = 1−

t

λ−αi

3-15

Page 53: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 53/103

and the mgf of S =n

i=1 X i is:

n

i=1

1 − t

λ

−αi

=

1 − t

λ

−n

i=1 αi

which is the mgf of the Gamma distribution with parametersn

i=1 αi and λ. Thus S hasthis Gamma distribution.

For the special case of the chi-squared distribution, suppose that X i has the χ2 distributionwith ri degrees of freedom. It follows from the above result that S then has the χ2 distributionwith

ni=1 ri degrees of freedom.

Gaussian Distr.: Suppose that X 1, X 2, X 3, . . . , X n are independent random variables, such thatX i has a Gaussian distribution with mean µi and variance σ2

i . Then:

M i(t) = eµit+σ2i

t2/2.

and the mgf of S = n

i=1

X i is:

exp

ni=1

(µit) +n

i=1

(σ2i t2/2)

.

Which is the mgf of the Gaussian distribution with mean of n

i=1 µi and variance of n

i=1 σ2i .

Thus S (the sum) has the Gaussian distribution with this mean and variance.

The concept of using the mgf to derive distributions extends beyond additive transformations. Thefollowing result illustrates this, and is of sufficient importance to be classed as theorem.

Theorem 3.2 Suppose that X has the Gaussian distribution with mean µ and variance σ2, and

let:Y =

(X − µ)2

σ2 .

Then Y has the χ2 distribution with one degree of freedom.

(We will skip the proof)

Corollary: If X 1, X 2, . . . , X n are independent random variables from a common Gaussian distri-bution with mean µ and variance σ , then:

Y =

n

i=1

(X i − µ)2

σ2

has the χ2 distribution with n degrees of freedom. This is a very important result!

3.5 The central limit theorem

We come now to what is perhaps the best known, and most widely used and misused, resultconcerning convergence in distribution, viz. the central limit theorem . We state it in the followingform:

3-16

Page 54: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 54/103

Theorem 3.3 The Central Limit Theorem) Let X 1, X 2, X 3, . . . be an iid sequence of random variables, having finite mean ( µ) and variance ( σ2). Suppose that the common mgf, say M X (t),and its first two derivatives exist in a neighbourhood of t = 0. For each n, define:

X n = 1n

ni=1

X i

and:

Z n =X n − µ

σ/√

n .

Then the sequence Z 1, Z 2, Z 3, . . . converges in distribution to Z which has the standard Gaussian distribution.

Comment 1: The practical implication of the central limit theorem is that for large enough n,the distribution of the sample mean can be approximated by a Gaussian distribution withmean µ and variance σ2/n, provided only that the underlying sampling distribution satisfies

the conditions of the theorem. This is very useful, as it allows the powerful statistical inferen-tial procedures based on Gaussian theory to be applied, even when the sampling distributionitself is not Gaussian. However, the theorem can be seriously misused by application to caseswith relatively small n.

Comment 2: The assumption of the existence of the twice-differentiable mgf is quite strong.However, the characteristic function (i.e. using imaginary arguments) and its first two deriva-tives will exist if the first two moments exist, which we have already assumed.

Tutorial exercises

1. A random variable X has probability density function given by

f X

(x) = 1

64x3 0 ≤ x ≤ 4

= 0 otherwise

(a) Find the expectation and variance of X .

(b) Is this distribution skew? Motivate your answer.

2. A random variable Y has probability density function given by

f Y

(y) = e−y 0 ≤ y

= 0 otherwise

(a) Find µ′r.

(b) Find the mean and variance of Y .

(c) Find the kurtosis of Y .

(d) Find the mgf (M Y (y)) of Y .

(e) Use the mgf to derive the mean and variance of Y .

(f) Let W = 2Y , find the mgf of W . Can you identify the distribution of W ?

3-17

Page 55: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 55/103

3. A random variable X has probability density function given by

f X

(x) = 2x 0 ≤ x ≤ 1

= 0 otherwise(a) Find µ′r.

(b) Find the mean and variance of X .

(c) Find the kurtosis of X .

(d) Comment on the distribution of X .

4. A random variable X has probability mass function given by

pX

(x) = 4xe−4

x! x = 0, 1, 2, · · ·

= 0 otherwise

(a) Find µ′r.(b) Find the mean and variance of X .

(c) Find the mgf of X .

(d) Use this mgf to find the first two moments of X .

5. A random variable T has probability mass function given by

pT

(t) =

5

t

42 − t

36

t = 0, 1, 2,

= 0 otherwise

Find the mean and variance of X .

6. A random variable X has probability density function given by

f X

(x) = 10

x x > 10

= 0 x ≤ 10

Find the mean and variance of X .

7. The following questions is from Hogg and Craig (1978):

(a) Let X denote the mean of a random sample of size 75 from the distribution that hasthe pdf

f X (x) = 1 for 0 < x < 1= 0 elsewhere

Use the central limit theorem to show that the approximate probability

P r[0.45 < X < 0.55] = 0.866

.

(b) Let X denote the mean of a random sample of size 100 from a distribution that is χ250

(Chi-square with 50 degrees of freedom). Compute an approximate value of Pr[49 <X < 51].

Hogg R.V. and Craig A.T. (1978): Introduction to mathematical statistics (fourth edition).Macmillan Publishing, Co., Inc., New York, 438 p.

3-18

Page 56: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 56/103

Chapter 4

Moments of bivariate distributions

4.1 Assumed statistical background

• Chapter two and three of this notes.

4.2 Moments of bivariate distributions: covariance and cor-relation

The concept of moments is directly generalizable to bivariate (or any multivariate) distributions.If (X, Y ) is a bivariate random variable, then we define the joint ( r, s)-th moment by:

µ′rs = E[X r

Y s

]

which for continuous random variables is given by: ∞−∞

∞−∞

xrysf XY

(x, y) dy dx

and for a discrete random variable by

∞−∞

∞−∞

xrys pXY

(x, y).

As with univariate moments, it is useful to subtract out the means for higher order moments.Thus the (r, s)-th central moment of (X, Y ) is defined by:

µrs = E[(X − µX )r(Y − µY )s]

The simplest such case is when r = 1 and s = 1, thus

µ11 = E[(X − µX )(Y − µY )]

which is termed the covariance of X and Y , written as Covar(X ,Y ) or as σXY . While variancemeasures the extent of dispersion of a single variable about its mean, covariance measures the

extent to which two variables vary together around their means. If large values of X (i.e. X > µX )

4-1

Page 57: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 57/103

tend to be associated with large values of Y (i.e. Y > µY ), and vice versa , then (X − µX )(Y − µY )will tend to take on positive values more often than negative values, and we will have σXY > 0. X and Y will then be said to be positively correlated . Conversely, if large values of the one variabletend to be associated with small values of the other, then we have σXY < 0, and the variables are

said to be negatively correlated . If σXY = 0, then we say that X and Y are uncorrelated .Covariance, or the sample estimate of covariance, is an extremely important concept in statisticalpractice. Two important uses are the following:

Exploring possible causal links: For example, early observational studies had shown that thelevel of cigarette smoking and the incidence of lung cancer were positively correlated. Thissuggested a plausible hypothesis that cigarette smoking was a causative factor in lung cancer.This was not yet a proof, but suggested important lines of future research.

Prediction: Whether or not a causal link exists, it remains true that if σXY > 0, and we observeX >> µX , then we would be led to predict that Y >> µY . Thus, even without proof of a causal link between cigarette smoking and lung cancer, the actuary would be justified in

classifying a heavy smoker as a high risk for lung cancer (even if, for example, it is propensityto cancer that causes addiction to cigarettes).

As with other moments, it is often easier to calculate the uncentred moment µ′11, than to calculatethe covariance by direct integration. The following is thus a useful result, worth remembering:

σXY = E[(X − µX )(Y − µY )]

= E[XY ] − µX E[Y ] − µY E[X ] + µX µY

= µ′11 − µX µY

We continue with the examples (G and I) from chapter two:

Example G: The joint probability mass function of X and Y is:

y = x =0 1 2 3

0 0.03125 0.0625 0.03125 01 0.06250 0.15625 0.1250 0.031252 0.03125 0.1250 0.15625 0.062503 0 0.03125 0.0625 0.03125

In order to obtain the covariance, we first calculate:

E [XY ] =3

x=0

3y=0

xypXY (xy)

= 0 × 0 × 0.03125 + · · · + 3 × 3 × 0.03125

= 2.5 = 5

2

E [X ] =3

x=0

xpX (x)

= 0 × 0.125 + 1 × 0.375 + 2 × 0.375 + 3 × 0.125

= 1.5 =

3

2

4-2

Page 58: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 58/103

It is left as an exercise to show that µY = 32 , σ2

X = 34 and σ2

Y = 34 . The covariance is:

σXY = E [XY ] − µX µY

= 52 − 3

2 × 3

2

= 1

4

Example H: Recall that the joint pdf of X, Y is given by:

f XY

(x, y) = 3

28(x + y2) for 0 ≤ x ≤ 2; 0 ≤ y ≤ 2

It is left as an exercise to show that µX = 16/14, µY = 9/7, σ2X = 46

147 and σ2

Y = 71245

. Inorder to obtain the covariance, we first calculate:

µ′11 = E [XY ]

=

2x=0

2y=0

xyf XY (xy) dydx

=

2x=0

2y=0

xy 3

28(x + y2) dy dx

= 3

28

2x=0

2y=0

(x2y + xy3) dydx

= 3

28

2x=0

x2y2

2 +

xy4

4

2y=0

dx

= 3

28 2

x=0(2x2 + 4x) dx

= 3

28

2x3

3 +

4x2

2

2x+0

= 3

28

2.4.2

3 +

4.2.2

2

= 10/7

Thus the covariance is:

σXY = 10

7 − 16

14

9

7 = − 2

49

We now state and prove a few important results concerning bivariate random variables, whichdepend on the covariance. We start, however, with a more general result:

Theorem 4.1 If X and Y are independent random variables, then for any real valued functions g(x) and h(y), E[g(X )h(Y )] = E[g(X )] · E[h(Y )].

Proof: We shall give the proof for continuous distributions only. The discrete case follows analo-gously.

4-3

Page 59: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 59/103

Since, by independence, we have that f XY

(x, y) = f X

(x)f Y

(y), it follows that:

E[g(X )h(Y )] =

∞−∞

∞−∞

g(x)h(y)f X

(x)f Y

(y) dy dx

= ∞−∞

g(x)f X (x) ∞−∞

h(y)f Y (y) dy dx

=

∞−∞

g(x)f X

(x) dx

∞−∞

h(y)f Y

(y) dy

= E[g(X )] · E[h(Y )]

Note in particular that this result implies that if X and Y are independent, then E[XY ] =µ′11=µX µY , and thus that σXY = 0. We record this result as a theorem:

Theorem 4.2 If X and Y are independent random variables, then σXY = 0.

The converse of this theorem is not true in general (i.e. we cannot in general conclude that X andY are independent if σXY = 0), although an interesting special case does arise with the gaussiandistribution (see property 3 of the Bivariate gaussian distribution). That the converse is not true,is demonstrated by the following simple discrete example:

Example of uncorrelated but dependent variables: Suppose that X and Y are discrete ran-dom variables, with p

XY (x, y) = 0, except for the four cases indicated below:

pXY

(0; −1) = pXY

(1; 0) = pXY

(0; 1) = pXY

(−1;0) = 1

4

Note that pX

(−1) = pX

(1) = 14 and p

X(0) = 1

2 , and similarly for Y . Thus X and Y are notindependent, because (for example) p

XY (0;0) = 0, while p

X(0) p

Y (0) = 1

4.

We see easily that µX

= µY

= 0, and thus σXY

= E[XY ]. But XY = 0 for all for caseswith non-zero probability, and thus E[XY ]=0. Thus the variables are uncorrelated, butdependent.

Theorem 4.3 For any real numbers a and b:

Var[aX + bY ] = a2σ2X + b2σ2

Y + 2abσXY

Proof: Clearly:E[aX + bY ] = a E[X ] + b E[Y ] = aµX + bµY

Thus the variance of aX + bY is given by:

E[(aX + bY −

aµX

−bµY )

2]

= E[a2(X − µX )2 + b2(Y − µY )2 + 2ab(X − µX )(Y − µY )]

= a2E[(X − µX )2] + b2E[(Y − µY )2] + 2abE[(X − µX )(Y − µY )]

= a2σ2X + b2σ2

Y + 2abσXY

Special Cases: The following are useful special cases:

Var[X + Y ] = σ2X + σ2

Y + 2σXY

Var[X − Y ] = σ2X + σ2

Y − 2σXY

If X and Y are independent (or even if only uncorrelated), these two cases reduce to:

Var[X + Y ] = Var[X −

Y ] = σ2

X + σ2

Y

4-4

Page 60: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 60/103

As with the interpretation of third and fourth moments, the interpretation of covariance is con-founded by the fact that the magnitude of σXY is influenced by the spreads of X and Y themselves.As before, we can eliminate the effects of the variances by defining an appropriate correlation co-efficient , namely

ρXY = σXY

σX σY =

Covar[X, Y ] Var[X ]Var[Y ]

To summarize:

1. Note that ρXY has the same sign as σXY , and takes on the value zero when X and Y areuncorrelated.

2. If X and Y are precisely linearly related, then |ρXY | = 1, with the sign being determined bythe slope of the line; and

3. If X and Y are independent, then ρXY = 0.

The magnitude of the correlation coefficient is thus a measure of the degree to which the twovariables are linearly related, while its sign indicates the direction of this relationship.

Example G: In example G, we had σ2X = 3

4, σ2

Y = 34

and σXY = 14

Thus:

ρXY =14

34 × 3

4

= 1

3

Example H : In example H, we had σ2X = 46

147 , σ2Y = 71

245 and σXY = − 249 Thus:

ρXY = − 2

49√ 0.31293 × 0.28980

= −0.13554

Example I: For homework show that ρXY = 0.

4.3 Conditional moments and regression of the mean

An alternative to covariance or correlation as a means of investigating the relationships betweentwo random variables, is to examine the means of the two conditional distributions. The condi-tional pdf’s f

X|Y (x|y) and f

Y |X(y|x) (or, for discrete distributions, the conditional probability mass

functions pX|Y (x|y) and pY |X (y|x)) define proper probability distributions, for which moments, andin particular means, can be calculated in the usual manner. We shall write µX|y , or E[X |Y = y ],to represent the mean of X conditional on Y = y, and similarly µY |x, or E[Y |X = x], for Y conditional on X = x.

Note that initially we condition a sample space that includes all possible values of X and Y . Butwhen we condtion, we consider only a fragment of the original space.

The conditional mean E[X |Y = y] is the expectation (or “long-run average”) of X , amongst alloutcomes for which Y = y (or, more correctly for continuous random variables, for y ≤ Y ≤ y + hin the limit as h → 0). Note that E[X |Y = y] is a function of the real number y only; it is nota random variable, and it does not depend on any observed value of X , since it is an average of all X ’s within a stated class. The conditional expectation of X given Y = y is also termed theregression (of the mean of) X on Y , which is often plotted graphically (i.e. E[X

|Y = y] versus y).

4-5

Page 61: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 61/103

We can, of course, also compute the regression of Y on X , and it is worth emphasizing that thetwo regressions will not in general give identical plots.

In first year, you were introduced to the concept of linear regression , which can be viewed as a bestlinear approximation to the relationship between E[Y

|X = x] and x. In the previous chapter, we

noted that for the bivariate gaussian distribution, the means of the two conditional distributions,i.e. the regressions are truly linear. In general, however, regressions will not be linear for otherdistributions. This contrast is illustrated in the following example:

Example H (cont.) We continue further with example H from the previous chapter, viz. thatfor which:

f XY

(x, y) = 3

28(x + y2) for 0 ≤ x ≤ 2; 0 ≤ y ≤ 2

From our previous results, it is easy to confirm that:

f Y |X

(y|x) = x + y2

2x + 83

for 0 ≤ y ≤ 2, and zero elsewhere (but only defined for 0 ≤ x ≤ 2)

Thus the regression of Y on X is given by:

µY |x =

2y=0

yf Y |X

(y|x) dy

=

2y=0

y x + y2

2x + 83

dy

= 1

2x + 83

2

y=0

(xy + y3) dy

= 1

2x + 83

xy2

2 +

y4

4

2y=0

= 2x + 4

2x + 83

It is left as an exercise to find the regression of X on Y .

Note that the two regressions are non-linear, and are not inverse functions of each other.

Tutorial Exercises

1. For each of the tutorial exercises (1 -6) of Chapter 2 calculate:

i) The marginal means and marginal variances.

ii) µ′11.

iii) The covariance.

iv) The correlation coefficient.

v) The regression of Y on X .

vi) The regression of X on Y .

2. Let E[X ] = 3, Var[X ] = 5, E[Y ] = 0 and Var[Y ] = 1. If X and Y are independent, find

4-6

Page 62: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 62/103

(a) E[2 + X ]

(b) E[(2 + X )2]

(c) Var[2 + X ]

(d) E[2X + 1](e) Var[2X + 1]

(f) E[XY ]

(g) Var[XY ]

(h) E[X + 3Y ]

(i) Var[X + 3Y ]

4-7

Page 63: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 63/103

Chapter 5

Distributions of Sample Statistics

Assumed statistical background

• Introstat - Chapter 8,9 and 10

• Chapter 1-4 of this course

5.1 Random samples and statistics

In applying probability concepts to real world situations (e.g. planning or decision making), weusually need to know the distributions of the underlying random variables, such as heights of people in a population (for a clothing manufacturer, say), sizes of insurance claims (for the actu-ary), or numbers of calls through a telephone switchboard (for the telecommunications engineer).Each random variable is defined in terms of a “population”, which may, in fact, be somewhathypothetical, and will almost always be very large or even infinite. We cannot then determinethe distribution of the random variable by total census or enumeration of the population, and weresort to sampling. Typically, this involves three conceptual steps:

1. Conceptualize or visualize a convenient hypothetical population defined on possible values of the random variable itself (rather the real observational units). For example, the populationmay viewed as all non-negative integers, or all real numbers (rather than hours of the day orstudents). Sometimes a “convenient” population may include strictly impossible situations;for example, we may view that population of student heights as all real numbers from −∞to ∞!

2. Use a combination of professional judgment and mathematical modeling to postulate a dis-tributional form on this population, e.g.:

• Persons’ heights assumed N (µ, σ2) on all real numbers;

• Actuarial claims assumed to be Gamma(α, λ) distributed on positive real numbers;

• Number of calls through a switchboard in any given period assumed Poisson( λ).

Note that this will usually leave a small number of parameters unspecified at this point, tobe “estimated” from data.

3. Observe an often quite small number of actual instances (outcomes of random experiments,or realizations of random variables), the “sample”, and use the assumed distributional formsto generalize sample results to the entire assumed population, by estimating the unknown

parameters.

5-1

Page 64: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 64/103

The critical issue here is the choice of the sample. In order to make the extension of sample resultsto the whole population in any way justifiable, the sample needs to be “representative” of thepopulation. We need now to make this concept precise. Consider for example:

• Are the heights of students in the front row of the lecture room representative of all UCTstudents?

• Would the number of calls through a company switchboard during 09h00-10h00 on Mondaymorning be representative?

• Would 10 successive insurance claims be representative of claims over the entire year?

The above examples suggest two possible sources of non-representativeness, viz. (i) observing cer-tain parts of the population preferentially, and (ii) observing outcomes which are not independentof each other. One way to ensure representativeness is to deliberately impose randomness on theselected population, in a way which ensures some uniformity of coverage. If X is the randomvariable in whose distribution we are interested (e.g. heights of male students) then we attempt

to design a scheme whereby each male student is equally likely to be chosen, independently of all others. Quite simply, each observation is then precisely a realization of X . Practical issues of ensuring this randomness and independence will be covered in later courses in this department,but it is useful to think critically of any sampling scheme in these terms. In an accompanyingtutorial, there are a few examples to think about. Discuss them amongst yourselves.

For this course, we need to define the above concepts in a rigorous mathematical way. For thispurpose, we shall refer to the random variable X in which we are interested as the population ran-dom variable , with distribution function given by F

X(x), etc. Any observation, or single realization

of X will usually be denoted by subscripts, e.g. X 1, X 2, . . . , X i, . . .. The key concept is that of arandom sample defined as follows:

Definition: A random sample of size n of a population random variable X is a collection on n iidrandom variables X 1, X 2, . . . , X n all having the same distribution as X .

It is usual to summarize the results of a random sample by a small number of functions of X 1, X 2, . . . , X n. You will be familiar with “5-number” summaries, and with the sample meanand variance (or even higher order sample moments). All of these summaries have the propertythat they can be calculated from the sample, without any knowledge of the distribution of X . Anysummary which satisfies this property is called a “statistic”. Formally:

Definition: A statistic is any function of a random sample which does not depend on any unknownparameters of the distribution of the population random variable.

Thus, a function such as (n

i=1 X

4

i )/(n

i=1 X

2

i )

2

would be a statistic, but n

i=1(X i − µX )

2

wouldgenerally not be (unless the population mean µX were known for sure a priori ).

Let T (X 1, X 2, . . . , X n) be a statistic. It is important to realize that T (X 1, X 2, . . . , X n) is a ran-dom variable, one which takes on the numerical value T (x1, x2, . . . , xn) whenever we observe the joint event defined by X 1 = x1, X 2 = x2, . . . , X n = xn. If we are to use this statistic to drawinferences about the distribution of X , then it is important to understand how observed valuesof T (X 1, X 2, . . . , X n) vary from one sample to the next: in other words, we need to know theprobability distribution of T (X 1, X 2, . . . , X n). This we are able to do, using the results of the pre-vious chapters, and we shall be doing so for a number of well-known cases in the remainder of thischapter. In doing so, we shall largely be restricting ourselves to the case in which the distributionof the population random variable X is normal with mean µ and variance σ2, one or both of whichare unknown. The central limit theorem will allow us to use the same results as an approximation

for non-normal samples in some cases.

5-2

Page 65: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 65/103

5.2 Distributions of sample mean and variance for Gaussiandistributed populations

Assume we draw a sample from a Gaussian population (e.g. X i ∼ N (ui, σ

2

i )), recall that themoment generating function of X i is given by

M Xi(t) = expµit +

1

2σ2

i t2,

remember the t is keeping a space:

M Xi(♠) = expµi♠ +

1

2σ2

i ♠2

The sample mean is the statistic:

X = 1

n

n

i=1

X i.

Assuming that the sample does in fact satisfy our definition of a random sample, the mgf of X is:

M X(t) = E[et X ]

= E

exp

t

X i

n

= E[exp(

t

nX1)]E[exp(

t

nX2)] · · · E[exp(

t

nXn)] (given independence)

= M X1( t

n)M X2(

t

n) · · · M Xn

( t

n)

= [M X ( tn

)]n (given identical)

= [M X (♠)]n (♠ = t

n)

= [expµ♠ + 1

2σ2♠2]n

= expnµ♠ + n1

2σ2♠2

= expnµt

n + n

1

2σ2(

t

n)2

= expµt + 1

2

σ2

n t2

which is the mgf of the Gaussian distribution with mean µ and variance σ2/n. The central limittheorem told us that this was true asymptotically for all well-behaved population distributions,but for gaussian sampling this is an exact result for all n.

The distribution of the sample variance is a little more complicated, and we shall have to approachthis stepwise. As a first step, let U i = (X i − µ)/σ, and consider:

ni=1

U 2i =

ni=1(X i − µ)2

σ2 .

By the corollary to Theorem 3.2, we know that this has the χ2 distribution with n degrees of

freedom. In principal we know its pdf. (from the gamma distribution), and we can compute

5-3

Page 66: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 66/103

as many moments as we wish. Integration of the density to obtain cumulative probabilities ismore difficult, but fortunately this is what is given in tables of the χ2 distribution. For any α(0 < α < 1), let us denote by χ2

n;α the value which is exceeded with probability α. In other words,if V has the χ2 distribution with n degrees of freedom, then:

Pr[V > χ2n;α] = α.

Now n

i=1(X i − µ)2/σ2 is not a statistic, but in the special case in which µ is known, it is the ratioof a statistic to the remaining unknown parameter (σ2), and can thus be used to draw inferencesabout σ2. Let us briefly look further at this case, to see how knowledge of the distribution of n

i=1(X i − µ)2/σ2 allows us to draw inferences. Throughout this chapter, we will be illustratinghow the results which we derive apply to concepts of point and interval estimation and of hypothesistests, with which you should really be familiar. Further examination of the underlying theoreticalprinciples and philosophies of these concepts will be dealt with in STA3030F.

Point estimation of σ: From the properties of the χ2 (gamma) distributions, we know that:

E

ni=1(X i − µ)2

σ2

= n.

Re-arranging terms, noting that σ is a constant (and not a random variable), even though itis unknown, we get:

E

1

n

ni=1

(X i − µ)2

= σ2.

Suppose now that, based on the observed values from a random sample X 1 = x1, X 2 =x2, . . . , X n = xn, we propose to use the following as an estimate of σ :

1

n

n

i=1(xi − µ)

2

.

From one sample, we can make no definite assertions about how good this estimate is. Butif samples are repeated many times, and the same estimation procedure applied every time,then we know that we will average out at the correct answer in the long run. We say thatthe estimate is thus unbiased .

Hypothesis tests on σ: Now suppose that a claim is made that σ2 = σ20, where σ2

0 is a givenpositive real number. Is this true? We might make this the “null” hypothesis, with analternative given by σ2 > σ2

0 . If the null hypothesis is true, then from the properties of theχ2 distribution, we know that:

Prni=1(X i

−µ)2

σ20 ≥ χ2

n;α = α.

Suppose now that for a specific observed sample, we find thatni=1(X i − µ)2

σ20

is in fact larger than χ2n;α for some suitably small value of α. What do we conclude? We

cannot be sure whether the null hypothesis is true or false. But we do know that either the null hypothesis is false, or we have observed a low probability event (one that occurswith probability less than α). For sufficiently small α, we would be led to “reject” the nullhypothesis at a 100α% significance level.

5-4

Page 67: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 67/103

Confidence interval for σ: Whatever the true value of σ, we know from the properties of theχ2 distribution that:

Pr χ2n;1

−α/2

≤ ni=1(X i − µ)2

σ2

≤χ2

n;α/2 = 1

−α

or, after re-arranging terms, that:

Pr

ni=1(X i − µ)2

χ2n;α/2

≤ σ2 ≤n

i=1(X i − µ)2

χ2n;1−α/2

= 1 − α.

An important point to note is that this is a statement of probabilities concerning the outcomeof a random sample from a normal distribution with known mean µ, and fixed (but unknown)variance σ2. It is NOT a probability statement about values of σ2.

In view of this result, we could state on the basis of a specific observed random sample, thatwe have a substantial degree of confidence in the claim that σ2 is included in the interval

ni=1(xi − µ)2

χ2n;α/2

;n

i=1(xi − µ)2

χ2n;1−α/2

.

In any one specific case, the assertion is either true or false, but we don’t know which.However, if we repeat the procedure many times, forming intervals in this way every time,then in the long run, on average, we will be correct a proportion 100(1 − α)% of the time,which is why the above interval is conventionally called a 100(1 − α)% confidence interval forσ2.

Example: A factory manufactures ball bearings of a known mean diameter of 9mm, but individualball bearings are assumed to be normally distributed with a mean of 9mm and a standarddeviation of σ .

A sample of 10 ball bearings is taken, and their diameters in mm (x1, x2, . . . , x10) are carefullymeasured. It turns out that

10i=1(xi − 9)2 = 0.00161.

Our estimate of σ2 is thus 0.00161/10 = 0.000161, while that of the standard deviation is√ 0.000161 = 0.0127. A 95% confidence interval can be calculated by looking up in tables

that χ210;0.025 = 20.483 and χ2

10;0.975 = 3.247. The confidence interval for σ2 is thus:0.00161

20.483 ;

0.00161

3.247

=

7.86 × 10−5 ; 49.58 × 10−5

.

The corresponding confidence interval for σ is obtained by taking square roots, to give[0.0089 ; 0.0223].

Suppose, however, that the whole point of taking the sample was because of our skepticism

with a claim made by the factory manager that σ ≤ 0.009. We test this against the alternativehypothesis that σ > 0.009. For this purpose, we calculate the ratio 0.00161/(0.009)2 whichcomes to 19.88. From χ2 tables we find that χ2

10;0.05 = 18.307 while χ210;0.025 = 20.483. We

thus can say that we would reject the factory manager’s claim at the 5% significance level,but not at the 2 1

2% level (or, alternatively, that the p-value for this test is between 0.025 and

0.05). There is evidently reason for our skepticism in the manager’s claim!

The problem, however, is that in most practical situations, if we don’t know the variance, we alsodo not know the mean. The “obvious” solution is to replace the population mean by the samplemean, i.e. to base the same sorts of inferential procedures as we had above on:

ni=1(X i − X )2

σ2

= (n − 1)S 2

σ2

5-5

Page 68: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 68/103

where S 2 is the usual sample variance defined by:

S 2 = 1

n − 1

n

i=1

(X i − X )2.

Certainly, if we define V i = (X i − X )/σ, then the above isn

i=1 V 2i , and it is easily shownthat the V i’s are normally distributed with zero mean. The variance of V i can be shown to be(n − 1)/n, which is slightly less than 1, but the real problem is that the V i are not independent,since

ni=1 V i = 0, and thus Theorem 3.2 and its corollary no longer apply. And yet it seems

intuitively that not too much can have changed. We now proceed to examine this case further.

Firstly, however, we need the following theorem:

Theorem 5.1 For random samples from the normal distribution, the statistics X and S 2 are independent random variables.

The proof is not part of this course.

Theorem 5.2 For random samples from the normal distribution, (n − 1)S 2/σ2 has the χ2 distri-bution with n − 1 degrees of freedom

The proof is not part of this course.

Comment: The only effect of replacing the population mean by the sample mean is to changethe distribution from the χ2 with n degrees of freedom to one with n − 1 degrees of freedom.The one linear function relating the n terms has “lost” us one degree of freedom.

Note that the expectation of (n − 1)S 2/σ2 is thus n − 1, and that E[S 2]=σ2, i.e. S 2 is now

the unbiased estimator of σ2

.Proof: The proof is not part of this course.

Example: Suppose, in the context of the previous example, that we discover that it is X which isequal to 9mm, and not µ, and that in fact

ni=1(xi − x)2 = 0.00161. The unbiased estimator

of σ2 is thus 0.00161/9=0.000179.

For the significance test, we now need to compare the observed value (19.88) of the statisticwith critical points of the χ2 distribution with 9 (not 10) degrees of freedom. We find now thatχ29;0.025 = 19.023, so that the hypothesis is rejected at the 2 1

2% significance level. Similarly,the 95% confidence interval needs to be based on χ2

9;0.025 and χ29;0.975 = 2.70, which gives

[8.46 × 10−5 ; 59.6 × 10−5] as the confidence interval for σ2, or [0.0092 ; 0.024] for σ .

5.3 Application to χ2 goodness-of-fit tests

The previous section shows that if a sequence of random variables Z 1, Z 2, . . . , Z n are approximatelynormally distributed with mean 0 and variance 1, and are “nearly” independent in the sense that theonly relationship between them is of the form

ni=1 Z i = 0, then the χ2 distributional result remains

correct, but we “lose” one degree of freedom. Strictly speaking we needed Var[Z i]=1 − 1/n, butthat is often a small effect. This suggests an intuitive rule, that the sum of squares of approximatelystandardized normal random variables will have a χ2 distribution, but with one degree of freedomlost for each relationship between them. This intuition serves us well in many circumstances.

Recall the χ2 goodness of fit test from first year. We could view this in the following way. Suppose

we have a random sample X 1, X 2, . . . , X n, and a hypothesis that these come from a specified

5-6

Page 69: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 69/103

distribution described by F (x). We partitioned the real line into a fixed number (say K ) oncontiguous intervals, and calculated the theoretical probability, say pk, that X would fall in eachinterval k = 1, 2, . . . , K , assuming that the hypothesis were true, as follows:

pk = x∈interval k f (x) dx

where f (x) is the pdf corresponding to F (x). Let the random variable N k be the number of obser-vations in which the random variable is observed to fall into interval k . Evidently,

nk=1 N k = n.

Any one N k taken on its own is binomially distributed with parameters n and pk. For “sufficientlylarge” n, the distribution of N k can be approximated by the normal distribution with mean npk

and variance npk(1 − pk). (Conventional “wisdom” suggests that the approximation is reasonableif npk > 5, or some would say > 10.) Thus:

N k − npk npk(1 − pk)

has approximately the standard normal distribution. It is perhaps a little neater to work in termsof:

Z k = N k − npk√

npk

which is also then approximately normal with zero mean and variance equal to 1 − pk. Note thatthe Z k are related by the constraint that:

K k=1

√ npkZ k =

K k=1

(N k − npk) = 0.

ThusK

k=1

Z 2k =K

k=1 (N k − npk)2

npk is a sum of squares of K terms which are approximately normally distributed with mean 0. If thechoice of categories is reasonably balanced, then the pk will be approximately equal, i.e. pk ≈ 1/K ,in which case each term has variance of approximately 1 − 1/K . This is then fully analogous tothe situation in Theorem 5.2 (with K playing the role of n there), and we would expect the sameresult to occur, viz. that the above sum has the χ2 distribution with K − 1 degrees of freedom. Itdoes, in fact, turn out that this is a good approximation, which is thus the basis for the χ2 test.

If the distribution function F (x) is not fully specified at the start, but involves parameters to beestimated from the data, this imposes further relationships between the Z k, leading to furtherlosses of degrees of freedom.

5.4 Student’s t distribution

For a random sample of size n from the N (µ, σ2) distribution, we know that:

X − µ

σ/√

n

has the standard normal distribution, and this fact can be used to draw inferences about µ if σ isknown. For example, a test of the hypothesis that µ = µ0 can be based on the fact (see normaltables) that:

Pr X − µ0

σ/√ n > 1.645 = 0.05

5-7

Page 70: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 70/103

if the hypothesis is true. Thus, if the observed value of this expression exceeds 1.645, then we couldreject the hypothesis (in favour of µ > µ0) at the 5% significance level. Similarly, a confidenceinterval for µ for known σ can be based on the fact that:

Pr−1.96 <X

−µ

σ/√ n < +1.96 = 0.95

which after re-arrangement of the terms gives:

Pr

X − 1.96

σ√ n

< µ < X + 1.96 σ√

n

= 0.95

We must re-emphasize that the probability refers to random (sampling) variation in X , and notto µ which is viewed as a constant in this formulation.

In practice, however, the population variance is seldom known for sure. The “obvious” thing todo is to replace the population variance by the sample variance, i.e. to base inferences on:

T =¯

X − µS/

√ n

This is now a function of two statistics, viz. X and S . Large values of the ratio can be due tolarge deviations in X from the population mean or to values of S below the population standarddeviation. Fortunately, we do know that X and S are independent, and we know their distributions,and thus we should be able to derive the distribution of T . It is useful to approach this in thefollowing manner. Let us first define:

Z =X − µ

σ/√

n

(which has the standard normal distribution) and:

U = (n

−1)S 2

σ2

which has the χ2 distribution with n − 1 degrees of freedom. Z and U are independent, and:

T = Z U/(n − 1)

.

In the following theorem, we derive the pdf of T in a slightly more general context which will proveuseful later.

Theorem 5.3 Suppose that Z and U are independent random variables having the standard nor-mal distribution and the χ2 distribution with m degrees of freedom respectively. Then the pdf of

T = Z

U/m

is given by:

f T

(t) = Γ((m + 1)/2)√

mπΓ(m/2)

1 +

t2

m

−(m+1)/2

Comments: The pdf for T defines the t-distribution, or more correctly Student’s t distribution,with m degrees of freedom . It is not hard to see from the functional form that the pdf has a “bell-shaped” distribution around t = 0, superficially rather like that of the normaldistribution. The t-distribution has higher kurtosis than the normal distribution, although

it tends to the normal distribution as m → ∞.

5-8

Page 71: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 71/103

For m > 2, the variance of T is m/(m − 2). The variance does not exist for m ≤ 2, and infact for m = 1, even the integral defining E[T ] is not defined (although the median is still atT = 0). The t-distribution with 1 degree of freedom is also termed the Cauchy distribution.

As you should know, tables of the t-distribution are widely available. Values in these tables

can be expressed as numbers tm;α, such that if T has the t-distribution with m degrees of freedom, then:

Pr[T > tm;α] = α.

Hypothesis tests and confidence intervals can thus be based upon observed values of the ratio:

T =X − µ

S/√

n

and critical values of the t-distribution with n − 1 degrees of freedom. This should be very familiarto you.

5.5 Applications of the t distribution to two-sample tests

A greater understanding of Theorem 5.3 can be developed by looking at the various types of two-sample tests, and the manner in which different t-tests occur in each of these. These were allcovered in first year, but we need to examine the origins of these tests.

Suppose that X A1, X A2, . . . , X Am is a random sample of size m from the N (µA, σ2A) distribution,

and that X B1, X B2, . . . , X Bn is a random sample of size n from the N (µB, σ2B ) distribution. We

suppose further that the two samples are independent. Typically, we are interested in drawinginferences about the difference in population means:

∆AB = µA

−µB .

Let X A and X B be the corresponding sample means. We now that X A is normally distributedwith mean µA and variance σ2

A/m, while X B is normally distributed with mean µB and varianceσ2

B/n. Since X A and X B are independent (since the samples are independent), we know furtherthat X A − X B is normally distributed with mean ∆AB and variance:

σ2A

m +

σ2B

n

and thus the term:

Z AB =X A − X B − ∆AB

σ2

A/m + σ2B/n

has the standard normal distribution.

If the variances are known, then we can immediately move to inferences about ∆ AB. If they arenot known, we will wish to use the sample variances S 2A and S 2B. The trick inspired by Theorem 5.3is to look for a ratio of a standard normal variate to the square root of a χ2 variate, and hope thatthe unknown population variances will cancel. Certainly, (m − 1)S 2A/σ2

A has the χ2 distributionwith m − 1 degrees of freedom and (n − 1)S 2B/σ2

B has the χ2 distribution with n − 1 degrees of freedom. Furthermore, we know that their sum:

U AB = (m − 1)S 2A

σ2A

+ (n − 1)S 2B

σ2B

has the χ2 distribution with m + n − 2 degrees of freedom. (Why?) This does not, however, seem

to lead to any useful simplification in general.

5-9

Page 72: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 72/103

But see what happens if σ2A = σ2

B = σ2, say. In this case Z AB and U AB become:

Z AB =X A − X B − ∆AB

σ 1m + 1n

and

U AB = (m − 1)S 2A + (n − 1)S 2B

σ2 =

(m + n − 2)S 2 pool

σ2

where the “pooled” variance estimator is defined by:

S 2 pool = (m − 1)S 2A + (n − 1)S 2B

m + n − 2 .

Now if we take T = Z AB/

U AB/(m + n − 2), then we get:

T =X A − X B − ∆AB

S pool 1

m + 1n

which by Theorem 5.3 has the t-distribution with m + n− 2 degrees of freedom. We can thus carryout hypothesis tests, or construct confidence intervals for ∆AB.

Example: Suppose we have observed the results of two random samples as follows:

m = 8 xA = 61 8

i=1(xAi − xA)2 = 1550

n = 6 xB = 49 6

i=1(xBi − xB)2 = 690

We are required to test the null hypothesis that µA−µB ≤ 5, against the one-sided alternativethat µA

−µB > 5 at the 5% significance level, under the assumption that the variances of

the two populations are the same. The pooled variance estimate is:

s2 pool = 1550 + 690

8 + 6 − 2 = 186.67

and thus under the null-hypothesis, the t-statistic works out to be:

t = 61 − 49 − 5√

186.67

18 + 1

6

= 0.949

The 5% critical value for the t-distribution with 8 + 6 − 2 = 12 degrees of freedom is 1.782,and we cannot thus reject the null hypothesis.

In general, when variances are not equal, there appears to be no way in which a ratio of a normal tothe square root of a χ2 can be constructed, in such a way that both unknown population variancescancel out. This is called the Behrens-Fisher problem. Nevertheless, we would expect that a ratioof the form

X A − X B − ∆AB S 2A/m + S 2B/n

“should have something like” a t-distribution. Empirical studies (e.g. computer simulation) haveshown that this is indeed true, but the relevant “degrees of freedom” giving the best approximationto the true distribution of the ratio depend on the problem structure in a rather complicatedmanner (and usually turns out to be a fractional number, making it hard to interpret). A number of approximations have been suggested on the basis of numerical studies, one of which is incorporated

into the STATISTICA package.

5-10

Page 73: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 73/103

There is one further special case, however, which is interesting because it allows, ironically, therelaxation of other assumptions. This is the case in which m = n. We can now pair the observations(at this stage in any way we like), and form the differences Y i = X Ai − X Bi say. The Y i will be iidnormally distributed with mean ∆AB and with unknown variance σ2

Y = σ2A + σ2

B . The problem

thus reduces to the problem of drawing inferences about the population mean for a single sample,when the variance is unknown. Note that we only need to estimate σ2Y , and not the individual

variances σ2A and σ2

B .

In order to apply this idea, we need only to have that the Y i are iid It is perfectly permissibleto allow X Ai and X Bi to share some dependencies for the same i. They might be correlated, orboth of their means may be shifted from the relevant population means by the same amount. Thisallows us to apply the differencing technique to “paired” samples, i.e. when each pair X Ai, X Bi

relate to observations on the same subject under two different conditions. For example, each “i”may relate to a specific hospital patient chosen at random, while X Ai and X Bi refer to responsesto two different drugs tried at different times. All we need to verify is that the differences are iidnormal, after which we use one sample tests.

Example: A random sample of ten students are taken, and their results in economics and statisticsare recorded in each case as follows.

Student Economics Statistics Difference (Y )1 73 66 72 69 71 -23 64 49 154 87 81 65 58 61 -36 84 74 107 96 89 78 58 60 -2

9 90 85 510 82 76 6

If care was taken in the selection of the random sample of students, then the statistics resultsand the economics results taken separately would represent random samples. But the twomarks for the same student are unlikely to be independent, as a good student in one subjectis usually likely to perform well in another. But the last column above represents a singlerandom sample of the random variable defined by the amount by which the economics markexceeds the statistics mark, and these seem to be plausibly independent. For example, if thereis no true mean difference (across the entire population) between the two sets of marks, thenthere is no reason to suggest that knowing that one student scored 5% more on economicsthan on statistics has anything to do with the difference experienced by another student,whatever their absolute marks.

The test of the hypothesis of no difference between the two courses is equivalent to the nullhypothesis that E[Y ]=0. The sample mean and sample variance of the differences above are4.9 and 32.99 respectively. The standard (one sample) t-statistic is thus 4.9/

32.99/10 =

2.698. Relevant critical values of the t-distribution with 9 degrees of freedom are t9;0.025 =2.262 and t9;0.01 = 2.821. Presumably a two-sided test is relevant (as we have not been givenany reason why differences in one direction should be favoured over the other), and thus the“p-value” lies between 5% (2 × 0.025) and 2% (2 × 0.01). Alternatively, we can reject thehypothesis at the 5%, but not at the 2% significance level.

5-11

Page 74: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 74/103

5.6 The F distribution

In the previous section, we have looked at the concept of a “two-sample” problem. Our concernthere was with comparing the means of two populations. Now, however, let us look at comparing

their variances. There are at least two reasons why we may wish to do this:

1. We have a real interest in knowing whether one population is more or less variable thananother. For example, we may wish to compare the variability in two production processes,or in two laboratory measurement procedures.

2. We may merely wish to know whether we can use a pooled variance estimate for the t-testfor comparing the means.

In the case of variances, it is convenient to work in terms of ratios, i.e. σ2A/σ2

B. Equality of variances means that this ratio is 1. We have available to us the sample variances S 2A and S 2B, andwe might presumably wish to base inferences upon S 2A/S 2B. The important question is: what is

the probability distribution of S 2

A/S 2

B for any given population ratio σ2

A/σ2

B?We know that U = (m − 1)S 2A/σ2

A and V = (n − 1)S 2B/σ2B have χ2 distributions with m − 1 and

n − 1 degrees of freedom respectively. Thus let us consider the function:

U/(m − 1)

V /(n − 1) =

S 2A/S 2Bσ2

A/σ2B

.

Since we know the distributions of U and V , we can derive the distribution of the above ratio,which will give us a measure of the manner in which the sample variance ratio departs from thepopulation variance ratio. The derivation of this distribution is quite simple in principle, althoughit becomes algebraically messy. We shall not give the derivation here, but recommend it as anexcellent exercise for the student. We simply state the result in the following theorem.

Theorem 5.4 Suppose that U and V are independent random variables, having χ2 distributions with r and s degrees of freedom respectively. Then the probability density function of

Z = U/r

V/s

is given by:

f Z

(z) = Γ((r + s)/2)

Γ(r/2)Γ(s/2)

r

s

r/2

zr/2− 1rz

s + 1

−(r+s)/2

for z > 0.

The distribution defined by the above pdf is called the F-distribution with r and s degrees of freedom (often called the numerator and denominator degrees of freedom respectively). One wordof caution: when using tables of the F-distribution, be careful to read the headings, to see whetherthe numerator degree of freedom is shown as the column or the row. Tables are not consistent inthis sense.

As with the other distributions we have looked at, we shall use the symbol F r,s;α to represent theupper 100α% critical value for the F-distribution with r and s degrees of freedom. In other words,with Z defined as above:

Pr[Z > F r,s;α] = α.

Tables are generally given separately for a number of values of α, each of which give F r,s;α forvarious combinations of r and s.

5-12

Page 75: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 75/103

Example: We have two alternative laboratory procedures for carrying out the same analysis. Letus call these A and B . Seven analyses have been conducted using procedure A (giving mea-surements X A1, . . . , X A7), and six using procedure B (giving measurements X B1, . . . , X B6).We wish to test the null hypothesis that σ2

A = σ2B, against the alternative that σ2

A > σ2B, at

the 5% significance level. Under the null hypothesis, S 2

A/S 2

B has the F-distribution with 6and 5 degrees of freedom. Since F 6,5;0.05 = 4.95, it follows that:

Pr

S 2AS 2B

> 4.95

= 0.05

Suppose now that we observe S 2A = 5.14 and S 2B = 1.08. This may look convincing, but theratio is only 4.76, which is less than the critical value. We can’t at this stage reject the nullhypothesis (although I would not be inclined to “accept” it either!).

You may have noticed that tables of the F-distribution are only provided for smaller values of α,e.g. 10%, 5%, 2.5% and 1%, all of which correspond to variance ratios greater than 1. For one-sided hypothesis tests, it is always possible to define the problem in such a way that the alternativeinvolves a ratio greater than 1. But for two-sided tests, or for confidence intervals, one does needboth ends of the distribution. There is no problem! Since

Z = U/r

V/s

has (as we have seen above) the F-distribution with r and s degrees of freedom, it is evident that:

Y = 1

Z =

V /s

U/r

has the F-distribution with s and r degrees of freedom.

Now, by definition:

1 − α = Pr[Z > F r,s;1−α] = Pr

Y <

1

F r,s;1

−α and thus:

Pr

Y ≥ 1

F r,s;1−α

= α.

Since Y is continuous, and has the F-distribution with s and r degrees of freedom, it followstherefore that:

F s,r;α = 1

F r,s;1−α

i.e. for smaller values of α, and thus larger values of 1 − α, we have:

F r,s;1−α = 1

F s,r;α.

Example (Confidence Intervals): Suppose that we wish to find a 95% confidence interval forthe ratio σ2A/σ2

B in the previous example. We now know that:

Pr

F 6,5;0.975 <

S 2A/σ2A

S 2B/σ2B

< F 6.5;0.025

= 0.95

which after some algebraic re-arrangement gives:

Pr

1

F 6,5;0.025

S 2AS 2B

< σ2

A

σ2B

< 1

F 6,5;0.975

S 2AS 2B

.

The tables give us F 6,5;0.025 = 6.98 directly, and thus 1/F 6,5;0.025 = 0.143. Using the aboverelationship, we also know that 1/F 6,5;0.975 = F 5,6;0.025 = 5.99 (from tables). Since theobserved value of S 2A/S 2B is 5.14/1.08=4.76, it follows that the required 95% confidenceinterval for the variance ratio is [0.682 ; 28.51].

5-13

Page 76: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 76/103

Tutorial exercises

1. Suppose we are interested in students smoking habits (e.g. the distribution of weekly smoking(number of cigarettes during one week, say)). The correct procedure would be to draw a

sample of students at random (from student number, say) and to interview each in order toestablish their smoking patterns. This may be a difficult or expensive procedure. Commenton the extent to which the following alternative procedures also satisfy our definition of arandom sample:

(a) Interview every 100th student entering Cafe Nescafe during one week.

(b) Interview all resident of Smuts Hall, or of Smuts and Fuller Halls.

(c) E-mail questionnaires to all registered students and analyze responses received.

(d) Include all relevant questions on next years STA204F sample questionnaire.

(e) Interview all students at the Heidelberg next Friday night.

2. Laboratory measurements on the strength of some material are supposed to be distributed

normally around a true mean material strength µ, with variance σ2. Let X 1, X 2 , . . . denoteindividual measurements. Based on a random sample of size 16, the following statistic wascomputed:

16i=1(x

i − x)2 = 133. Can you reject the hypothesis : σ2 = 4.80?

3. In the problem of Question 2, suppose that for a sample of size 21, a value of 5.1 for thestatistic s2 was observed. Construct a 95% confidence interval for σ2.

4. Let X ∼ N (3, 9), W ∼ N (0, 4), Y ∼ N (−3, 25) and Z ∼ χ223 (chi squared with 23 degrees

of freedom) be independent random variables. Write down the distributions of the followingrandom variables

(a) T = 1

3X − 1

(b) D = T 2

(c) A = W

2

(d) B = (Y + 3)2

25

(e) K = (X + Y )2

34

(f) E = K + Z

(g) G = 6(W 2)

E (h) S = X + W + Y

5. W,X,Y,Z are independent random variables, where W, X, and Y have the following normaldistributions:

W ∼ N (0, 1) X ∼ N (0, 19

) and Y ∼ N (0, 116

)

and Z ∼ χ40 (chi-squared distribution with 40 degrees of freedom). We assert that

T = cY √

bX 2 + W 2 + Z

has a t distribution with m degrees of freedom. For what values of c, b and m is the assertiontrue, and what results justify the assertion?

5-14

Page 77: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 77/103

6. Eight operators in a manufacturing plant were sent on a special training course. The times,in minutes, that each took for a particular activity were measured before and after the course;these were as follows:

Operator 1 2 3 4 5 6 7 8Time (before course) 23 18 16 15 19 21 31 22Time (after course) 17 14 13 13 12 20 14 17

Would you conclude that the course has speeded up their times? Is there evidence for theclaim that the course, on average, leads to a reduction of at least one minute per activity?

7. One of the occupational hazards of being an airplane pilot is the hearing loss that resultsfrom being exposed to high noise levels. To document the magnitude of the problem, a teamof researchers measured the cockpit noise levels in 18 commercial aircraft. The results (indecibels) are as follows:

Plane Noise level (dB) Plane Noise level (dB) Plane Noise level (dB)1 74 7 80 13 732 77 8 75 14 833 80 9 75 15 864 82 10 72 16 835 82 11 90 17 836 85 12 87 18 80

(a) Find a 95% confidence interval for µ by firstly assuming that σ2 = 27 and secondlyby assuming that σ2 is unknown and that you have to estimate it from the sample.(Assume that your are sampling from a normal population)

(b) Find a 95% confidence interval for σ2 by firstly assuming that µ = 80.5 and secondly byassuming that µ is unknown and that you have to estimate it from the sample. (Assume

that your are sampling from a normal population)

8. Two procedures for refining oil have been tested in a laboratory. Independent tests with eachprocedure yielded the following recoveries (in ml. per l . oil):

Procedure A : 800.9; 799.1; 824.7; 814.1; 805.9; 798.7; 808.0; 811.8; 796.6; 820.5

Procedure B: 812.6; 818.3; 523.0; 911.2; 823.9; 841.1; 834.7; 824.5; 841.8; 819.4; 809.9; 837.5;826.3; 817.5

We assume that recovery per test is distributed N (µA

, σ2A

) for procedure A, and N (µB

, σ2B

)for procedure B.

(a) If we assume σ2A

= σ2B

, test whether Procedure B (the more expensive procedure) hashigher recovery than A. Construct a 95% confidence interval for∆AB

= µB − µ

A.

(b) What if we cannot assume σ2A

= σ2B

?

5-15

Page 78: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 78/103

Chapter 6

Regression analysis

Assumed statistical background

• Introstat - Chapter 12

• Bivariate distributions and its moments - Chapter 2 and 4 of these notes

Maths Toolbox

• Maths Toolbox B.5.

Assumed software background

You can use any package with which you are comfortable e.g. excel, SPSS or Statistica. You need

to be able to (a) complete a regression analysis (b) supply residual diagnostics and (c) be able tointerpret the output of the package.

6.1 Introduction

In Chapter two we consider the situations in which we simultaneously observed two (or more) vari-ables on each member/object of a sample (or observed two or more outcomes in each independenttrial of a random experiment).

Example RA: We draw a sample of size 10 from the STA2030S class and measure the height andweight of each member in the sample:

1 2 3 4 5 6 7 8 9 10height 1.65 1.62 1.67 1.74 1.65 1.69 1.81 1.90 1.79 1.58weight 65 63 64 67 63 68 73 85 65 54

We can produce a scatter diagram (scatterplot, bivariate scatter) of the two variables (Figure6.1). A random scatter of dots (points) would indicate that no relationship exist between the twovariables, however in figure 6.1 it is clear that we have a distinct correlation between weight andheight (as height increases, the corresponding average weight increases).

In first year statistics (Introstat, Chapter 12) you considered two questions:

6-1

Page 79: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 79/103

Scatterplot

50

60

70

80

90

1.5 1.6 1.7 1.8 1.9 2

height

w e i g h t

Figure 6.1: Example RA - a bivariate scatter plot of weight versus height for a sample of size 10from the STA2030S class

• Is there a relationship between the variables (the correlation problem)?

If the answer to this question is yes, you continued with regression.

• How do we predict values for one variable, given particular values for the other variable(s)?(the regression problem)

In this chapter we are going to study (linear) regression: how one or more variables (explanatoryvariables, predictor variables, regressors or X variables) affect another variable (dependent variable,outcome variable, response variable or Y). In particular we are going to apply linear regression(fitting a straight line through the data). Note in some instances we can do some transformationto either the Y ’s or the X ′s (to force linearity) before we fit a straight line. Linearity refers to thefact that the left hand side values (Y ′s) are a linear function of the parameters (β 0 and β 1) and afunction of the X values.

6.2 Simple (linear) regression - model, assumptions

The term Simple (linear) regression refers to the case when we only have one explanatory variableX. The (simple linear) regression model 6.1 is:

Y i = β 0 + β 1X i + εi for n = 1, · · ·, n (6.1)

where:

• Y i is the value of the response (observed, dependent) variable in the ith trial

6-2

Page 80: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 80/103

• β 0 and β 1 are parameters

• X i is a fixed (known) value of the explanatory variable in the ith trial

• εi is a random error term with the following properties:

εi ∼ (0, σ2) (that is E[εi] = 0 and the Var[εi] = σ2)

εi and εj are uncorrelated that is

Cov [εi, εj] = 0 for all i, j; i = j

Model 6.1 is simple in the sense that it only includes one explanatory variable and it is linear inthe parameters. Y i is a random variable and the expectation and variance of Y i is given by

E [Y i] = E [β 0 + β 1X i + εi]

= β 0 + β 1X i + E [εi]

= β 0 + β 1X i

V ar[Y i] = V ar[β 0 + β 1X i + εi]

= V ar[εi]

= σ2

Furthermore any Y i and Y j are uncorrelated.

If we also assume that the error terms are also Gaussian (εi ∼ N [0, σ2]) then

Y i ∼ N [β 0 + β 1X i, σ2].

(e.g. compare to the properties of the condition distribution discussed in Chapter 2).

The parameters β 0 and β 1 in the simple regression model are also known as the regression param-eters. β 0 is called the intercept of the regression line (the value Y takes when X is 0). β 1 is theslope of the regression line (the change in the conditional means of Y per unit increase in X ). Theparameters are usually unknown and in this course we are going to estimate them by using themethod of least squares, sometimes called ordinary least squares (OLS). The method of least squares minimizes the sum of the squared residuals (observed - fitted), Q =

ni=1(Y i − Y i)2.

We want to find estimates for β 0 and β 1 that minimize Q:

Q =

ni=1

(Y i − Y i)2 =

ni=1

(Y i − β 0 − β 1X i)2

The values of β 0 and β 1 which minimize Q can be derived by differentiating Q with respect to β 0and β 1.

∂Q

∂β 0= −2

ni=1

(Y i − β 0 − β 1X i)

∂Q

∂β 1= −2

n

i=1

X i(Y i − β 0 − β 1X i)

6-3

Page 81: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 81/103

Denote the values of β 0 and β 1 that minimize Q by β 0 and β 1. If we set these partial derivativesequal to zero we obtain the following pair of simultaneous equations:

ni=1

(Y i − ˆβ 0 −

ˆβ 1X i) = 0

and

ni=1

X i(Y i − β 0 − β 1X i) = 0

.

or by bringing in the summation, we write

n

i=1

Y i − nβ 0 − β 1

n

i=1

X i = 0

and

ni=1

X iY i − β 0

ni=1

X i − β 1

ni=1

X 2i = 0

.

By rearranging the terms we obtain the so called normal equations:

ni=1

Y i = nβ 0 + β 1

ni=1

X i

and ni=1

X iY i = β 0

ni=1

X i + β 1

ni=1

X 2i (6.2)

By solving the equations simultaneously we obtain the following OLS estimates for β 0 and β 1:

β 1 =

ni=1 X iY i − (

n

i=1 Xi)(n

i=1 Y in )n

i=1 X 2i − (n

i=1 Xi)2

n

β 0 = Y − β 1 X

.For the example in figure 6.1, the regression line is given in figure 6.2.

The estimated regression line (figure 6.2) , fitted by OLS has the following properties:

1. The sum of the residuals is null.

6-4

Page 82: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 82/103

y = -59.398 + 73.858x

R2 = 0.7848

50

60

70

80

90

1.5 1.6 1.7 1.8 1.9 2height

w e i g h t

Figure 6.2: Least square regression line for example RA: weight versus height for a sample of size10 from the STA2030S class

ni=1

εi =n

i=1

(Y i − Y i)

=n

i=1

(Y i − β 0 − β 1X i)

=n

i=1

Y i −n

i=1

β 0 −n

i=1

β 1X i

=n

i=1

Y i − nβ 0 − β 1

ni=1

X i

=n

i=1

Y i −n

i=1

Y i (6.3)

= 0

2. The sum of squared errors (or the residual sums of squares) denoted by SSE is

SS E =n

i=1

(Y i − Y i)2

=

ni=1

ε2i

(SSE is equivalent to the minimal value of Q used before). Under OLSE it is a minimum.The error mean square or residual mean square (MSE) is defined here as the SSE divided byn − 2 degrees of freedom, thus

M SE = SS E

n − 2 =

ni=1 ε2i

n − 2

6-5

Page 83: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 83/103

MSE is an unbiased estimator for σ2, thus E [M SE ] = σ2 (Homework: Can you prove this?)

3.n

i=1 Y i =n

i=1 Y i (easily seen from (6.3))

4. ni=1 X iεi = n

i=1 X i(Y i

− Y i) = n

i=1 X iY i

− β 0n

i=1 X i

− β 1n

i=1 X 2i = 0 (from normal

equations)

5.n

i=1 Y iεi = 0

The formula’s for the estimators of β 0 and β 1 are quite messy, imagine how they are going to looklike if we have more than one explanatory variables (multivariate regression). Before we continuewe will move to matrix notation. Matrix notation is very useful in statistics. Without matrixnotation you cannot read most statistical handbooks, it is an elegant mathematical way of writingcomplicated models and the matrix notation makes the manipulation of these models much easier.

6.3 Matrix notation for simple regression

Model 6.1 can be written in matrix notation as:

Y = X β + ε (6.4)

where

• Y is a n × 1 observed response vector,

• ε is a n × 1 vector of uncorrelated random error variables with expectation E [ε] = 0, andvariance matrix V ar(ε) = σ2I ,

• β is a 2 × 1 vector of regression coefficients that must be estimated, and

• X is a n × 2 matrix of fixed regressors or explanatory variables, whose rank is 2 (we willassume that n > 2). That is

Y =

Y 1Y 2···

Y n

X =

1 X 11 X 2· ·· ·· ·1 X n

β =

β 0β 1

ε =

ε1ε2···

εn

or we may rewrite 6.4 as:

Y 1Y 2···

Y n

=

1 X 11 X 2· ·· ·· ·1 X n

β 0β 1

+

ε1ε2···

εn

=

β 0 + β 1X 1 + ε1β 0 + β 1X 2 + ε2

···

β 0 + β 1X n + εn

The error vector ε expectation and variance can be written as:

6-6

Page 84: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 84/103

E [ε] =

E [ε1]E [ε2]

···

E [εn]

=

00

···0

; V ar[ε] = σ2I = σ2

1 0 0 · · · 00 1 0 · · · 0

· · · · · · ·· · · · · · ·· · · · · · ·0 0 0 · · · 1

The normal equations (6.2) can be written in terms of matrix notation as

XX ′β = X ′Y (6.5)

where

ˆβ = β 0

β 1 , X ′X = n X iX i X 2i and X ′Y = Y iX iY i thus

n

X i

X i

X 2i

β 0β 1

=

nβ 0 + β 1

X i

β 0

X i + β 1

X 2i

=

Y i

X iY i

To obtain the OLS estimates we pre-multiply 6.5 by the inverse of X ′X (one of our assumptionswas that X is of full column rank, so that ( X ′X )−1 exist):

(X ′X )−1X ′X β = (X ′X )−1X ′Y

thus

β = (X ′X )−1X ′Y

We consider the data of Example RA. At this stage we assume the X variable is height and the Y variable is weight, hence the matrices are then:

Y =

6564···

54

X =

1 1.651 1.62· ·· ·· ·1 1.58

β =

β 0β 1

ε =

ε1ε2···ε10

X ′X =

10 17.117.1 29.3286

X ′Y =

6691150.46

thus

(X ′X )−1 = 1

0.876 29.3286 −17.1

−17.1 10 6-7

Page 85: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 85/103

and

β = (X ′X )−1X ′Y = 1

0.876 29.3286

−17.1

−17.1 10 669

1150.46 = −59.3979

73.8584 = β 0

β 1 and the fitted values are given by

Y = X β =

1 1.651 1.62· ·· ·· ·1 1.58

−59.3979

73.8584

=

62.4684931560.25273973

···

57.29840183

.

The residuals (ε) are then

ε =

6564···

54

62.4684931560.25273973

···

57.29840183

=

2.5315068493.747260274

···

−3.298401826

Note:10

i=1 εi = 0.

6.4 Multivariate regression - model, assumptions

In the previous section we only consider one explanatory variable. We can generalize/extend theregression models discussed so far to the more general models where we simultaneously considermore than one explanatory variables. We will consider ( p − 1) regressor variables.

The linear regression model is given by

Y = X β + ε (6.6)

where

• Y is a n × 1 observed response vector (the same as in the simple case),

• ε is a n × 1 vector of uncorrelated random error variables with expectation E [ε] = 0, andvariance matrix V ar(ε) = σ2I (the same as in the simple case),

• β is a p × 1 vector of regression coefficients that must be estimated (note the additionalregression parameters), and

• X is a n × p matrix of fixed regressors or explanatory variables, whose rank is p (we willassume that n > p). The X matrix includes a column of ones (for intercept term) as well asa column for each of the ( p − 1) X variables.

6-8

Page 86: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 86/103

The X matrix and the β vector is now:

X =

1 X 11 X 12 · · · X 1 p−11 X 21 X 22 · · · X 2 p−1

· · · · · · ·· · · · · · ·1 X n1 X n2 · · · X np−1

β =

β 0β 1

···

β p−1

The row subscript in the X matrix identifies the trial/case, and the column subscripts iden-tifies the X variable.

If β is the ordinary least square estimator (OLSE) of β obtained by minimizing (Y −Xβ )′(Y −Xβ )over all β then

β = (X ′X )−1X ′Y

The fitted values are Y = X β = X (X ′X )−1X ′Y = H Y, and the residual terms are given by thevector

ε = Y − Y

= Y − X β

= Y − X (X ′X )−1X ′Y

= Y − H ′Y where H = X (X ′X )−1X

= (I − H )Y

H is called the hat matrix . It is a n × n square matrix, symmetric and idempotent (see AppendixB.5). Note that (I − H ) is therefore also idempotent.

The variance-covariance matrix of the residuals is:

V ar[ε] = V ar[(I − H )Y ]

= (I − H )V ar[Y ](I − H )′

= (I − H )σ2I (I − H )′

= σ2(I − H )

Properties of the regression coefficients:

1. E [β ] = β (unbiased)

E [β ] = E [(X ′X )−1X ′Y ]

= (X ′X )−1X ′E [Y ]

= (X ′X )−1X ′Xβ

= β

6-9

Page 87: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 87/103

2. V ar[β ] = σ2(X ′X )−1

V ar[β ] = V ar[(X ′X )−1X ′Y ]

= (X ′X )−1X ′V ar[Y ]X (X ′X )−1

= (X ′X )−1X ′σ2IX (X ′X )−1= σ2(X ′X )−1

3. MSE = σ2(X ′X )−1

4. β is the best linear unbiased estimator (BLUE) of β .

5. If we assume that ε are Gaussian (normal) in distribution then the BLUE’s and the maximumlikelihood estimators (MLE’s) coincide (proof in STA3030).

6.5 Graphical residual analysis

To recap, a residual is the difference between an observed value and a fitted value, thus:

εi = Y i − Y i

or in matrix notation the residual vector (n × 1 is

ε = Y − Y

Some properties of residuals were discussed before, e.g. the residuals sum to one (implying thatthe mean of the residuals are 0). Thus the residuals are not independent random variables. If we

know n − 1 residuals we can determine the last one. The variance of the residuals is

SSE

n − 2 whichis an unbiased estimator of the variance (σ2) (see STA3030F for properties of estimators).

The standardized residual is

εi√ M SE

= εi

σ

Informally (by graphical analysis of residuals and raw data) we can investigate how the regressionmodel (6. 1 and 6.4) departs from the model:

To investigate if the (linear) model is appropriate we should plot the explanatory variable(s) on theX axis and the residuals on the Y axes. The residuals should fall in an horizontal band centered

around 0; (with no obvious pattern, as in Figure 6.3 (a)). Figures 6.3(b), (c) and (d) suggestnon-standard properties of the residuals:

1. Non-linearity of regression function

We start off in the case of the simple regression model by using a scatterplot. Does this scat-terplot suggest strongly that a linear regression function is appropriate? (e.g. see Figure 6.1).Once the model is fitted, we use the resulting residuals to plot the explanatory variable(s) onthe X axes and the residuals on the Y axes. Do the residuals depart from 0 in a systematicway? (Figure 6.3 (b) is an example that suggest a curvilinear regression function).

If the data departs from linearity we can consider some transformations on the data in orderto create/force a linear model for new resulting forms of the variables.

Logarithmic transformation: Consider the model

6-10

Page 88: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 88/103

X

0

( )a

0

( )c

X

0

( )b

0

( )d

X X

Figure 6.3: Prototype residual plots (Neter et al. (1996, Figure 3.4)

Y = β 0β X1 ε (6.7)

However this model (6.7) is intrinsically linear since it can be transformed to a linear formatby taking log10 or loge = ln on both sides:

log10 Y = log10 β 0 +X log10 β 1 + log10 ε Y ′ = β ′0 + X β ′1 + ε′ (6.8)

Note that model (6.8) is now in the standard linear regression form, whereas (6.7) was not.Homework: Choose two functions like (6.7). Graph both functions, then transform themto obtain forms like (6.8). Plot these new relationship. What do you see?

Note sometimes it may only be necessary to take the logarithm of either X or Y .

Reciprocal Transformation

Consider the following model which is not linear in X :

Y = β 0 + β 1X

+ ε (6.9)

The transformation

6-11

Page 89: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 89/103

X ′ = 1

X

makes model (6.9) linear.

Homework: Choose two functions like (6.9). Graph both functions, then transform (recip-rocal). Plot these new relationship. What do you see?

2. Non-constancy of error variance

Figure 6.3 (c) is an example where the error variance increases with X . In the case of multiple regression a plot of residuals (on Y −axis)versus fitted values (Y on X-axis) is aneffective manner to weather or not the error variance is constant over the fitted values formean response.

Heteroscedasticity: error variance is not constant over all observations (the plot of theresiduals fan out).

Homoscedasticity; error variance is constant (horizontal band of residuals).

When we have heteroscedasticity the OLSE parameter estimators are still unbiased but theydo not have minimum variance.

In the case of Poisson distributions (mean equals the variance) a useful transformation tostabilize the variance (and improve normality) is the square-root transformation (

√ y)

Standard deviation proportional to X

WhenV ar[ε] = kX 2i

then an appropriate transformation (in the case of Gaussian error terms) is to divide by X i:

Y i = β 0 + β 1X i + εi; εi ∼ N [0, kX 2i ]

Y iX i = β 0

X i + β 1 + εi

X i Y ′ = β ′1X ′i + β ′0 + ε′

and the variance is now constant:

V ar[ε′] = V ar[ εi

X i] =

V ar[εi]

X 2i=

kX 2iX 2i

= k

3. Error terms are not independent

When the residuals are independent we should see a ’random’ distribution of residuals around

the base line (Figure 6.3 (a))When data are obtained in a time order, plot the residuals against time (time does not needto be a variable) (Figure 6.3(d)).

4. Outliers are extreme observations. In a graph a particular data point does not fall intothe random scatter of residuals but outside (Figure 6.4). Outliers need to be investigatedcarefully. Did we made a typing error? Did we measure it wrong? Outliers can be excludedfrom the analysis, but be careful - you only exclude it if you have cause!

In this course we are only going to examine outliers graphically and explore the standardizedresiduals (residual divided by its standard error (MSE)). We are not going to study measuresof influence (most of them are based on the hat matrix) such as studentized residual, DFFITS,Cook’s distance and so on.

6-12

Page 90: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 90/103

X

0

Figure 6.4: Residual plot with one outlier

5. The error terms are not Gaussian in distribution

Normality of the error terms will be investigated graphically (see specific graph in software).We expect to see a straight line. Departures from a straight line might be an indication thatthe residuals are not Gaussian.

Sometimes a transformation of the variables will bring the residuals nearer to normality(some variance stabilizing transformations will also improve normality).

6. Important explanatory variables omitted in modelPlot your residuals against any variables omitted from the model. Do the residuals varysystematically?

6.6 Variable diagnostics

6.6.1 Analysis of variance (ANOVA)

The deviation Y i − Y i (a quantity measuring the variation of the observation Y i from its conditionalmean) can be decomposed as follows

Y i − Y = Y i − Y + Y i − Y i I = II III

where I is the total deviation, II is the deviation of fitted OLS regression value from the overallmean and III is the deviation of the observed value from the fitted value on regression line (Figure6.5).

The sum of the squared deviations (the mixed terms are zero) is given by

6-13

Page 91: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 91/103

0 1ˆ

iY X β β = +

iY

iY Y −

ˆi

Y

Y

ˆi i

Y Y −

ˆi i

Y Y −

Figure 6.5: Decomposition for one observation - ANOVA

ni=1

(Y i − Y )2

=

ni=1

(Y i − Y )2

+n

i=1(Y i − Y i)

2

SSTO = SSR SSE

where SSTO is the total sum of squares (corrected for the mean) with n −1 degrees of freedom(df),SSR is the regression sum of squares with p − 1 degrees of freedom ( p − 1 exploratory regressorvariables and SSE denotes the error sum of squares (defined before) with n − p degrees of freedom( p parameters are fitted).

In matrix notation and for any value of p the sums of squares are

SSTO = Y ′Y −

nY 2

SS R = β ′X ′Y − nY 2 = Y ′X β − nY 2

SS E = Y ′Y − β ′X ′Y = Y ′Y − Y ′X β

A sum of squares divided by its degree of freedom is called a mean square (MS). The breakdown of the total sum of squares and associated degrees of freedom are displayed in an analysis of variancetable (ANOVA table):

ANOVA Table

6-14

Page 92: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 92/103

Source SS df MS

Regression SSR = β ′X′Y − nY2

p − 1 M SR =

SS R

p − 1

Error SSE = Y ′Y − β ′X ′Y n − p MSE = SS E

n − p

Total SST0 = Y ′Y − nY 2 n − 1

The coefficient of multiple determination is denoted by R2 and is defined as

R2 = SSR

SSTO

= 1 − SSESSTO

Note that

0 ≤ R2 ≤ 1

.

R2 measures the proportionate reduction in the (squared) variation of Y achieved by the introduc-tion of the entire set of X variables considered in the model. The coefficient of multiple correlationR is the positive square root of R2. In the case of simple regression, R is the absolute value of thecoefficient of correlation.

Adding more explanatory variables to the model will increase R2 (SSR becomes larger, SSTO doesnot change). A modified measure that adjust for increase in regressor variables is the adjustedcoefficient of multiple determination, denoted by R2

a. It is defined by:

R2a = 1 − n − 1

n − p

SSE

SSTO

Note that R2a < R2.

6.7 Subset selection of regressor variables - building the re-gression model

Although p regressor variables are available, not all of them may be necessary for an adequatefit of the model to the data. After the functional form of each regressor variable is obtained (i.e.X 2i , log(X j), X iX j , and so on), we seek a ’best’ subset of regressor variables. This ’best’ subsetis not necessarily unique but may be one of a unique set of ’best’ subsets.

To find a subset there are basically two strategies, all possible regressions and stepwise regression(which we take to include the special cases of forward selection and backward elimination).

6-15

Page 93: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 93/103

6.7.1 All possible regressions

In the all possible regressions search procedure, all possible subsets (equations) are computed anda selection of a pool of ’best’ models (subsets of variables ) are performed under some criterion

(R2

, Adjusted R2

, MSE, C p).The purpose of the all possible regression approach is to identify a small pool of potential models(based on a specific criteria). Once the ’pool’ is identified, the models are scrutinized and a bestmodel is selected.

If there are ( p − 1) = k explanatory variables and one intercept term there will be 2k possiblemodels. For example if p = 3 (constant,X 1, X 2) the following 22 models are possible:

E [Y ] =

β 0

β 0 + β 1X 1

β 0 + β 1X 2

β 0 + β 1X 1 + β 2X 2

where the meaning and the values of the coefficients β 0, β 1, β 2 is different in each model.

The criterion (one of R2, Adjusted R2, MSE, C p or others) that is used to find the all possiblesubsets does not form part of this course. It is expected that students must be able to apply this inpractical applications. As a general guideline one might prefer either Mallows C p or the AdjustedR2 (go and explore/experiment with some of them!)

6.7.2 Stepwise regression

Some practitioners prefer stepwise regression because this technique requires less computation thanall-possible subsets regression. This search method computes a sequence of regression equations.At each step an explanatory variable is added or deleted. The common criterion for adding (ordeleting) some regressor variable examines the effect of that particular variable which producesthe greatest reduction (or smallest increase) in the error sums of squares, at each step. Understepwise regression we can distinguish basically three procedures (i) forward selection, (ii) backwardelimination procedure and (iii) forward selection with a view back.

The theory involved in Stepwise regression will not be discussed further. It is expected that thestudent should explore/experiment with these methods and compare the outcomes with the setof ’best’ models found under all-possible-regression. Do the two main stream regression methods

produce the same result? Most software programs require some input values (Stepwise regression,e.g. F to include ext.). How do the models with different starting parameters differ?

6.8 Further residual analysis

In this chapter we did not explore a number of refined diagnostics for identifying the optimal model,instead we only discussed graphical methods to identify and correct model problems. Other topics(or problems) that help us to refine the regression models are outliers, influential observations,collinearity (Neter et al., (1996)) and do not form part of this course.

6-16

Page 94: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 94/103

6.9 Inference about regression parameters and predicting

6.9.1 Inference on regression parameters

Earlier in this chapter we have noted that

• The regression parameter estimator β is unbiased, that is

E [β ] = β

• and that the variance-covariance matrix of β is

V ar[β ] = σ2(X ′X )−1.

The estimated co-variance matrix is σ2(X ′X )−1. The diagonal elements of this estimated

co-variance matrix report the variance for ˆβ 0,

ˆβ 1, ·, ·, ·,

ˆβ p−1. Thus

V ar[β j ] = σ2(X ′X )−1jj

( where (X ′X )−1jj is the j th diagonal element of (X ′X )−1).

If we further assume that the error terms are Gaussian (N (0, σ2I )), and the populationvariance σ2 is unknown, then using the results of chapter 5:

β j − β j

σ

(X ′X )−1jj

∼ tn− p for k = 0, 1, · · ·, p − 1 (6.10)

Now we consider the null hypothesis:

H 0 : β j = 0 vs H 1 : β j = 0.

Then under H 0 the test statistic is

t0 =β j

σ

X ′X )−1jj

and the underlying distribution is the t distribution with (n − p) degrees of freedom.

The (1 − α)% confidence interval for β j is given by

β j − t((n− p),α/2)σ

(X ′X )−1jj ≤ β j ≤ β j + t((n− p),α/2)σ

(X ′X )−1jj .

This interval is sometimes written as

β j ± t((n− p),α/2)σ

(X ′X )−1jj .

6-17

Page 95: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 95/103

6.9.2 Drawing inferences about E[Y |xh

]

Define the (1 × p) row vector

X ′h = [1 X h1 X h2 · · · X h( p−1)]

The corresponding prediction for the Y −value at X ′h, is denoted by Y h and is

Y h = X ′hβ.

E [Y h] = E [X ′hβ ] = X ′hE [β ] = X ′hβ

So that X ′hβ unbiased, with variance

V ar[Y h] = V ar[X ′hβ ] = X ′hV ar[β ]X h = σ2X ′h(X ′X )−1X h

and the estimated variance is given by

σ2X ′h(X ′X )−1X h

Thus we can construct a t random variable based on Y h namely

Y h − X ′hβ

σ

X ′h(X ′X )−1X h(6.11)

and a confidence interval of the form

Y h ± t((n− p),α/2)σ

X ′h(X ′X )−1X h.

6.9.3 Drawing inferences about future observations

Define a new observation Y h(new) corresponding to X ′h. Y h(new) is assumed to be independent of the n Y i’s observed.

Y h(new) = X ′hβ.

E [Y h(new)] = E [X ′hβ ] = X ′hE [β ] = X ′hβ

E [Y h(new) − Y h(new)] = X ′hβ − X ′hβ = 0

and

V ar[Y h(new) − Y h(new)] = V ar[Y h(new)] + V ar[Y h(new)]

= σ2X ′h(X ′X )−1X h + σ2

= σ2

[1 + X ′h(X ′X )−1

X h]

6-18

Page 96: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 96/103

and the estimated variance is

σ2[1 + X ′h(X ′X )−1X h]

Thus just as in the previous section we can construct a t random variable with n − p degrees of freedom, and use this random variable to set up a confidence interval for a new predicted value:

Y h(new) ± t((n− p),α/2)σ

[1 + X ′h(X ′X )−1X h]

Tutorial Exercises

The tutorial exercises for this chapter are different than the previous chapters. This tutorialinvolve hands on (lab work). Under resources in the Course Vula site you will find a folder calledRegression tutorial. The data sets are from problems taken from Neter et al. (1996)).

1. This question is based on the data in the file Gradepointaverage.dat (Neter et al. (1996),p 38).

A new entrance test for 20 students selected at random from the a first year class wasadministrated to determine whether a student’s grade point average (GPA) at the end of thefirst year (Y) can be predicted from the entrance test score (X).

1 2 3 · · · 18 19 20X 5.5 4.8 4.7 · · · 5.9 4.1 4.7Y 3.1 2.3 3.0 · · · 3.8 2.2 1.5

(a) Assume a linear regression model and use any statistical software package to answer thefollowing questions:

Fit an appropriate model;Obtain an estimate for the GPA for a student with an entrance test score X = 5.0.

Obtain the residuals. Do the residuals sum to zero?

Estimate σ2?

(b) Using matrix methods, find :

i. Y ′Y ; X ′X ; X ′Y

ii. (X ′X )−1, and hence the vector of estimated regression coefficients.

iii. The Anova table (using matrix notation)

iv. Estimated variance-covariance matrix of β

v. Estimated value of Y h(new), when X h = [1, 5.0]

vi. Estimated variance of Y h(new)

vii. Find the hat matrix H viii. Find σ2

(c) Did you encounter any violations of model assumptions? Comment.

For the following exercises: fit a linear model, analyse the residuals (graphically) and comeup with a ’best ’ model. Also follow additional guidelines from class. Interpret all the resultsand report back on the best model.

2. The data file Kidneyfunction.dat (Neter et al. (1996) Problem 8.15, p358)

3. The data file roofingshingles.dat(Neter et al. (1996) Problem 8.9, p356)

References: Neter J., Kutner M.H., Nachtsheim C.J. and Wasserman W., (1996). Applied linearstatistical models (fourth edition).

19

Page 97: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 97/103

Appendix A

Attitude

• Stats come slowly, everyday is building on the previous day. I do not miss a class.• I come to every tutorial, and I come prepared (leave the problem cases for the tutorial).

• Exercise, exercise, exercise. (Everyday)

• I am paying for this course, use all the resources (classes, tutorials, hotseats) the lecturersare there to answer your questions, they are available!

• Questions need to be specific! (show that you are doing your bit)

A-1

Page 98: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 98/103

Appendix B

Maths Toolbox

The maths toolbox includes a summary (not necessary complete) of some of the maths toolsthat you need to understand and succeed in this course. It is important to realize that this is astatistical course and not a maths course, but we do not teach maths we assume that your maths’tools’ are sharp.

B.1 Differentiation (e.g. ComMath, chapter 3)

B.2 Integration (ComMath, chapter 7)

B.3 General

1. Binomial theorem (ComMath, chapter 2)

(a + b)n =

ni=0

n

i

aibn−i

2. Geometric series

n−1j=0

arj = a1 − rn

1 − r

∞j=0

arj = a 1

1 − r , for |r| < 1

3. Expansion of ex (ComMath, chapter 4)

ex = 1 + x + x2

2! +

x3

3! + · · ·

=∞

j=0

xj

j! for − ∞ < x < ∞

e−∞ = 0

B-1

Page 99: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 99/103

B.4 Double integrals

Now let us turn to double integrals (a special case of multiple integrals ), which are written asfollows: b

x=a

d

y=c

f (x, y) dydx

The evaluation of the double integral (or any multiple integral) is in principle very simple: it is therepeated application of rules for single integrals. So in the above example, we would first evaluate: d

y=c

f (x, y) dy

treating x as if it were a constant . The result will be a function of x only, say k(x), and thus inthe second stage we need to evaluate:

b

x=a

k(x) dx.

In the same way that the single integral can be viewed as measuring area under a curve, the doubleintegral measures volume under a two-dimensional surface.

There is a theorem which states that (for any situations of interest to us here), the order of integration is immaterial: we could just as well have first integrated with respect to x (treating yas a constant), and then integrated the resulting function of y. There is one word of caution requiredin applying this theorem however, and that relates to the limits of integration. In evaluating theabove integral in the way in which we described initially, there is no reason why the limits on yshould not be functions of the “outer” variable x, which is (remember) being treated as a constant;the result will still be a function of x, and thereafter we can go on to the second step. Whenintegrating over x, however, the limits must be true constants. Now what happens if we reversethe order of integration? The outer integral in y cannot any longer have limits depending on x,

so there seems to be a problem! It is not the theorem that is wrong; it is only that we have to becareful to understand what we mean by the limits.

The limits must describe a region (an area) in the X -Y plane over which we wish to integrate,which can be described in many ways. We will at a later stage of the course encounter a number of examples of this, but consider one (quite typical) case: suppose we wish to integrate f (x, y) overall x and y satisfying x ≥ 0, y ≥ 0 and x + y ≤ 1. The region over which we wish to integrate isthat shaded area in figure B.1; the idea is to find the volume under the surface defined by f (x, y),in the column whose base is the shaded triangle in the figure. If we first integrate w.r.t. y , treatingx as a constant, then the limits on y must be 0 and 1 − x; and since this integration can be donefor any x between 0 and 1, these become the limits for x, i.e. we would view the integral as:

1

x=0 1−x

y=0

f (x, y) dy dx.

But if we change the order around, and first integrate w.r.t. x, treating y as a constant, then thelimits on x must be 0 and 1 − y, and this integration can be done for any y between 0 and 1, inwhich case we would view the integral as: 1

y=0

1−y

x=0

f (x, y) dxdy.

We may also wish to transform or change the variables of integration in a double integral. Thussuppose that we wish to convert from the x and y to variables u and v defined by the continuouslydifferentiable functions:

u = g(x, y) v = h(x, y).

B-2

Page 100: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 100/103

x + y = 1

y

x

Figure B.1: Region of double integration

We shall suppose that these functions define a 1-1 mapping, i.e. for any given u and v we can finda unique solution for x and y in terms of u and v , which we could describe as “inverse functions”,say:

x = φ(u, v) y = ψ(u, v).

We now define the Jacobian of the (bivariate) transformation from (x, y) to (u, v) by the absolute value of the determinant of the matrix of all partial derivatives of the original variables (i.e. x and y) with respect to the new variables , in other words:

|J | =

∂x/∂u ∂x/∂v

∂y/∂u ∂y/∂v

=

∂x

∂u

∂y

∂v − ∂y

∂u

∂x

∂v

Note that the Jacobian is a function of u and v .

Theorem B.1 Suppose that the continuously differentiable functions g(x, y) and h(x, y) define a one-to-one transformation of the variables of integration, with inverse functions φ(u, v) and ψ(u, v).Then: b

x=a

c

y=b

f (x, y) dydx =

b′

u=a′

d′

v=c′

f [φ(u, v), ψ(u, v] |J | dv du

where a′ ≤ u ≤ b ′ ; c′ ≤ v ≤ d ′ describes the region in the transformed space corresponding toa ≤ x ≤ b ; c ≤ y ≤ d in the original variables.

This equation then defines a procedure for changing the variable of integration from x and y tou = g(x, y) and v = h(x, y):

B-3

Page 101: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 101/103

1. Solve for x and y in terms of u and v to get the inverse functions φ(u, v) and ψ(u, v).

2. Obtain all partial derivatives of φ(u, v) and ψ(u, v) with respect to u and v, and hence obtainthe Jacobian |J |.

3. Calculate the minimum and maximum values for u and v (where the ranges for the variablein the inner integral may depend on the variable in the outer integral).

4. Write down the new integral, as given by the theorem.

Example: Evaluate: ∞0

∞0

e−xe−2y dy dx

Lets do the transformation

V = X (we sometimes refer to this as a dummy transformation) and U = 2Y . The inverse

transformation is easy to write down directly as X = V and Y = U

2

. The Jacobian is given

by:

|J | =

0, 1

12

, 0

=

0 − 1

2

=

1

2

Clearly the ranges for the new variables are 0 ≤ u ≤ ∞ and 0 ≤ v < ∞. The integral is thus

given by: ∞u=0

∞v=0

e−ve−u 1

2 dvdu

thus ∞u=0

∞v=0

e−ve−u 1

2dvdu =

1

2

∞u=0

e−u(−1)

e−v∞

v=0du

= 1

2(−1)

e−u

∞u=0

= 1

2

Note: It is possible to start by evaluating the derivatives of u = g(x, y) and v = h(x, y) w.r.t.x and y. But it is incorrect simply to invert each of the four derivatives (as functions of x and y), and to substitute them in the above. What you have to do is to evaluate thecorresponding determinant first, and then to invert the determinant. This will still be afunction of x and y , and you will have to then further substitute these out. This seems to bea more complicated route, and is not advised, although it is done in some textbooks.

B-4

Page 102: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 102/103

B.5 Matrices: e.g. ComMath, Chapter 5

B5.1.1 You need to know the following definitions and matrix operators

• Matrices (n × p)• Vectors (n × 1)

• Manipulating rows and columns of a matrix

• Matrix operations (add, subtract, multiply)

• Square, symmetric matrices

• The transpose of a matrix

• The trace of a matrix

• The inverse of a non-singular matrix

• The rank of a matrix

B5.1.2 Expectation of random matrix.

For a random matrix T with dimension m × k, the expectation is

E [T ] =

E [T 11] E [T 12] · · · E [T 1k]E [T 21] E [T 22] · · · E [T 2k]

· · · · · ·· · · · · ·

E [T m1] E [T m2] · · · E [T mk]

If we consider W = AT , where T is a random matrix m × k and A is a constant matrix(r

×m), then

E [A] = A

E [W ] = E [AT ] = AE [T ]

B5.2.3 The variance-covariance matrix of a random vector T (k × 1) is:

V ar[T ] =

V ar[T 11] Cov [T 12] · · · Cov [T 1k]Cov[T 21] V ar[T 22] · · · Cov [T 2k]

· · · · · ·· · · · · ·Cov[T k1] Cov[T k2] · · · V ar[T kk]

The variance-covariance matrix is a symmetric matrix: Cov [T ij ] = C ov[T ji ] for i = j .

If we consider W = AT , where T is a random vector k × 1 and A is a constant matrix of dimension m × k, then

V ar[A] = 0

V ar[W ] = V ar[AT ] = AVar[T ]A′

B-5

Page 103: Sta 2030 Notes

7/23/2019 Sta 2030 Notes

http://slidepdf.com/reader/full/sta-2030-notes 103/103

B5.2.3 A matrix S is idempotent if

S 2 = SS = S.

An example of an idempotent matrix is the hat matrix. The hat matrix is

H = X (X ′X )−1X ′

for which

H 2 = HH

= X (X ′X )−1X ′X (X ′X )−1X ′

= X (X ′X )−1X ′

= H