Transcript
Medical Statistics
and
Demography Made Easy®
Medical Statistics
and
Demography Made Easy®
Devashish Sharma
MSc (Gold Medalist), PhD (Statistics)
Professor, Statistics and Demography
MLN Medical College
Allahabad Central University
Allahabad, India
®
JAYPEE BROTHERS MEDICAL PUBLISHERS (P) LTD
New Delhi • Ahmedabad • Bengaluru • Chennai • Hyderabad
Kochi • Kolkata • Lucknow • Mumbai • Nagpur
Published by
Jitendar P Vij
Jaypee Brothers Medical Publishers (P) Ltd
Corporate Office
4838/24 Ansari Road, Daryaganj, New Delhi - 110002, India, Phone: +91-11-43574357
Registered Office
B-3 EMCA House, 23/23B Ansari Road, Daryaganj, New Delhi - 110 002, India
Phones: +91-11-23272143, +91-11-23272703, +91-11-23282021
+91-11-23245672, Rel: +91-11-32558559, Fax: +91-11-23276490, +91-11-23245683
e-mail:
[email protected], Visit our website: www.jaypeebrothers.com
Branches
❑
2/B, Akruti Society, Jodhpur Gam Road Satellite
Ahmedabad 380 015, Phones: +91-79-26926233, Rel: +91-79-32988717
Fax: +91-79-26927094, e-mail:
[email protected]
❑
202 Batavia Chambers, 8 Kumara Krupa Road, Kumara Park East
Bengaluru 560 001, Phones: +91-80-22285971, +91-80-22382956, 91-80-22372664
Rel: +91-80-32714073, Fax: +91-80-22281761 e-mail:
[email protected]
❑
282 IIIrd Floor, Khaleel Shirazi Estate, Fountain Plaza, Pantheon Road
Chennai 600 008, Phones: +91-44-28193265, +91-44-28194897, Rel: +91-44-32972089
Fax: +91-44-28193231 e-mail:
[email protected]
❑
4-2-1067/1-3, 1st Floor, Balaji Building, Ramkote Cross Road,
Hyderabad 500 095, Phones: +91-40-66610020, +91-40-24758498
Rel:+91-40-32940929, Fax:+91-40-24758499 e-mail:
[email protected]
❑
No. 41/3098, B & B1, Kuruvi Building, St. Vincent Road
Kochi 682 018, Kerala, Phones: +91-484-4036109, +91-484-2395739
+91-484-2395740 e-mail:
[email protected]
❑
1-A Indian Mirror Street, Wellington Square
Kolkata 700 013, Phones: +91-33-22651926, +91-33-22276404, +91-33-22276415
Rel: +91-33-32901926, Fax: +91-33-22656075 e-mail:
[email protected]
❑
Lekhraj Market III, B-2, Sector-4, Faizabad Road, Indira Nagar
Lucknow 226 016, Phones: +91-522-3040553, +91-522-3040554
e-mail:
[email protected]
❑
106 Amit Industrial Estate, 61 Dr SS Rao Road, Near MGM Hospital, Parel
Mumbai 400 012, Phones: +91-22-24124863, +91-22-24104532,
Rel: +91-22-32926896, Fax: +91-22-24160828 e-mail:
[email protected]
❑
“KAMALPUSHPA” 38, Reshimbag, Opp. Mohota Science College, Umred Road
Nagpur 440 009 (MS), Phone: Rel: +91-712-3245220, Fax: +91-712-2704275
e-mail:
[email protected]
USA Office
1745, Pheasant Run Drive, Maryland Heights (Missouri), MO 63043, USA
Ph: 001-636-6279734 e-mail:
[email protected],
[email protected]
Medical Statistics and Demography Made Easy
© 2008, Devashish Sharma
All rights reserved. No part of this publication and CD ROM should be reproduced, stored in a
retrieval system, or transmitted in any form or by any means: electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the author and the publisher.
This book has been published in good faith that the material provided by author is original.
Every effort is made to ensure accuracy of material, but the publisher, printer and author will not
be held responsible for any inadvertent error(s). In case of any dispute, all legal matters are to
be settled under Delhi jurisdiction only.
First Edition:
2008
ISBN 978-81-8448-353-6
Typeset at JPBMP typesetting unit
Printed at Ajanta Offset & Packagins Ltd., New Delhi
This book is dedicated to
My Parents
Late Dr BK Sharma and Mrs Kusum Sharma
for being the constant
source of enlightenment in
the path of my mundane life
My Teacher
Professor MK Singh
for moulding my inner-self
and
outer appearance to make
me what I am
Preface
There are many books on general applied statistics,
assuming various level of mathematical knowledge, but no
book is available which is specially designed for Medical
Students at undergraduate level. The main feature of this
book is that it will help medical students at undergraduate
and postgraduate levels, as well as those students who are
preparing for various PGME examinations.
The present book, which is explicitly directed towards
medical applications, will have two special aspects. First,
use of examples almost entirely related to medical problems,
which I think, help the research workers and students to
understand the underlying computational points. Second,
the choice of statistical topics reflects the extent of their
usage in medical research. Several topics, such as vital
statistics, statistical methods in epidemiology and health
information would not normally be included in the general
book on applied statistics.
This book is intended to be useful to both medical
research workers with very little mathematical expertise as
well as those students who are preparing for various PGME
examinations. The emphasis throughout is on the general
concept underlying statistical techniques. Proofs are
regarded as of secondary importance, and are usually
omitted. Though, there are many mathematical formulae,
but these are necessary for computations and the
relationship between various methods. They rarely involve
other than very simple algebraic manipulations. Some
computational steps, such as those involve in probability
and significance test are perhaps more difficult. I have given
viii
Medical Statistics and Demography Made Easy
some solved examples clearly mentioning every steps
involve in the computation.
Nearly 50 unsolved questions mainly related to medical
problems are included, which will help undergraduate
students in their professional examination. For students
preparing for PGME examination, nearly 300 MCQs related
to various topics are included in this book. These includes
questions asked in various competitive examinations as well
as questions which I thought are important for such tests.
Going through these questions will help them to solve
problems related to Statistics and Demography in their
competitive examinations.
I owe thanks to my colleagues especially in Department
of Obstetrics and Gynaecology and of Community
Medicine. Special thanks to my wife Mrs. Anita Sharma,
and my son Dr. Pulak Sharma who helped me a lot by
suggesting me to frame this work according to problems
which he and his friends are facing.
I express my deep sense of gratitude to my publisher
Jaypee Brothers Medical Publishers (P) Ltd for their untiring
efforts in bringing out this book in such an elegant form.
Suggestions and criticism for further improvement of this
book as well as errors and misprint will be most gratefully
received and duly acknowledged.
Devashish Sharma
Contents
1. Classification and Tabulation ...................................... 1
2. Measure of Central Tendency .................................... 15
3. Measure of Dispersion ................................................ 31
4. Theoretical Discrete and Continuous
Distribution ................................................................... 47
5. Correlation and Regression ........................................ 61
6. Probability ..................................................................... 73
7. Sampling and Design of Experiments ..................... 83
8. Testing of Hypothesis ................................................. 99
9. Non-parametric Tests ................................................ 151
10. Statistical Methods in Epidemiology ..................... 163
11. Vital Statistics (Demography) .................................. 209
12. Health Information .................................................... 239
13. A Report on Census 2001 .......................................... 247
14. National Population Policy ...................................... 287
Unsolved Questions .......................................................... 305
Answers of MCQs and Unsolved Questions ............... 327
Appendix : Statistical Tables ................................................. 335
Index ...................................................................................... 349
Chapter 1
Classification
and Tabulation
2
Medical Statistics and Demography Made Easy
There are two types of data, (1) Primary data and (2)
Secondary data. Primary data is one which was originated
by the investigator and Secondary data is that data which the
investigator does not originate but obtains from someone’s
record.
Both primary and secondary data are broadly divided in
two categories:
1. Attributes (Qualitative data).
2. Variables (Quantitative data).
Attributes: are qualitative characteristics which are not
capable of being described numerically or, the data obtained
by classifying the presence or absence of attribute, e.g. Sex,
Nationality, Colour of eyes, Socioeconomic status. They can
further divided into two groups: (a) Nominal (b) Ordinal.
(a) Nominal: The quality that can be easily differentiated
by mean of some natural or physical line of demarcation,
e.g. some physical characteristic such as colour of eyes,
sex, physical status of a person, etc.
(b) Ordinal: An ordered set is known as ordinal, i.e. when
the data are classified according to some criteria which
can be given an order such as socioeconomic status.
Variable: are quantitative characteristics which can be
numerically described. Variables may be discrete or
continuous.
Discrete variables: can take exact values, e.g. Number of
family members, number of living children, etc.
Continuous variables: if a variable can take any numerical
value within a certain range is called continuous variable,
e.g. Height in cm, Weight in kg, etc.
Classification and Tabulation
3
REPRESENTATION OF DATA
Data may be representation either by means of graph or
diagram or by means of tables.
Tables
Tables are of two types: (1) Simple table or Complex depending
the number of measurements of single or multiple sets of item,
(2) Frequency distribution table.
There are certain general principles, which should be
followed while presenting the data into tabulated form:
1. A table should be numbered.
2. A title should be given, title should be brief and self
explanatory.
3. Heading of columns and rows should be clear.
4. Data must be presented according to size and
importance.
5. If percentage or averages are to be compared it should be
placed as close as possible.
6. Foot note may be given where necessary.
Simple Table
Table 1.1: Showing number of patients attending
hospital in winter season*
Months
November
December
January
February
Male
Female
No.
%
No.
%
250
350
100
400
25.00
35.00
10.00
40.00
150
100
70
180
30.00
20.00
14.00
36.00
Source* = Hospital Outdoor attendance
4
Medical Statistics and Demography Made Easy
Frequency Distribution Table
In a frequency distribution table, the data is first split up into
convenient groups (class interval) and the number of items
(frequencies) which occur in each group is shown in adjacent
column.
Following are the ages of 23 cases admitted to a hospital:
20, 35, 46, 10, 5, 25, 48, 33, 37, 41, 26, 29, 15, 6, 29, 56, 69, 66, 64,
25, 26, 56, 42.
Age group
Tally marks
Frequencies
0 – 10
10 – 20
20 – 30
30 – 40
40 – 50
50 – 60
60 – 70
⎜⎜
⎜⎜
⎜⎜⎜⎜ ⎜⎜
⎜⎜⎜
⎜⎜⎜⎜
⎜⎜
⎜⎜⎜
2
2
7
3
4
2
3
Table 1.2: Age distribution of admitted cases
Age group
Cases admitted
(in years)
No
%
0 – 10
10 – 20
20 – 30
30 – 40
40 – 50
50 – 60
60 – 70
2
2
7
3
4
2
3
8.69
8.69
30.46
13.04
17.39
8.69
13.04
Total
23
100
Classification and Tabulation
5
In constructing frequency distribution table, the question
that arise is: into how many groups the data should be split?
As per rule it might be stated that when there is large data, a
maximum of 20 groups, and when there is not much data, a
minimum of 5 groups could be conveniently taken.
As far as possible class interval should be equal.
GRAPHS OR DIAGRAMS
Bar chart: This is a simple way of representing data. In bar
diagram the length of bar is proportional to the magnitude to
be represented. Bar charts are of three types: (a) Simple bar
chart, (b) Multiple bar chart, (c) Component bar chart.
(a) Simple bar diagram
(b) Multiple bar diagram
(c) Component bar diagram
Figure 1.1
6
Medical Statistics and Demography Made Easy
Pie chart: In pie chart the area of segment of circle
represents frequency. The total frequency comprises of 360°.
Area of each segment depends upon the angle corresponding
to frequency of each group. Pie diagram is particularly useful
when the data is represented in percentage. In such cases 1%
is equal to 3.6°.
Figure 1.2
Pictogram: Small pictures or symbols are used to present data
Figure 1.3
Classification and Tabulation
7
Cumulative Frequency Curve or Ogive: Cumulative
frequencies are obtained by adding the frequencies of each
variable. The cumulative frequency table is obtained as
follows:
Age in years
Frequencies
20
21
23
35
36
45
67
5
3
7
10
3
5
8
Total
41
Cumulative frequency
5
5+3=8
8 + 7 = 15
15 + 10 = 25
25 + 3 = 28
28 + 5 = 33
33 + 8 = 41
Less than Cumulative Frequency Curve: Less than
cumulative frequency table is expressed as:
Age in years
Frequencies
Cumulative frequency
20
21
23
35
36
45
67
5
3
7
10
3
5
8
Less than or equal to 20 = 5
Less than or equal to 21 = 8
Less than or equal to 23 = 15
Less than or equal to 35 = 25
Less than or equal to 36 = 28
Less than or equal to 45 = 33
Less than or equal to 67 = 41
Total
41
8
Medical Statistics and Demography Made Easy
Figure 1.4
More than Cumulative frequency curve: More than
cumulative frequency table is expressed as:
Age in years
Frequencies
Cumulative frequency
20
21
23
35
36
45
67
5
3
7
10
3
5
7
More than or equal to 20 = 41
More than or equal to 21 = 36
More than or equal to 23 = 33
More than or equal to 35 = 26
More than or equal to 36 = 16
More than or equal to 45 = 13
More than or equal to 67 = 8
Total
41
Classification and Tabulation
9
Figure 1.5
Line Diagram: Line diagram are used to show the trend
with the passage of time. Time is independent variable
represented on X-axis and the dependent variable on Y- axis.
It is essential to show zero point on y-axis.
Figure 1.6
10
Medical Statistics and Demography Made Easy
Histogram: Histogram is used to represent a continuous
frequency distribution, is essentially an area chart in which
the area of the bar represents the frequency associated with
the corresponding interval. It is not essential to show zero
point on X-axis (horizontal axis) but necessary to show it on
vertical axis.
Figure 1.7
Frequency Polygon: It is obtained by joining the upper
mid points of Histogram blocks by a straight line.
Frequency Curve: It is obtained by joining the upper mid
points of Histogram blocks by a smooth line.
Figures 1.8A and B
Classification and Tabulation
11
Scattered Diagram: Scattered diagram is used to
represent two variables simultaneously. Each point represent
one individual.
Figure 1.9
Comparison between Bar diagram and Histogram:
1. Bar diagram is used to represent the frequency mainly
characterized by qualitative variables and discrete
variable, while Histogram is used to represent
frequencies characterized by continuous variable.
2. In bar diagram length of bar represents frequency,
while in histogram area of bar represents frequency.
MULTIPLE CHOICE QUESTIONS
1. Scatter diagram show:
(a) Trend event with the passage of time
(b) Frequency distribution of a continuous variable
(c) The relation between maximum and minimum
values
(d) Relation between two variables
(AI,90)
12
Medical Statistics and Demography Made Easy
2. Sex composition can be demonstrated in which of the
following:
(a) Age pyramid
(b) Pie chart
(c) Component bar chart (d) Multiple bar chart
(JIPMER, 91)
3. Quantitative data can be best represented by:
(a) Pie chart
(b) Pictogram
(c) Histogram
(d) Bar diagram
(PGI, 80; AMC, 83, 87)
4. Percentage of data can be shown in:
(a) Graph presentation (b) Pie chart
(c) Bar diagram
(d) Histogram
(PGI, 79; Delhi, 87)
5. Graph showing relation between 2 variables is a:
(a) Scatter diagram
(b) Frequency polygon
(c) Picture chart
(d) Histogram
(AI, 96)
6. Weight in kg is a:
(a) Discrete variable
(c) Nominal scale
(b) Continuous variable
(d) None of the above
(AI, 96)
7. All are the example of nominal scale except:
(a) Age
(b) Sex
(c) Body weight
(d) Socioeconomic status
(AI, 96)
8. The average birth weights in a hospital are to be
demonstrated by statistical representation. The is best
done by:
(a) Bar chart
(b) Histogram
(c) Pie chart
(d) Frequency polygon
(AIIMS 95)
Classification and Tabulation
13
9. All are included in the nominal scale except:
(a) Colour of eye
(b) Sex
(c) Socioeconomic status (d) Occupation
(MP, 98)
10. Age and sex distribution is best represented by:
(a) Histogram
(b) Pie chart
(c) Bar diagram
(d) Age pyramid
(DNB, 2001)
11. Continuous quantitative variables are expressed by:
(a) Bar chart
(b) Histogram
(c) Frequency polygon
(d) Ogive
(e) Pie chart
(PGI, 2002)
12. Cumulative frequencies are represented by:
(a) Histogram
(b) Line diagram
(c) Pictogram
(d) Ogive
13. In which type of graphical representation frequencies
are represented by area of a rectangle
(a) Bar diagram
(b) Component bar diagram
(c) Age pyramid
(d) Histogram
14. Two variables can be plotted together by:
(a) Pie chart
(b) Histogram
(c) Frequency polygon
(d) Scatter diagram (AI,95)
15. Which of the following statement is false:
(a) Primary data is originated by the investigator
(b) Primary data originated by an investigator may be
used as secondary data by other investigator
(c) Data obtained from records of Hospitals are
secondary data
(d) None of the above statements are true
14
Medical Statistics and Demography Made Easy
16. Best way to study relationship between two variables
is:
(a) Scatter diagram
(b) Histogram
(c) Bar chart
(d) Pie chart
(AI,92)
17. All are the examples of nominal scale except:
(a) Race
(b) Sex
(c) Iris colour
(d) Socioeconomic status
(AI,96)
18. Low birth weight statistics of a hospital is best shown
by:
(a) Bar charts
(b) Histogram
(c) Pictogram
(d) Frequency polygon
(AIIMS, Dec 95)
19. Categorical values are:
(a) Age
(c) Gender
(b) Weight
(Manipal, 2002)
20. If the grading of diabetes is classified as “mild”,
“moderate” and “severe” the scale of measurement
used is:
(a) Interval
(b) Nominal
(c) Ordinal
(d) Ratio
21. The best method to show the association between
height and weight of children in a class is by:
(a) Bar chart
(b) Line diagram
(c) Scatter diagram
(d) Histogram (AI, 2002)
22. Mean and standard deviation can be worked out only
if data is on:
(a) Interval/Ratio scale (b) Dichotomous scale
(c) Nominal scale
(d) Ordinal scale
(AIIMS, 2005)
Chapter 2
Measure of
Central Tendency
16
Medical Statistics and Demography Made Easy
Statistical constants which enables us an idea about the
concentration of values in the central part of the distribution.
The following are five measures of central tendencies:
1. Arithmetic Mean or simply Mean.
2. Median.
3. Mode.
4. Geometric Mean.
5. Harmonic Mean.
Arithmetic Mean: A.M. of a set of observations is their
sum divided by the number of observations.
The arithmetic mean X of n observations X1, X2 ............
Xn is
In case of frequency distribution where the variable and
frequencies are:
Variable
Frequencies
x1
f1
x2
f2
The arithmetic mean is
x2
f3
x4
f4
............ ............ xn
............ ............ fn
where i = 1, 2, 3, 4, ....... n
and
Short Cut Method: Let ui = xi – A, where A is any arbitrary
constant,
In case of continuous variables formed Grouped
frequency distribution., ‘xi’ are taken as the mid value of the
class interval, i.e. xi = (Lower + Upper Limit)/2, and then
calculate mean.
In case of short cut method we will generate a variable
ui = (xi – A)/h, where h is the length of class interval or class
Measure of Central Tendency
17
width, and the mean of the variable x will be
Properties of arithmetic mean:
1. Sum of deviations of a set of values from their arithmetic
mean is zero.
2. Sum of squares of deviation of a set of values is minimum
when taken about mean.
Merits and Demerits of Arithmetic Mean
Merits
1. It is based on all observations.
2. Of all averages, arithmetic mean is affected least by
fluctuations of samples, i.e. arithmetic mean is a stable
average.
3. If
is the mean of n1 observations and if
the mean of
n2 observations then the combined mean of two series is
Demerits
1. AM cannot be used if we are dealing with qualitative
data.
2. AM cannot be obtained if a single observation is missing.
3. AM is affected very much by extreme values.
4. AM cannot be calculated if extreme class is open, i.e.
below 10 or above 90.
5. In extremely asymmetrical (Skewed) distribution usually
AM is not a suitable measure of location.
Median: Median of a distribution is the value of the
variable which divide it into two equal parts.
If there are n observations then arrange the values either
is ascending or descending order. If ‘n’ is odd then
18
Medical Statistics and Demography Made Easy
th value is the median and if n is even then median
will be the average of
th and
th observation.
For example if there are 9 (i.e. odd) values than arrange these
values in either in ascending or descending order and
median is
, i.e. 5th values. Suppose if number of
observation are even, i.e. 10 then median lies between 5th
and 6th value.
In case of discrete frequency distribution median is
calculated by forming a cumulative frequency table, then steps
for calculating median are:
(i) Find
where
.
(ii) See the cumulative frequency just greater than
.
(iii) The value of x corresponding to cumulative frequency
just greater than
is median.
In case of continuous frequency distribution the class
corresponding to the cumulative frequency just greater than
or in rare cases equal to
(where C.F. is exactly equal to
) is called median class and the value of median is obtained
by the following formula:
Where l is the lower limit of median class, h is the class
width, N = fi , C is the cumulative frequency preceding to
median class and f is the frequency of median class.
Measure of Central Tendency
19
Median can also be obtained by less than and greater
than cumulative frequency curves of Ogives. The intersection
of less than and greater than cumulative frequencies curve is
median.
Figure 2.1
Merits and Demerits of Median
Merits
1. It is not at all affected by extreme values.
2. It can be calculated for distribution with open end class.
3. Median is the only average to be used while dealing
with qualitative data. Which cannot be measured
quantitatively but can still arrange in ascending or
descending order.
Demerits
1. In case of even number of observations median cannot
be determined exactly.
2. It is not based on all observations.
20
Medical Statistics and Demography Made Easy
Mode: Mode is the value which occurs most frequently
in a set of observations.
In the following set of 10 observations; “5, 20, 16, 10, 20,
5, 16, 16, 18, 14” 16" is the most frequently occurred value,
therefore 16 is the mode of the set of observations.
In case of discrete frequency distribution, the mode in the
value of x corresponding to maximum frequency.
The mode is determined by method of grouping if :
(i) The maximum frequency is repeated
(ii) If the maximum frequency occurs in the very beginning
or at the end of the distribution.
In case of continuous distribution Mode can be
determined by following formula:
f1 is the maximum frequency, the group corresponding to
maximum frequency is called Modal group, l if the lower limit
of modal group, h is the class width, f0 and f2 are the frequencies
preceding and following to modal group.
Mode can also be obtained by Histogram:
Figure 2.2
Measure of Central Tendency
21
Merits and Demerits of Mode
Merits
1. Mode is not affected by extreme values.
Demerits
1. Mode is ill-defined. It is not always possible to find a
clearly defined mode. In some cases distribution has two
modes is called bimodal.
2. It is not based on all observations.
3. As compared to mean, mode is affected to a great deal by
fluctuation of sampling.
Relationship between Mean, Median and Mode:
If a distribution is moderately asymmetrical then
Mode = 3 Median – 2 Mean
EXAMPLE FOR CALCULATING MEAN, MEDIAN AND
MODE
In case discrete distribution
Table 2.1
Variable
(xi)
Frequency
(fi)
Cumulative
Frequency
ui = xi – A
(A = 47)
ui.fi
25
28
34
47
52
55
60
5
7
10
12
6
4
6
5
12
22
34
40
44
50
–22
–19
–13
0
5
8
13
–110
–133
–130
0
30
32
78
Total
50
–233
N f1 50
22
Medical Statistics and Demography Made Easy
Mean
Mean = [(25×5)+(28×7)+(34×10)+(47×12)+(52×6)+(55×4)+
(60×6)]/50
= 2117/50 = 42.34
Short Cut Method
Let u1 x1 A, where
Mean X A U 47 4.66 42.34
Median
N
25.
2
Cumulative frequency just greater than 25 is 34. The value of
xi corresponding to 34 is 47. Therefore median of this set of
data is 47.
In this example total frequency N = 50, therefore
Mode
The maximum frequency in the above Table is 12. The value
of xi corresponding to maximum frequency is also 47. The
mode of this set of data is 47.
In case of continuous frequency distribution:
Table 2.2
Groups
fi
Cumu.
freq.
xi =
(U+L)/2
xi.f i
ui =
(xi-A)/h
ui.fi
10-20
20-30
30-40
40-50
50-60
60-70
70-80
5
3
7
10
12
7
6
5
8
15
25
37
44
50
15
25
35
45
55
65
75
75
75
245
450
660
455
450
-3
-2
-1
0
1
2
3
-15
-6
-7
0
12
14
18
Total
50
2410
16
Measure of Central Tendency
23
A = 45, h = 10, N = 50,
U = upper limit of class interval, L = Lower limit of class
interval
Mean
Mean =
fi x i 2410
48.2
N
50
Short Cut Method:
Mean of ui is U
Mean of xi is
fi ui 16
0.32
N
50
X A h U 45 10 0.32 45 3.2 48.2
Median
N
25, the cumulative
2
frequency 25 lies in the group 40 – 50 (this is a rare case
In this example N = 50, therefore
where C.F. of a group is equal to
N
, therefore 40 – 50 is the
2
median group.
Lower limit of median group is 40, i.e. l = 40, frequency of
median group is 10, i.e. f = 10, Cumulative frequency
preceding to median group is 15, i.e. C = 15, and class width
is 10, i.e. h = 10.
Then the mean is calculated by the formula
N
Median l + h C /f
2
25 – 15
= 40 + 10
10
24
Medical Statistics and Demography Made Easy
=
Therefore, median of this set of data is 50.0
Mode
The maximum frequency in the above table is 12, therefore
Modal group is 50 – 60, the formula for calculating mode in
grouped frequency distribution is:
Therefore, in this example, l the lower limit of Modal group
is 50, frequency of modal group is f1 = 12, width of class
interval, h = 10, the frequencies preceding and following
modal group are 10 and 7 respectively, i.e. f0 = 10 and f2 = 7.
Then mode is calculated as
10 12 10
20
50
50 2.85 52.85
24 10 7
7
Thus mode of the data represented in Table 2.2 is 52.85.
Mode = 50 +
Geometric Mean: The geometric mean G of n
observations
xi,
i = 1, 2, .......... n is the nth root of their product.
G x i . x 2 . x 3 .......... x n
1/n
Properties of geometric mean:
1. If any observation is zero, geometric mean becomes zero.
2. If any observation is negative, geometric mean becomes
imaginary, regardless of the magnitude of other
observations.
3. Geometric mean is used to find out the rate of population
growth.
Measure of Central Tendency
25
Harmonic Mean: Harmonic mean is the reciprocal of
arithmetic mean of the reciprocals of observations.
HM =
1
, where i = 1, 2, 3, ......... n
1
1/x i
N
Relationship between Arithmetic, Geometric and
Harmonic Mean:
HM < GM < AM and GM2 = AM × HM
MULTIPLE CHOICE QUESTIONS
1. What is the mode in statistic:
(a) Value of middle observation
(b) Arithmetic average
(c) Most commonly occurring value
(d) Difference between the highest and lowest value
(AI, 88; AIIMS, 86)
2. The frequently occurring value in a data is:
(a) Median
(b) Mode
(c) Standard deviation
(d) Mean
(TN, 91)
3. Mean incubation period of leprosy is calculated by:
(a) Median
(b) Harmonic mean
(c) Mode
(d) Geometric mean
(PGI, 81, AMC, 86, 87)
4. Calculate the mode of 70, 71, 72, 70, 70:
(a) 70
(b) 71
(c) 71.5
(d) 72
(PGI 79, AMC 85,88)
26
Medical Statistics and Demography Made Easy
5. Arrange the values in a serial order is to determine:
(a) Mean
(b) Mode
(c) Median
(d) Range
(AIIMS, 94)
6. Determination of which statistical parameter requires
quantities to be arranged in ascending or descending
orders:
(a) Mean
(b) Median
(c) Mode
(d) SD
(AIIMS, Dec 95)
7. 10 babies were born in a hospital, 5 were less than 2.5
kg and 5 were greater than 2.5 kg, the average is:
(a) Arithmetic mean
(b) Geometric mean
(c) Median
(d) Mode average
(AIIMS, 97)
8. The mean of 10 observations is 25,but later on it was
found that an observation 24 was wrongly written as
14. What will be the mean of correct sample:
(a) 24.5
(b) 25.5
(c) 26
(d) 26.5
9. Mean height of 10 female students of a class is 150 cm,
and the mean height of 20 male students is 175 cm.
What will be the mean height of all the 30 students of
the class:
(a) 166
(b) 166.6
(c) 168
(d) 166.8
10. If mean of a series is 10 and median is 15, what will be
the mode of the series:
(a) 20
(b) 25
(c) 30
(d) 35
Measure of Central Tendency
27
11. Which of the following measures of central tendency
will be calculated when the class interval is not closed:
(a) Mean
(b) Median
(c) Mode
(d) Geometric mean
12. Which measure of central tendency is most suitable to
determine the rate of population growth:
(a) Arithmetic mean
(b) Geometric mean
(c) Harmonic mean
(d) Median
13. Relation between arithmetic man, geometric mean and
harmonic mean is:
(a) GM < HM< AM
(b) HM< GM < AM
(c) AM < GM< HM
(d) GM< AM< HM
14. Complete the following relation:
(a) 2
(c) 1
Mode – Median = ? (Median – Mean)
(b) 3
(d) 1.5
15. Which of the following measure of central tendency is
extensively used in microbiological research:
(a) Harmonic mean
(b) Arithmetic mean
(c) Geometric mean
(d) None of the above
16. The most suitable average to be used while dealing
with socioeconomic status is:
(a) Arithmetic mean
(b) Median
(c) Geometric mean
(d) Harmonic mean
17. The geometric mean of the following set of data is:Data:
15, 23, 45, 0, 34, 10, 9
(a) 19.4
(c) 45
(b) 0
(d) 17
28
Medical Statistics and Demography Made Easy
18. The mean and median of 100 items are 50 and 52
respectively. The value of the largest item is 100. It was
later found that it is actually 110. Therefore, the true
mean is ——— and true median is ———.
(a) 50 and 52
(b) 50.10 and 52.5
(c) 50.10 and 52
(d) 50 and 52.5
19. The point of insertion of the ‘less than’ and ‘greater
than’ ogive correspond to:
(a) The mean
(b) The median
(c) The geometric mean (d) None of these
20. Which measure of central tendency can be calculated
from a frequency distribution with open end interval:
(a) Mean
(b) Geometric mean
(c) Harmonic mean
(d) Median
21. The relationship between AM, GM, and HM is:
(b) HM2 = AM × GM
(a) GM2 = AM × HM
(c) AM = ½ (GM × HM) (d) None of the above
22. Which measures of central tendency does not
influenced by extreme values:
(a) Mode
(b) Mean
(c) Median
(d) Harmonic mean
23. Values are arranged in ascending and descending
order to calculate:
(a) Mode
(b) Mean
(c) Median
(d) Standard deviation
(AI,98)
Measure of Central Tendency
29
24. Number of cases of malaria detected in 10 years are
100, 160, 190, 250, 300, 300, 320, 320, 550, 380. How to
calculate the average number of cases per year:
(a) Arithmetic mean
(b) Geometric mean
(c) Mode
(d) Median
(AIIMS, June 2000)
25. Calculate the median from the following values;
1.9, 1.9, 1.9, 1.9, 2.1, 2.4, 2.5, 2.5, 2.5, 2.9
(a) 1.2
(b) 1.9
(c) 2.25
(d) 2.5 (AIIMS, Nov 2000)
26. Malaria incidence in village in the year 2000 is 430,
500, 410, 160, 270, 210, 300, 350, 4000, 430, 480, 540,
which of the following is the best indicator for
assessment of malaria incidence in that village by the
epidemiologist:
(a) Arithmetic mean
(b) Geometric mean
(c) Median
(d) Mode
(AIIMS, May 2001)
27. The median of values 2,5,7,10,10,13,25 is:
(a) 10
(b) 13
(c) 25
(d) 5
(AIIMS,Nov 2001)
28. The incidence of malaria in an area is: 250, 300, 320,
300, 5000, 200, 350,. The best value to give idea of
incidence in past 7 years;
(a) Median
(b) Mode
(c) Arithmetic mean
(d) Geometric mean
(AIIMS, Nov 2001)
30
Medical Statistics and Demography Made Easy
29. Which of the following statements is/are correct
regarding mean, median and mode:
(a) Mode nominal value
(b) Mean is sensitive to extreme values
(c) Median is not sensitive to extreme values
(Manipal, 2002)
30. For a negatively skewed data mean will be:
(a) Less than median
(b) More than median
(c) Equal to median
(d) One
(AIIMS, 2005)
Chapter 3
Measure
of Dispersion
32
Medical Statistics and Demography Made Easy
DISPERSION
Dispersion means “scatteredness”. Dispersion gives an idea
about the homogeneity (less dispersed) or heterogeneity (more
scattered) of the distribution.
Measure of Dispersion
Range: The range is the difference between two extreme
observations. If A and B are greatest and smallest observations
respectively then
Range = A – B
Range is a simple but crude measure of dispersion.
Quartile Deviation or Semi-Inter Quartile Range: Quartiles
divide the total frequency into four equal parts.
Figure 3.1
Q1 = First Quartile (The frequency between first quartile and
origin is 25% of total frequency).
Q2 = Second Quartile (The frequency between second
quartile and origin is 50% of total frequency).
Q3 = Third Quartile (The frequency between third quartile
and origin is 75% of total frequency).
Measure of Dispersion
33
(Q 3 – Q1 )
2
Quartile deviation is a better index than range because it make
use of 50% of observations.
In case of continuous frequency distribution the quartile
is calculated by the following formula:
Quartile deviation =
Where l is the lower limit of quartile class, h is the class
width, N fi , C is the cumulative frequency preceding to
quartile class and f is the frequency of quartile class. For first
quartile i = 1, for second quartile i = 2 and for third quartile
i = 3.
It is to be noted that second quartile is equal to median
Decile divides the total frequency into 10 equal
parts, the formula for calculating Decile is
Where l is the lower limit of Decile class, h is the class
width, N fi , C is the cumulative frequency preceding to
decile class and f is the frequency of decile class. For first
decile i = 1, for second decile i = 2 and for third decile i = 3 ….
and for 9th decile i = 9.
Percentile: Percentile divides the total frequency into 100
equal parts. The formula for calculating percentile is:
Where l is the lower limit of percentile class, h is the class
width, N fi , C is the cumulative frequency preceding to
34
Medical Statistics and Demography Made Easy
percentile class and f is the frequency of percentile class. For
first percentile i = 1, for second percentile i = 2 and for third
percentile i = 3…. and for 99th percentile i = 99.
Mean Deviation: If xi; fi, i = 1, 2, 3, .... n is a frequency
distribution then mean deviation from the average A (usually
Mean, Median, Mode) is given by:
Mean Deviation
Where fi N
Mean deviation is least when taken from Median
Standard Deviation and Root Mean Square Deviation:
Standard deviation
is the positive square root of the
arithmetic mean of the square of deviations of the given values
from their arithmetic mean:
Where N fi and x Mean
Square of Standard Deviation is known as Variance.
Root Mean Square Deviation: Root mean square deviation S
is given by:
S
fi x i A
2
N
where N fi and A is any arbitrary number
Relation between σ and S:
Standard Deviation is minimum value of Root Mean
Square Deviation
S
Relation between Mean Deviation from Mean and SD
Mean deviation from mean < SD
Measure of Dispersion
35
Coefficient of Dispersion
When we want to compare the variability of two series which
differ widely in their averages or which are measured in
different units. We will calculate coefficient of dispersion,
which is a pure number independent of units.
The coefficient of dispersion based on different measure
of dispersion:
Based on Range
CD = (A – B) / (A + B)
Where A and B are the maximum and minimum values.
Based on Quartile Deviation:
CD = (Q3 – Q1) / (Q3 + Q1)
Where Q1 and Q3 are first and third quartiles respectively.
Based on Standard Deviation:
CD = SD / Mean
Coefficient of Variation
100 times of coefficient of dispersion based on standard
deviation is called coefficient of variation
CV = (SD / Mean) × 100
The series having greater CV is said to be more variable
than the series having less CV or in other words the series is
more homogenous if the CV is less.
Examples for Calculating Standard Deviation; Quartile,
Coefficient of Dispersion and Coefficient of Variation:
In case of Discrete Data:
Simple Method
36
Medical Statistics and Demography Made Easy
Variable xi
18
45
34
22
35
39
17
Total
–12
15
4
–8
5
9
–13
210
724
No. of cases = 7
SD
xi x
n
2
724
103.42 10.16
7
Range = Max (A) = 45; Min (B) = 17 = A – B = 28
Coefficient of Dispersion (Based on Range)
A B 28
0.45
A B 62
Coefficient of dispersion (Based on SD)
144
225
16
64
25
81
169
SD
10.16
0.338
Mean
30
SD
Coefficient of variation
100 33.8
Mean
Measure of Dispersion
37
Short-cut Method:
Variable xi
ui2
ui = (xi – A)
18
45
34
22
35
39
17
Total
–17
10
–1
–13
0
4
–18
289
100
1
169
0
16
324
– 35
899
No. of cases = 7; Let A = 35
Mean u = –
35
7
= – 5; therefore Mean
(In this case we simply change the origin and SD is
independent of Origin)
In case of continuous frequency distribution:
Age
group
fi
Cumm.
xi
freq.
(U+L)/2
fi . xi
x i2
fi . xi2
20 – 30
5
5
25
25 × 5 = 125
625
625 × 5 = 3125
30 – 40
22
27
35
22 × 35 = 770
1225
1225 × 22 = 26950
40 – 50
20
47
45
20 × 45 = 900
2025
2025 × 20 = 40500
50 – 60
10
57
55
10 × 55 = 550
3025
3025 × 10 = 30250
60 – 70
3
60
65
65 × 3 = 195
4225
4226 × 3 = 12678
Total
N = 60
2540
113503
38
Medical Statistics and Demography Made Easy
U = Upper limit of class interval; L = Lower limit of class
interval
fi .x i
2540
42.33
N
60
Standard Deviation
Mean x
fi .x i 2
(σ) = N
=
x
2
113503
2
42.33
60
1891.71 1791.82 99.89 9.9
Quartiles
iN
Quartile = l + h C /f, where i = 1, 2, 3
4
First Quartile (Q1): N = 60; for first quartile i 1;
iN 60
15
4
4
Cumulative frequency just above 15 is 27, therefore 30 – 40 is
the first quartile group
Thus in the above formula: 1 = 30, h = 10, C = 5 and f = 22, i = 1.
Second Quartile or Median (Q2):
N = 60; for second quartile i 2;
iN
60 60
2
30
4
4
2
Cumulative frequency just above 30 is 47, therefore 40 – 50 is
the second quartile group.
Thus in the formula: l = 40, h = 10, C = 27 and f = 20, i = 2.
Measure of Dispersion
39
Third Quartile (Q3): N = 60; for third quartile
i 3;
iN
60 180
3
45
4
4
4
Cumulative frequency just above 45 is 47, therefore 40 – 50 is
the third quartile group
Thus in the formula: l = 40, h = 10, C = 27 and f = 20, i = 3.
Q 3 40
10 45 27
180
40
40 9 49
20
20
Coefficient of Dispersion (Based on Quartile)
Q 3 Q i (49 34.45)
Q 3 Q i (49 34.45)
14.55
0.174
83.55
Coefficient of Dispersion (Based on Standard Deviation)
SD
9.9
0.2338
Mean 42.33
Coefficient of Variation
0.23 100 23.38
Short Cut Method:
Age
group
20
30
40
50
60
–
–
–
–
–
30
40
50
60
70
Total
fi
x1
(U + L)/2
ui = (x i – A)
/h
5
22
20
10
3
25
35
45
55
65
–2
–1
0
1
2
60
fi × ui
ui2
–10
– 22
0
10
6
4
1
0
1
4
–16
fi × ui2
20
22
0
10
12
64
40
Medical Statistics and Demography Made Easy
U = Upper limit of class interval; L = Lower limit of class
interval
A (Arbitrary constant) = 45; h (Class width) = 10
Mean x A hu 45 10 – 0.267 45 – 2.67 42.33
f . u 2
2
64
2
SD (u) i i u
0.2672 1.06 0.07 .99
N
12
SD (x) = h × SD(u) = 10 × 0.99.
(In this case we change the origin as well as scale while
creating a new variable ui; therefore we have to multiply SD
of ui by ‘h’ to obtain the Standard deviation of xi).
SKEWNESS
Skewness means lack of symmetry. A distribution is said to
be skewed if
Mean Median Mode
Measure of Skewness
Skewness of a distribution can be measured by following
formulae:
1. Sk = Mean – Median
2. Sk = Mean – Mode
For comparing two series we calculate coefficient of
skewness
Karl Pearson’s Coefficient of Skewness:
Sk
(Mean Mode)
Measure of Dispersion
41
If mode is ill defined then
(Mean Median)
Sk 3
The limits for Karl Pearson’s coefficient of skewness if +
3. In practice these limits rarely attained
Skewness is positive if Mean > Mode or Mean > Median,
and negative if Mean (M) < Mode (Mo) or M < Md.
Figure 3.2
Figure 3.3
KURTOSIS
Kurtosis (Curvature of curve) enables us an idea about the
flatness of curve. It is measured by coefficient 2 .
Figure 3.4
42
Medical Statistics and Demography Made Easy
A - is called normal curve or Mesokurtic curve
.
B - which is flatter than normal curve is called Platykurtic
curve
.
C - Which is more peaked than normal curve called
Leptokurtic curve
.
MULTIPLE CHOICE QUESTIONS
1. In statistics, spread of dispersion is described by the:
(a) Median
(b) Mode
(c) Standard deviation (d) Mean
(Kerala, 88)
2. In statistical analysis what is used to mention the
dispersion of data:
(a) Mode
(b) Range
(c) Standard error of
(d) Geometric mean
mean
(PGI, 81, AMC 87, 92)
3. Measure of dispersion is:
(a) Mean
(b) Mode
(c) Standard deviation (d) Median
Kerala, 94)
4. Among the measure of dispersion which is most
frequently used:
(a) Range
(b) Mean
(c) Median
(d) Standard deviation
(Karn, 94)
5. Best index to detect deviation is:
(a) Variation
(b) Range
(c) Mean deviation
(d) Standard deviation
(AIIMS, 96)
Measure of Dispersion
43
6. Mean weight of 100 children was 12 kg. The standard
deviation was 3. Calculate the percent coefficient of
variation:
(a) 25%
(b) 35%
(c) 45%
(d) 55% (AIIMS, Nov 2000)
7. Mean square deviation will be minimum when taken
from ————.
(a) Mean
(b) Median
(c) Arbitrary constant
(d) Mode
8. Sum of absolute deviation about median is:
(a) Least
(b) Greatest
(c) Zero
(d) Equal
9. If mean and mode of the given distribution is equal
then its coefficient of skewness is ————-.
(a) 3
(b) Zero
(c) 1
(d) None of the above
10. Least value of root mean square of deviation is:
(a) Mean deviation from median
(b) Mean deviation
(c) Standard deviation
(d) Mean deviation from arbitrary constant
11. If mean of the distribution is 40 and median is 50 find
the mode the nature of the distribution:
(a) 70 and positively skewed
(b) 70 and negatively skewed
(c) 60 and negatively skewed
(d) 60 and positively skewed
12. If each of a set of observations of a variable is multiplied
by a constant (non-zero), the standard deviation of the
resultant variable:
44
Medical Statistics and Demography Made Easy
(a) Is unaltered
(c) Decreases
(b) Increases
(d) In unknown
13. Mean, SD and Variance have the same units:
(a) True
(b) False
14. Which quartile divides the total frequencies in 3: 1 ratio:
(a) First quartile
(b) Second quartile
(c) Third quartile
(d) Inter quartile range
(AI, 2003)
15. If 25% of the items are less than 10 and 25% are more
than 40 the deviation is:
(a) 20
(b) 15
(c) 10
(d) 40
16. If in a frequency curve of scores, the value mode was
found to be lower than mean the distribution is:
(a) Symmetric
(b) Negatively skewed
(c) Positively skewed
(d) Normal
17. In any discrete distribution (when all the values are
not same) the relations between Mean deviation (MD)
and standard deviation (SD) is:
(a) MD = SD
(b) MD > SD
(c) MD < SD
(d) None of these
18. If maximum value of a distribution is 60 and minimum
value is 40 he coefficient of dispersion is:
(a) 0.5
(b) 0.3
(c) 0.25
(d) 0.2
19. In a perfectly symmetrical distribution 50% of items
are above 60 and 75% items are below 75. Therefore
the of quartile deviation and coefficient of skewness
is:
(a) 15 and 0.5
(b) 15 and 0.25
(c) 30 and 0.5
(d) 30 and 0.25
Measure of Dispersion
45
20. Match the following:
(1) Range
(a)
(2) Quartile deviation
(b)
(3) Coefficient of variation
(c) X max X min
(4) Mean deviation
(d)
(a) 1-A, 2-B, 3-C, 4-D
(c) 1-C, 2-B, 3-A, 4-D
fi x i x
N
(b) 1-C, 2-A, 3-B, 4-D
(d) 1-C, 2-D, 3-A, 4-B
21. Root mean square deviation is:
(a) Standard deviation
(b) Standard error
(c) Standard variation
(d) Standard error of proportion
(AI,97)
22. Right sided skewed deviation causes:
(a) Median is more than mean
(b) SD more than variance
(c) Tale to the right
(d) Not affected at all
(AI, 98)
23. In a hospital, 10 babies were born on same day. All of
them had birth weight 2.8 kg. The standard deviation
will be:
(a) Zero
(b) One
(c) –1
(d) 0.28
(AI,2001)
24. Median incubation period means:
(a) Time for 50% cases to occur
(b) Time between primary case and secondary cases
(c) Time between onset of infection and period of
maximum infectivity
(JIPMER, 2003)
46
Medical Statistics and Demography Made Easy
25. If the systolic blood pressure in a population has a mean
of 130 mm Hg and a median of 140 mm Hg, the
distribution is said to be:
(a) Symmetrical
(b) Positively skewed
(c) Negatively skewed
(d) Either positively or negatively skewed depending
on the standard deviation
26. If each value of a given group of observations is
multiplied by 10, the standard deviation of the resulting
observations is:
(a) Original std. deviation × 10
(b) Original std. deviation/10
(c) Original std. deviation – 10
(d) Original std. deviation it self
Chapter 4
Theoretical Discrete
and Continuous
Distribution
48
Medical Statistics and Demography Made Easy
THEORETICAL DISCRETE DISTRIBUTION
Binomial Distribution
Let a random experiment be performed repeatedly, and let
the occurrence of an event in a trial be called a success and its
non-occurrence a failure. Consider a set of n independent
trials (‘n’ being finite), in which the probability ‘p’ of success
in any trial is constant for each trial. The q = 1 – p, is the
probability of failure in any trial.
If there are ‘x’ success in ‘n’ trial, then the number of
failure will be (n – x).
But ‘x’ success in n trials can occur in nCx ways and the
probability for each of these ways is px qn – x. Hence, the
probability of ‘x’ success in ‘n’ trials in any order whatsoever
is given by the expression:
n x n x
xp q
The probability distribution of number of success so
obtained is called binomial probability distribution.
A random variable is said to follow binomial distribution if
it assumes only non-negative values.
Two independent constants are ‘n’ and ‘p’ in the distribution,
known as parameters. ‘n’ is also sometimes known as the
degree of binominal distribution.
Physical Conditions for Binomial Distribution
We get binomial distribution under the following
experimental conditions:
1. Each trial results in two mutually exclusive disjoint
outcomes, termed as success and failure.
Theoretical Discrete and Continuous Distribution
49
2. The number of trials ‘n’ is finite.
3. The trials are independent of each other.
4. The probability of success ‘p’ is constant for each trial.
Mean and Standard Deviation of Binomial Distribution
If a random variable X follows a binomial distribution with
parameters ‘n’ and ‘p’ then its mean is np and variance is
npq
Mean = np
Variance = npq
POISSON DISTRIBUTION
Poisson distribution is a limiting case of binomial distribution
under the following conditions:
1. ‘n’ the number of trials is indefinitely large n
2. ‘p’ the constant probability of success for each trial and
is indefinitely small, i.e.
3.
(say) is finite. Thus
and
, where
is a positive real number.
A random variable is said to follow a Poisson distribution
if it assume only non-negative values and its probability mass
function is given by:
= 0 otherwise
Here is known as the parameter of the distribution.
Remarks
Poisson distribution occurs when there are events which do
not occur as outcomes of a definite number of trials (unlike
50
Medical Statistics and Demography Made Easy
binomial distribution) of an experiment but which occur at
random points of time and space wherein our interest lies
only in the number of occurrence of events, not in nonoccurrence.
For example: Number of deaths from a disease (not in
form of epidemic) such as heart attack, or cancer, or due to
snake bite.
Mean and Variance of Poisson Distribution
Poisson distribution is the only distribution in which mean
and variance are equal to λ.
THEORETICAL CONTINUOUS DISTRIBUTION
Normal (or Gaussian) Distribution
The Binominal and Poisson distributions both related to a
discrete random variable. The most important continuous
distribution is the Gaussian (CF Gauss, 1777-1855), or as it is
frequently called, the normal distribution.
Chief Characteristics of the Normal Distribution
The normal probability curve with mean μ and standard
deviation σ is given by the equation
2 0
1. The curve is bell shaped and symmetrical about the line
.
2. Mean, median and mode of distribution coincide.
3. As x increases numerically, f(x) decreases rapidly, the
maximum probability occurring at the point
and
is given by
Theoretical Discrete and Continuous Distribution
51
4.
5. Since f(x) being the probability, can never be negative,
no portion of the curve lies below x-axis.
6. x-axis is an asymptote to curve.
7. The point of inflexion where the curve changes its shape
from concave to convex of the curve are given by
8. Relation between Quartile deviation, Mean deviation
and Standard deviation is given by:
9. The total area under normal probability curve is unity.
Shape of Curve
Figure 4.1
A variable X is said to be a normal variate if it follows a
normal probability distribution with mean μ and variance σ2
2
and is represented as X ~ N ( , ).
If
and
and
then X + Y ~ N
.
52
Medical Statistics and Demography Made Easy
The sum as well as the difference of the two independent
normal variate is also a normal variate.
In X ~ N (μ, σ2) then kX will be distributed normally with
mean kμ and variance k2σ2, i.e. kX ~ N (kμ, k2σ2), also X+a
will be distributed normally with mean μ + a and variance σ2,
i.e. X+a ~ N (μ + a, σ2)
STANDARD NORMAL VARIATE
If x ~ N (μ, σ2), then
is a standard normal variate
with mean 0 and variance 1.
Area Properties
Standardized variable z
Figure 4.2
The above curve of normal distribution showing the
scales of the original variable which differ from μ by +σ, + 2σ
Theoretical Discrete and Continuous Distribution
53
and + 3σ. From the above Figure it is clear that a relatively
small proportion of the area under the curve lies outside the
pair of values x = μ + 2σ and x = μ – 2σ. In fact the probability
that x lies within μ + 2σ is very nearly 0.95 and the probability
that lies outside this range in correspondingly 0.05.
In X and Y are two independent standard normal variate
then U = X + Y and V = X – Y are also independently distributed
as a normal variate with mean 0 and variance 2.
The following tables gives the area under the normal probability curve for some important values of normal variate x.
Distance from mean ordinate
in terms of + σ
Area under
normal curve
x+1σ
x + 1.96 σ
x+2σ
x + 2.58 σ
x+3σ
68.3%
95%
95.4%
99%
99.7%
Importance of Normal Distribution
1. Most of the distribution occurring in practice, i.e.
Binomial, Poisson can be approximated by Normal
distribution.
2. Many distribution of sample statistic tend to normal for
large samples and as such they can be studied with the
help of normal distribution.
3. The entire theory of small samples tests viz. ‘t’, ‘F’, χ2
tests is based on the fundamental assumption that the
parent population from which the sample is drawn
follows a normal distribution.
54
Medical Statistics and Demography Made Easy
MULTIPLE CHOICE QUESTIONS
1. In a standard normal curve the area between one
standard deviation on either side will be:
(a) 68%
(b) 85%
(c) 99.7%
(d) None of the above
(AI, 88, AIIMS, 86)
2. Normal distribution curve depends on:
(a) Mean and sample
(b) Mean and median
(c) Median and standard deviation
(d) Mean and standard deviation
(AI, 90)
3. The area under a normal distribution curve for SD of 2
is:
(a) 68%
(b) 95%
(c) 97.5%
(d) 100%
(AI, 93)
4. Mean + 1.96 SD included following % of values in a
distribution:
(a) 68%
(b) 99.5%
(c) 88.7%
(d) 95%
(AI, 96)
5. Shape of normal curve is:
(a) Symmetrical
(b) Curvilinear
(c) Linear
(d) Parabolic (Assam, 95)
6. SD is 1.96 the confidence limits is:
(a) 63.6%
(b) 66.6%
(c) 95%
(d) 99%
7. 95% of confidence limits exist between:
(b) + 2 SD
(a) + 1 SD
(c) +3 SD
(d) 4 SD
[Hint: 1.96 is approximately equal to 2]
(AI,98)
(AI,99)
Theoretical Discrete and Continuous Distribution
55
8. All are true regarding standard distribution curve
except:
(a) One standard deviation including 95% of the values
(b) Median is the mid point
(c) Mode is the common value recurrently occurring
(d) Mean and mode coincides
(AI, 2000)
9. The relation between mean deviation about mean and
quartile deviation is:
(a) Mean deviation is less than quartile deviation
(b) Mean deviation is more than quartile deviation
(c) Mean deviation is equal to quartile deviation
(d) They are not related to each other
10. The point of inflexion of normal curve are:
(a) Mean + SD
(b) Mean + 2SD
(d) Mean + 2/3 SD
(c) Mean + 3 SD
11. If X and Y are two independent normal variate then X–
Y is also a normal variate:
(a) True
(b) False
12. The mean and variance of a normal distribution:
(a) Are same
(b) Cannot be same
(c) Are sometimes equal
(d) Are equal in the limiting case, as n → ∞
13. For a normal distribution:
(a) Mean> Median > Mode
(b) Mean < Median < Mode
(c) Mean > Median < Mode
(d) Mean = Mode = Median
14. The standard normal distribution is represented by:
(a) N (0,0)
(b) N (0,1)
(c) N (1,0)
(d) N (1,1)
56
Medical Statistics and Demography Made Easy
15. If in a normal distribution the standard deviation is
equal to 45, then the mean deviation from mean is
equal to:
(a) 45
(b) 40
(c) 36
(d) 30
16. In a normal distribution the number of observations
less than divided by mean are included in the range:
(a) Mean + 3 SD
(b) Mean + 1 SD
(c) Mean + 2 SD
(d) Mean + 0.67 SD
[Hint: As mean divides the total area into two equal parts (i.e.
50% of observations will lie below mean and 50% of
observations lie above mean). The first quartile of normal
distribution is μ – 0.6745σ. These limits will include 50% of
observations. Therefore number of observations included
within limits Mean + 0.67 SD will be less than that divided
by mean].
17. Normal distribution is:
(a) Very flat
(b) Very peaked
(c) Smooth
(d) Bell shaped symmetrical distribution about mean
18. There are two independent normal variate X and Y. X
~ N (6, 3) and Y ~ N (3, 6). Then the distribution of X–Y
is:
(a) N (3,3)
(b) N (3,6)
(c) N (–3, 9)
(d) N (3,9)
19. Total area under the normal probability curve is:
(a) 100
(b) 10
(c) 1
(d) 0.05
Theoretical Discrete and Continuous Distribution
57
20. Binomial distribution tends to normal distribution if:
(a) n →∞ and neither p or q is very small
(b) n →∞ and p → 0
(c) n →∞ and q → 0
(d) None of the above
21. Normal distribution is symmetrical only for some
specified values of X:
(a) True
(b) False
22. For a normal distribution, quartile deviation, mean
deviation and standard deviation are in the ratio:
(a) 4/5 : 2/3: 1
(b) 2/3: 4/5: 1
(c) 1: 4/5 : 2/3
(d) 4/5: 1: 2/3
23. The mean deviation about mean of a normal
distribution is:
(a)
(b)
(c)
(d)
[Hint:
is approximately equal to
]
24. If X is distributed Normally with mean m and variance
s2, then a linear combination of X, i.e. a X+ b will also
be a Normal Variate with:
(a) Mean aμ and variance a2σ2
(b) Mean aμ + b and variance a2σ2
(c) Mean μ + b and variance b2σ2
(d) Mean bμ + a and variance b2σ2
25. In the estimation of standard probability, Z Score is
applicable to:
58
Medical Statistics and Demography Made Easy
(a)
(b)
(c)
(d)
Normal distribution
Skewed distribution
Binominal distribution
Poisson distribution
(UPSC, 2001)
26. A non-symmetric frequency distribution is known as:
(a) Normal distribution
(b) Skewed distribution
(c) Cumulative frequency distribution
(d) None of the above
(Orissa, 99)
27. The area between one standard deviation on either
side of mean in a normal distribution is:
(a) 62%
(b) 68%
(c) 90%
(d) 99% (AIIMS, May 95)
28. True about normal distribution curve is all except:
(a) Mean, median and mode coincides
(b) Total area of the curve is one
(c) Standard deviation is one
(d) Mean of the curve is hundred
(AIIMS, Dec.97
[SD of standard normal curve is 1]
29. Which statement is true about standard normal
distribution curve:
(a) Mean 1 and standard deviation 0
(b) Mean 0 and standard deviation1
(c) Curve skews towards left
(d) Curve skews towards right
(AIIMS, Nov 99)
30. In a normal distribution curve, True statement is:
(a) Mean = SD
(b) Median = SD
(c) Mean = 2 Median
(d) Mean = Mode
(AIIMS, May 2001)
31. Systolic BP of a group of person follow normal
distribution curve. The mean BP is 120. The values
above 120 are:
Theoretical Discrete and Continuous Distribution
(a) 25%
(c) 50%
59
(b) 75%
(d) 100% (AIIMS,Nov 2001)
32. All are true in normal distribution curve except:
(a) Is bell shaped , symmetrical and on the x axis
(b) Occurs only in normal people
(c) Median=mode=mean
(Manipal, 2002)
33. A population study showed a mean glucose of 86 mg/
dL. In a sample of 100 showing normal curve
distribution, what percentage of people have glucose
above 86?
(a) 65
(b) 50
(c) 75
(d) 60
(AI, 2002)
34. The standard normal distribution:
(a) Is skewed to the left
(b) Has mean = 1.0
(c) Has standard deviation = 0.0
(d) Has variance = 1
(AI, 2002)
Chapter 5
Correlation and
Regression
62
Medical Statistics and Demography Made Easy
ASSOCIATION AND CORRELATION
Association
Association may be defined as the concurrence of two random
variables when they occur more frequently together than one
would expect by chance.
Correlation
Correlation indicates the degree of association between two
random variables
CORRELATION
A series where each term of series may assume values of two
or more variables. For example, if we measure the heights
and weights of certain group of persons, we will get a
distribution known as Bi-variate distribution.
If the two variables deviate in the same direction then
correlation is said to be Positive. But if deviate in opposite
direction then the correlation is said to be negative.
Scatter diagram is the simplest way to represent a bivariate
distribution.
Karl Pearson Correlation of Coefficient
Correlation coefficient between two random variables x and
y, usually denoted by rx y, is a numerical measure of linear
relationship between them:
Cov(x y) 1
xy x y / x y
rx y
x y
n
Graphical representation of the standard data for
different values of r.
Correlation and Regression
63
Figure 5.1
Properties of Correlation Coefficient
1. Correlation coefficient ‘r’ lies between –1 and +1
2. Correlation coefficient is independent of change of origin
and scale.
3. `TWO independent variables are uncorrelated. If x and
y are two independent variables then rx y = 0.
4. But two uncorrelated variables may or may not
independent rx y = 0, merely implies the absence of any
linear relationship.
Standard Error of Correlation Coefficient
If ‘r’ is the correlation coefficient is a sample of n pair of
observations, then standard error is given by:
SE (r)
(1 r 2 )
n
64
Medical Statistics and Demography Made Easy
REGRESSION
Regression Analysis
Regression analysis is a mathematical measure of the average
relationship between two or more variables in terms of original
units of the data.
The line of regression is obtained by the principles of least
square.
Let us suppose that in a bi-variate distribution (xi, yi); (i = 1, 2,
...n); y is dependent variable and x is independent variable.
Let the line of regression of y on x is given by:
y = a + bx
Where a and b are constant, estimated by the method of least
square
‘b’ is the slope of the regression equation of y on x.
The regression y on x is given by
y
(y y) r
xx
x
The line of regression x on y is given by:
(x X ) r x y y
y
Regression Coefficient will never be of different signs.
The correlation coefficient can also be calculated on the basis
of regression coefficient:
‘r’= byx . bxy
Where
and bxy r x
y
byx . bxy r 2
Hence,
Correlation and Regression
65
It may be noted that the sign of correlation coefficient is the
same as that of regression coefficient, since the sign of each
depends upon the co-variance term. Thus if regression
coefficients are positive, ‘r’ is positive and if the regression
coefficients are negative, ‘r’ is negative.
Solved Example
Find the correlation coefficient and line of regression between
height and weight of 10 individuals:
Case no.
1
Height 175
Weight 65
2
3
4
166
56
182
78
167
66
5
6
7
176 169
72 69
182
81
8
9
10
190 187 151
87 84 60
Correlation Coefficient
Height
(xi)
175
166
182
167
176
169
182
190
187
151
Total
N = 10
Weight
ui =
vi =
(yi) (xi – 170) (yi – 70)
65
56
78
66
72
69
81
87
84
60
ui2
vi2
ui .vi
+5
–4
+12
–3
+6
–1
+12
+20
+17
–19
–5
–14
+8
–4
+2
–1
+11
+17
+14
–10
25
16
144
9
36
1
144
400
289
361
25
196
64
16
4
1
121
289
196
100
–25
+56
+96
+12
+12
+1
+132
+340
+238
+190
+45
18
1425
1012
1052
66
Medical Statistics and Demography Made Easy
SD (vi )
‘r’
1012
(1.8)2 101.2 3.24 97.96 9.89
10
u i . vi / N u . v
u . v
1052/10 4.5 1.8
11.05 9.89
105.2 8.1
0.88
109.28
Mean of x = 170 + 4.5 = 174.5; Mean of y = 70 + 1.8 = 71.8
SD (x) = SD (u) = 11.05 and SD (y) = SD (v) = 9.89
11.05
Re gression of x on y : (x 174.5) 0.88
y 71.8
9.89
x 174.5 0.98(y 71.8)
x 174.5 0.98y 70.36 or, x 174.5 70.36 0.98 y
Similarly
y 0.78 64.31
Thus by putting the value of one variable in regression
equation we can predict the value of other variable
Correlation and Regression
67
MULTIPLE CHOICE QUESTIONS
1. Correlation between two variables is a numerical
measure of:
(a) Relationship between them
(b) Linear relationship between them
(c) Quadratic relationship between them
(d) All the above
2. If the correlation coefficient between two variables are
zero, then:
(a) Two variables are independent
(b) Two variables are linearly related
(c) There is a perfect correlation between the two
variables
(d) There may be a non-linear relation between the two
variables
3. The correlation coefficient between X and Y will have
positive sign when:
(a) X is increasing and Y is decreasing
(b) Both X and Y are increasing
(c) X is decreasing and Y is increasing
(d) There is no change in X and Y
4. The coefficient of correlation:
(a) Can take any value between –1 and +1
(b) Is always less than –1
(c) Is always greater than +1
(d) Cannot be zero
5. The coefficient of correlation between X and Y is +0.24.
There covariance is 3.5 and the variance of X is 16. The
SD of Y is:
68
Medical Statistics and Demography Made Easy
(a)
(c)
0.24
4 3.5
(b)
16
3.5 0.24
(d)
3.5
0.24 4
6. The coefficient of correlation is independent of:
(a) Change of scale only
(b) Change of origin only
(c) Both change of origin and scale
(d) Neither change of origin nor change of scale
7. Probable error of r is:
(a)
(c) 0.6745
(1 r 2 )
n
(b) 0.6745
(1 r 2 )
n
(d) 0.6745
(1 r 2 )
n
8. If one of the regression coefficient is greater than unity
then the other will be:
(a) Also greater than unity
(b) less than unity
(c) will equal to 1
(d) All the above
9. If two variables are uncorrelated then the two line of
regression, i.e. X on Y and Y on X will:
(a) Coincides
(b) Perpendicular
(c) The angle between will be equal to 45°
(d) The two lines are parallel to each other
10. If one of the regression coefficient is positive then the
other will be:
(a) Also positive
Correlation and Regression
69
(b) Will be negative
(c) May or may not be positive
(d) Not depends on the sign of the regression coefficient
11. If the correlation coefficient between two variables X
and Y is 0.63. All the values of X is and Y is multiplied
by a non- zero constant 6. The correlation between the
new variables will be:
(a) More than 0.63
(b) Less than 0.63
(c) 0.63
(d) Cannot be calculated
12. Regression coefficient is independent of:
(a) Change of scale only
(b) Change of origin only
(c) Change of origin as well as scale
(d) Neither change of origin nor scale
13. If the two lines of regression X on Y and Y on X coincides
then the correlation will be:
(a) r = + 1
(b) r = 0
(c) r = +0.5
(
d) – 1 < r < 1
14. If the lines of regression are given as x + 2y – 5 = 0 and
2x + 3y = 8. Then the mean of x and y respectively are:
(a) 1, 2
(b) 1, 2
(c) 2, 5
(d) 2, 3
[Hint: The lines of regression pass through Mean x and
therefore at the point
the lines of regression will
be
and
, by solving these two
equations we can calculate the values of mean of a and y]
70
Medical Statistics and Demography Made Easy
15. The following statistics is used to measure the linear
association between two characteristics in the same
individuals:
(a) Coefficient of variation
(b) Coefficient of correlation
(c) Chi-square
(d) Standard error
(Karnat, 96)
16. All are the features of correlation of coefficient except:
(a) Cause effect association cannot be shown
(b) Risk association can be revealed
(c) Correlation risk to disease
(d) Indicates linear relationship
(AIIMS, 97)
17. When the height and weight is perfectly correlated,
coefficient of correlation is:
(a) +1
(b) –1
(c) 0
(d) More than 1
(AIIMS, 2000)
18. Height to weight is a/an:
(a) Association
(b) Correlation
(c) Proportion
(d) Index
(AIIMS, 96)
[Hint: Association is the relationship between two random
variables and correlation coefficient shows the degree of
association].
19. Correlation coefficient tends to lie between:
(a) Zero to –1.0
(b) –1.0 to +1.0
(c) +1.0 to zero
(d) +2.0 to –2.0(AIIMS, June
97)
20. If the correlation between height and weight is 2.6. True
is:
(a) Positive correlation
(b) No association
Correlation and Regression
71
(c) Negative correlation
(d) Calculation of coefficient is wrong
(AIIMS, June 2000)
21. In a regression between height and age follow y = a +
bx. The curve is:
(a) Hyperbola
(b) Sigmoid
(c) Straight line
(d) Parabola
(AIIMS, Nov 2001)
22. The correlation between IMR and socioeconomic
status is best depicted by:
(a) Correlation (+1)
(b) Correlation (+0.5)
(c) Correlation (– 1)
(d) Correlation (– 0.8)
(AIIMS, Nov 2001)
[Hint: The IMR decreases with the increase in socioeconomic
status, but it is not a perfectly correlated].
23. The correlation between variables A and B in a study
was found to be 1.1. This indicates:
(a) Very strong correlation
(b) Moderately strong correlation
(c) Weak correlation
(d) Computational mistake in calculating correlation
(AI, 2002)
24. A Cardiologist found a highly significant correlation
coeffcient (r = 0.90, p = 0.01) between the systolic blood
pressure valuse and serum cholesterol values of the
patients attending his clinic. Which of the following
statements is wrong interpretation of the correlation.
(a) Since there is a high correlation the magnitudes of
both the measurements are likely to be close to each
other.
(b) A patient with a high level of systolic BP is also
likely to have a high level of serum cholesterol.
72
Medical Statistics and Demography Made Easy
(c) A patient with a low level of systolic BP is also likely
to have a low level of serum cholesterol.
(d) About 80% of the variation in systolic blood pressure
among his patients can be explained by their serum
cholesterol values and vice versa.
(AI, 2005)
25. Total Cholesterol level = a + b (calorific intake) + c
(physical activity) + d (body mass index); is an example
of:
(a) Simple linear regression
(b) Simple curvilinear regression
(c) Multiple linear regression
(d) Multiple logistic regression
(AI, 2005)
Chapter 6
Probability
74
Medical Statistics and Demography Made Easy
Random Series: If a coin is tossed very large number of times,
and the result of each toss is written down, the result may be
something like the following (H standing for heads and T for
tails):
H, H, T, T, T, H, T, H, H, H, T, T, H, H, T, H, .......................
Such a sequence is called Random Sequence or Random
Series.
Trial and Events: Each toss of the above series is called
Trial and each result is called Outcome or Events.
In the above series in first trial, the outcome is head.
Exhaustive Events: The total number of possible events
in any trial is known as Exhaustive Events or Exhaustive
Cases. Thus in tossing of a coin there are only two events –
Head and Tail. Or in throwing of a die there are six exhaustive
cases since one of the six faces 1,2,3, .......... 6 will come
uppermost.
Mutually Exclusive Events: Events are said to be mutually
exclusive if the happening of one precludes the happening of
all the others. For example, In throwing of a die all 6 faces 1 to
6 are mutually exclusive – since if one of these faces comes,
the possibility of all the other faces in the same trial is ruled
out.
Equally Likely Events: If all the events in a trial have
equal chance of taking place, there is no reason to except one
in preference to others. For example, In throwing of an
unbiased die, all the six faces are equally likely to come.
Independent Events: Several events are said to be
independent if happening of an event is not affected by the
supplementary knowledge concerning the occurrence of any
number of remaining events. For example: in tossing of an
unbiased coin the event of getting head in the first toss is
Probability
75
independent of getting a head in the second, third and
subsequent tosses.
MATHEMATICAL OR CLASSICAL PROBABILITY
If in a trial result there are ‘n’ exhaustive, mutually exclusive
and equally likely cases and out of them ‘m’ are favourable to
the happening of an event ‘E’, then the probability of
happening of an event ‘E’ is:
m
p P(E)
n
and the probability of non occurrence of the event E:
(n m)
m
1 1 p
n
n
Thus, p + q = 1
Obviously, p and q are non negative and cannot exceed
1, i.e. 0 < p < 1.
q
Sure Event: If the probability of occurrence of an event is
1, i.e. p = P(E) = 1 the E is called Sure Event.
Impossible Event: If the probability of an occurrence of
an event ‘E’ is zero, i.e. p = P(E) = 0 then E is called Impossible
Event.
ADDITIVE AND MULTIPLICATIVE PROPERTY OF
PROBABILITY
Here we will consider the two basic laws of probability, i.e.
the addition and multiplication operation of probability.
Addition Rule
If in a population of doctors, the probability of a male doctor
is 0.8 and the doctor is a surgeon is 0.4. If ‘A’ is defined that a
doctor is male the probability of occurrence of A is P (A) = 0.8,
76
Medical Statistics and Demography Made Easy
similarly if B is that the doctor is surgeon then probability of
occurrence of B is P (B) = 0.4.
If the two separate probabilities are added then the result
is 0.8 + 0.4 = 1.2, which is wrong because the probability of
occurrence of an event cannot exceed 1. This is because of the
double event – person that is male and also surgeon is counted
twice, once when we are calculating the probability of male
doctor and another as a part of surgeon, thus the probability
of double event is subtracted.
This can be clear by the following diagram:
Figure 6.1
Figure 6.2
In Figure 6.1 the shaded portion is included in circle A as
well as in circle B, i.e. while calculating the probability of
male doctors the surgeons who are male are included in it,
and while calculating the probability of surgeons, the portion
of males who are surgeon is also included.
Probability
77
Therefore in additive law the probability of double event
is subtracted. As shown in Figure 6.2.
The additive property of probability states that:
If A and B are two events the combined probability of two
events is given by:
P (A) P(B)
P (B) P(A B)
P(A B) P(A)
i.e. Prob (A or B or both) = Prob (A) + Prob (B) – Prob (A and B)
In case of Mutually Exclusive Events:
i.e. P (A or B) = Prob (A) + Prob (B)
In case of mutually exclusive events (Fig. 6.3) The
probability of occurrence of male surgeon is independent of
the probability of occurrence of female surgeon.
Figure 6.3
Thus if the probability of male surgeon in a population of
doctors, i.e. P (A) = 0.3 and the probability of female surgeon,
i.e. P(B) 0.1. Then the probability of surgeon in the population
of Doctors is:
P (A or B) = P (A) + P (B) = 0.3 + 0.1 = 0.4
Multiplication Rule
When the events are not mutually exclusive:
78
Medical Statistics and Demography Made Easy
Figure 6.4
Suppose in the Figure 6.4 there are n points in the square
and m1 the number of points in the circle A; m2 number of
points in the circle B and m3 be the number of points common
to both A and B. (assume m1 > 0 and m2 > 0).
Then the probability that both the events A and B occurs
if given by:
P (A and B) = P (A ) × P ( B given A)
Or
P (A and B) = P (B) × P (A given B)
P (B given A) is known as condition probability of
occurrence of B with the condition that A had already
occurred, and P (A given B) is the conditional probability of
occurrence of A when B had already occurred.
In the above example,
m
m
P(A) 1 ; P(B) 2 ,
n
n
P(B given A)
Thus,
m m m
P(A and B) 1 3 3
n
n m1
Probability
79
m m m
P(A and B) 2 3 3
n
n m2
Which is equal to number of points common to both A
and B to total number of points, i.e. n.
Or
In case of independent events:
The multiplication rule is:
P (A and B) = P (A) . P (B)
Suppose that two random sequence of trials are
proceeding simultaneously; for example, at each stage a coin
may be tossed and a die is thrown. What is the probability of
a particular combination of result, for example a head (H) on
the coin and a 5 on the die? The result is given by simple
multiplication rule.
P (H and 5) = P (H) × P (5)
In this example, the probability of 5 on a die was not
affected by whether or not H occurred on the coin. Or in other
words the two events are said to be independent and by
multiplication rule the probability of H and 5 is equal to:
1 1 1
P(H and 5) P (H) . P (5) .
2 6 12
MULTIPLE CHOICE QUESTIONS
1. The Probability of Sure event is:
(a) 0
(b) 0.5
(c) – 1
(d) + 1
2. Out of 1000 individuals surveyed, it was observed the
260 were suffering from respiratory disorders and 470
were from diabetes. And 170 were suffering from
diabetes as well as respiratory disorders. The
probability of persons suffering from respiratory
problems is:
80
Medical Statistics and Demography Made Easy
(a) 0.26
(b) 0.43
(c) 0.17
(d) 0.47
[Hint: Total person suffering from respiratory disorders also
includes those who are suffering from respiratory disorders as
well as diabetes also].
3. In the above problem the probability of individuals
who are suffering from diabetes alone is:
(a) 0.47
(b) 0.17
(c) 0.26
(d) 0.43
4. Find the probability of persons suffering from
respiratory disorders, diabetes as well as both diabetes
and respiratory disorders:
(a) 1.07
(b) 0. 17
(c) 0.90
(d) 0.69
5. Find the probability of persons suffering from diabetes
as well respiratory disorders:
(a) 0.90
(b) 0.17
(c) 1.17
(d) 0.47
6. The probability of any events in any case does not
exceed:
(a) 0.5
(b) 0.9
(c) –1
(d) 1
7. The probability of any event lies between:
(a) – 1 < P < 1
(b) 0 < p < 1
(c) 0 < P < 1
(d) –1 < P < 0
8. In a population incidence of ocular deficiency in male
is 20%, and in females is 25%. What is the probability
of ocular disease in the population:
(a) 0.05
(b) 0.25
(c) 0.45
(d) None of the above
Probability
81
9. In question no. (8) what is the probability of diabetes
in the population:
(a) 0
(b) 0.25
(c) 0.20
(d) None of the above
10. The events A and B are mutually exculsive, so:
(a) Prob. (A or B) = Prob (A) + Prob (B)
(b) Prob (A and B) = Prob (A) . Prob (B)
(c) Prob (A) = Prob (B)
(d) Prob (A) + Prob (B) = 1
(AI, 2005)
Chapter 7
Sampling and Design
of Experiments
84
Medical Statistics and Demography Made Easy
POPULATION
The group of individuals under study is called population or
universe. The population may be finite or infinite.
SAMPLE
A finite subset of individuals in a population is called a
sample and the number of individuals in a sample is called
sample size.
The sample characteristic are utilized to approximately
determine or estimate the population. The error involved in
such approximation is known as sampling error which is
inherent and unavoidable in any and every sampling scheme.
Types of Sampling
Some of the commonly known and frequently used sampling
techniques are:
1. Random sampling
2. Stratified sampling
3. Systemic sampling
4. Cluster sampling
Random Sampling
In this case the sampling units are selected at random. A
random sample is one in which each unit of population has
an equal chance of being included in the sample.
Suppose we take a sample of size n from a finite population
of size N. Then there are NCn possible samples. A sampling
technique in which each of NCn samples has equal chance of
being selected is known as Random Sampling and the sample
obtained by this technique is termed as random sample.
In simple random sampling each unit of the population
has equal chance of being included in the sample and that
Sampling and Design of Experiments
85
this probability is independent of the previous drawing. To
ensure that sampling is simple, it must be done with replacement, if
population is finite. However, in case of infinite population
replacements are not necessary.
Stratified Sampling
If the population is not homogenous, then entire
heterogeneous population is divided into a number of
homogenous groups, usually called strata. The units are
sampled at random from each of these stratum, the sample
size in each stratum varies according to the relative importance
of the stratum in the population.
The sample which is the aggregate of the sampled units
of each stratum is termed as stratified sample.
Such a sample is a good representative of the population
when the population considered is heterogeneous.
Systemic Sampling
In systemic sample the number of units in population should
be a product of number of units in sample (i.e. sample size). If
there are N units in the population and they are numbered in
some order. Suppose we want to draw a sample of n units
from this population, then there should be a constant k which
when multiplied by sample size (n) will be equal to population
size (N), i.e. n . k = N or k = N/n. We divide the N units of
population units into n groups of k unit each as follows:
1
2
3
4
i
k
k+1
k+2
k+3
k+3
i+k
2k
2k + 1
2k + 2
2k + 3
2k + 4
i + 2k
3k
-
-
-
(n – 1)k + 1
(n – 1)k + 2
(n – 1)k + 3
(n – k)k + 4
i + (n – 1)k
(n – 1)k + k = nk = N
86
Medical Statistics and Demography Made Easy
In systemic sampling, to select a sample of n units, if k =
N/n then every kth unit is selected commencing with a
randomly chosen number between 1 and k. Hence, the
selection of the first unit determines the whole sample. Let
the ith unit be selected at random from first k unit, then the
sample will consist of ith, (i+k)th, (i+2k)th and [i +(n-1)k)th unit
of the population.
In system sampling the first unit will be drawn at random
and the remaining unit will follow a systemic pattern.
Example: Suppose from a population of size N = 5,000, we
want to draw a sample of size 250 (i.e. n = 250), then
5, 000
20. Therefore, in systemic sampling the first unit of
250
the sample is selected at random from the first 20 unit of the
population. Let us draw the 6th unit from the first 20 unit. Then the
first unit of the sample will be the 6th unit of the population, the
second unit of the sample will be the 26th unit of the population, the
next unit will be the 46th unit of the population and so on. In this
way we can draw a sample of size 250.
k
Advantages of Systemic Sampling
1. Easier to draw without mistake.
2. More precise than simple random sampling as more
evenly spread over population.
Disadvantages of Systemic Sampling
1. If the list has periodic arrangement then it can fare very
badly.
Cluster Sampling
Contrary to Simple Random sampling and Stratified
sampling, where single subjects are selected from the
Sampling and Design of Experiments
87
population, in cluster sampling the subjects are selected in
groups or clusters.
Cluster sampling is used when ‘natural’ grouping are
evident in the population. The total population is divided
into groups or clusters. Elements within a cluster should be
as heterogeneous as possible. But there should be
homogeneity between clusters. Each cluster must be mutually
exclusive and collectively exhaustive. A random sampling
technique is then used on relevant clusters to choose which
clusters to include in the study.
In single-stage cluster sampling, all the elements from
each of the selected clusters are used. In two-stage cluster
sampling a random sampling technique is applied to the
elements from each of the selected clusters.
One version of cluster sampling is area sampling or
geographical cluster sampling. Clusters consist of
geographical areas. A geographically dispersed population
can be expensive to survey. Greater economy than simple
random sampling can be achieved by treating several
respondents within a local area as a cluster
Example: Suppose we want to conduct interviews with hotel
managers in a major city about their training needs. We could decide
that each hotel in the city represents one cluster, and then randomly
select a small number, e.g. say 10. Then we can contact the managers
of these 10 hotels for interview. When all the managers of the selected
10 hotels are interviewed then this is referred to as ‘one-stage
cluster sampling’.
If the subjects to be interviewed are selected randomly within
the selected clusters, it is called ‘two-stage cluster sampling’.
This technique might be more appropriate if the number of subjects
within a unit is very large (e.g. instead of interviewing managers,
we want to interview employees).
88
Medical Statistics and Demography Made Easy
Advantages of Cluster Sampling
1. The main objective of cluster sampling is to reduce the
costs, i.e. cluster sampling reduced field costs.
2. Applicable where no complete list of units is available
(special lists only need be formed for cluster).
Disadvantages of Cluster Sampling
1. Clusters may not be representative of whole population
but may be too alike.
2. Analysis is more complicated than for simple random
sampling.
Difference between Cluster Sampling
and Random Sampling
1. In simple random sampling single subjects are selected
from the population, while in cluster sampling the
subjects are selected in a groups or clusters.
2. As compared to random sampling the cluster sampling
is more evenly spread over the population.
Difference between Stratified and Cluster Sampling
1. Unlike stratified sampling, the clusters are thought of as
being typical of the population, rather than subsection
as in stratified sampling in which we divide the
heterogeneous population into homogeneous subsection
(strata).
2. In stratified sampling subjects are selected randomly
within strata. While in cluster sampling all units of the
selected cluster are interviewed (one-stage cluster
sampling).
3. In stratified sampling the strata should be homogeneous,
there should be maximum homogeneity within strata.
But in cluster sampling the clusters should be as
Sampling and Design of Experiments
89
heterogeneous as possible, each cluster should be a small
scale version of the population. In other words there
should be maximum heterogeneity within clusters and
minimum between clusters.
Multistage Sampling
We can also combine cluster sampling with stratified
sampling. For example, if we want to interview employees in
a randomly selected clusters of hotels(in above example of
cluster sampling). We might stratified employees based on
some characteristic (e.g. seniority, job function, etc) and then
randomly select employees from each of these strata. This
type of sampling is referred as Multistage Sampling.
Parameter and Statistic
In order to avoid verbal confusion with the statistical constants
of the population, viz. mean (μ) standard deviation (σ), etc
which are usually referred to as parameters, statistical
measures computed from the sample observations alone, e.g.
mean ( x ) and standard deviation (s), etc have been termed as
statistic.
Sampling Distribution
If we draw a sample of size n from a population of size N,
then the total number of possible samples will be NCn = k
(say). For each of these k samples we will compute mean and
standard deviation , then there will be k values of mean as
well as standard deviation. The set of values so obtained, one
for each sample is called sampling distribution.
Standard Error
The standard deviation of sampling distribution is known as
its standard error (SE).
90
Medical Statistics and Demography Made Easy
The standard errors of some well known statistics, for large
samples, are given below, where n is the sample size, σ is the
population standard deviation, and P the population
proportion, and Q = 1 – P, n1 and n2, represents the sizes of
two independent random samples respectively drawn from
the population(s).
Statistic
Standard error
Sample mean:
Sample proportion p
Difference between two samples
means
Difference between two samples
proportions (p1 – p2)
P1 Q l P2 Q 2
n1 n 2
Utility of Standard Error
Standard error plays a very important role in the large sample
theory and forms the basis of testing of hypothesis.
The magnitude of standard error gives an index of the
precision of the estimate of the parameter. The reciprocal of
standard error is taken as the measure of reliability or
precision of statistic.
Thus, in order to double the precision. Which amounts to
reducing the standard error to half, the sample size has to be
increased four times.
Sampling and Design of Experiments
91
SE enables us to determine the probable limits within the
population parameters may be expected to lie. The probable
limits for population proportion P are given by:
p3
pq
n
Confidence Limits based on Mean and Standard Error
95% confidence limits
99% confidence limits
Mean + 2 SE
Mean + 3 SE
Size of a Statistical Investigation
One question most commonly asked about the planning of a
statistical study is how many observations should be made?
In any review of this problem at the planning stage is likely to
be important to relate the sample to a specified degree of
precision.
Suppose we want to compare the means of two
population μ1 and μ2 assuming that they have the same known
standard deviation, σ, and two equal samples of size ‘n’ are
to be taken. If the standard deviation are known to be different
the present result may be thought of as an approximation
(taking σ to be the mean of two values). If the comparison is of
two proportions, π1 and π2, σ may be taken approximately to
be the pooled value.
1 1 1
1
2
2
2 1
We now consider two ways in which the precision may
be specified.
92
Medical Statistics and Demography Made Easy
Given Standard Error
Suppose it is required that the standard error of the difference
between the observed means
and
is less than ε;
equivalently the width of the 95% confidence interval might
be specified to be not wider than + 2ε. This implies
Given Difference to be Significant
We might require that if x1 x 2 is greater in absolute value
than some value d0, then it shall be significant at some
specified level (say at two sided test 2α level). Denote by u2α;
(for
2α
=
0.05,
u2α = 1.96). Then
DESIGN OF EXPERIMENTS
While planning of a clinical experiment to compare the effect
of various treatments on some type of experimental units.
Then the problem is how the treatments should be allotted to
these units.
The allotments of treatment to experimental units should
be such that the disparity between the characteristic of units
receiving different treatments should be eliminated. This
cannot be eliminated completely but it can be reduced if the
groups of experimental units to which treatments were to be
applied were made alike in various relevant respect.
The three basic principle of doing these are:
1. Randomization
Sampling and Design of Experiments
93
2. Replication
3. Local Control.
Randomization
In simplest form the randomization means that the choice of
treatment for each unit should be made by an independent
act of randomization (by toss of a coin or by using random
number table).
In clinical trials the total number of patients is often not
known in advance, since many patients may become available
for inclusion in the trial sometime after it started. The simplest
method is then to be allocate treatment by an independent
random choice for each treatment.
Replication
An important principle of experimental design is Replication,
the use of more than one experimental unit for each treatment.
Various purpose are served by replication:
(a) An appropriate amount of replication ensures that the
comparison between treatments are sufficiently precise,
the sampling error between two means decreases as the
amount of replication in each group increases.
(b) The effect of sampling variation can be estimated only if
there is an adequate number of degree of replication. For
example, In comparison of means of two groups, for
instance, if both samples were as low as 2, the degree of
freedom for a ‘t’ test would only be 2, the critical point of
‘t’ at 2 degree of freedom are very high and the test
therefore loses a great deal in effectiveness merely
because of the inadequacy of the estimate of within group
variation.
(c) Replication may be useful in enabling observation to be
spread over a wide variety of experimental conditions.
94
Medical Statistics and Demography Made Easy
Local Control
The third basic principle concerns the reduction in random
variation between experimental units is Local control. As we
know that the formula for the standard error of a mean is
, shows that effect of random error can be reduced
either by increasing the ‘n’ (number of replication) or by
decreasing ‘σ’. This suggests that experimental units should
be as homogenous as possible in their response to treatment.
In clinical trials, For example, it may be that a precise
comparison could be effected by restricting age, sex, clinical
conditions and other features of the patients, but these
restrictions may make it too difficult to generalized for the
result. A useful solution to this dilemma is to subdivide the
units into relatively homogenous groups called blocks.
Treatments can then be allocated randomly within blocks so
that each block provided a small experimental unit. The
precision of the overall comparison between treatments is
then determined by random variability within blocks rather
then between different blocks. This is called a randomized
Block Design.
There are some more complex designs allowing
simultaneously comparing more than one set of treatments.
But they are beyond the scope of this book.
MULTIPLE CHOICE QUESTIONS
1. If the mean is 230 and the standard error is 10, the 95%
confidence limits would be:
(a) 210 to 250
(b) 220 to 240
(c) 225 to 235
(d) 230 to 210
(AI, 89)
Sampling and Design of Experiments
95
2. All of the following are examples of random sampling
method except:
(a) Stratified sampling
(b) Quota sampling
(c) Systemic sampling
(d) Simple random sampling
(AI, 96, AIIMS, 2000)
3. Area under 2SD of normal curve is:
(a) 66%
(b) 95%
(c) 97%
(d) 99%
(AI, 93)
4. True regarding “Double blind” of people study:
(a) Participant is not aware to study or control group
(b) Neither the doctor not the participants is aware of
the group allocation and the treatment received
(c) The participants, the investigator and the person
analyzing the data are all blind
(d) All the above
(AI, 96)
5. Sampling error is:
(a)
(b)
(c)
(d) None
(AI, 2001)
[There are only two types of error for testing a hypothesis, αerror
is
type-I error and β-error is type-II error, sampling error is
inherent in sample while estimating population parameters
on the basis of samples drawn, a proper sampling will reduce
the sampling error].
6. Which is true in cluster sampling:
(a) Every nth case is chosen for study
(b) Natural group is taken as sampling unit
(c) Stratification of the population is done
(d) Involves use of random number
[Cluster sampling clusters are elected by natural demarcation
and every unit of cluster is selected as sampling unit]
(AIIMS, 92)
96
Medical Statistics and Demography Made Easy
7. In a sampling method adopted for VIP coverage
evaluation survey of a district is:
(a) Random sampling
(b) Cluster sampling
(c) Stratified sampling
(d) Multistage sampling
(JIPMER, 80, Orissa 91)
8. If you are doing a survey of a village divide the
population into lanes and rows select 5 lanes random
and survey all houses of the lane is type of:
(a) Simple random sampling
(b) Stratified sampling
(c) Systemic sampling
(d) Cluster sampling
[Hint: In cluster sampling we divide the population into
clusters according to geographical criteria and then take all
units of the cluster; at least in first stage cluster sampling].
9. Simple random sampling. True is:
(a) Adjacent number is considered while taking sample
(b) Each unit has an equal chance of being drawn in
the sample
(c) Each portion of sample represents a corresponding
strata of universe
(d) None of the above
(AIIMS, 2001)
10. For a survey, a village is divided into 5 lanes then each
lane is sampled randomly. It is an example of:
(a) Simple random sample
(b) Stratified random sampling
(c) Systemic random sampling
(d) All of the above
(AIIMS, 96)
11. True about simple random sampling is:
(a) All person have equal right to be selected
(b) Only selected person have right to be selected
Sampling and Design of Experiments
97
(c) Techniques provides least number of possible
samples
(d) Every fixed unit is taken for sampling
(AIIMS, June 98)
12. If sample size is bigger in random sampling, which of
the following is/are true:
(a) It approaches maximum samples
(b) Reduces non-sampling error
(c) Increases the precision of the result
(d) Decrease standard error
[Hint: Precision is inversely proportional to standard error,
to double the precision we have to reduce the standard error to
half, thus increasing the sample size four times].
(AIIMS, June 99)
13. In a random sample the chance of being picking up is:
(a) Same and known
(b) Not same and not known
(c) Same and not known
(d) Not same but known
[Hint: If a sample of size ‘n’ is drawn from a population of size
N the probability of selection of each unit is 1/N].
(AIIMS,Nov 99)
14. While calculating the incubation period for measles in
a group of 25 children, the standard deviation is 2 and
mean incubation period is 8 days. Calculate standard
error:
(a) 0.4
(b) 1
(c) 2
(d) 0.5
15. In a population of pregnant female. Hb is estimated on
100 women with standard deviation of 1 gm. The
standard error is:
98
Medical Statistics and Demography Made Easy
(a) 1
(c) 0.01
(b) 0.1
(d) 10
(AIIMS, Nov 2001)
16. In a controlled trial to compare two treatment, the main
purpose of randomization is to ensure that:
(a) Two groups will be similar in prognostic factors
(b) The clinician does not know which treatment the
subjects will receive
(c) The sample may be referred to a known population
(d) The clinician can predict in advance which
treatment the subjects will receive
(AIIMS, 2002)
17. Mean hemoglobin of a sample of 100 pregnant women
was found to be 10 mg% with a standard deviation
1.0mg%. The standard error of the estimate would be:
(a) 0.01
(b) 0.1
(c) 1.0
(d) 10.0
(AIIMS, 2004)
18. Which sampling method is used in assessing
immunization status of children under an
immunization programme:
(a) Quota sampling
(b) Multistage sampling
(c) Stratified random sampling
(d) Cluster sampling
[Hint: In cluster sampling we divide the population in small
cluster, which are representative of populations, Cluster
sampling involves less time and cost].
(AIIMS, 2004)
Chapter 8
Testing of Hypothesis
100 Medical Statistics and Demography Made Easy
Statistical Hypothesis
A statement about population which we want to verify on the
basis of information available from a sample.
Test a Statistical Hypothesis
It is a two-action decision problem after the experimental
sample values have been obtained, the two action being
acceptance or rejection of hypothesis under consideration.
Null Hypothesis
Null hypothesis is the hypothesis of no difference, which is
usually denoted by H0.
Alternative Hypothesis
Every statistical hypothesis is being tested to observe that
null hypothesis is accepted or rejected. Which is meaningful
only when it is being tested against a rival hypothesis. This
hypothesis is denoted by H1.
Wrongly rejecting a null hypothesis seems to be more
serious error than wrongly accepting it.
Critical Region
Let x1, x2, ........ xn be the sample observation denoted by “O”.
All the values of “O” will be aggregate of samples and they
constitute a space called sample space. We consider x1, x2,
........ xn as a point in ‘n’ dimensional sample space.
We divide the sample space into two distinct parts ω and
.
We reject the null hypothesis HO if the observed sample
point fall in ω. The region ω is known as critical region.
Testing of Hypothesis 101
Figure 8.1
Types of Errors
Table related to decision and hypothesis.
Decision from sample
Accept H0
Reject H0
True statement H0 True
Correct
Wrong (Type-I error)
Correct
H0 False Wrong (Type-II error)
The probability of Type-I and Type-II errors are denoted
by and respectively.
= Probability of Type-I error, i.e. Probability of rejecting
H0 when it is true.
= Probability of Type-II error, i.e. probability of
accepting H0 when H0 is false.
Level of Significance
the probability of Type-I error is known as the level of
significance. It is also called the size of critical region.
102 Medical Statistics and Demography Made Easy
Power of Test
(1 – ) is called the power of test to test the hypothesis H0
against alternative hypothesis H1
Since Type-I error is deemed to be more serious than the
Type-II error. The usual practice is to control Type-I error
at a predetermined level and choose a test which
minimizes .
Steps in Solving Testing of Hypothesis Problem
1. Explicit knowledge about the nature of population, about
which the hypothesis are set-up.
2. Setting up the null and alternative hypothesis.
3. Choose a suitable statistic called test statistic which will
reflect the probability of H0 and H1.
4. On the basis of test statistic, reject or accept the null
hypothesis.
Test of Significance
A very important aspect of sampling theory is the study of the
test of significance which enables us to decide on the basis of
sample results, if
(i) The deviation between the observed sample statistic and
the hypothetical parameter values or
(ii) The deviation between two independent sample statistic.
Is significant or might be attributed to chance or
fluctuating of sampling.
One Tailed and Two Tailed Tests
In any test, the critical region is represented by a portion of
the area under the probability curve of the sampling
distribution of the test statistic.
Testing of Hypothesis 103
A statistical hypothesis where the alternative hypothesis
is one tailed (right tailed or left tailed) is called a one tailed
test
For example, testing mean of a population
Against the alternative
is called one tailed test.
A test where the alternative hypothesis is two tailed such
as:
H0 : x
Against the alternative
Is called two tailed test.
Critical Values or Significant Values
The value of the test statistic which separates the critical
region (rejection region) and the acceptance region is called
critical value or significant value.
It depends upon:
(i) The level of significance used.
(ii) The alternative hypothesis, whether it is two tailed or
single tailed.
Suppose that the critical value of the test statistics at a
level of significance
The value of
for a two tailed test is given by
is such that the area between the left
and to the right of
is also
2
area α is divided into two equal parts.
of
is
.
. Thus, the total
104 Medical Statistics and Demography Made Easy
Two Tailed Test (Level of Significance α)
Figure 8.2
In case of single–tail test, the critical value
is
determined so that total area to the right of it (for right tailed
test) is and for left tailed test the total area to the left of
is .
Figure 8.3
Figure 8.4
Testing of Hypothesis 105
Thus, the critical value of Z for a single tailed test (left or
right) at a level ‘ ’ is same as the critical value of Z for a two
tailed test at a level of significance ‘2 ’.
Critical values (Zα) of ‘Z’
Critical values
(Zα)
Level of significance
1%
5%
10%
Right tailed test
Z 2.33
Z 1.96
Z 1.64
Z 1.64
Z 1.28
Left tailed test
Z 2.33
Z 1.64
Z 1.28
Two tailed test
TEST OF SIGNIFICANCE FOR LARGE SAMPLES
For large values of n, almost all the distribution are very
closely approximated by normal distribution. Thus we can
apply the normal test, which is based upon the fundamental
properties of normal probability curve (area property).
1. Compute the test statistic Z under H0.
2. If Z 3 , H0 is always rejected.
3. If
, we test its significance at certain level of
significance, usually at 5% and sometimes at 1% level of
significance.
Thus for a two tailed test if
> 1.96, H0 is rejected at 5%
level of significance. Similarly if
> 2.58, H0 is rejected at
1% level of significance.
For practical purpose, sample may be regarded as large if
n > 30.
106 Medical Statistics and Demography Made Easy
Sampling of Attributes
Sampling from a population is divided into two mutually
exclusive classes – one class possessing a particular attribute
say ‘A’ and other class not possessing that attribute ‘ ’
The presence of an attribute in a sampling unit may be termed
as success and its absence is failure.
Test for Single Proportion
If x is the number of success in n independent trials with
constant probability ‘P’.
Then observed proportion of success
proportion SE(p) =
and SE of
, where Q = 1 – P.
Then test statistic
for large n
Under the null hypothesis that the sample proportion is
equal to population proportion, i.e. the sample is drawn from
the same population with proportion of success P.
The probable limits for normal variate of the observed
proportion of success are:
PQ
n
If P is not known than taking p (the sample proportion)
as an estimate of P. Then the probability limits for the
proportion in the population.
P 3 SE p , i.e. P 3
p3
pq
, where q 1 p
n
Testing of Hypothesis 107
In particular 95% confidence limits for P are p + 1.96
,
and 99% confidence limits for P is given by p + 2.58
.
TEST OF SIGNIFICANCE FOR DIFFERENCE
OF PROPORTION
Let x1 and x2 be the number of person possessing certain
characteristic (attribute), say A, in a random sample of size n1
and n2 from the two population respectively.
Then sample proportions are given by:
If P1 and P2 are the population proportion, then under
the null hypothesis H0 : P1 = P2, the test statistic for difference
of proportion.
p1 p2 ~ N 0, 1
Z
1
1
PQ
n1 n 2
Generally we do not have any information about the
proportion “A” of population in such circumstances the
estimate of population proportion under null hypothesis.
H 0 : P1 P2 P(say) is calculated. The estimate
of P
(n 1 p1 n 2 p2 )
and Q (1 P)
(n 1 n 2 )
Then, Test Statistic
108 Medical Statistics and Demography Made Easy
Solved Examples
Test for Single Proportion
QUESTION: Thirty peoples were attacked by a viral disease in a
village and only 28 survived. If the survival rate of this viral
infection is reported to be 85%. Then test whether the survival rate
by this infection in this village is more then the reported survival
rate at 5% level of significance.
SOLUTION:
Setting of Hypothesis
Null hypothesis: The survival rate in this village is equal to
proportion of survival = 0.85 the reported survival rate, i.e.
H0 : P = 0.85
Alternative hypothesis: Survival rate in this village is more than
85%, i.e. H1 : P > 0.85 (One tail test)
Total number of persons survived x = 28
Total number of person attacked by infection = 30
x 28
;
0.93.
n 30
The reported survival rate = 85%, i.e. P = 0.85;
Proportion of person survived; p
therefore
Q = 1 – 0.85 = 0.15
The Test Statistic:
p P
Z
~ N 0, 1
PQ
n
Z
0.93 0.85
0.85 0.15
30
Z 1.25
0.08
0.08
1.25
0.0042 0.064
Testing of Hypothesis 109
Tabulated value of Z at 0.05 (i.e. critical value) = 1.64 (For one
tailed test).
Because Zcal < Ztab; therefore Null hypothesis is accepted.
Conclusion: The survival rate in the village is not more than
the reported survival rate.
Test of Significance of Difference of Proportion
(When population proportion is not known):
QUESTION: A survey conducted by a health agency, it was found
that in Town A out of 876 births 45% were male, while in town B
out of 690 birth 473 were males.
Is there any significant difference in the proportion of male
child in the two towns.
SOLUTION:
Proportion of male child in Town A p1 = 0.45;
therefore
q1 = (1 – p1) = (1 – 0.45) = 0.55
Total number of Birth in town A is 876, i.e. n1 = 876
In Town B out of 690 birth 473 were males therefore,
Setting of Hypothesis
Null hypothesis: There is no significant difference between the
proportion of male child in two towns, i.e. H0 : P1 = P2
Alternative hypothesis: H 1 : P1 P2 (Two tail test).
Because population proportion is not known, therefore we
have to estimate it from sample proportions:
110 Medical Statistics and Demography Made Easy
Q 1 0.55 0.45
therefore,
Test statistics:
Z
Z
p1 p 2
1
1
PQ
n
n
2
1
0.45 0.68
1
1
0.55 0.45
876 690
0.23
0.23
2.87
0.247 0.026 0.08
Critical value of Z at 5% level of significance (for two tail
test) = 1.96; which is less than Zcal. Thus null hypothesis is
rejected.
Conclusion: There is a significant difference between
proportion of male birth in two Towns.
Test of Significance for Single Mean
If x1, x2, ........... xn is a random sample from a normal population
with mean μ and SD σ, then for large samples the statistic
Z
x – ~ N 0, 1
n
Under the null hypothesis H0 : x , i.e. the sample is
drawn from the population with mean μ.
If the population standard deviation is unknown then we
use sample standard as an estimate of
Confidence limits for μ:
Testing of Hypothesis 111
95% confidence limits for μ is
+ 1.96
and 99% confidence limits for μ is
+ 2.58
Test of Significance for Difference of Means
Let
be the mean of random sample of size n1 from a
population mean
and SD
, and
be the mean of an
independent random sample of size n 2 from another
population with mean
and SD
.
Under the null hypothesis
then the test
statistic becomes (for large samples).
Remarks:
1. If 12 22 2 , i.e. samples have been drawn from the
population with common SD s then under
2. If is not known, then its estimate based on sample
variance is used. The unbiased estimate of
by:
Estimate of
is given
112 Medical Statistics and Demography Made Easy
3. If 12 2 2 and
and
are not known then they
can be estimated on the basis of sample. This results in
some error, which will be very less and can be ignored if
samples are large. There estimated for large samples are
given by
and 2 2 S 2 2
In this case the test statistic is:
x1 x 2
Z
~ N 0, 1
S 12 S 2 2
n1 n 2
However if the sample sizes are small, then a small sample
test ‘t-test’ for difference of means should be used.
Solved Example
Test of Significance for Single Mean
QUESTION: A sample of 900 individuals has a mean haemoglobin
of 12.7 mg%. Is the sample drawn from a population with mean
13.6 mg% and SD 2.70.
SOLUTION:
Setting of Hypothesis
Null hypothesis: The sample is drawn from the population
with mean 13.6, i.e. H 0 : 13.6.
Alternative hypothesis: H1 : 13.6 (Two tail test).
The Test Statistic:
Z
x 12.7 13.6 0.9 0.9 1,
n
2.70
900
2.70
30
0.9
Z 1
Testing of Hypothesis 113
Critical value of Z at 5% level of significance (for two tail test)
= 1.96, i.e. Ztab = 1.96; which is more than the calculated
value of Z . Hence we accept the null hypothesis.
Conclusion: The sample is drawn from a population with
haemoglobin level 13.6 and SD 2.70.
Test of Significance for Difference of Mean
QUESTION: A random sample is drawn from two hospitals and
following data related to blood pressure of adult males hospital
workers were obtained:
Mean blood pressure
Standard deviation
No. of cases
Hospital A
Hospital B
127.56 mmHg
10.37 mmHg
700
140.78 mmHg
13.77 mmHg
360
Is the blood pressure of male workers of Hospital B is
significantly higher than those working in Hospital A.
SOLUTION:
Setting of Hypothesis
Null hypothesis: There was no significant difference between
the blood pressure of workers working in two hospitals, i.e
Alternative hypothesis:
Test statistics:
In this example
(one tail test).
114 Medical Statistics and Demography Made Easy
x1 = 127.56; S1 = 10.37; n1 = 700
= 140.78; S2 = 13.77 and n2 = 360
Putting these values in test statistic
Z
13.22
13.22
16.12
0.82
0.153 0.526
The calculated value of Z is much higher than the
tabulated value of Z. Thus we can reject the null hypothesis.
Conclusion: The difference in the mean values of blood
pressure of workers of two hospitals is highly significant.
Thus we can say that the mean value of workers working in
Hospital B is significantly higher than those working in
Hospital A.
EXACT SAMPLING DISTRIBUTION
χ2 – Distribution)
Chi-Square Distribution (χ
The square of standard normal variate is known as ChiSquare variate with 1 degree of freedom.
If x ~ N ( , 2 ), then
is a standard
2
x
normal variate then Z 2
is a Chi-Square
distribution with 1 degree of freedom.
In general if xi (i = 1, 2, ........n) are n independent normal
variate with mean μi and variance i2 (i = 1, 2, ........n); then
Testing of Hypothesis 115
is a Chi-Square distribution with ‘n’
degree of freedom.
Remarks:
1. Normal distribution is a particular form of
distribution when n = 1
2.
- distribution tends to normal distribution for large
degree of freedom. In practice for n > 30, then
approximation to normal distribution is fairly good.
-
Degree of Freedom
The number of independent variate which make the statistic
(e.g.
) is known as degree of freedom and is usually
represented by (nu).
In general, the number of degree of freedom, is the total
number of observations less than number of independent
constraints.
In a set of n observations usually the degree of freedom
(df) for are (n – 1) because of a linear constraint
on
frequencies.
Mean and Standard Deviation of
Mean and SD of
is ‘n’ and “
-distribution with “n” degree of freedom
” respectively.
Mode and Skewness of
Mode of
- Distribution
- Distribution
distribution with n degree of freedom is (n – 2)
Skewness =
116 Medical Statistics and Demography Made Easy
2
Skewness is greater than zero for n > 1 thus
distribution is positively skewed.
Further, skewness is inversely proportional to square of
roof of df it rapidly tends to symmetry as the df increases,
consequently as ‘n’increases.
Figure 8.5
For n = 2 the curve will meet the y= f(x) axis at x = 0, i.e. at
f(x) = 0.5
For n = 1, it will be an inverted J-shaped curve.
Conditions for the Validity of
- Distribution
For the validity of Chi-Square test for “goodness of fit” between
theory and experiment. The following conditions must be
satisfied.
1. Sample observations should be independent.
2. N, total frequency should be reasonably large, say greater
than 50.
3. No theoretical cell frequency should be less than 5.
Testing of Hypothesis 117
Critical Values
Figure 8.6
The value
known as the upper (right-tailed)
-
point, or critical value, can be calculated from
– table for
different values of n and .
The value of
increases as ‘n’ (df) increases and
the level of significance decreases.
Application of
- Distribution
- distribution has large number of application. Some of
which are: (1) to test the ‘Goodness of fit’ and (2) to test the
independence of ‘attributes’.
1. Goodness of fit: A very powerful test for testing the
significance of discrepancy between theory and
experiment. It enables us to find if the deviation of the
experiment from theory is just a chance or is it really due
to the inadequacy of theory to fit the observed data.
If Oi (i = 1,2, ........ n) is the set of observed (experimental)
frequencies and Ei (i = 1, 2, ........ n) are the corresponding
set of expected frequencies (theoretical or hypothetical),
then Chi-Square is given by:
118 Medical Statistics and Demography Made Easy
2
follow a distribution with (n – 1) degree of freedom.
2. Independence of attributes:
Four-fold classification:
Comparison of two proportions (2 × 2 contingency table):
An alternative method of representing the proportions
is a 2 × 2 contingency table or fourfold classification.
The total frequency or grand total is split into different
dichotomies represented by two ‘horizontal’ rows and
the two ‘vertical columns. There are four combinations
(2 × 2) of rows and column categories and the
corresponding frequencies occupy the four inner cells of
the body of the table. The comparison can be done by
applying
significance tests (discussed for comparing
several proportions).
The 2 × 2 contingency table is described as:
Positive
Negative
Total
Group 1
Group 2
Group 1 + Group 2
r1
ni – r1
r2
n2 – r 2
R (r1 + r2)
N–R
n1
n2
N (n1 + n2)
Manifold Classification
Comparison of several proportions (2 × k contingency table):
The comparison of two proportions was considered from two
point of view – the sampling error of the difference of
proportions and the
significance test.
Testing of Hypothesis 119
When more than two proportions are compared the
calculation of standard errors between pairs of proportions
requires several comparison, and an undue number of
significant differences may arise. provides a method by
which we can compare several proportions.
Suppose there are k groups of observations and that in
the ith group ni individuals have been observed, of whom ri
shows a certain characteristic (say being positive). The
proportion of positive,
is denoted by pi. The data may be
described as follows:
1
2
i
r1
ni – r1
r2
n2 – r2
ri
n i – ri
Total
n1
n2
ni
nk
N
Proportion
positive
p1
p2
pi
pk
P=
R/N
Positive
Negative
k
All
groups
rk
R
n k – rk N – R
The frequencies form 2 × k contingency table (there being
2 rows and k columns). test requires for each of the observed
frequency Oi, an expected frequency which is calculated by
the formula:
The quantity
is calculated and finally
120 Medical Statistics and Demography Made Easy
2
(O i Ei )2
Ei
The summation is over the 2k cells in the table.
On the null hypothesis that all k samples are drawn
randomly from populations with the same proportions of
2
positives, the is distributed approximately as (k – 1)(2 – 1)
df
General Contingency Table (r × s)
Let us consider two attributes A and B. A is divided into r
classes A1, A2, ........ Ar and B is divided into s classes B1, B2
........ Bs.
The cell frequencies can be expressed as (r × s) manifold
contingency table.
A1
A2
A3
-
-
Ar
B1
(A1B1)
(A2B1)
(A3B1)
(ArB1)
B2
(A1B2)
(A2B2)
(A3B2)
(ArB2)
B3
(A1B3)
(A2B3)
(A3B3)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Bs
(A1Bs)
(A2Bs)
(A3Bs)
(ArB3)
(ArBs)
(Ai Bj) is the number of person possessing the attributes
(Ai) and (Bj) [ i =, 1,2, ....... r; j = 1, 2, ...... s].
Testing of Hypothesis 121
Also
where
“(where Oij is the observed frequency of “Col i” and “Row j”
and Eij is the corresponding expected frequency.)”
Under the null hypothesis that attributes are independent:
2
The - test
is distributed as
-variate with (r – 1) (s – 1) degree of freedom
SOLVED EXAMPLE
Fourfold Contingency Table
Comparison of Two Proportion (2 × 2 Contingency Table)
The same question mentioned while calculating difference of
proportion can also be expressed as follows:
Town A
Town B
Total
Male
Female
394
482
473
217
867
699
Total Births
876
690
1566
Two proportions can also be compared by applying
test.
Setting of Hypothesis
Null hypothesis: There is no significant difference between the
proportion of male birth of two Towns.
122 Medical Statistics and Demography Made Easy
The test statistic is:
Where Oi are the observed value and Ei are expected
values.
In this example there are four observed values two values
for males corresponding to Town A and B and two for females
for Town A and B (i.e. 394, 473, 482 and 217 respectively).
The expected value for these four observed values is
calculated as follows:
Expected value for 394, i.e E (394) =
867 876
484.98
1566
Similarly:
2
E(473)
867 690
382.01
1566
E(482)
699 876
391.01
1566
E(217)
699 690
307.98
1566
(394 484.98)2 (473 382.01)2 (482 391.01)2
498.98
382.01
391.01
(217 307.98)2
307.98
2 17.06 21.67 21.17 26.87 86.77
2
Calculated value of is much more than tabulated value
of at (2-1) × (2-1) = 1 degree of freedom. Hence we reject the
null hypothesis.
Testing of Hypothesis 123
Conclusion: The proportion of male birth in two towns is not
same. In town B the proportion of male birth is much higher
when compared with town A.
Manifold Contingency Table
Comparison of Several Proportions: The 2 × k Contingency
Table:
QUESTION: The following table showing the persons suffering
from Respiratory illness in different groups:
Presence of
respiratory illness
Absence
Total
Children
Adolescents
Adult
Elderly
people
Total
76
47
65
79
267
54
67
89
46
256
130
114
154
125
523
Find out that the proportion of persons suffering from respiratory
illness in different categories is same.
SOLUTION: In the above table there are eight observed values
corresponding to four columns and two rows. Therefore this
is a (2 × 4) contingency table.
The expected values corresponding to each observed
values are calculated as follows:
E(65)
267 154
78.61;
523
E(79)
267 125
63.81
523
124 Medical Statistics and Demography Made Easy
E(54)
256 130
63.63;
523
E(67)
256 114
55.80
523
E(89)
256 154
75.38;
523
E (46)
256 125
61.85
523
(76 66.36)2 (47 58.19)2 (65 78.61)2 (79 63.81)2
2
66.36
58.19
78.61
63.81
(54 63.63)2 (67 55.80)2 (89 75.38)2 (46 61.85)2
+
+
+
63.63
55.80
75.38
61.85
2
Critical value of at (2 – 1) × (4 – 1) = 3 degree of freedom
and 5% level of significance is
than calculated value of
hypothesis.
= 9.35. Hence
tab is less
, therefore, we reject the null
Conclusion: The incidence of respiratory illness in different
groups is not same.
Exact Sampling Distribution
The entire sampling theory was based on the application of
normal test. However if the sample size ‘n’ is small the normal
test cannot be applied. In such cases exact sample test was
developed. Some of these tests are:
1. t-test;
2. F-test;
3. Fisher Z transformation.
The exact sample tests can, however, be applied to large
samples also though the converse is not true.
Testing of Hypothesis 125
In all the exact samples tests, the basic assumption is
that “the population (s) from which the sample (s) are drawn
is (are) normal”.
Student’s ‘t’ distribution: Let xi (i = 1, 2, .......... n) be a random
sample of size n from a normal population with mean and
variance
. Then the Student’s t- is defined by the statistic:
(x i x)2
xi
2
and S
is the unbiased estimate
(n 1)
n
of population variance.
Where x
Application of ‘t’-Distribution
‘t’-distribution has a wide number of application some of
which are:
1. To test if the sample mean x differ significantly from
the hypothetical value of its population mean
.
2. To test the significance difference between two sample
means.
3. To test the significance between sample correlation
coefficient.
Assumptions for Student’s ‘t’ Test
1. The parent population from which the sample is drawn
is normal.
2. The sample observations are independent, i.e. the sample
is random.
3. The population SD σ is unknown.
126 Medical Statistics and Demography Made Easy
‘t’- Test for Single Mean
If x1, x2, ..........xn is a random sample drawn from a population
with a specified mean μ0, then under the null hypothesis:
2
where S
x i x
2
(n 1)
follows a ‘t’ distribution with (n – 1) degree of freedom.
It calculated t > tabulated t, null hypothesis will be
rejected, at the level of significance adopted.
‘t’ - Test for Difference of Means
Suppose we want to test if
(a) Two samples xi; (i = 1, 2, ........... n1) and yj; (j = 1, 2, ...........
n2) have been drawn from the population with same
mean or
(b) Two samples x and y differ significantly or not.
Under the null hypothesis
(a) The sample have been drawn from the population with
same means, i.e. μx = μy or
(b) The sample means
The Test Statistics
and
do not differ significantly
Testing of Hypothesis 127
n1 1 S12 n 2 1 S 2 2
Where S 2
n1 n 2 2
Follows a ‘t’ distribution with (n1 + n2 – 2) degree of freedom.
Assumptions of ‘t’- Test for Difference of Means
1. Parent population from which the samples have been
drawn are normally distributed.
2. The population variances are equal and unknown, i.e.
x 2 y 2 2 .
3. The two samples are random and independent of each
other.
Paired ‘t’-Test for Difference of Means
Paired ‘t’-test is applied
(i) When the sample sizes are equal.
(ii) The two samples are not independent but the sample
observations are paired together, the pair of observations
(xi, yi); (i = 1, 2, ........... n) corresponding to ith unit of the
sample.
Here instead of applying the difference of means, we
consider the increment.
Under the null hypothesis H 0 : d = 0 , i.e. the increment
are due to fluctuation of samples.
The Test Statistic:
Where d
di
(di d)2
and S 2
.
n
(n 1)
128 Medical Statistics and Demography Made Easy
‘t’- Test for Testing Significance of Correlation Coefficient
If r is the observed correlation coefficient in a sample of n pair
of observations from a bi-variate normal population. The
under the null hypothesis that population correlation
coefficient is zero, the test statistic.
r
‘t’
n 2
1 r2
Follows a student ‘t’ distribution with (n – 2) degree of
freedom. If t comes out to be significant then we reject H0.
SOLVED EXAMPLES
Test for Significance for Single Mean (For Small
Sample)
QUESTION: A random sample of 10 students has the following IQ
67, 110, 115, 75, 63, 117, 120, 115, 100 and 97.
Do these data support that the sample is drawn from a population
of Medical students with IQ =100.
SOLUTION:
Setting of hypothesis: The sample is drawn from a population
of medical student with IQ = 100, i.e. H 0 : 100 .
Alternative hypothesis: H1 : 100 (Two Tail Test)
The Test Statistic is:
“t”
Where S
xi2 n x
n 1
x 100
S
n
2
; is an unbiased estimate of
Testing of Hypothesis 129
From the above data we can calculate Mean and SD ‘S’;
which is equal to:
x
S
x i 976
97.6; and
n
10
(99558 10(97.6)2
(99558 95257.6)
21.85
(10 1)
9
By putting these values in test statistic we can calculate
the value of ‘t’
t
97.6 100
21.85
10
2.4
2.4
0.34
21.85 6.91
3.16
The tabulated value of ‘t’ at (n – 1) = 9 degree of freedom
at 5% level of significance is 2.62.
The tabulated value of ‘t’ is more than the calculated
value; hence we accept the null hypothesis.
Conclusion: The sample is drawn from the population of
medical students with IQ = 100.
“t” Test for Difference of Mean between Two
Independent Groups
QUESTION: Two groups of rats were placed on diets with high
and low protein contents and the gain in weight were recorded after
2 months. The results of gain in weight are as follows:
Group A (high protein diet): 140
146
117 160
107 102
123
114
145
121
127
132
107
153
97
120
63 110
115 120
120
150
96
74
86
Group B (low protein diet):
130 Medical Statistics and Demography Made Easy
Find out whether there is any significant difference between
the weight gain in rats of two groups.
SOLUTION:
Setting of hypothesis
Null hypothesis: H 0 : 1 2 ; and
Alternative hypothesis:
Mean and SD of the two groups can be calculated which will
be equal to:
Group A:
Group B: n 2 11; x 2 104.63 and S 2 24.68
The Test Statistics
x1 x 2
1
1
S
n1 n 2
Where S2 is the pooled estimate of variance and is equal to
‘t’
S2
n 1 1 S12 n 2 1 S 2 2
n1 n 2 2
In this problem S2 = 454.73 (by putting the values of n1,
n2, S1 and S2 in the above formula)
Thus
S 454.73 21.32.
The test statistic will be equal to:
t
128.11 104.63
23.48
23.48
2.75
1 1
21.32 0.071 0.091 8.52
21.32
14 11
Tabulated value of ‘t’ at (n1 + n2 – 2) degree of freedom,
i.e. 23 df is 2.04 which is less than calculated value of ‘t’.
Hence, we reject the null hypothesis.
Testing of Hypothesis 131
Conclusion: Weight gain of rats in Group A (high protein diet)
is significantly more than those rats which are on low protein
diet.
Paired ‘t’ Test” for Difference of Mean
QUESTION: In a clinical trial the anxiety score of 10 patients were
recorded (baseline value). A new tranquillizer was given to each
patient for one month. After one month the anxiety scores were
again recorded. Which are as follows:
Case
number
1
2
3
4
5
6
7
8
9
10
Baseline
values (xi)
23
21
24
19
17
26
22
17
12
15
After one
month (yi)
15
20
26
17
17
21
16
12
12
11
Find out whether the new tranquillizer is effective to
psychoneurotic patients.
SOLUTION:
Setting of hypothesis
Null hypothesis: There is no difference in mean anxiety score;,
i.e.
H0 : 1 2
Alternative hypothesis:
The Test Statistic
where di = xi – yi
d is the mean of di and S is standard deviation of di
132 Medical Statistics and Demography Made Easy
The mean ad SD of di is calculated as follows:
Case No.
Base line
values (xi)
After one
month (yi)
di = xi – yi
di2
1
2
3
4
5
6
7
8
9
10
23
21
24
19
17
26
22
17
12
15
15
20
26
17
17
21
16
12
12
11
8
1
–2
2
0
5
6
5
0
4
64
1
4
4
0
25
36
25
0
16
Total
31 – 2= 29
175
(175 84.1)
3.17
9
Put these values in test statistic we can get the value of ‘t’
t
2.9
2.9
=
2.89
3.17
1.003
10
Tabulated value of ‘t’ at (n – 1) degree of freedom, i.e. 9
degree of freedom is 2.26; which is less than calculated value
of t = 2.89. Hence we reject the null hypothesis.
Conclusion: We can safely say that the new tranquillizer is
effective on psychoneurotic patients.
Testing of Hypothesis 133
‘t’ Test for Significance of Correlation Coefficient
QUESTION: If in a sample of 30 individuals, the correlation
coefficient between height and weight is r = +0.46. Find out whether
this correlation coefficient is significant in the population.
SOLUTION:
Setting of hypothesis
Null hypothesis: H 0 : 0 ; where ρ is the population
coefficient, i.e. the observed sample correlation is not
significant of any correlation in the population.
Alternative hypothesis:
The Test Statistics
is distributed as ‘t’ distribution with (n
– 2) degree of freedom.
In this problem r = +0.46; n = 30, putting these values in
the formula we get
‘t’
0.46
2
1 0.46
30 2
0.46 5.29 2.43
2.76
0.88
0.88
Tabulated value of ‘t’ at 28 degree of freedom and 5%
level of significant is 2.048 which is less than calculated value
of ‘t’. Thus we reject the null hypothesis.
Conclusion: On the basis of this sample we can say that there
is a significant positive correlation between height and weight
of individuals.
134 Medical Statistics and Demography Made Easy
F - Statistic
If X and Y are two independent Chi-Square variate with ν1
and ν2 degree of freedom, then F- statistic is defined by:
X Y
F /
1 2
Thus F is defined as the ratio of two independent ChiSquare variate divided by their respective degree of freedom
and it follows a F-distribution with (ν1, ν2) degree of freedom.
Mode of F - Distribution
1. Since F > 0. mode exists if and only if ν1 > 2
2. Mode of F-distribution is always < 1.
Skewness of F - Distribution
Coefficient of Skewness is given by:
Since mean > 1 and mode < 1. Hence F-distribution is highly
positively skewed.
Critical values of F - distribution
Figure 8.7
Testing of Hypothesis 135
Application of F - Distribution
F-test for Equality of Population Variance
Suppose we want to test
(i) Whether two independent samples xi; (i = 1, 2, ...... n1)
and yj, (j = 1, 2, ...... n2) have drawn from normal
population with same variance 2 .
(ii) Whether the two independent estimates of the population
variance are homogenous or not.
Under the null hypothesis
2
Where: Sx
xi x
(say)
2
n 1 1
2
and Sy
yj y
2
n 2 1
Follows F-distribution with 1 , 2 degree of freedom;
where
and
.
F-test for Equality of Several Means
F-test can be used for testing equality of several means using
the technique of Analysis of Variance (ANOVA).
COMPARISON OF SEVERAL GROUPS
One-way Analysis of Variance
The technique ‘analysis of variance’ forms a powerful method
of analyzing the way in which the mean values of a variable
is affected by classifications of the data of various sorts. This
technique concerned with the comparison of means rather
than variances.
136 Medical Statistics and Demography Made Easy
‘t’ distribution for the comparison of the means of two
groups of data, distinguishing between the paired and
unpaired cases. The analysis of variance’, is a generalization
of unpaired ‘t’ test, appropriate for any number of groups, It
is entirely equivalent to unpaired ‘t’ test when there are just
two groups.
Some examples of a one-way classification of data into
several groups are as follows:
(a) The reduction in blood sugar recorded for groups of
individuals given different doses.
(b) The values of certain lung function test recorded for men
of the same age group in a number of different
occupational categories.
Suppose there are k groups of observations on a variable
y, and that the ith group contains n i observations. The
numbering of the groups from 1 to k is quite arbitrary, although
if there is a simple ordering of groups it will be natural to use
this in the numbering.
Groups
1
2
........
i
........
k
All group
combined
Number of cases
n1
n2
........
ni
........
nk
N=
Mean of y
........
Sum of y
Sum of
y2
........
ni
= T/N
T1
T2
........
Ti
........
Tk
T=
Ti
S1
S2
........
Si
........
Sk
S=
Si
Note that the entries N, T and S in the final column are
the sum along the corresponding rows, but is not the sum
of
.
(
will be the mean of
)
if all the ni are equal otherwise
Testing of Hypothesis 137
In one way analysis of variance total sum of squares about
the mean of N values of y can be portioned into two parts:
(1) The sum of squares of each reading about its own mean
and
(2) The sum of squares of the deviations of each group mean
about the grand mean
(y ij y)2 (y ij y)2 (y i y)2
We can write this result as:
Total SSq = Within group SSq + Between SSq
Where SSq stands for sum of squares.
Now, if there are very large differences between group
means, as compared to with the within-group variation, the
between SSq is likely to be larger than within-group SSq. If on
the other hand, all the group means are nearly equal then
there is a considerable variation within groups. The relative
sizes of the between and within group SSq should be therefore,
provide an opportunity to assess the variation between group
means in comparison with that within groups.
The total sum of squares as well as sum of squares
between and within groups can be obtained by the following
formulae:
Total Sum of Squares:
y ij y
2
ij
T2
S
N
Within Sum of Squares:
For the ith group
yij y i
j
2
S i
T2
i
n
i
138 Medical Statistics and Demography Made Easy
Summing over k groups, therefore:
y ij y i
ij
2
T2
S1 1
n
1
T22
S 2
n2
T2
Si i
i
i ni
T2
S i
i ni
Tk 2
...... S k
nk
Between Sum of Squares:
yi y
2
Total SSQ Within group SSQ
ij
T2
S
n
T2
i
i ni
Ti 2
S
i n i
T2
N
Summarizing the results, we have the following formulae
for portioning the total sum of squares:
T 2 T2
1 N
Between groups
n1
Within groups
T2
S 1
i n1
Total
S
i
T2
N
Testing of Hypothesis 139
Testing for difference between mean of more than two
groups (i.e. k > 2):
Suppose that the ni observations in the ith group from a
random sample from a population with mean μ i and
variance 2 , As in two sample t-test we assume that is same
for all groups. To examine the evidence for the difference
between the μi we shall test the null hypothesis that the μi do
not vary, being equal to some common value μ. There are
three ways for estimating . These are as follows:
From total sum of squares: The whole collection of N
observations may be regarded as a random sample of size N,
and consequently:
Is an estimate of 2 .
From within group SSq: Separates unbiased estimated may
got for each group in turn:
A combined estimate based purely on variation within
groups may be derived by adding the numerator and
denominator of these ratio to gibe within group mean sum of
squares (or MSSq):
S2W
Within group SSq within group SSq
n i 1
N k
From between groups SSq: Since both S2T and S2w are
unbiased estimate of 2 . By subtracting them we can get the
third unbiased estimate by the between groups mean square.
140 Medical Statistics and Demography Made Easy
This we can form the analysis of variance table:
Source
df
Between groups k – 1
Sum of squares
Ti 2 T 2
B
i n i N
Within groups
Ti 2
S
A B
N–k
i ni
Total
N – 1 S
Mean sum
of squares F-ratio
S2B
S 2B
S 2w
S2w
T2
A
N
The difference between means could be made to depend
largely on the F-test in the analysis of variance at 1 = (k – 1)
and 2 =(N – k) degree of freedom.
If k = 2 the situation considered above is precisely that
for which the unpaired (or two sample) t test is. The variance
ratio, F will have 1, and N – 2 degree of freedom at t will have
n1 + n2 – 2, i.e. (N – 2) degree of freedom.
The value of F is equal to the square of the value of ‘t’. The
distribution of F on 1 and N – 2 degrees of freedom is
precisely the same as the distribution of the square of a
variable following ‘t’ distribution on N – 2 degree of freedom.
Testing of Hypothesis 141
If k > 2 we may examine the difference between a
particular pair of mean, choose because the contrast between
these particular groups is of logical interest.
The standard error of the difference between two mean,
say
and
may be estimated by:
and the difference
is tested by referring:
To the ‘t’ distribution with N-k degree of freedom. (Since
this is the number of degree of freedom associated with the
estimated variance s2). Confidence limits for the difference in
mean may be set in usual way, using tabulated percentiles of
‘t’ on N-k degree of freedom. “The only function of the analysis
of variance in this particular comparison has been replace
the estimate of variance on n1 + n2 – 2 degree of freedom
(which would be used in the two samples).”
Solved Example
Comparison of Several Means (ANOVA)
QUESTION: In a clinical trial, Twenty patients undergoing
operation were divided into four groups. Four different Anaesthetic
drugs were tested. The drugs were alloted at random in these groups.
The blood pressure was recorded just after induction. The results of
this trial was as follows:
142 Medical Statistics and Demography Made Easy
Group 1
Group 2
Group 3
Group 4
179
138
134
198
103
178
175
112
165
186
172
135
135
182
150
181
186
180
172
178
Find the affect of different drugs on blood pressure in patients.
SOLUTION:
Setting of hypothesis
Null hypothesis: There is no significant difference between the
mean values blood pressure between groups,
i.e. H0 : 1 2 3 4
Alternative hypothesis:
One way analysis of variance:
Group 1 Group 2 Group 3 Group 4
.
Total (Ti )
Number of cases (ni)
Mean (
)
Sum of squares (Si = ∑yi
Ti2/n i
2)
All
groups
179
138
134
198
103
752
5
178
175
112
165
186
816
5
172
135
135
182
150
774
5
181
186
180
172
178
896
5
150.4
163.2
154.8
179.2
118,854
136,674
121,658
160682 S = 537886
113100.8 133171.2 119815.2 160563.2
Sum of squares between groups =
T =3238
N = 20
Testing of Hypothesis 143
Total sum of squares =
(T)2
S
[537, 886 524232.2] 13653.8
N
Analysis of variance table:
Source
Degree of
freedom
Sum of
squares
Mean sum
of squares
Sum
squares
between
groups
24 – 1 = 3
2418.2
2418.2
3
Error sum of
squares
19 – 3 = 16 (13653.8 – 2418.2)
16
= 11235.6
Sw 2= 702.25
F-value
SB2 = 806.06
Total sum of 20 – 1 = 19
squares
11235.6
13653.8
The critical value of F (from F table) at 3 and 16 degree of
freedom is Ftab = 3.24; which is more than calculated value of
F (From Analysis of variance table). Hence we accept the null
hypothesis, i.e. there is no significant difference between the
mean blood pressure values in four groups.
Conclusion: There is no significant different between the
blood pressure just after induction of different drugs. The
four drugs have same effect on blood pressure of patients.
144 Medical Statistics and Demography Made Easy
Comparison of mean values of blood pressure in Group 1 and
Group 4 on the basis of analysis of variance table:
Mean blood pressure of patients in Group 1 = 150.4
Mean blood pressure of patients in Group 4 = 179.2
Number of cases in both groups = 5
Standard error
The critical value of ‘t’ at (N – 2), i.e. 18 degree of freedom
is 2.10 which is more than the calculated value of ‘t’. Hence,
we accept the null hypothesis. That there is no significant
difference between the blood pressure values of group 1 and
group 4.
Thus by the use of analysis of variance table we can compare
the mean values of two groups also.
MULTIPLE CHOICE QUESTIONS
1.
pq
indicates:
n
(a) Standard error of proportion
(b) Difference between proportion
(c) Standard error of mean
(d) Standard deviation from the mean
(AI, 93)
Testing of Hypothesis 145
2. The number of degree of freedom in a table of (4 × 4)
is:
(a) 4
(b) 8
(c) 9
(d) 16
(AI,95)
3. Confidence limits is:
(a) Range and standard deviation
(b) Median and standard error
(c) Mean and standard error
(d) Mode and standard deviation
(AI,99)
4. All are true regarding student t-test except:
(a) Standard error of mean is not estimated
(b) Standard population is selected
(c) Two samples are compared
(d) Student’s t- map (table) is required for calculation
(AI, 2000)
5. A community has a population of 10,000 individuals,
beta carotene was given to 6,000 individuals and the
remaining population was not given beta carotene.
After some time 3 in the first group developed lung
cancer and 2 in the second group also developed lung
cancer. The correct statement is:
(a) Beta carotene and lung cancer have no association
(b) The P-value is not significant
(c) The study is not designed properly
(d) Beta carotene is associated with lung cancer
(AI, 2001)
6. If the mean is 230 and the standard error is 10, the 95%
confidence limits would be:
(a) 210 to 250
(b) 220 to 240
(c) 225 to 235
(d) 230 to 210
(AI, 89)
146 Medical Statistics and Demography Made Easy
7. Significant ‘p’ value is all except:
(a) 0.005
(b) 0.05
(c) 0.01
(d) 0.1
8. The mean BP of a group of persons was determined
and after an interventional trial, the mean BP estimated
again. All the test to be applied to determine the
significance of intervention is:
(a) Chi-Square
(b) Paired ‘t’ test
(c) Correlation coefficient
(d) Mean deviation
(AIIMS, 95)
9. Which of the following is a pre-requisite for the ChiSquare test to compare:
(a) Both samples should be mutually exclusive
(b) Both sample need not be mutually exclusive
(c) Normal distribution
(d) All of the above
(UPSC 2000)
10. If a group of persons taking part in a controlled trial of
an anti-hypertensive drug the blood pressures were
measured before and after giving the drug. Which of
he following tests will you use for comparison:
(a) Paired t-test
(b) F test
(c) ’t’-test
(d) Chi-Square test
(AIIMS,2000, Dec 97)
11. About test of significance between two large
population, one of the following statement is true:
(a) Null hypothesis states that two means are equal
(b) Standard error of difference is the sum of the
standard error of 2 means
(c) Standard error of means are equal
Testing of Hypothesis 147
(d) Standard error of difference between population is
calculated
[Hint: Null hypothesis is usually the hypothesis of no
difference, is to be tested for the possible reason of rejection
under the assumption that it is true.The denominator for test
of difference between two population is the standard error of
difference of means or proportion not the standard error of
difference between population].
(AIIMS, Dec 98)
12. True about Chi-Square test is:
(a) Null hypothesis is equal
(b) Doesn’t measures the significance
(c) Measures the significant difference between two
proportions
(d) Test correlation and regression
(AIIMS, June 99)
13. For 95% confidence limits true is:
(a) 1.95 of standard error of mean
(b) Reduces 95% of values
(c) 2.95 of standard error of mean
(d) Normal distribution + 2.5 SD
(AIIMS, June 95)
14. Standard error of mean indicates:
(a) Dispersion
(b) Distribution
(c) Variation
(d) Deviation
[Hint: Standard error is merely the standard deviation of some
statistic calculated from a sample (in this case, the mean) is an
indefinitely long series of repeated sampling].
(AIIMS, Nov. 99)
15. In a ‘p’ test p indicates the probability:
(a) Accepting null when it is false
(b) Accepting when it is true
(c) Rejecting null when it is true
(d) Rejecting null when it is false
[Hint: Level of significance is also the critical region]
(AIIMS,June 2000)
148 Medical Statistics and Demography Made Easy
16. In a group of 100 children, the weight of a child is 15
kg. The standard error is 1.5 kg. Which one of the
following is true:
(a) 95% of all children weigh between 12 and 18 kg
(b) 95% of all children weigh between 13.5 and 16.5
(c) 99% of all children weigh between 12 and 18
(d) 99% of all children weigh between 13.5 and 16.5
(AIIMS,May 2001)
17. A group tested for a drug shows 60% improvement as
against a standard group showing 40% improvement.
The best test to test the significance of result is:
(a) Student’s ‘t’ test
(b) Chi-Square test
(c) Paired ‘t’ test
(d) Test for variance
(AIIMS, Nov 2001)
18. A test was done to compare serum cholesterol levels in
obese and non-obese women. The test for significance
of difference is:
(a) Paired ‘t’ test
(b) Students ‘t’ test for independent variables
(c) Chi-Square test
(d) Fisher test
(AIIMS, Nov 2001)
19. Which of the following is a parametric test of
significance:
(a) U test
(b) ‘t’ test
(JIPMER, 2003)
20. For testing the statistical significance of the difference
in heights of school children among three
socioeconomic groups, the most appropriate statistical
test is :
(a) Student’s ‘t’ test
(b) Chi-Square test
Testing of Hypothesis 149
(c) Paired ‘t’ test
(d) One way analysis of variance (one way ANOVA)
(AI, 2002)
21. In a study, variation in cholesterol was seen before and
after giving a drug. The test which would give its
significance is
(a) Unpaired ‘t’ test
(b) Paired ‘t’ test
(c) Chi-Square test
(d) Fisher’s test
(AI, 2002)
22. An investigator wants to study the association between
maternal intake of iron supplements (Yes/ No) and
birth weights (in gm) of newborn babies. He collects
relevant data from 100 pregnant women and their
newborns. What statistical test of hypothesis would you
advise for the investigator in this situation ?
(a) Chi-Square test
(b) Unpaired or independent t-test
(c) Analysis of variance
(d) Paired t-test
[Hint: The investigator classify the pregnant women into two
groups depending upon intake of iron supplement. Thus there
are two independent groups and mean birth weights of the
babies can be compared].
(AIIMS, 2003)
23. A randomized trial comparing the efficacy if two drugs
showed a difference between the two with a ‘p’ value
of < 0.005. In reality, however, the two drugs do not
differ. This is therefore is an example of:
(a) Type-I error (α-error) (b) Type-II error (β error)
(c) 1 – α
(d) 1 – β
[Hint: Rejecting null hypothesis, when it is true is called typeI error]
(AIIMS, 2002)
150 Medical Statistics and Demography Made Easy
24. If we reject null hypothesis when it is actually true, is
known as:
(a) Type –I error
(b) Type II error
(c) Power
(d) Specificity
(AIIMS, 2004)
25. A randomized trial comparing the efficacy of two drugs
showed a difference between two (with a p valuse <
0.05). Assume in reality, however the two drugs do not
differ. This is therefore an example of:
(a) Type I error (α error)
(b) Type II error (β error)
(c) 1 – α
(d) Power of Test.
(AIIMS, 2004)
26. The Hb level in healthy women if 13.5 g/dl and
standard deviation is 1.5 g/dl, what is the Z score for a
women with Hb level 15.0:
(a) 9.0
(b) 10.0
(c) 2.0
(d) 1.0
(AIIMS, 2004)
Chapter 9
Non-parametric
Tests
152 Medical Statistics and Demography Made Easy
Non-parametric (NP) tests does not depend on the particular
form of the basic frequency function from which the samples
are drawn.
Non-parametric tests does not make any assumption
regarding the form of the population.
Advantages of Non-parametric Tests
1. Non-parametric methods are very simple and easy to
apply.
2. No assumption is made about the form of frequency
function of the parent population from which the sample
is drawn.
3. NP tests can apply to the data which are mere
classification (i.e. which are measured in nominal scale).
4. NP tests are available to deal with the data which are
given in ranks, or whose seemingly numerical score have
the strength of ranks (i.e. scores are given in grades, i.e.
A–, A, A+, B, B+).
Disadvantages of Non-parametric Tests
1. NP tests can only be used if the measurements are
nominal or ordinal. If a parametric test exists it is more
powerful than NP tests.
Remarks
Since no assumption is made about parent population, the
non-parametric methods are some times referred as
distribution free methods.
These tests are based on the ‘Ordered Statistic’ theory. A
sample x1, x2 ......... xn is an ordered sample. If x1 < x2 < x3 .........
< xn .
The whole structure of NP methods rests on simple but
fundamental property of order statistic.
Non-parametric Tests 153
Run Test
Suppose x1, x2 ............ xn1 is an ordered sample from a
population and y1, y2, ............ yn2 be an independent ordered
sample from other population. We want to test if the samples
have been drawn from the same population or from different
population.
Let us combine two samples and arrange the observations
in order of magnitude to give the combined ordered sample:
x1, x2
y1, y2, y3
x3, x4, x5
y4, y5
1(l = 2)
2(l = 3)
3(l = 3)
4(l = 2)
x6 ............
Run: A run is defined as a sequence of one kind
surrounded by a sequence of other kind and the number of
elements in a run is usually referred as the length ‘l’ of the
run.
If both samples came from same population, there would
be a thorough mingling of xi and yj in combined sample and
the number of runs in the combined sample would be large.
On the other hand if the samples came from two different
population then their ranges do not overlap, then there would
be only two runs. Of the type x1, x2 ............ xn1 and y1, y2, ............
yn2.
Generally, any difference in mean and variance would
tend to reduce the number of runs. Thus alternative hypothesis
will entail too few runs.
Procedure: In order to test the null hypothesis that the
samples have come from the same population. We have to
count the number of runs ‘U’ in the combined ordered sample.
When n1 and n2 are large then under null hypothesis ‘U’
is asymptomatically normal with
2n l n 2
Mean (U)
1 and
nl n2
154 Medical Statistics and Demography Made Easy
Variance (U)
2n l n 2 2n l n 2 n l n 2
n l n 2 2 n l n 2 1
Thus we can use the normal test:
Z
U Mean U
Variance U
~ N 0, 1
This approximation is fairly good if each of n1 and n2 is
greater than 10. Since alternative hypothesis is ‘too few
runs’ the test is ordinarily one tailed with only negative
values leading to the rejection.
OTHER NON-PARAMETRIC TESTS
Median Test
Median test is a statistical procedure for testing, if the two
independent ordered samples differ in their central
tendencies.
If x1, x2 ........ xn1 and y1, y2, ........ yn2 be two independent
ordered samples and z1, z2, ........ zn1 + n2 be the combined
ordered sample.
Let m1 be the number of x’s and m2 be the number of y’s
exceeding the median value of combined series.
No. of observations > Median
No. of Observations < Median
(m1+m2)
Total
Sample 1
Sample 2
Total
m1
n1 – m1
m2
n2 – m2
m1 + m2
(n1+n2) –
n1
n2
(n1 + n2)
If the frequencies are small we can compute the exact
probabilities. However, if the frequencies are large, we may
Non-parametric Tests 155
use χ2 test with 1 degree of freedom for testing H0 (the null
hypothesis, that the samples came from the same population).
The approximation test is fairly good, if both n1 and n2
exceed 10.
Sign Test
Sign test is used under the following conditions:
(a) When any given pair of observations two things being
compared.
(b) For any pair, each of the two observations is made under
similar extraneous conditions.
(c) Different pairs are observed under different conditions.
Third condition (condition ‘c’) implies that di = (xi – yi); i
= 1, 2, 3 ........ n have different variance and thus renders the
paired ‘t’ test invalid, which would have otherwise being
used unless there was obvious non-normality.
Sign test is based on the sign (plus or minus) of the
deviation di = (xi – yi). No assumptions are made regarding
the parent population. The only assumptions are:
(1) Measurements are such that the deviations di = (xi – yi)
can be expressed in term of positive or negative.
(2) Variables have continuous distribution.
(3) di’s are independent.
Different pairs (xi, yi) may be from different population (say
with respect to age, weight, stature, education). The only
requirement is that within each pair, there is matching with
respect to relevant extraneous factors.
156 Medical Statistics and Demography Made Easy
Procedure:
Let (xi, yi), i = 1, 2, 3 ........ n be n paired observations drawn
from the two population. Under the null hypothesis that two
population are equal. Find out the difference between each
pair of observations, i.e. di = xi – yi.
Let us define Ui such that
If xi > yi (i.e. positive sign); Ui = 1; and if xi < yi (i.e. negative
sign) Ui = 0.
Since Ui; i = 1, 2, 3 ........ n are independent. Therefore
U U1
For large samples, (n > 30), we may regard U to be
asymptotically normal (under null hypothesis) with mean
and variance equal to:
Mean of U
n
and Variance
2
Thus,
and we may use Normal test.
Mann-Whitney Wilcoxon ‘U’ Test
The non-parametric test for two samples was the most widely
used test when we do not make assumption about the parent
population.
Let x1, x2, ........ xn1 and y1, y2, ........ yn2 be two independent
ordered samples of size n1 and n2.
Non-parametric Tests 157
Mann-Whitney test is based on the pattern of x’s and y’s
in the combined order samples.
x1, x2, y1, y2, y3, x3, x4, x5, y4, y5, x6 ........
Let ‘T’ denote the sum of ranks of the y’s in the combined
sample. The rank of y in the combined sample is: 3, 4, 5, 8, 9
........
Then T = 3 + 4 + 5 + 8 + 9
U n1 . n 2
n 2 n 2 1
T
2
If ‘T’ is significantly large or small then H0 will be rejected.
It has been established that under the null Hypothesis U
is asymptotically normally distributed with mean (μ, σ2) where
Then
Hence
n n n n 2 1
n1 n 2
and 2 1 2 1
2
12
U
~ N 0, 1
A normal test can be used if both n1 and n2 are greater
than 8.
Z
Solved Example
Run Test
QUESTION: In the given set of data drawn from two populations;
Apply Run and test the hypothesis whether the samples are drawn
from the population with same distribution function:
xi 15 77 01 65 69 69 58 40 81 16 20 20 00 84 22
y j 28 26 46 66 36 86 66 17 43 49 85 40 51 40 10
158 Medical Statistics and Demography Made Easy
SOLUTION:
Setting the Hypothesis
Null hypothesis: The two populations have same distribution
function. H0: f1(.) = f2(.)
Alternative hypothesis: H1: f1(.) f2(.)
The Test Statistics:
Where
Mean U
2n1n 2
1 and
n1 n 2
Variance U
2n 1n 2 2n1 n 2 n1 n 2
n 1 n 2 2 n 1 n 2 1
Calculate the number of RUN is the combined ordered
series. For this first arrange xi and yj in ascending order:
S.No. 1
xi
yi
2
3
4
5
6
7
8
9
10 11 12 13 14 15
00 01 15 16 16 20 22 40 58 65 69 69 77 81 84
10 17 26 28 36 40 40 43 46 49 51 66 66 85 86
Combine the two series in ordered form in terms of xi and yj:
x1, x2,
y1,
x3, x4, x5,
y2,
x6, x7,
y3, y4, y5,
1
2
3
4
5
6
x8,
y6, y7, y8, y9, y10, y11,
x9, x10,
y12, y13,
7
8
9
10
.x11, x12, x13, x14, x15,
11
y14, y15
12
Non-parametric Tests 159
Thus, we can see that in the combined series there are 12
runs (the sequence of one kind of series). Therefore U = 12
(Total number of Runs).
The mean and variance of U:
Mean U
Variance U
2 15 15
1 15 1 16; and
15 15
2 15 15 2 15 15 15 15
2
15 15 15 15 1
450 450 30
30 2 29
450 430 193500
7.43
900 29
26100
Thus the test statistic Z is
Variance U
12 16
4
1.47
7.43 2.72
The tabulated value of Z is more than the calculated value
(i.e. Z = 1.47). Hence, we accept the null hypothesis. That the
distribution of two populations is same.
Conclusion: The distribution of two populations from
which the two samples are drawn is same.
Z
Sign Test
QUESTION: In the above example if (xi, yi ) be the pair of
observations are drawn from the two population Then apply sign
test and find out whether the distribution of two population are
equal:
xi 15 77 01 65 69 69 58 40 81 16 20 20 00 84 22
y j 28 26 46 66 36 86 66 17 43 49 85 40 51 40 10
160 Medical Statistics and Demography Made Easy
SOLUTION:
Setting of Hypothesis
Null hypothesis: The two populations have same distribution
function. H0: f1(.) = f2(.)
Alternative hypothesis: H1: f1(.) f2(.)
The Test Statistic is
S.no.
1
xi
yj
15 77 01 65 69 69 58 40 81 16 20 20 00 84 22
28 26 46 66 36 86 66 17 43 49 85 40 51 40 10
– + – – + – – + + – – – – + +
di =
(x i –y i )
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Ui = 1, if xi > yi (i.e. positive sign) and 0 if xi < yi (i.e. negative
sign)
U U i 6 (There are total 6 pairs in which xi > yi).
Thus Test statistic Z is:
Tabulated value of Z is more than the calculated value.
Hence, we accept the null hypothesis, i.e. the distribution
functions of two populations are same.
Conclusion: The two sample are drawn from the same
population
Non-parametric Tests 161
Mann-Whitney U Test
QUESTION: In the same set of data Apply Mann-Whitney U test to
compare the distribution function of the population.
The combined observations of two series are arranged in ascending
order: (As in Run Test):
Ranks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x1 x2 y1 x3 x4 x5 y2 x6 x7 y3 y4 y 5 x8 y6 y 7
Ranks 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
y8 y9 y10 y11 x9 x10 y12 y13 x11 x12 x13 x14 x15 y14 y15
T (sum of ranks of y in combined ordered series) is
calculated from the above table, which is equal to:
T = 3 + 7 + 10 + 11 + 12 + 14 + 15 + 16 + 17 + 18 + 19 + 22
+ 23 + 29 + 30 = 246
U n1 . n 2
n 2 n 2 1
15 15 1
T 225
246
2
2
225 120 246 99
Mean and variance of ‘U’ is:
Mean (U)
Variance
(U)
n 1 .n 2 n 2 n 2 1
12
15 . 15 15 15 1 225 31
12
12
581.25
Thus, test statistics Z is
99 112.5 13.5
0.55
24.11 24.11
162 Medical Statistics and Demography Made Easy
Tabulated value of Z is more than the calculated value of
Z. Hence, we accept the null hypothesis, i.e. the two samples
are drawn from the same population.
Conclusion: The distribution function of the populations
from which the two samples are drawn is same.
MULTIPLE CHOICE QUESTIONS
1. Statistical tests that are non-parametric include:
(a) Regression
(b) Correlation
(c) The student’s test
(d) Rank correlation
(e) Wilcoxon rank sum test
(PGI, 80, AIIMS 80)
2. If the distribution of population is not known which of
the following test will be used:
(a) F-test
(b) Students ‘t’ test
(c) ANOVA
(d) Sign test
3. For large sample size Mann-Whitney U test the test
statistics “U” is Normally distributed with:
(a) N (μ, 1)
(b) N (μ, σ2)
2
(c) N (0, σ )
(d) N (0, 1)
Chapter 10
Statistical Methods
in Epidemiology
164 Medical Statistics and Demography Made Easy
Epidemiology is a study of the distribution and determinants
of health related states or events in a specified population.
Epidemiology is by definition concerned with certain
problems affecting groups of individuals rather then single
subjects.
In broad terms Epidemiology is concerned with the
distribution of disease, chronic as well as communicable
diseases which gives rise to epidemics of the classical sort.
Some important terms used in epidemiological studies:
Baseline: Health state (disease severity, confounding
condition) of individuals at the beginning of a prospective
study. A difference (asymmetry) in the distribution of baseline
values between groups will bias the results.
Blinding (Masking): Blinding is a method to reduce bias
by preventing observers and/or experimental subjects
involved in any analytic study from knowing the hypothesis
being investigated, the case control classification, the
assignment of individuals or groups, or the different treatment
being provided. Blinding reduces bias by preserving symmetry
in the observer’s measurements and assessment. This bias is
usually not due to deliberate deception but due to human
nature and prior held belief about the area of study.
Placebo: A placebo is the dummy treatment used in a
control in place of actual treatment. If a drug is being evaluated,
the inactive carrier is used along with active drug. So it is as
similar as possible in appearance and in administration to
the active drug. Placebo are used to blind observers and for
human trials, the patient to which group the patient is
allocated.
Case definition: The set of history, clinical sign and
laboratory findings that are used to classify an individual as
a case or not for an epidemiological study. Case definition
Statistical Methods in Epidemiology 165
are needed to exclude individuals with the other conditions
that occurs at an endemic background, rate in a population
or other characteristics that will confuse or reduce the
precision of a clinical trial.
Cohort: A group of individuals identified on the basis of
a common experience or characteristic that is usually
monitored over time from the point of assembly.
Experimental unit: In an experiment, the experimental
unit are the units that are randomly selected or allocated to a
treatment and the unit upon which the sample size is
calculated and subsequently data analysis must be based.
Prospective study (Data): Data collection and the events
on interest occur after individuals are enrolled (e.g. clinical
trials or cohort studies) This prospective collection enables
the use of more solid consistent criteria and avoid potential
biases or retrospective recall. Prospective studies are limited
to those conditions that occurs relatively frequently and to
studies with relatively short follow-up periods so that
sufficient number of eligible individuals can be enrolled and
followed within a reasonable period.
Retrospective study (Data): All events of interest have
already occurred and data are generated from historical
records (secondary data) or from recall (which may result in
the presence of significant recall bias). Retrospective data is
relatively inexpensive compared to prospective studies
because of the use of available information and is typically
used in case-control studies. Retrospective studies of rare
conditions are much more efficient than prospective studies.
Basic Measures of Epidemiology
Measurements of epidemiology includes the following:
1. Measurement of mortality, morbidity, etc.
166 Medical Statistics and Demography Made Easy
2. Measurement of the presence or absence or distribution
of the characteristic or attributes of the disease.
3. Measurements of demographic variables.
4. Measurement of the presence, absence or distribution of
the environmental and other factors suspected of causing
the disease.
Parameters of Measurements
Epidemiologist usually express disease magnitude as rate,
ratio or proportion. These three are the basic parameters of
measuring epidemiology.
Rate: A rate measures the occurrence of some particular
event (occurrence of death or disease) in a population during
a given period of time. The rate is expressed as per thousand.
For example:
Death rate
Total number of deaths in a year
1000
Mid year population
The rates can be broadly classified as
(1) Crude rate.
(2) Specific rates.
(3) Standardized rates.
Ratio: In ratio the numerator is not a component of
denominator. The numerator and denominator may involve
an interval of time and may be instantaneous in time.
For example:
In sex ratio (Male: Female), the numerator will be the
number of males population during a given period, and the
denominator will be the number of female population during
the same period. If number of males = ‘a’ and number of
females = ‘b’
a
Then, Ratio
b
Statistical Methods in Epidemiology 167
Thus we can see that the numerator is not a component
of denominator.
Proportion: A proportion is a ratio which indicate the
relation of magnitude of a part of the whole. The numerator is
always included in the denominator. A proportion is usually
expressed as percentage.
In the above example the proportion of Male: Female is:
Proportion of male in the population
a
100
a b
Numerator: Numerator refers to number of times an event
occurs. The numerator is a component of denominator in
calculating rates but not in a ratio.
Denominator: Literal meaning of denominator is the
number below line in a fraction. In epidemiology generally,
we use three types of denominator.
Mid year population: While calculating rates (death, birth)
the denominator comprises the mid-year population.
Because of the population size changes daily due to birth,
deaths and migration, therefore, we use mid year population
as a denominator for calculating rates. The mid year
population refers to the population estimated as on 1st July
Population at risk: For calculating morbidity statistics the
population exposed to risk is used as denominator. The term
is applied to all those to whom an event could have happened
whether it did or not.
For example: While calculating general fertility rate, the
women of reproductive age group (15-49 years) is taken as
denominator, because women < 15 years and > 49 years of
age generally does not give birth, therefore, they are not
exposed to risk.
168 Medical Statistics and Demography Made Easy
Related to events: In some situation the denominator may
be related to total events instead of total population.
For example: While calculating maternal mortality rate
the denominator will be number of live birth.
Measurements of Mortality
The measures of mortality are Crude Death Rate, Age Specific
Death Rates, Standardized Death Rates. Which will be
discussed in details in the following heading.
Measurements of Morbidity
Morbidity is defined as ‘any departure, subjective or objective
from the state of physiological well-being’.
The morbidity could be measured in terms of three units.
(a) Person who were ill.
(b) The illness (period or spell) that these persons
experienced.
(c) The duration (says or weeks, etc.) of illness.
Disease is frequently measured by incidence and
prevalence rates (though prevalence is referred as rates, but it is
actually the ratio).
Incidence rate (Person): The number of new cases occurring
in a defined population during a specified period of time.
No. of new cases of a specified
disease during a given period of time
Incidence rate
1000
Population at risk during that period
Persons
Example: If there are 1,000 new cases of illness in a
population of 50,000 in a year then incidence rate is:
Incidence rate
1000
1000 20 per thousand per year
50, 000
Statistical Methods in Epidemiology 169
Incidence rate must include unit of time, the incidence of
disease in the above example is 20 per 1000/year.
Features of incidence rate:
1. Only New cases
2. During a given period of time
3. In a specified population (population at risk)
4. Unit of time should be mentioned.
Incidence rate (Spells): The number of new spells of illness
in a defined population during a specified time.
No. of spells of sickness starting in
a defined period of time
Incidence rate
1000
Mean number of persons exposed
Spells
to risk in that period
Incidence measures the rate at which new cases are
occurring in a population. It is not influenced by the duration
of disease.
Use of incidence rate: Incidence rates are useful in
determining the causality of diseases.
The incidence rate is useful for taking action
(a) To control disease.
(b) Distribution of disease and efficacy of prevention and
therapeutic measures.
If the incidence rate is increasing, it might indicate failure
or ineffectiveness of the current control programme and
there is a need for a new disease control programme.
Prevalence: The total number of all individuals who have
an attribute or disease at a particular time (or during a
particular period) divided by the population at risk of having
attribute or disease at this point of time or mid way through
the period.
170 Medical Statistics and Demography Made Easy
Prevalence refers specially to all current cases (old and
new) existing at a given point of time, or over a given period
of time in a given population:
Prevalence are of two types:
(1) Point prevalence.
(2) Period prevalence.
Point prevalence: Point prevalence of a disease as a
measure of all cases (old and new) of a disease at one point of
time in relation to defined population.
No. of all current cases old and new
of a specified disease existing at a
Point
given po int of time
Pr evalence
100
Estimated population at the
same point of time
In point prevalence ‘point’ may be a day, several days or even
few weeks depending upon the time it takes to examine the
population.
Period prevalence: It includes cases arising before but
existing into or through the year as well as those cases arising
during the year.
Period prevalence it is a combination of point prevalence and
incidence.
No. of all current cases old and new
of a specified disease existing at a
Period
given po int of time
Pr evalence
100
Estimated mid int erval
population at risk
Incidence and Prevalence can best explained by following
Figure
Statistical Methods in Epidemiology 171
Figure 10.1
From the above figure number of new cases in the given
period (January 2000 – December 2000) are 3 (case 2, 5 and 8).
Therefore for incidence, number of new cases will be 3.
For point prevalence at January 2000, three cases will be
included (case 3,6, and 7). While for point prevalence at
December 2000 2 cases will be included (case 5 and case 8).
For period prevalence (during a period from January 2000
to December 2000) 6 cases will be included (Case 2, 3, 5, 6,
7and 8; case 2, 5 and 8 are new cases and 3, 6 and 7 are old
cases). Case no 1 and 4 are excluded because these two cases
fell outside the given period).
Use of Prevalence
Prevalence helps to estimate the magnitude of health/disease
problem in the community and to identify potential high risk
population.
Prevalence data provide an indication of the extent of a
condition and may have implications to the provision of
services needed in a community.
172 Medical Statistics and Demography Made Easy
Prevalence rate is especially useful for administrative and
planning purpose.
Both measures of prevalence are proportions - as such
they are dimensionless and should not be described as rates
(Friis and Sellers, 1999).
• Friis RH and Sellers TA Epidemiology for public health
practice 2nd ed., Aspen Publishers, Inc. (1999).
Incidence
# New cases*
Population at risk*
* During specified time period
Prevalence
Remember, incidence means NEW. Prevalence means ALL.
Relation between Incidence and Prevalence
If the population is stable and incidence and duration are
unchanging:
Then Prevalence = Incidence × Duration
Or
Incidence =
And
Duration =
Statistical Methods in Epidemiology 173
From the above relation we can say that the longer the
duration of disease the prevalence rate will be high in relation
to incidence.
If shorter the duration of illness the disease is acute and
of short duration (either because of rapid recovery or death)
the prevalence will be relatively low as compared to incidence.
Decrease in prevalence may take place not only from a
decrease in incidence but also from a decrease in duration of
illness either more rapid recovery or more rapid death.
Epidemiological Studies
Epidemiological studies can be classified as observational
studies and experimental studies:
Observational studies were further divided into
Descriptive studies and Analytical studies. While
Experimental studies were divided into Randomized
controlled trials, Field trials and Community trials.
Observational Studies
In observational studies the allocation or assignment of factors
is not under control of investigator. In an observational study,
the combination are self selected or are ‘experiments of
nature’. Observational studies provide a weaker empirical
evidence because of the potential of large confounding biases
to be present where there is an unknown association between
a factor and outcome.
The greatest value of these type of studies is that they
provide preliminary evidence that can be used as the basis
for hypothesis in stronger experimental studies.
Descriptive studies: The objective of descriptive studies
is to describe the distribution of variables in a group. Statistics
serve only to describe the precision of those measurements or
to make statistical inferences about the values in the
174 Medical Statistics and Demography Made Easy
population from which the sample is drawn. Such studies
asked questions about:
(a) When the disease occurring-time distribution.
(b) Where it is occurring-place distribution.
(c) Who is getting the disease - person distribution.
Measurement of morbidity in descriptive studies:
Measurement of morbidity has two aspects – Incidence and
Prevalence. Incidence can be obtained from longitudinal
studies and prevalence from cross-sectional studies. Beside
case series and case report the descriptive studies may use
cross-sectional and longitudinal studies to obtain estimates
of the health and disease problems of the population.
Case series: A descriptive, observational study of a series
of cases, typically describing the manifestations, clinical
course and prognosis of condition. A case series provides a
weak empirical evidence because of the lack of comparability
unless the findings are dramatically different from
expectations. Case series are best used as a source of
hypothesis for investigation by stronger study design.
Unfortunately, the case series is the most commonly used in
clinical trials.
Case report: A description of a single case, typically
describing the manifestations, clinical course and prognosis
of that case. Due to the wide range of natural biologic variation
in these aspects, a single case report provides little empirical
evidence to the clinicians. They do describe how other
diagnosed and treated the condition and what the clinical
outcome was.
Longitudinal studies (Incidence Study): Longitudinal
studies are those studies in which the observations are
repeated in the same population over a prolonged period of
time by means of follow-up examinations. Longitudinal
Statistical Methods in Epidemiology 175
studies are useful in (a) identifying the risk factors of disease
and (b) for finding out the incidence rate or rate of occurrence
of new cases of the disease in community.
Cross-sectional studies (Prevalence Study): A
descriptive study of the relationship between disease and
other factors at one point of time (usually) in a defined
population. Cross-sectional studies lack any information on
timing of exposure and outcome relationship and include
only prevalent cases. Cross-sectional studies are more useful
for chronic than short-lived diseases. This type of studies
tells about distribution of a disease in a population rather
than its aetiology.
Analytical studies: In analytical studies, the subject of
interest is the individual within the population. The object is
not to formulate but to test hypothesis. Although individuals
are evaluated in analytical studies, the inference is not to the
individual but to the population from which they are selected.
Measurement of morbidity in analytical studies: Analytical
studies comprise two distinct types of observational studies
(a) Cohort study and (b) Case control study studies. From
these studies we can determine (1) whether or not a statistical
association exists between a disease and a suspected factor
and (2) if it exists , the strength of the association.
Cohort study: A prospective, analytical, observational
study, based on data, usually primary, from a follow-up
period of a group in which some have had, have or will have
the exposure of interest and to determine the association
between the exposure and an outcome.
‘Cohort’ is defined as a group of people who share a
common characteristic or experience within a defined period.
In a cohort study a population of individuals selected
usually by geographical or occupational criteria rather then
176 Medical Statistics and Demography Made Easy
on medical grounds. The population is classified by the factor
or factors of interest and followed prospectively in time so
that the rates of occurrence of various manifestations of disease
can be observed and related to the classification by aetiological
factors.
Because of their prospective nature, cohort studies are
stronger than case-control studies when well executed but
they are more expensive.
Case control study: A retrospective, analytical,
observational study often based on secondary data in which
the proportion of cases with a potential risk factors are
compared to the proportions of controls (individuals without
the disease) with the same risk factor.
The method is appropriate when the classification by the
disease is simple (i.e. presence or absence of a specific
condition). A further advantage is that, by mean of the
retrospective enquiry, the relevant information can be
obtained comparatively quickly.
A central problem in a case control study is the method
by which the controls are chosen. Ideally, they should be on
average similar to the cases in all respect except in the medical
condition under study and in associated aetiological factors.
These studies are commonly used for initial, inexpensive
evaluation of risk factors with long induction of periods.
Unfortunately, due to the potential for many forms of bias
in this study type, case control studies provide relatively weak
empirical evidence even when properly executed.
Case control studies are often called retrospective studies
while cohort studies are called prospective studies.
Experimental Studies
The hallmark of the experimental study is that the allocation
or assessment of individuals is under control of investigator
Statistical Methods in Epidemiology 177
and thus can be randomized. The key is that the investigator
controls the assignment of the exposure of the treatment but
otherwise symmetry of potential unknown confounders is
maintained through randomization. Properly executed
experimental studies provide the strongest empirical
evidence. The randomization also provides a better
foundation for statistical procedures than do the observational
studies.
The following are some important randomized control
trials:
Randomized controlled clinical trial (RCT): A
prospective, analytical experimental study using primary
data generated in the clinical environment. Individuals similar
at the beginning are randomly allocated two or more
treatment groups and the outcomes the groups compared after
sufficient follow-up time.
Properly executed, the RCT is the strongest evidence of
the clinical efficacy of preventive and therapeutic procedures
in the clinical setting.
Randomized cross-over clinical trial: A prospective,
analytical, experimental study using primary data generated
in the clinical environment. Individuals with a chronic
condition are randomly allocated to one of two treatment
group, and after a sufficient treatment period and often
washout period, are switched to other treatment for the same
period.
In this type of study design each patient serves as his
own control. The patients are randomly assigned to a study
group and control group. The study receives the treatment
under consideration. The control group receive some
alternative form active treatment or placebo. The two groups
are observed over a time. The patients in each group are taken
off their medication or placebo to allow for possible
178 Medical Statistics and Demography Made Easy
elimination of the medication from the body and for the
possibility of any ‘carry out’ effects. After this period the two
groups are switched. Those who received the treatment under
study are changed to control group therapy or placebo, and
vice versa.
Carry over studies has an advantage that during the
course of investigation, patients will receive the new therapy.
But this design is susceptible to bias if carry over effects of
first treatment occurs.
Randomized controlled laboratory study: A prospective,
analytical, experimental study using primary data generated
in the laboratory environment. Laboratory studies are very
powerful tolls for doing basic research because all extraneous
factors other than those of interest can be controlled or
accounted for (e.g. age, gender, genetics, nutrition,
environment, etc.). However, this control of other factors is
also the weakness of this type of study.
If any interaction occurs between these factors and the
outcome of interest, which is usually the case, the laboratory
results are not directly applicable to clinical setting unless
the impact of these interactions are also investigated.
Bias Occurred in the Studies
Systemic Error
Almost all studies have bias, but to varying degree. Bias can
be reduced only by a proper study design and execution and
not by increasing the sample size( which increases the
precision by reducing the opportunity for a random chance
deviation from the truth). The critical question is whether or
not the results could be due to large part to bias, thus making
the conclusion invalid.
Statistical Methods in Epidemiology 179
Observational study design are inherently more
susceptible to bias than are experimental study design.
Following are some bias which can occur in any study:
Confounding bias: Confounding is the distortion of the
effect of one risk factor by the presence of another.
Confounding occurs when another risk factor for a disease is
also associated with the risk factor being studied but acts
separately. Age, gender, breed are often confounding risk
factors. Confounding can be controlled by restriction, by
matching on the confounding variable.
Systemic error due to the failure to account for the effect
of one or more variables that are related to both the causal
factor being studied and the outcome, and are not distributed
in the same manner between the groups being studied.
Confounding can be accounted for if the confounding
variable are measured and are included in the statistical
model of the cause-effect relationships.
Ecological (Aggregation) bias: Systemic error that occurs
when an association observed between variables representing
group averages is mistakenly taken to represent the actual
association that exists between these variables for individuals.
This bias occurs when the nature of the association at the
individual level is different from the association observed at
the group level.
Measurement bias: Systemic error that occurs because
of the lack of blinding or related reasons such as diagnostic
suspicion, the measurement method (instrument or observer
of instrument) are consistently different between groups in
the study Screening bias is one of the most important
measurement bias.
Screening bias: The bias that occurs when the presence
of a disease is detected earlier during its latent period by
180 Medical Statistics and Demography Made Easy
screening tests but the course of the disease is not be changed
by earlier intervention. Because the survival after screening
detection is longer than survival after detection of clinical
signs, ineffective intervention appears to be effective unless
they are compared appropriately in clinical trials.
Readers bias: Systemic errors of interpretation made
during inference by the users or reader of clinical information.
Such biases are due to clinical experience, tradition, prejudice
and human nature. The human tendency is to aspect
information that supports preconceived opinions and to reject
that which do not support preconceived openions.
Sampling (Selection) bias: Systemic error that occurs
when, because of design and execution errors in sampling,
selection, or allocation methods, the study comparisons are
between groups that differ with respect to the outcome of
interest for reasons other than those under study.
Analysis of Epidemiological Studies
Analysis of Cohort Study
The analysis of epidemiological studies are done and the data
are analyzed in term of:
(a) Incidence rate of outcome among exposed and nonexposed.
(b) Estimation of risk.
(a) Incidence Rates
In cohort study, we can determine incidence directly in those
exposed and those non exposed.
The frame work of the cohort study can be represented as
follows:
Statistical Methods in Epidemiology 181
Cohort
Disease
Total
Positive
Negative
Exposed
Non-exposed
a
c
b
d
(a + b) = H1
(c + d) = H2
Total
(a + c) = V1
(b + d)= V2
N
Then incidence rates are:
Incidence of exposed
Incidence of non-exposed
(b) Estimation of Risk
The risk of outcome of disease or death in exposed and nonexposed cohort is determined by two indices (a) relative risk
and (b) attributable risk
Relative Risk
Relative risk is the ratio of the incidence of the disease (or
death) among exposed and the incidence among non-exposed.
This may also referred and risk ratio.
Estimation of relative risk is important in aetiological
studies,. It directly measures the ‘strength’ of the association
between suspected cause of effect.
A relative risk of 1 indicates no association; relative risk
of greater than 1 suggests a ‘positive’ association between
exposure and disease under study.
The larger the relative risk, the greater the strength of the
association between suspected factor and disease.
182 Medical Statistics and Demography Made Easy
a
H
Re lative risk (RR) 1
c
H2
Attributable Risk
Attributable risk (AR) is the difference in incidence rates of
disease (or deaths) between exposed group and non-exposed
group. This may also be referred as “Risk difference”.
Attributable risk are often expressed as percent.
Attributable risk indicates to what extent the disease
under study can be attributed to exposure.
Relative Risk vs Attributable Risk
Relative risk is important in aetiological enquires, larger the
relative risk the stronger the association between cause and
effect.
Attributable risk gives a better idea than relative risk about
the impact of successful preventive or public health
programme.
Statistical Methods in Epidemiology 183
Analysis of Case Control Study
In case control study data are analyzed in terms of:
(a) Exposure rates among cases and controls to suspected
factor
(b) Estimation of disease risk associated with exposure
(Odds ratio).
Exposure Rates
A case control study provides a direct estimation of exposure
rate (frequency of exposure) to a suspected factor is a disease
and non-disease group.
The framework of a case control study in form of 2 × 2
contingency table.
Factor
Case
Control
Total
Exposed
Non-exposed
a
c
b
d
(a + B) =H1
(c + d) = H2
Total
(a + c) = V1
(b + d)= V2
N
Exposure rate for cases
a
a c
Exposure rate for control
b
b d
The exposure rate for exposed and non-exposed can be
compared by applying suitable statistical tests (comparing
the proportion of two groups be z-test for proportion or the
association between two groups and factors by Chi-Square
test).
184 Medical Statistics and Demography Made Easy
Estimation of Risk Association with Exposure
A typical case control study does not provide incidence rate
from which a relative risk (RR) can be directly calculated. The
common association measure for a case control study is the
Odds Ratio.
Odds Ratio
Odds ratio is a measure of the strength of association between
risk factor and outcome. Cases must be a representative of
those with disease and control of those without disease.
a
to , these two quantities can be
b
thought of as odd in favour of having the disease.
It is the ratio of
Odds Ratio
Odds ratio is a key parameter in the analysis of case
control study.
Important Features of Relative Risk
(Risk Ratio) and Odds Ratio:
(a) The odds ratio is used in retrospective design called case
control study, while the risk ratio is useful in Cohort
(prospective) study design.
(b) Both the odds ratio and the relative risk compare the
likelihood of an event between two groups. The odds
ratio compares the relative odds of death (disease) in
each group, while the relative risk (risk ratio) compares
the probability of death (disease) in each group rather
than odds.
Statistical Methods in Epidemiology 185
(c) Both the odds ratio and the relative risk are computed by
division and are relative measures.
(d) Both the risk ratio and the odds ratio takes on valuse
between zero (0) and infinity ( ). One is the natural
value means that there is no difference between the
groups compared, close to zero and infinity measures a
large difference. A risk ratio/odds ratio larger than 1
means that the group one has larger proportion than
group two, if the opposite is true the risk ratio/odds
ratio will be smaller than 1. If we swap the two
proportions the risk ratio/odds ratio will take on its
inverse (1/RR; 1/OR).
(e) The odds ratio can be compared with risk ratio. The risk
ratio is easier to inerpret than odds ratio. Howeer, in
practice the odds ratio is used more often. This has to do
with the fact that odds ratio is more closely related to the
frequently used statistical techniques such as logistic
regression.
(f) The risk ratio gives the percentage difference in
classification between group one and group two, while
odds ratio gives the ratio of the odds of suffering some
fate. The odds themselves are also ratio.
(g) Both odds ratio and risk ratio are non negative valuse
and lies between 0 and (0 < OR < ; 0 < RR < α).
(h) The significance of odds ratio can be tested by using
95% confidence interval. If the value 1 is not included
within 93% CI, then odds ratio is significant at 5% level
(p<0.05).
Diagnostic Tests
In epidemiological studies much use is made of diagnostic
test, based either on clinical observations or on laboratory
techniques, by means of which individuals are classified as
186 Medical Statistics and Demography Made Easy
healthy or as falling into one of a number of disease categories.
Such tests are, of course, important throughout the whole
medicine, and in particular from the basis of screening
programme for the early diagnosis of disease.
Most such tests are imperfect instruments, in the sense
that healthy individuals will occasionally be classified
wrongly as being ill, while some individuals who are really
ill may fail to detect. How should we measure the ability of a
particular diagnostic test to give the correct diagnosis both
for healthy and for ill subjects?
Properties of diagnostic tests have traditionally been
described using sensitivity, specificity, positive and negative
predictive values. These measures, however, reflect population
characteristics and do not easily translate to individual
patients.
In clinical practice, physician are often faced with
interpreting the results of diagnostic tests. These results are
not absolute. A negative test does not always rule out disease
and some positive results can be false.
Clinical epidemiology has long focused on sensitivity
and specificity, as well as positive and negative predictive
values, as a way of measuring diagnostic utility. The test is
compared against a reference (gold) standard, and the results
are tabulated in a 2 × 2 contingency table.
The gold standard is a test that is considered to be the
most accurate among all known tests. All the other should be
compared with this test, in order to indicate whether they are
reliable, so that less accurate tests are not preferred.
Sensitivity: Sensitivity is the proportion of those with the
disease who test positive. Sensitivity is a measure of how
well the test detects disease when it is really there; a sensitive
test has few false negative.
Statistical Methods in Epidemiology 187
Specificity: Specificity is the proportion of those without
disease who test negative. It measures how well the test rules
out disease when it is really absent; a specific test has few
false positive.
Predictive values: Considering sensitivity and specificity
we can choose what is necessary or helpful, but the most
important is predictive value. Results of a test can be positive
or negative.
In case the test is positive or abnormal, it is necessary to
know some important information about the disease. The
positive predictive value express how many times the positive
results of the test really represents disease. The positive
predictive value expresses the proportion of those with positive
test results who truly have disease.
On the other hand, negative predictive value is the
probability of a negative result really correlates to a disease
free person.
Thus we can summarize these diagnostic tests as:
Sensitivity: is disease focused–i.e. the percentage of people
with the disease that the test correctly identifies.
Specificity: is wellbeing or normal focused–i.e. the
percentage of normal people the test correctly identifies as
normal.
Positive predictive value: focuses of the positive results–i.e.
the percentage of positive results that are correct.
Negative predictive value: focuses on the negative results–
i.e. the percentage of negative results that are correct.
Early Diagnostic and Screening Test
Defining normality and abnormality:
One of the central concerns in clinical medicine is
differentiating the normal from the abnormal. How does one,
188 Medical Statistics and Demography Made Easy
for instance, decide that somebody has hypertension? This
will not a big problem if the frequency distribution of BP in
hypertensive people and non hypertensive people completely
different and did not overlap. In reality, they overlap (Figure
10.2) and no matter which cut-off point is used for diagnosis,
some hypertensive will be wrongly labeled as normotensive,
while some normotensive will be diagnosed as hypertensive.
If the cut-off point is moved to the left, the number of false
negative will decrease at the expense of more false positive. If
the cut-off point is shifted to the right the reverse will happen.
Figure 10.2
An ideal test will completely separate the diseased and
the disease-free groups and there would be no overlap (Fig.
10.3) Such ideal test are very rare. Overlap is almost seen and
this makes it difficult to validate tests.
Statistical Methods in Epidemiology 189
A test with complete separation of groups results
in a perfect diagnostic performance
A test with partial separation of groups results
in a intermediate diagnostic performance
A test with no separation of groups results in no diagnostic information
Figure 10.3
190 Medical Statistics and Demography Made Easy
Validity of Test
A diagnostic test is valid if it detects most people with the
target disorder and excludes most people without disorder,
and if a positive test usually indicates that the disorder is
present. To understand this, we need to understand the need
to validate tests against a gold standard.
Using a 2 × 2 table, we could compute the sensitivity,
specificity, positive predictive value, and the negative
predictive value of the test.
It is important that all new tests should be validated by
comparison against a test which is established and considered
a gold standard. Diagnostic test are generally not 100%
accurate. If the sensitivity is very high, the specificity tends to
be low.
Suppose the data be classified as:
Gold standard*
Test
Result
Total
Positive
Negative
+
–
a
c
b
d
a+b
c+d
Total
a+c
b+d
N
[* By Gold standard we can classify the individual as
presence/ absence of a particular disease]
‘a’= True positive
‘b’ = False positive
‘c’ = False negative
‘d’ = True negative
Sensitivity = The proportion of person with the condition who
test positive.
=
Statistical Methods in Epidemiology 191
Specificity = The proportion of persons with out the condition
who test negative.
=
Positive predictive value: The proportion of person with
a positive test who have the condition.
=
a
a b
Negative predictive value: The proportion of person with
a negative test, who do not have the condition.
=
Diagnostic accuracy: The following condition given the
diagnostic accuracy of the test
=
a d
a b c d
Prevalence: Prevalence of the disease is the total positive
cases by gold standard to total cases.
=
Predictive Value in Relation to Prevalence
Positive predictive value (PPV) is a function of specificity,
sensitivity and prevalence.
192 Medical Statistics and Demography Made Easy
The positive predictive value is expressed as percentage.
It is influenced by the sensitivity, specificity of the screening
test and the prevalence of disease.
SENSITIVITY AND SPECIFICITY IN TERMS OF
TYPE-I AND TYPE–II ERRORS
Table related to decision and hypothesis (Types of error).
Decision from sample
True statement
Accept H0
Reject H0
H0 True
H0 False
1 – β = power
(Type-II error) = β
α
1–α
1
1
Total
Table related to diagnostic test:
Gold Standard*
Test
Result
Total
Positive
Negative
+
–
a
c
b
d
a+b
c+d
Total
a+c
b+d
N
From the above tables we can see that β (Type-II error) is
false negative and α the type-I error is false positive. (1 – β) the
power of test is true positive and (1 – α) is true negative.
a
As we know that sensitivity of a test is a c therefore,
which is power of test is equal to sensitivity,
similarly
is equal to specificity.
Statistical Methods in Epidemiology 193
Thus we see that there is an analogy here with
significance test. If the null hypothesis is that an individual
is a true positive and a negative test is regarded as significant.
The α is analogous to significance level and 1 – β is analogous
as power of test, the alternative hypothesis is that individual
is true negative.
Likelihood Ratio
A fairly new concept in diagnostic tests is the concept of
likelihood ratios. Likelihood ratios are more practical way of
making sense of diagnostic test result and have immediate
clinical relevance. In general a useful test provides high
positive likelihood ratio and a small negative likelihood ratio.
Likelihood ratios are independent of disease prevalence.
They may be understood using the following analogy. Assume
that the patient test positive on diagnostic test; if this were a
perfect test, it would mean that the patient would certainly
have a disease (true positive). The only thing that stops us
from making this conclusion is that some patients without
disease also test positive (false negative). We therefore have
to correct the true positive (TP) rate by the false positive (FP)
rate, this is done mathematically by dividing one by the other.
Pr obability of positive test
in those with disease
Positive likelihood ratio
Pr obability of positive
test without disease
TP rate
FP rate
194 Medical Statistics and Demography Made Easy
a
a c
b
b d
Likewise, if a patient test negative, we are still worried
about the likelihood of this being a false negative (FN) rather
than a true negative (TN). This likelihood is given
mathematically by the probability of a negative test in those
with diseases, compared to the probability of a negative test
in those without disease.
Probability of negative
test in those with disease
Negative likelihood ratio
Pr obability of negative
test without disease
FN rate
TN rate
c
a c
d
b d
Likelihood ratios have number of useful properties:
1. Because they are based on a ratio of sensitivity and
specificity, they do not vary in different populations or
setting.
Statistical Methods in Epidemiology 195
2. They can be used directly at the individual patient level.
3. They allow the clinician to quantitate the probability of
disease for any individual patient.
The interpretation of likelihood ratios is intuitive: The
larger the positive likelihood ratio, the greater the likelihood
of disease; the smaller the negative likelihood ratio, the lesser
the likelihood of disease.
For example: A 50-year-old male with the positive stress
test. It is known that a more than 1 mm depression of exercise
stress testing have a sensitivity and specificity of 65% and
89% respectively for coronary artery disease when compared
with reference standard of angiography [Ref: (Diamond GA et
al Analysis of probability as an aid in the clinical diagnosis of
coronary-artery disease. N Eng J Med 1979; 300: 1350-8)].
This means that positive likelihood ratio
0.65
5.9
1 0.89
Thus we can say that the likelihood of this patient having
a disease has increased by approximately six-fold given the
positive test result.
Thus we can say that the likelihood ratios are useful and
practical way of expressing the power of diagnostic tests in
increasing and decreasing the likelihood of disease.
Unlike sensitivity and specificity, which are the
population characteristics, likelihood ratios can be used at
the individual patient level.
MULTIPLE CHOICE QUESTIONS
1. Prevalence of disease affects:
(a) Sensitivity
(b) Specificity
(c) Predictive value
(d) Repeatability
(AI, 92)
196 Medical Statistics and Demography Made Easy
2. Sensitivity of a test:
(a) True positive/True positive + False negative
(b) True negative/True negative + False positive
(c) False negative/True negative + True positive
(d) False negative/True positive + False negative
(AI, 92, 93, 97)
3. Which of the following is not true for case control study.
(a) Easy to carry out
(b) Inexpensive
(c) Attributable risk can be measured
(d) No attrition problem
(AI, 94)
4. All is true about prevalence except:
(a) Rate
(b) Specifically for old and new cases
(c) prevalence = incidence × duration
(d) Prevalence is of two types
(AI, 96)
5. Case control study provides all except:
(a) Incidence
(b) Relative risk
(c) Odds ratio
(d) Strength of association
(AI, 97)
6. True about prevalence all except:
(a) Rate
(b) Ratio
(c) Duration of disease affects it
(d) Numerator and denominator are separate (AI,98)
7. Incidence rate is measured by:
(a) Case control study
(b) Cohort study
(c) Cross-sectional study (d) Cross over study(AI, 98)
8. Predictive value for positive test is defined as :
(a) True positive/true positive + False negative × 100
(b) True positive/True positive + False positive × 100
Statistical Methods in Epidemiology 197
(c) False positive/True positive + False positive × 100
(d) False positive/ True positive + False negative × 100
(AI, 99)
9. Specificity of a test means all except:
(a) Identify those without disease
(b) True positive
(c) True negative
(d) An ideal screening test should have 100% specificity
(AI, 2000)
10. ELISA test for HIV was done in a population. What
will be the result of performing double screening
ELISA test:
(a) Increased sensitivity and positive predictive value
(b) Increased sensitivity and negative predictive value
(c) Increased specificity and positive predictive value
(d) Increased specificity and negative predicted value
(AI, 2001)
[Hint: By performing double screening, the true positive will
increase and the value of false negative will decrease]
11. Incidence is calculated by:
(a) Retrospective study (b) Prospective study
(c) Cross-sectional study (d) Random study
(AIIMS, May 95)
12. Prevalence is a:
(a) Rate
(c) Proportion
(b) Ratio
(d) Mean
(AIIMS, Feb 97)
13. Incidence of disease among exposed minus that of nonexposed is equal to:
(a) Relative risk
(b) Attributable risk
(c) Odds ratio
(d) None of the above
(AIIMS, June 97)
198 Medical Statistics and Demography Made Easy
14. Specificity is related to:
(a) True positive
(c) False positive
(b) True negative
(d) False negative
(AIIMS, Dec 97)
15. ELISA test has sensitivity of 95% and specificity of 95%.
Prevalence of HIV carriers is 5%. The predictive value
of positive test is:
(a) 95%
(b) 50%
(c) 100%
(d) 75%
[Solution: The Positive predictive value is given by
PPV
Prevalence sensitivity
Prevalence × sensitivity (1 Prevalence) (1 specficity)
0.05 .95
0.05 0.95 (1 0.05) (1 0.95)
0.05 .95
1
0.5
0.05 0.95 (1 1) 2
and is expressed in percentage = 50%
(AIIMS, June 99)
16. All of the following are true about case control study
except:
(a) Relatively cheap
(b) Relative risk can be calculated
(c) Used for rare cases
(d) Odds ratio can be calculated
(AIIMS,June 2000, AI 2002))
17. Which of the following are best for calculating the
incidence of a disease:
(a) Case control
(b) Cohort
(c) Cross-sectional study (d) Longitudinal study
(AIIMS,Nov 2000)
Statistical Methods in Epidemiology 199
18. Too much false positive in a test is due to which of the
following:
(a) High prevalence
(b) Test with high specificity
(c) Test with high sensitivity
(d) High incidence
(AIIMS, Nov 2000)
19. In a community, the specificity of ELISA test is 99%
and sensitivity is 99%. The prevalence of the disease is
5/1000. Then positive predictive value of the test is:
(a) 33%
(b) 67%
(c) 75%
(d) 99%
[Solution: The Positive predictive value is given by
PPV
Prevalence sensitivity
Prevalence sensitivity (1 Prevalence) (1 specficity)
Prevalence
5
0.005, specificity
1000
0.005 0.99
0.005 0.99 (1 0.005) (1 0.99)
0.005 0.99
0.005 0.99 (0.995)(0.01)
0.005 0.99
0.99 (0.005 0.01)
(take 0.995 0.99)
0.005
0.015
= approximately 0.33 and is expressed as
percentage
= 33%]
(AIIMS, May 2001)
200 Medical Statistics and Demography Made Easy
20. In a village of 1 lakh population, among 20,000 exposed
to smoking 200 developed cancer, and among 40,000
people unexposed 40 developed cancer. The relative
risk of smoking in the development of cancer is:
(a) 20
(b) 10
(c) 5
(d) 15
[Hint: Incidence of smokers =
200
;
20, 000
Incidence of non-smokers =
Relative Risk =
]
(AIIMS, May 2001)
21. A women exposed to multiple sex partners has 5 times
increased risk for CaCx. The attributable risk is:
(a) 20%
(b) 50%
(c) 80%
(d) 100%
[Solution: Let incidence rate among non-exposed is x, then
incidence rate among exposed is 5 times higher therefore the
incidence rate among exposed is 5x.
According to definition of attributable risk
AR =
And expressed in percentage = 80%]
(AIIMS,Nov 2001)
22. True about case control study All except:
(a) Less expensive
(b) Those with disease and not diseased compared
Statistical Methods in Epidemiology 201
(c) Attributed risk is estimated
(d) None of these
AIIMS,Nov 2001)
23. Which of the following is true about cohort study:
(a) Incidence can be calculated
(b) It is from effect to cause
(c) It is inexpensive
(d) Shorter time than case control
(JIPMER,2003)
24. For the calculation of positive predictive value of a
screening test, the denominator is comprised of:
(a) True positives +False negatives
(b) False positives + True negatives
(c) True positives + False positives
(d) True positives + True negatives
(AI, 2003)
25. The table below shows the screening test results of
disease ‘Z’ in relation to the true disease status of the
population being tested:
Screening
test results
Yes
Disease
Total
No
Positive
negative
400
100
200
600
600
700
Total
500
800
1300
The specificity of the screening test is:
(a) 70%
(b) 75%
(c) 79%
(d) 86%
26. If prevalence of diabetes is 10%, the probability that
three people selected at random from the population
will have diabetes is:
202 Medical Statistics and Demography Made Easy
(a) 0.01
(b) 0.03
(c) 0.001
(d) 0.003
[Hint: There are two rules of probability, the addition law and
the multiplication law.
1
= 0.1
10
The probability of all 3 having diabetes can be calculated using
the multiplication law of probability. It will be
Probability of one person having diabetes is p =
p × p × p = 0.1×0.1×0.1 = 0.001 ]
27. The usefulness of a screening test depends upon its:
(a) Sensitivity
(b) Specificity
(c) Reliability
(d) Predictive value
(AI, 2002)
28. In a low prevalence area for Hepatitis B, a double ELISA
test was decided to be performed in place of a single
test which used to be done. This would cause an
increase in the:
(a) Specificity and positive predictive value
(b) Sensitivity and positive predictive value
(c) Sensitivity and negative predictive value
(d) Specificity and negative predictive value (AI, 2002)
29. The association between coronary artery disease and
smoking was found to be as follows.
Smokers
Non-smokers
Coronary art dis
No. coronary art dis
30
20
20
30
Statistical Methods in Epidemiology 203
The Odds ratio can be estimated as
(a) 0.65
(b) 0.8
(c) 1.3
(d) 2.25
30 30
= 2.25 ]
[Hint: Odds ratio =
20 20
(AI,
2002)
30. A screening test is used in the same way in two similar
populations; but the proportion of false positive results
among those who test positive in population A is lower
than those who test positive in population B. What is
the likely explanation?
(a) The specificity of the test is lower in population A
(b) The prevalence of the disease is lower in population
A
(c) The prevalence of the disease is higher in population
A
(d) The specificity of the test is higher in population A
[Hint: When false positive result in population A is less than
that of B. Then PPV of population A is higher than that of B,
thus by the formula the prevalence of population A is higher
than that of B]
(AIIMS, 2003)
31. Residence of three village with three different types of
water supply were asked to participate in a study to
identify cholera carries. Because several cholera deaths
had occurred in the recent past, virtually everyone
present at the time submitted to examination. The
proportion of residents in each village who were carries
was computed and compared. This study is a :
(a) Cross- sectional study.
(b) Case-control study.
204 Medical Statistics and Demography Made Easy
(c) Concurrent cohort study.
(d) Non-concurrent.
(AIIMS, 2003)
32. A drug company is developing a new pregnancy-test
kit for use on an outpatient basis. The company used
the pregnancy test on 100 women who are known to be
pregnant. Out of 100 women, 99 showed positive test.
Upon using the same test on 100 non-pregnant women,
90 showed negative result. What is the sensitivity of
the test ?
(a) 90%
(b) 99%
(c) Average of 90 and 99
(d) Cannot be calculated from the given data
[Hint:
Pregnant
Non-pregnant
Total
Test positive
Test negative
99
1
10
90
109
91
Total
100
100
200
Sensitivity =
99
= 0.99 (expressed in percentage = 99%)]
100
(AIIMS, 2003)
33. Which of the following relationship between different
parameters of a performance of a test is correct:
(a) Sensitivity = 1 – specificity
(b) Positive predictive value = 1 – negative predictive
value
(c) Sensitivity is inversely proportional to specificity
(d) Sensitivity = 1 – positive predictive value
[Hint: Both sensitivity and specificity can not be increase
simultaneously. If one increase then other will decrease]
(AIIMS, 2004)
Statistical Methods in Epidemiology 205
34. Which of the following is not an advantage of a
prospective cohort study:
(a) Precise measurement of exposure is possible
(b) Many disease outcomes can be studies
simultaneously
(c) It usually cost less than a case control study
(d) Recall bias is minimized compared with a case
control study
35. The incidence rate of a disease is five times greater in
women than in men, but the prevalence rate shows no
sex difference. The best explanation is that:
(a) The crude death rate (by all causes) is greater in
women
(b) The case fatality rate for this disease is lower in
women
(c) The case-fatality rate is greater in women
(d) Risk factors for the disease are more common in
women
36. In a study of a disease in which all cases that developed
were ascertained, if the relative risk for the association
between factor and disease is equal to or less than 1
then:
(a) The factors protect against the development of the
disease
(b) There is either no association or a negative
association between the factors and disease
(c) Either matching is not done properly
(d) There is a significant positive association between
the diseases
[Hint: The risk ratio 1 indicate that there is no difference
between two groups, and the range of Risk Ratio lies between
(0
60
4,000
12,000
6,000
8,000
36
48
66
158
3,000
20,000
4,000
3,000
30
100
48
60
1,000
4,000
3,000
2,000
Total
30,000
308
30,000
238
10,000
Find out the death rate of which district is higher.
9. The following data given the number of women in
child bearing age and yearly birth in five year age
groups for a city. Calculate the general fertility rate
and total fertility rates. If the ratio of male to female is
13:12. What is the gross reproductive rate?
310 Medical Statistics and Demography Made Easy
Age
Group
Female
pop
Births
Age
Group
Female
pop
Births
15 – 19
20 – 24
25 – 29
30 – 34
16,000
15,000
14,000
13,000
400
1710
2100
1430
35 – 39
40 – 44
45 – 49
Total
12,000
11,000
9,000
60,000
960
330
36
6690
10. A total of 1,000 individuals were surveyed and
classified as:
Hypertensive
Normotensive
Total
Smokers
Non-smokers
250
50
250
450
500
500
Total
300
700
1000
(a) Calculate the prevalence of hypertension from the
study.
(b) Calculate smoking rate among hypertensive and
normotensive.
(c) Find out whether, smoking is associated with
hypertension.
(d) Find out the risk associated with hypertension.
11. A comparative evaluation of Ziehl-Neelsen staining
and culture on Lowenstein Jensen medium in the
diagnosis of pulmonary and extrapulmonary
tuberculosis patients. Following results were obtained:
Unsolved Questions 311
Z-N
stain
L-J culture (Gold standard)
Positive
Negative
Total
Positive
Negative
16
16
0
12
16
28
Total
32
12
44
Find out the sensitivity, specificity, positive predictive
value, negative predictive value and diagnostic accuracy
of Z.N. Stain.
12. Following are the marks obtained by students in an
examination:
Marks
No. of students
Marks
No. of students
20 – 30
25
60 – 70
27
30 – 40
26
70 – 80
15
40 – 50
36
> 80
10
50 – 60
42
(a) Find out the quartile deviation
(b) Also comment about the skewness of the
distribution.
13. Form a frequency distribution table of the following
data and calculate the two most suitable measures of
central tendencies:
32
47
41
51
30
39
18
48
54
32
31
46
15
37
32
56 300
21
45
32
37
41
44
18 650
47 390
42
44
37
56
48
53
42
37
41
51
50
47
48
312 Medical Statistics and Demography Made Easy
14. The haemoglobin levels of patients are as follows:
Hb%
No. of cases
Hb%
No. of cases
6
7
8
9
10
14
23
26
30
130
11
12
13
14
110
70
50
12
(a) Find out the median of the above distribution by
using ogives.
(b) Also find out the mean by using short cut method.
15. A random sample of patients selected from the
Cardiology OPD of a hospital have following values
of blood pressure:
Blood
pressure
No. of cases
Blood
pressure
No. of cases
130 – 140
14
160 – 170
23
140 – 150
24
170 – 180
40
150 – 160
54
180 – 190
32
Calculate coefficient of dispersion (based on Quartiles
and based on Mean and SD).
16. Find the correlation coefficient and line of regression
between height and weight of 10 individuals:
Unsolved Questions 313
Case
No.
Height
Weight
Case
No.
Height
Weight
1
175
65
6
169
69
2
166
56
7
182
81
3
182
78
8
190
87
4
167
66
9
187
84
5
176
72
10
151
60
17. A survey conducted by a health agency, it was found
that in Town A out of 876 birth 46% were male, while
in Town B out of 690 birth 473 were males.
Is there any significant difference in the proportion
of male child in the two towns. Clearly state the
hypothesis which is to be tested.
18. A sample of 900 individuals has a mean haemoglobin
of 12.7 mg%. Is the sample drawn from a population
with mean 13.6 mg% and SD 2.70.
19. A random sample is drawn from two hospitals and
following data related to blood pressure of adult males
hospital workers were obtained:
Hospital A
Hospital B
Mean blood pressure 127.56 mmHg 140.78 mmHg
Standard deviation
13.77 mmHg 10.37 mmHg
No. of cases
360
700
Is the blood pressure of male workers of Hospital B is
significantly higher than those working in Hospital A.
314 Medical Statistics and Demography Made Easy
20. Two groups of rats were placed on diets with high
and low protein contents and the gain in weight (in
gms) were recorded after 2 months. The results of gain
in weight are as follows:
Group A (high protein diet):
140 117 160 123 145 127 107 146 107 102
114 121 132 153
Group B (low protein diet):
97 63 110 120 96 74 86 120 115 120
150
Find out whether there is any significant difference
between the weight gain in rats of two groups.
21. In a clinical trial the anxiety score of 10 patients were
recorded (baseline value). A new tranquillizer was
given to each period for one month. After one month
the anxiety scores were again recorded. Which are as
follows:
Case
No.
Baseline
value
(xi)
After
one
month
(yi)
Case
No.
Baseline
value
(xi)
After
one
month
(yi)
1
2
3
4
5
23
21
24
19
17
15
20
26
17
17
6
7
8
9
10
26
22
17
12
15
21
16
12
12
11
Find out whether the new tranquillizer is effective to
psychoneurotic patients.
Unsolved Questions 315
22. Concentration of haemoglobin (xi) and bilirubin (yi)
for infants with haemolytic disease of newborn are as
follows:
Case No.
(xi)
(yi)
Case No.
(xi)
(yi)
1
2
3
4
15.8
12.3
9.5
9.4
1.8
5.6
3.6
3.8
5
6
7
8
9.2
8.8
7.6
7.4
5.6
5.6
4.7
6.8
Calculate the correlation coefficient and comment
whether haemoglobin level is directly proportional to
bilirubin levels.
23. Most recent amount smoked by all patients other than
those with cancer of the lung, from a retrospective
survey, are as follows:
Dis.
Cigarette daily
Total
Group
0
1–4
5–14
15–24
> 24
Cancer
RDS
CHD
GI Dis.
Others
236
42
22
39
38
78
33
19
31
31
237
128
64
143
91
110
98
38
81
44
57
34
23
34
18
718
335
166
328
215
Total
377
185
663
371
166
1762
Find out whether various disease groups are associated
with daily cigarette smoking. Also mention the degree
of freedom required in this problem.
316 Medical Statistics and Demography Made Easy
24. Following table shows the number of individuals in
various age groups who were found in a survey to be
positive and negative for Schistosoma mansoni eggs
in the stool.
Age in yesrs
0–10 10–20 20–30
Total
30–40
> 40
Test +
Test –
14
87
16
33
14
66
7
34
6
11
57
231
Total
101
49
80
41
17
288
Find out whether the presence of Schistosoma mansoni
eggs in the stool is related to age.
25. Number of children who were nasal carrier or noncarrier of Streptococcus pyogenes, classified by size of
tonsils. The results of survey as follows:
Present
but not
enlarged
Tonsils
Enlarged
Total
Greatly
enlarged
Carrier
Non-carrier
19
497
29
560
24
269
72
1326
Total
516
589
293
1398
Find out whether nasal carrier are associated with size
of tonsils.
26. Two groups of female rats were placed on diets with
high and low protein content, and gain in weight
between the 28th and 84th days of age was measured
for each rat. The results were as follows:
Unsolved Questions 317
High protein diet
(n – 12)
134
146
104
119
124
161
107
83
Low protein diet
(n – 8)
113
129
97
123
70
118
101
85
107
132
94
115
Find out whether there is any significant increase in the
weight of rats who were given high protein diet.
27. In a clinical trial to assess the value of a new method
of treatment (A) in comparison with the old method
(B). patients were divided at random into two groups.
Out of 257 patients treated by method A. 41 died, of
244 patients treated by method B, 64 died. Find out
whether difference in fatality rate of group A is less
than group B.
28. Fill in the blanks:
(a) Statistical hypothesis under test is called ..................
(b) The probability of type-I error is given by ...................
(c) The probability of type-I error is also called
...................
(d) If β is the probability of type II error, the (1–b) is
called ................ of the test.
(e) The power of function is related to type .............
error.
(f) In any testing problem, the type ................... error is
considered more serious then type .................. error.
(g) The level of significance of a test is related to type
............... error and is given by .................
318 Medical Statistics and Demography Made Easy
(h) Critical region provides a criteria for .................. Null
hypothesis.
(i) The choice of one tailed and two tailed test depends
on .................
29. Calculate standard deviation of the following two
series:
Series A
25
30
45
60
10
100
70
Series B
100
120
180
240
40
400
280
30. Two random samples of size 16 and 25 are drawn from
normal population and the data of abdominal skin fold
thickness are as follows:
Sample
No. of
observation
Sum of
observation
Sum of square
observations
1
2
16
25
76
105
561
680
Find out whether there is any significant difference
between skin fold thickness of two groups.
31. Fill in the blanks:
(a) Absolute sum of deviation is minimum from
.................
(b) The sum of squares of deviation is least when
measured from .....................
(c) If 25% of the items are less than 10 and 25% are more
than 40, the coefficient of quartile deviation is
.................
Unsolved Questions 319
(d) In a symmetric, distribution the upper and lower
quartile are equidistant from ..................
(e) If mean and the mode of a given distribution are
equal, then its coefficient of skewness is ..................
(f) In any distribution, the standard deviation is always
..................... the mean deviation from mean.
32. A clinical researcher postulates that weight bearing
exercise prevents the development of osteoporosis by
increasing secretion of calcitonin a hormone that
inhibits bone re-absorption. He wishes to test the
hypothesis by comparing blood levels of calcitonin
in subjects who exercise to those in subjects who do
not. The mean calcitonin secretion (µg/dl) in study and
control groups of women alongwith their respective
standard deviation are given below:
Study group
No. of women
(ages 25 to 45)
Sample mean
Sample SD
Control group
100
100
0.60
0.20
0.54
0.15
Test the desired hypothesis based on the above
observation.
33. A community health director observes that exposure
of a particular pesticide results in a higher rate of
miscarriage. To test the hypothesis regarding exposure
and miscarriage, he selects 40 women experiencing a
miscarriage and 160 women experiences a normal
pregnancy from the records of the hospital. The 200
subjects were interviewed to determine their prior
exposure to the pesticide. The results are summarized
as:
320 Medical Statistics and Demography Made Easy
Exposed
Not Exposed
Total
30
60
10
100
40
160
Miscarriage
Normal preg.
Explain the type of study design and finds odds in
favour of exposure pesticide.
34. Test whether there is any association between marital
status and breast cancer among females:
Breast Cancer
Married
Unmarried
Yes
No
26
16
9
49
35. Compute crude death rates of population A, B and C
from the table and also compare the death rate of
population A and B taking population C as standard
population.
Age
Group
PA
DA
PB
DB
< 10
10 – 20
20 – 40
40 – 60
> 60
16,000
25,000
45,000
21,000
12,000
425
560
955
752
600
20,000
12,000
50,000
30,000
10,000
600
240
1250
1050
550
PC
DC
12,000 372
30,000 660
62,000 1612
15,000 525
3,000 180
36. In Allahabad city, 20% of a random sample of 900
school children had defective eye sight, while in
Kanpur city 15% of random sample of 1,600 children
had the same defect. Is the difference between two
proportions significant?
37. Draw two systemic samples of size 5 from the data
given below:
3, 4, 7, 5, 1, 6, 8, 2, 7, 4, 7, 11, 9, 3, 4, 6, 13, 11, 11, 10
Unsolved Questions 321
38. A screening test is 90% sensitive and 60% specific.
Calculate Positive and negative likelihood ratio of the
test.
39. Two population of women using oral contraceptives
and no contraceptive device were followed-up for
occurrence of myocardial infarction and observation
are given below:
Myocardial
infarction
No Myocardial
infarction
25
35
40
100
OC users
Non-users
Explain what type of study design has been adopted,
also find the relative risk of myocardial infarction due
to Oral contraceptive.
40. On the basis of two stage screening programme
adopted blood sugar at first stage and glucose
tolerance test (GTT) at second stage for detecting
diabetes. Calculate net sensitivity and net specificity
on the basis of following results.
I stage
Diabetes (+)
Diabetes (–)
Total
Test (+)
Test (–)
425
125
1575
7875
2000
8000
Total
550
9450
10,000
II stage
Diabetes (+)
Diabetes (–)
Total
Test (+)
Test (–)
400
25
175
1400
575
1425
Total
425
1575
2000
322 Medical Statistics and Demography Made Easy
41. A random sample of 25 patients is taken from ICCU
of a hospital and the outcome cured (C) or death (D)
was recorded according to the date of admission of
the patient, which are as follows:
C
C
C
D
D
D
C
C
C
C
C
D
D
C
D
D
D
D
C
C
D
C
D
D
C
Apply a run test to test that whether the sequence of
cured and death is random.
42. Two samples are drawn from a two populations whose
distribution is not known. In one group (Group A, n1
= 10) a high caloric diet was given and the second
group (Group B, n2 = 10) was on normal diet. The
weight gain in two groups were recorded after a month
and the increase in weight was recorded in these
group:
Group A 12
10 12
15
9
6
10
5
15
9
7
16 18
12
9
8
6
9
10
5
Group B
Apply suitable test to find out whether the weight gain
in two groups are same.
43. A coefficient of correlation of 0.4 is derived from a
random sample of size 102 pairs of observation. Is the
value of ‘r’ is significant.
44. In four families each containing eight persons, the
chest measurements (in cm) of these persons are given
below. Calculate whether there is any significant
difference between the chest measurement of these
families.
Unsolved Questions 323
Family 1
Family 2
Family 3
Family 4
35
53
47
60
85
66
49
55
67
39
33
65
69
66
58
42
56
47
33
79
90
49
57
62
56
78
44
42
39
67
68
86
45. The following table gives the frequency distribution
of pulse rate of 60 normal persons:
Pulse rate
No. of persons
Pulse rate
45 – 50
50 – 55
55 – 60
3
7
20
60 – 65
65 – 70
70 – 75
No. of persons
15
9
6
Calculate upper and lower quartile and the coefficient
of dispersion.
46. The value of mean and median of 100 observations
are 50 and 52 respectively. The value of the largest item
is 100. It was found later that the correct value is
actually 120. Find the correct value of mean and
median and also calculate the mode and second
quartile.
47. Two laboratories carry out independent estimates of
content of progesterone in a particular brand of oral
contraceptive. A sample is taken from each batch,
halved and the separate halved sent to two
laboratories. The following data are obtained:
324 Medical Statistics and Demography Made Easy
No. of sample
9
Mean value of the difference of estimate
Standard deviation of difference
0.8
16
Find out whether there is significant difference between
the content of progesterone in oral contraceptive on the
basis of report of two laboratories.
48. Calculate the correlation coefficient for the following
height (in inches) of father (x) and their sons (y):
x
65
66
67
67
68
69
70
72
y
67
68
65
68
72
72
69
71
49. In an investigation on neonatal blood pressure in
relation to maturity following results were obtained:
Babies
9 days old
1. Normal
2. Neonatal asphyxia
Number
50
15
Mean systolic SD
BP
75
69
8
6
Is the difference in mean systolic BP between the two
groups statistically significant?
50. From a field area 40 females using oral contraceptive
and 60 females using other contraceptive were
randomly selected and the number of hypertensive
cases from the groups were recorded as given below:
Unsolved Questions 325
Type of
Contraceptive
Total
No. of
hypertensive
Oral
Others
40
60
12
18
Find whether there is any significant difference between
Oral contraceptive users in Hypertensive and
normotensive females.
Answers of MCQs and Unsolved Questions 327
Answers of MCQs
and
Unsolved Questions
328 Medical Statistics and Demography Made Easy
Answers of MCQs
Chapter 1: Classification and Tabulation
1. d
2. a
3. c
4. b
5. a
6. b
7. d
8. b
9. c
10. d
11. b
12. d
13. d
14. d
15. d
16. a
17. d
18. b, d
19. c
20. c
21. c
22. a
Chapter 2: Measure of Central Tendency
1. c
2. b
3. d
4. a
5. c
6. b
7. c
13. b
8. c
14. a
9. b
15. c
10. b
16. b
11. b
17. b
12. b
18. c
19. b
25. c
20. d
26. c
21. a
27. a
22. c
28. a
23. c
29. b, c
24. a
30. a
Chapter 3: Measure of Dispersion
1. c
7. a
2. b
8. a
3. c
9. b
4. d
10. c
5. d
11. b
6. a
12. b
13. b*
19. b
14. c
20. b
15. b
21. a
16. c
22. c
17. c
23. a
18. d
24. a
25. c
26. a
* because variance is the square of standard deviation
Chapter 4: Theoretical Discrete and Continuous
Distribution
1. a
7. b
2. d
8. a
3. b
9. b
4. d
10. a
5. a
11. a
6. c
12. b
Answers of MCQs and Unsolved Questions 329
13. d
14. b
15. c
16. d
17. d
18. d
19. c
25. a
20. a
26. a
21. b
27. b
22. b
28. d
23. a
29. b
24. b
30. d
31. c
32. b
33. b
34. d
Chapter 5: Correlation and Regression
1. b
2. d
3. b
4. a
5.
7. a
8. b
9. b 10. a
11.
13. a
14. b
15. b 16. b
17.
19. b 20. d
21. c
22. d
23.
25. c
Chapter 6: Probability
1. d
2. b
3. a
7. c
8. c
9. a
4. c
10. a
d
c
a
d
5. b
6.
12.
18.
24.
c
a
a
d
6. d
Chapter 7: Sampling and Design of Experiments
1. a
2. b
3. b
4. b
5. d
6. b
7. b
8. d
9. b 10. b
11. a
12. c, d
13. a
14. a
15. b 16. a
17. b
18. d
Chapter 8: Testing of Hypothesis
1. a
2. c
3. c
4. a
7. d
8. b
9. a
10. a
13. a
14. a
15. c
16. a
19. b 20. d
21. b 22. b
25. a
26. d
5.
11.
17.
23.
a
a
b
a
6.
12.
18.
24.
a
c
b
a
Chapter 9: Non-parametric Tests
1. e
2. d
3. b
Chapter 10: Statistical Methods in Epidemiology
1. c
2. a
3. c
4. a
5. a
6. a
330 Medical Statistics and Demography Made Easy
7.
13.
19.
25.
31.
37.
b
b
a
b
a
a
8.
14.
20.
26.
32.
38.
b
b
b
c
b
b
9.
15.
21.
27.
33.
39.
a
b
c
a
c
d
10.
16.
22.
28.
34.
b
b
c
a
c
11.
17.
23.
29.
35.
b
b
a
d
c
Chapter 11: Vital Statistics (Demography)
1. c
2. d
3. c
4. b
5. c
7. d
8. c
9. b 10. d
11. a
12.
18.
24.
30.
36.
b
c
c
c
b
6. a
12. d
13. d
14. b
15. c
16. d
17. b
18. b
19. a
20. a
21. c
22. b
23. a
24. a
25. b
26. a
27. c
28. d
29. a
30. b
31. a
32. a
33. d
34. a
35. a
36. d
37. d
38. b
39. c
40. a
5. a
6. d
Chapter 12: Health Information
1. a
2. c
3. b
7. a
8. d
9. b
4. d
Chapter 13: A Report on Census 2001
1. b
2. d
3. b
4. c
5. c
6. b
7. d
8. c
9. b
10. b
11. b
12. c
13. b
14. b
Chapter 14: National Population Policy
1. c
2. b
7. b
8. a
3. b
4. a
5. a
6. d
Answers of MCQs and Unsolved Questions 331
Answers of Unsolved Questions
1. Null hypothesis H0 : µA = µB, Alternative hypothesis
H1 : µA ≠ µB; Mean (a) = 51.28’ SD (a) = 2.28; Mean (B) =
53.14, SD (B) = 1.67; “t” = 2.95, d.f. = 12, P < 0.05.
2. H0: µA = µB; H1: µA ≠ µB; Mean (difference) = 2; SD (d)
= 2.64, “t” = 2.27, d.f. = 8, p > 0.05.
3. H0: No association between coronary artery disease and
smoking; χ2Cal = 4, d.f. = 1; p < 0.05.
4. Hint: Go through Chapter 2.
5. Mean = 132.4; Median = 131.22; Mode = 132.5;
approximately symmetrical.
6. Correlation coefficient “r” = + 0.82.
7. Regression line x on y: x = 57.4 + 0.58y
Regression line y on x: y = 26 + 0.96x
Estimate of cholesterol for blood pressure ‘x = 160’ is
179.6.
8. Crude death rate (A) = 10.26; CDR (B) = 7.93
Standardized death rate (A) = 9.7; SDR (B) = 10.6
9. GFR = 77.4; TFR = 2.56; GRR = 1.23
10. Prevalence = 300/1000; Rate of smokers for
Hypertensive = 83.33%; Rate of smoking for
Normotensive = 35.71; χ2 = 190.46, Risk Ratio = 5.
11. Sensitivity = 50%, Specificity = 100%, PPV = 100%, NPV
= 42.85%, Diagnostic Accuracy = 63.36%.
12. Q1 = 37.78, Q3 = 135.75; Coff. of dispersion = 0.24.
332 Medical Statistics and Demography Made Easy
13. Median = 43.33; Mode = 43.33.
14. Median = 11; Mean = 10.52.
15. Coff. of dispersion (based on SD) = 0.09
Coff. of dispersion (based on Quartile) = 0.07
16. Correlation coefficient “r” = 0.79; Regression line
between Height (Ht) and weight (Wt) is Ht = 111.32 +
0.88 Wt.
17. H0: P1 = P2; H1: P1 ≠ P2; Z = 9.16; p < 0.001.
18. H0: µ = 13.6; H1 : µ ≠ 13.6; Z = 10, p < 0.001.
19. H0: µA = µB, H1 : µA < µB; Z = 15.94; p < 0.001.
20. H0: µA = µB, H1 : µA ≠ µB; Mean (A) = 128.14, SD (A) =
18.33; Mean (B) = 104.63, SD (B) = 24.60; ‘t’ = 2.27, d.f. =
23; p < 0.05.
21. H0: µx = µy, H1 : µx > µy; Mean (difference) = 2.9; SD (d)
= 3.17; ‘t’ = 2.89, d.f. = 9; p < 0.05.
22. Correlation coefficient ‘r’ = – 0.58; inversly proportional.
23. H 0 ; No association between disease groups and
cigarette smoking: χ2 = 27.18, d.f. = 16; p < 0.05.
24. H0: No relation between age and presence of Shistosoma
mansoni eggs, χ2 = 10.35, d.f. = 4; p < 0.05.
25. H0: Nasal carrier are not associated with size of tonsils;
χ2 = 7.85, d.f. = 2; p < 0.05.
26. H0: µ1 = µ2; H1: µ1 ≠ µ2; ‘t’ = 1.84, d.f. = 18, p > 0.05.
27. H0: µA = µB; H1: µA < µB; Z = 2.77, p < 0.01.
28. (a) Null hypothesis, (b) α; (c) Level of significance; (d)
Power; (e) Type II (f) Type I, Type II; (g) Type I, α ; (h)
Rejecting (i) Alternative hypothesis.
29. SD (A) = 30.64; SD (B) = 122.56.
Answers of MCQs and Unsolved Questions 333
30. H0: µ1 = µ2; µ1 ≠ µ2; Mean (1) = 4.75, SD (1) = 3.65; Mean
(2) = 4.20, SD (2) = 3.15; ‘t’ = 0.51; d.f. = 39; p > 0.05.
31. (a) Median; (b) Mean; (c) 15; (d) Mean; (e) zero; (f) less.
32. H0: µ1 = µ2; H1: µ1 ≠ µ2; ‘Z’ = 2.5, p < 0.05.
33. Retrospective study; Odds ratio = 5.
34. H0: No association between marital status and breast
cancer; χ2 = 20.02, d.f. = 1; p < 0.001.
35. CDR (A) = 27.66; CDR (B) = 30.24; CDR (c) = 27.45
Standardized death rate (A) = 24.53; SDR (B) = 26.26.
36. H0: P1 = P2; H1: P1 ≠ P2; ‘Z’ = 3.21; p < 0.001.
37. Hint: Systematic sampling; 20 = 5 × k; k = 20/5 = 4.
38. Positive likelihood ratio = 2.25;
Negative likelihood ratio = 0.16
39. Prospective study; Risk ratio = 1.48.
40. Sensitivity = 72.2%, Specificity = 98.14%
41. H0: sequence of crude and death in this series is random,
No. of run = 11, “z” = 1.02; p > 0.05 (i.e. accept H0).
42. H0: µ1 = µ2; H1: µ1 ≠ µ2; Mann Whitney U-test, ‘Z’ =
0.01; p > 0.05.
43. ‘t’ = 4.39; d.f. = 100, p < 0.001.
44. H0: µ1 = µ2 = µ3 = µ4; H1: µ1 ≠ µ2 ≠ µ3 ≠ µ4: Analysis of
variance, ‘F’ = 0.14; d.f. (3.28); p > 0.05.
45. Q1 = 56.25; Q2 = 65.00, Coeff. of dispersion = 0.07.
46. Mean = 50.20; Median = 52, Mode = 55.6.
47. H0: d = 0; H1: d ≠ 0, ‘t’ = 0.15, d.f. = 8, p > 0.05.
48. Correlation coefficient ‘t’ = 0.60.
49. H0: µ1 = µ2; H1: µ1 ≠ µ2; ‘t’ = 2.65, d.f. = 63. p < 0.05.
50. H0: P1 = P2; H1: P1 ≠ P2; ‘Z’ = 0, p > 0.05.
Appendix
Statistical Tables
336 Medical Statistics and Demography Made Easy
Table 1: Areas under normal curve
Normal probability curve is given by
f x
1 x 2
1
exp
x
2
2
and standard normal probability curve is given by
z
1
1
exp z 2 , z
2
2
Figure A-1
The following table gives the shaded area in the diagram,
viz.... P(0 < Z < z) for different values of z.
Appendix 337
Tables of Areas
↓Z→
0
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
.0000
.0398
.0793
.1179
.1554
.1915
.2257
.2580
.2881
.3159
.3413
.3643
.3849
.4032
.4192
.4332
.4452
.4554
.4641
.4713
.4772
.4821
.4861
.4893
.4918
.4938
.4953
.4965
.4974
.4981
.4987
.4990
.4993
.4995
.4997
.4998
.4998
.4999
1
.0040
.0438
.0832
.1217
.1591
.1950
.2291
.2611
.2910
.3186
.3438
.3655
.3869
.4049
.4207
.4345
.4463
.4564
.4649
.4719
.4778
.4826
.4864
.4896
.4920
.4940
.4955
.4966
.4975
.4982
.4987
.4991
.4993
.4995
.4997
.4998
.4998
.4999
2
3
4
5
6
7
8
9
.0080
.0478
.0871
.1255
.1628
.1985
.2324
.2642
.2939
.3212
.3461
.3686
.3888
.4066
.4222
.4357
.4474
.4573
.4656
.4726
.4783
.4830
.4868
.4898
.4922
.4941
.4956
.4967
.4976
.4982
.4987
.4991
.4994
.4995
.4997
.4998
.4999
.4999
.0120
.0517
.0910
.1293
.1664
.2019
.2357
.2673
.2967
.3238
.3485
.3708
.3907
.4082
.4236
.4370
.4484
.4582
.4664
.4732
.4788
.4834
.4871
.4901
.4925
.4943
.4957
.4968
.4977
.4983
.4988
.4991
.4994
.4996
.4997
.4998
.4999
.4999
.0160
.0557
.0948
.1331
.1700
.2054
.2389
.2703
.2995
.3264
.3508
.3729
.3925
.4099
.4251
.4382
.4495
.4591
.4671
.4738
.4793
.4838
.4875
.4904
.4927
.4945
.4959
.4969
.4977
.4984
.4988
.4992
.4994
.4996
.4997
.4998
.4999
.4999
.0199
.0596
.0987
.1368
.1736
.2088
.2422
.2734
.3023
.3289
.3531
.3749
.3944
.4115
.4265
.4394
.4505
.4599
.4678
.4744
.4798
.4842
.4678
.4906
.4929
.4946
.4960
.4970
.4978
.4984
.4989
.4992
.4994
.4996
.4997
.4998
.4999
.4999
.0239
.0636
.1026
.1406
.1772
.2123
.2454
.2764
.3051
.3315
.3554
.3770
.3962
.4131
.4279
.4406
.4515
.4608
.4686
.4750
.4803
.4846
.4881
.4909
.4931
.4948
.4961
.4971
.4979
.4985
.4989
.4992
.4994
.4996
.4997
.4998
.4999
.4999
.0279
.0675
.1064
.1443
.1808
.2157
.2486
.2794
.3078
.3340
.3577
.3790
.3980
.4147
.4292
.4418
.4525
.4616
.4693
.4756
.4808
.4850
.4884
.4911
.4932
.4959
.4962
.4972
.4979
.4985
.4989
.4992
.4995
.4996
.4997
.4998
.4999
.4999
.0319
.0714
.1103
.1480
.1844
.2190
.2517
.2823
.3106
.3365
.3599
.3810
.3997
.4162
.4306
.4429
.4535
.4625
.4699
.4761
.4812
.4854
.4887
.4913
.4934
.4951
.4963
.4973
.4980
.4986
.4990
.4993
.4995
.4996
.4997
.4998
.4999
.4999
.0359
.0759
.1141
.1517
.1879
.2224
.2549
.2852
.3133
.3389
.3621
.3830
.4015
.4177
.4319
.4441
.4545
.4633
.4706
.4767
.4817
.4857
.4890
.4916
.4936
.4952
.4964
.4974
.4981
.4986
.4990
.4993
.4995
.4997
.4998
.4998
.4999
.4999
338 Medical Statistics and Demography Made Easy
3.9
.5000 .5000 .5000 .5000 .5000 .5000 .5000 .5000 .5000 .5000
Table 2: Ordinates of the normal probability curve
The following table gives the ordinates of the standard
normal probability curve, i.e., it gives the value of
1
1
exp z 2 , z
2
2
for different values of z, where
z
Z
X E X X
~ N 0, 1
x
Obviously z z
Z
.00
.01
.02
.03
.04
.05
.06
.07
.08
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
.3989
.3970
.3910
.3814
.3683
.3521
.3335
.3123
.2897
.2661
.2420
.2179
.1942
.1714
.1497
.1295
.1109
.0940
.0790
.0656
.0540
.0440
.3989
.3965
.3902
.3802
.3668
.3503
.3312
.3101
.2874
.2637
.2396
.2155
.1919
.1691
.1476
.1276
.1092
.0925
.0775
.0644
.0529
.0431
.3989
.3961
.3894
.3790
.3653
.3485
.3292
.3079
.2850
.2313
.2371
.2131
.1895
.1669
.1456
.1257
.1074
.0909
.0761
.0632
.0519
.0422
.3988
.3956
.3885
.8778
.3637
.3467
.3271
.3056
.2827
.2589
.2347
.2107
.1872
.1647
.1435
.1238
.1057
.0893
.0748
.0620
.0508
.0413
.3986
.3951
.3876
.3765
.3621
.3448
.3251
.3034
.2803
.2565
.2323
.2083
.1849
.1626
.1415
.1219
.1040
.0878
.0734
.0608
.0498
.0404
.3984
.3954
.3867
.3752
.3605
.3429
.3230
.3011
.2780
.2541
.2299
.2059
.1826
.1604
.1394
.1200
.1023
.0863
.0721
.0596
.0488
.0396
.3982
.3939
.3857
.3739
.3589
.3410
.3209
.2989
.2756
.2516
.2275
.2036
.1804
.1582
.1374
.1182
.1006
.0848
.0707
.0584
.0478
.0387
.3980
.3932
.3847
.3725
.3572
.3391
.3187
.2966
.2732
.2492
.2251
.2012
.1781
.1561
.1354
.1163
.0989
.0833
.0694
.0573
.0468
.0379
.3977
.3925
.3836
.3712
.3555
.3372
.3166
.2943
.2709
.2468
.2227
.1989
.1758
.1539
.1334
.1145
.0973
.0818
.0681
.0562
.0459
.0371
.09
.3973
.3918
.3825
.3697
.3538
.3352
.3144
.2920
.2685
.2444
.2203
.1965
.1736
.1518
.1315
.1127
.0957
.0804
.0669
.0551
.0449
.0363
Appendix 339
Contd...
Contd...
Z
.00
.01
.02
.03
.04
.05
.06
.07
.08
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
.0355
.0283
.0224
.0175
.0136
.0104
.0079
.0060
.0044
.0033
.0024
.0017
.0012
.0009
.0006
.0004
.0003
.0002
.0347
.0277
.0219
.0171
.0132
.0101
.0077
.0058
.0043
.0032
.0023
.0017
.0012
.0008
.0006
.0004
.0003
.0002
.0339
.0270
.0213
.0167
.0129
.0099
.0075
.0056
.0042
.0031
.0022
.0016
.0012
.0008
.0006
.0004
.0003
.0002
.0332
.0264
.0208
.0163
.0126
.0096
.0073
.0055
.0040
.0030
.0022
.0016
.0011
.0008
.0005
.0004
.0003
.0002
.0325
.0258
.0203
.0158
.0122
.0093
.0071
.0053
.0039
.0029
.0021
.0015
.0011
.0008
.0005
.0004
.0003
.0002
.0317
.0252
.0198
.0154
.0119
.0091
.0069
.0051
.0038
.0028
.0020
.0015
.0010
.0007
.0005
.0004
.0002
.0002
.0310
.0246
.0194
.0151
.0116
.0088
.0067
.0050
.0037
.0027
.0020
.0014
.0010
.0007
.0005
.0003
.0002
.0002
.0303
.0241
.0189
.0147
.0113
.0086
.0065
.0048
.0036
.0026
.0019
.0014
.0010
.0007
.0005
.0003
.0002
.0002
.0297
.0235
.0184
.0143
.0110
.0084
.0063
.0047
.0035
.0025
.0018
.0013
.0009
.0007
.0005
.0003
.0002
.0001
.09
.0290
.0229
.0180
.0139
.0107
.0081
.0061
.0046
.0034
.0025
.0018
.0013
.0009
.0006
.0004
.0003
.0002
.0001
340 Medical Statistics and Demography Made Easy
Table 3: Significant values
of t-distribution
(Two tail areas)
Probability (Level of Significant)
d.f. (v)
0.50
0.10
0.005
0.02
0.01
0.001
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1.00
0.82
0.77
0.74
0.73
0.72
0.71
0.71
0.70
0.70
0.70
0.70
0.69
0.69
0.69
0.69
0.69
0.69
0.69
0.36
0.69
0.39
0.69
0.69
0.68
2.68
0.68
0.68
0.68
0.68
0.67
6.31
0.92
0.35
2.13
2.02
1094
1.90
1080
1.83
1.81
1.80
1.78
1.77
1.76
1.75
1.75
1.74
1.73
1.73
1.73
1.72
1.72
1.71
1.71
1.71
1.71
1.70
1.70
1.70
1.70
1.65
12.71
4.30
3.18
2.78
2.57
2.45
2.37
2.31
2.26
2.23
2.20
2.18
2.16
2.15
2.13
2.12
2.11
2.10
2.09
2.09
2.08
2.07
2.07
2.06
2.06
2.06
2.05
2.05
2.05
2.04
1.96
31.82
.6397
4.54
3.75
3.37
3.14
3.00
2.92
2.82
2.76
2.72
2.68
2.05
2.62
2.60
2.58
2.57
2.55
2.54
2.53
2.52
2.51
2.50
2.49
2.49
2.48
2.47
2.47
2.46
2.46
2.33
63.66
6.93
5.84
4.60
4.03
3.71
3.50
3.36
3.25
3.17
3.11
3.06
3.01
2.98
2.95
2.92
2.90
2.88
2.86
2.85
2.83
2.82
2.81
2.80
2.79
2.78
2.77
2.76
2.76
2.75
2.58
636.62
31.60
12.94
8.61
6.86
5.96
5.41
5.04
4.48
4.59
4.44
4.32
4.22
4.14
40.7
4.02
3.97
3.92
3.88
3.85
3.83
3.79
3.77
3.75
3.73
3.71
3.69
3.67
3.66
3.65
3.29
Appendix 341
Table 4: Significant values χ α of chi-square
distribution (Right tail areas for given probability
2
Where
Degree
of
freedom
0 = .99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
.000157
.0201
.115
.297
.554
.872
1.239
1.646
2.088
2.558
3.053
3.571
4.107
4.660
4.229
5.812
6.408
7.015
7.633
8.260
8.897
9.542
10.196
10.856
)
and is degrees of freedom (d f)
0.95
0.50
0.10
0.05
0.02
0.01
.00393
.103
.352
.711
1.145
1.635
2.167
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.682
9.390
10.117
10.851
11.591
11.338
13.091
13.848
.455
1.386
2.366
3.357
4.351
5.348
6.346
7.344
8.343
9.340
10.341
11.340
12.640
13.339
14.339
15.338
16.338
17.338
18.338
19.337
20.337
21.337
22.337
23.337
2.06
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
29.615
30.813
32.007
32.196
3.840
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410
32.671
33.924
35.172
36.415
5.214
7.824
9.837
11.668
13.388
15.033
16.622
18.168
19.679
21.161
22.618
24.054
25.472
26.873
28.259
29.633
30.995
32.346
33.687
35.020
36.343
37.659
38.968
40.270
6.635
9.210
11.341
13.277
15.086
16.812
18.475
20.090
21.666
23.209
24.725
26.217
27.688
29.141
30.578
32.000
33.409
34.805
36.191
37.566
38.932
40.289
41.638
42.980
Contd...
342 Medical Statistics and Demography Made Easy
Contd...
Degree
of
freedom
0 = .99
25
26
27
28
29
30
11.524
12.198
12.879
13.565
14.256
14.953
0.95
0.50
0.10
0.05
0.02
0.01
14.611
15.379
16.151
16.928
17.705
18.493
24.337
25.336
26.336
27.336
28.336
29.336
34.382
35.363
36.741
37.916
39.087
40.256
37.652
38.885
40.113
41.337
42.557
43.773
41.566
41.856
44.140
45.419
46.693
47.962
44.314
45.642
46.963
48.278
49.588
50.892
Note: For degrees of freedom
quantity
greater than 30, the
may be used as a normal variate with unit
variance.
Appendix 343
Table 5: Significant values of the variance ratio
F-distribution (Right tail areas 5 percent points)
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
120
240
2
3
4
5
6
8
12
24
161.4 199.5 215.7 224.6 230.2 234.0 238.9 243.9 249.0 254.3
18.51 19.00 19.16 19.25 19.30 19.35 19.37 19.41 19.45 19.50
10.13 9.55 9.28 9.12 9.01 8.94 8.84 8.74 9.64 9.55
7.71 6.94 6.59 6.39 6.26 6.16 6.04 5.91 5.77 5.65
6.61 5.79 5.41 5.19 5.05 4.95 4.82 4.68 4.53 4.96
5.99 5.14 4.76 4.53 4.39 4.28 4.15 4.00 3.84 3.67
5.59 4.74 4.35 4.12 3.97 3.87 3.78 3.57 3.41 3.23
5.32 4.46 4.07 3.84 3.69 3.58 3.44 3.28 3.12 2.93
5.12 4.26 3.865 3.63 3.48 3.37 3.23 3.07 2.90 2.71
4.96 4.10 3.71 3.48 3.33 3.22 3.07 2.91 2.74 2.54
4.84 3.98 3.59 3.365 3.20 3.09 2.95 2.79 2.61 2.40
4.75 3.88 4.49 3.26 3.11 3.00 2.85 2.69 2.50 2.30
4.67 3.80 5.51 3.18 3.02 2.92 2.7
2.60 2.42 2.21
4.60 3.74 3.51 3.11 2.96 2.85 2.70 2.53 2.35 2.13
4.54 3.68 3.29 3.06 2.90 2.79 2.64 2.48 2.29 2.07
4.49 3.63 3.4
3.01 2.85 2.74 2.59 2.42 2.24 2.01
4.45 3.59 3.20 2.96 2.81 2.70 2.55 2.38 2.19 1.96
4.41 3.55 3.96 2.93 2.77 2.66 2.51 2.34 2.15 1.92
4.38 3.52 3.13 2.90 2.74 2.63 2.48 2.31 2.11 1.88
4.35 3.49 3.10 2.87 2.71 2.60 2.45 2.28 2.08 1.84
4.32 3.47 3.07 2.84 2.68 2.57 2.42 2.25 2.05 1.81
4.30 3.44 3.05 2.82 2.66 2.55 2.40 2.23 2.03 1.76
.28 3.42 3.03 2.80 2.64 2.53 2.38 2.20 2.00 1.76
4.26 4.40 3.01 2.78 2.62 2.51 2.36 2.18 1.98 1.73
4.24 3.38 2.99 2.76 2.60 2.49 2.34 2.16 1.96 1.71
4.22 3.37 2.98 2.74 2.59 2.47 2.32 2.15 1.95 1.60
4.21 3.35 2.96 2.73 2.57 2.46 2.30 2.13 1.93 1.67
4.20 3.34 2.95 2.71 2.56 2.44 2.29 2.12 1.91 1.65
4.18 3.33 2.93 2.70 2.54 2.43 2.28 2.10 1.90 1.64
4.17 3.32 2.92 2.69 2.53 2.42 2.27 2.09 1.89 1.62
4.08 3.23 2.84 2.61 2.45 2.34 2.18 2.00 1.79 1.51
4.00 3.15 2.76 2.52 2.37 2.25 2.10 1.92 1.70 1.30
3.92 3.87 2.68 2.45 2.29 2.17 2.02 1.83 1.62 1.25
3.84 2.99 2.60 2.37 2.21 2.09 1.94 1.75 1.52 1.00
47
74
76
35
59
22
42
01
21
60
18
62
36
85
29
62
49
08
16
03
97
16
12
55
16
84
63
33
57
18
26
52
37
70
56
99
16
31
17
18
37
15
93
07
38
28
94
77
17
63
12
86
43
24
62
85
56
12
37
22
04
32
92
97
19
35
94
53
78
34
32
73
67
27
99
35
13
35
77
72
43
46
75
95
12
39
31
59
29
44
86
62
66
26
64
40
96
88
33
50
44
84
50
83
49
57
16
78
09
36
42
56
96
38
33
83
42
27
27
17
16
92
39
54
24
95
64
47
96
81
50
96
34
20
50
95
14
89
16
07
26
50
43
55
55
56
27
47
14
26
68
82
38
87
45
34
87
58
44
11
08
54
06
67
07
96
36
57
71
27
46
26
755
72
09
19
09
99
37
30
82
88
19
82
54
61
20
07
31
22
13
97
16
45
20
79
83
00
2
17
77
98
52
49
46
42
32
05
31
89
12
64
59
15
83
11
53
34
37
04
10
42
17
98
53
90
03
62
51
25
36
34
37
86
46
76
07
93
74
50
07
46
63
32
79
72
43
06
93
16
09
00
19
32
31
96
23
47
71
44
09
71
37
78
93
09
74
47
00
45
49
62
24
38
88
78
67
75
38
62
62
32
53
15
90
17
70
04
59
52
06
20
80
54
87
21
12
15
90
33
27
13
57
06
76
33
43
34
85
76
14
22
42
35
76
86
51
52
26
07
55
12
18
37
24
18
68
66
50
85
02
06
20
33
73
00
84
16
36
38
10
44
Table 6: Random sampling numbers
13
03
66
49
60
06
88
53
87
96
50
58
13
77
80
07
58
14
32
04
54
79
12
44
10
45
53
98
43
25
07
42
27
45
51
59
21
53
07
97
94
72
38
55
10
86
35
84
83
44
99
08
60
24
88
88
23
74
77
77
07
68
23
93
60
85
26
92
39
66
02
11
51
97
26
83
21
46
24
34
88
64
85
42
29
34
12
52
02
73
14
79
54
49
01
30
80
90
99
80
05
10
53
39
64
76
79
54
28
95
73
10
76
30
Contd...
19
44
21
45
11
05
79
04
48
91
06
38
79
43
10
89
14
81
30
344 Medical Statistics and Demography Made Easy
34
57
42
39
94
90
27
24
23
96
67
90
05
46
19
26
97
71
99
95
68
74
27
00
29
16
11
35
38
31
66
14
68
20
67
05
07
68
26
14
Contd...
93
10
56
61
52
40
84
51
78
58
82
94
10
16
25
30
25
37
68
98
70
88
85
65
98
47
45
18
73
97
66
75
16
86
91
13
65
86
29
94
60
23
85
53
75
14
11
00
90
79
59
06
20
38
47
70
76
53
61
24
22
09
54
58
87
64
75
33
97
15
83
06
33
42
96
55
59
48
66
68
35
98
87
37
59
05
73
96
51
06
62
09
32
38
44
74
29
55
37
49
85
42
66
78
36
71
88
02
40
15
64
19
51
97
33
30
97
90
32
69
15
99
47
80
22
95
05
75
14
93
11
74
26
01
49
77
68
65
20
10
13
64
54
70
41
86
90
19
02
20
12
66
38
50
13
40
60
72
30
82
92
61
73
42
26
11
52
07
04
01
67
02
49
87
34
44
71
96
77
53
03
71
32
10
78
05
27
60
02
90
19
94
78
75
86
22
91
57
84
75
51
62
08
50
63
65
40
62
33
10
00
37
45
66
82
78
38
69
57
91
59
99
11
67
06
09
14
93
31
75
71
34
04
81
53
84
67
36
03
93
77
15
12
42
55
38
86
55
08
06
74
02
91
41
91
26
54
10
29
30
59
06
44
32
13
76
22
59
39
40
60
76
16
40
00
04
13
96
10
34
56
51
95
17
08
83
98
33
51
78
47
70
92
01
52
33
58
46
45
25
78
29
92
55
27
20
12
82
16
78
21
90
53
74
43
46
18
92
65
20
06
16
63
85
01
37
22
43
49
89
29
30
56
91
48
09
24
42
04
57
83
93
16
47
50
90
08
90
36
62
68
86
16
62
85
52
76
45
23
27
52
58
29
94
15
57
07
49
47
02
02
38
02
48
27
68
15
97
11
40
91
05
56
44
29
16
52
37
95
67
02
45
75
51
55
07
54
60
04
48
05
77
24
67
39
00
74
38
93
74
37
94
50
84
26
97
55
49
96
73
74
51
48
94
43
66
80
59
30
33
31
38
98
32
62
57
52
91
24
92
Contd...
70
09
29
16
39
11
95
44
3
17
03
30
95
08
89
06
95
04
67
51
Appendix 345
53
26
23
20
25
50
22
79
75
96
74
38
30
43
25
63
55
07
54
85
17
90
41
60
91
34
85
09
88
90
55
63
35
63
98
02
64
85
58
34
Contd...
21
22
26
16
27
23
06
58
36
37
57
04
13
82
23
77
59
582
50
38
17
21
13
24
84
99
86
21
82
55
74
39
77
18
70
58
21
55
81
05
69
82
89
15
87
67
51
46
69
26
37
43
48
14
00
71
19
99
69
90
71
48
01
51
61
61
99
06
65
01
98
73
73
22
39
71
23
31
31
94
50
22
10
54
48
32
00
72
51
91
80
81
82
95
00
41
52
04
99
58
80
28
07
44
64
28
65
17
18
82
33
53
97
75
03
61
23
49
73
28
89
06
82
82
56
69
26
10
37
81
00
94
22
42
06
50
33
69
68
41
36
00
04
00
26
84
94
94
88
46
91
79
21
49
90
72
12
96
68
36
38
61
59
62
90
94
02
25
61
74
09
33
05
39
55
12
96
10
35
45
15
54
63
61
18
62
82
21
38
71
77
62
03
32
85
41
93
47
81
37
70
13
69
65
48
67
90
61
44
12
93
46
27
82
78
94
02
48
33
59
11
43
36
04
13
86
23
75
12
94
19
86
24
22
38
96
18
45
03
03
48
91
03
69
26
24
07
96
42
97
82
28
83
49
36
26
39
88
76
09
43
82
69
38
37
98
79
49
32
24
47
08
72
02
94
44
07
13
24
90
40
78
11
18
70
33
62
28
92
02
94
31
89
48
37
95
02
41
30
35
45
12
15
65
15
41
67
24
85
71
80
54
44
07
30
27
18
43
12
57
86
23
83
18
42
19
80
00
88
37
04
46
05
70
69
36
39
89
48
29
98
29
80
97
57
95
60
49
65
07
04
31
60
37
32
99
07
20
60
12
00
06
13
85
65
47
75
55
54
03
45
53
35
16
90
02
25
97
18
82
83
66
29
72
65
53
91
65
34
92
07
94
80
04
89
96
99
17
99
62
26
24
54
13
80
53
12
79
81
18
31
13
39
61
00
74
32
14
10
54
03
27
28
21
07
09
19
07
35
75
49
47
88
87
33
83
23
17
34
60
Contd...
91
12
19
49
39
38
81
78
85
66
66
38
94
67
76
30
70
49
72
65
346 Medical Statistics and Demography Made Easy
92
95
45
08
85
84
78
17
76
31
44
66
24
73
60
37
67
28
15
19
03
62
08
07
01
72
88
45
96
43
50
22
96
31
78
84
36
07
10
55
Contd...
90
10
59
83
68
66
22
40
91
73
71
28
75
28
67
18
30
93
55
89
61
08
07
87
97
44
15
14
61
99
14
16
65
12
72
27
27
15
18
95
56
23
48
60
65
21
86
51
19
84
35
84
57
54
30
46
59
22
40
66
70
98
89
79
03
66
26
23
60
43
19
13
28
22
24
57
37
60
45
51
10
93
64
24
73
06
63
22
20
89
11
52
40
01
02
99
75
21
44
10
23
35
58
31
52
38
75
30
72
94
58
53
19
11
94
16
41
75
75
19
98
08
89
66
16
05
41
88
93
36
49
94
72
94
08
96
66
46
13
34
05
86
75
56
56
92
99
57
48
475
26
53
12
25
63
56
48
91
90
88
85
99
83
21
00
68
58
95
98
56
50
75
25
71
38
30
86
98
24
15
11
29
85
48
53
156
42
67
57
69
11
45
12
96
32
33
97
77
94
84
34
76
62
24
55
54
36
47
07
47
17
96
74
16
36
72
80
27
96
97
76
29
27
06
90
35
72
29
23
07
17
30
75
16
66
85
61
85
61
19
60
81
89
93
27
02
24
83
69
40
76
96
67
88
02
22
45
42
02
75
76
33
30
91
33
42
58
94
65
90
86
73
60
68
69
84
23
28
57
12
48
34
14
98
42
35
37
69
95
22
31
89
40
64
36
64
53
88
55
76
45
91
78
94
29
48
52
40
39
91
57
62
60
36
38
38
04
61
66
39
34
58
56
05
38
96
18
06
69
07
20
70
81
74
25
56
01
08
83
43
60
93
27
49
87
32
51
07
58
12
18
31
19
45
39
98
63
84
15
78
01
63
86
01
22
14
03
14
56
78
95
99
24
19
48
99
45
69
73
64
64
14
63
47
13
50
37
16
80
35
60
17
62
59
03
01
76
62
42
63
18
52
59
59
88
41
18
36
30
34
78
43
01
50
45
30
08
03
37
91
96
52
02
00
34
48
11
86
44
72
75
76
16
92
22
64
27
73
61
25
Contd...
39
32
80
38
83
52
39
78
19
08
46
48
61
88
15
98
64
42
11
08
Appendix 347
81
86
91
71
66
96
83
60
17
69
93
30
29
31
01
33
84
40
31
59
53
51
35
37
93
02
49
84
18
79
75
38
51
21
29
95
90
46
20
71
Contd...
95
60
62
89
73
36
92
50
38
23
08
43
71
30
10
29
32
70
67
13
22
79
98
03
05
87
29
10
86
84
45
478
62
88
61
13
68
29
95
83
00
80
82
43
50
83
03
34
24
88
65
35
46
71
78
39
92
13
13
27
18
24
54
38
08
56
06
31
37
58
13
82
40
44
71
35
33
80
20
92
47
36
97
46
22
20
28
57
79
02
025
88
80
91
32
01
98
03
02
79
72
59
20
82
23
14
81
75
81
39
00
33
81
14
76
20
75
54
44
64
00
87
56
68
71
82
39
95
53
37
41
69
30
88
95
71
66
07
95
64
18
38
95
72
77
11
38
820
74
67
84
96
37
47
62
34
99
27
94
72
38
82
15
32
91
74
62
51
73
42
93
72
34
89
87
62
40
96
64
28
79
07
74
14
01
21
25
94
24
10
07
36
39
23
00
33
14
94
85
54
58
53
80
82
93
97
06
02
16
14
51
04
23
30
22
74
71
78
04
96
69
89
08
99
20
90
84
74
10
20
72
19
05
63
58
82
94
32
05
53
32
35
32
70
49
65
63
77
33
92
59
76
38
15
40
14
58
66
72
84
81
96
16
80
82
96
61
76
52
16
21
47
25
56
92
53
45
50
01
48
76
35
46
60
96
42
29
15
83
55
45
45
15
34
54
73
94
95
32
14
80
23
70
47
59
68
08
48
90
23
57
15
35
20
01
19
19
52
90
52
26
79
50
18
26
63
93
49
94
42
09
18
71
47
75
09
38
74
76
98
92
18
80
97
94
86
67
44
76
45
77
60
30
89
25
03
81
33
14
94
82
05
67
63
66
74
04
18
70
54
19
82
88
99
43
56
14
13
53
56
80
98
72
49
39
54
32
55
47
96
48
11
12
82
11
54
44
80
89
07
84
90
16
30
67
13
92
63
14
09
56
08
57
93
71
29
99
55
74
93
25
07
42
21
98
26
08
77
54
11
27
95
21
24
99
56
81
62
60
89
39
35
79
30
60
4
09
09
36
06
44
97
77
98
31
93
07
54
41
30
348 Medical Statistics and Demography Made Easy
Index
A
Addition rule of probability 75
Age and sex composition 211
Age pyramid 211
Age specific fertility rate 224
Alternative hypothesis 100
Analysis of variance table 140
Analytical studies 175
Application of ‘t’ distribution
125
Arithmetic mean 16
Association 62
Assumption for student’s ‘t’ test
125
Attributable risk 182
Attributes 2
B
Bar chart 5
Base line 164
Basic population data 256
Binominal distribution 48
Blinding (Masking) 164
C
Case control study 176
Case definition 164
Case report 174
Case series 174
Census 2001 250
Chi square distribution 114
Classical probability 75
Cluster sampling 86
Coefficient of dispersion 35
Coefficient of variation 35
Cohort 165
Cohort study 175
Comparative statistics of
different indicators 279
Comparison of several
proportions (2 × k
contingency table) 118
Comparison of two proportions
by Chi square 118
Concept of population policy
289
Conditional probability 78
Confidence limits 107
Confounding bias 179
Contingency table (2 × 2 table)
121
Continuous variable 2
Correlation 62
Country health profile 261
Critical region 100, 103
Critical value 103
Cross-sectional studies 175
Crude birth rate 224, 277
Crude death rate 214, 278
Cumulative frequency curve 7
350 Medical Statistics and Demography Made Easy
D
Decile 33
Degree of freedom 115
Demographic cycle 210
Denominator 167
Density 252
Density of population 213
Dependency ratio 212
Descriptive studies 173
Design of experiments 92
Diagnostic accuracy 191
Direct standardization 219
Discrete variable 2
Dispersion 32
E
Ecological bias 179
Equally likely events 74
Exact sampling distribution
114
Exhaustive events 74
Experimental studies 176
Experimental unit 165
Exposure rates 183
F
Failure 106
Family size 213
Fertility trends 251
First quartile 32
Fourfold classification 118
Frequency curve 10
Frequency distribution table 4
Frequency polygon 10
F-statistic 134
F-test for equality of
population variance 135
F-test for equality of several
means 135
G
General contingency table (r ×
s) 120
General fertility rate 224
Geometric mean 24
Goals of national population
policy 295
Goodness of fit 117
Gross reproductive rate 225
Growth rate 230, 252
H
Harmonic mean 25
Histogram 10
History of census 248
Hospital records 243
I
Impossible event 75
Incidence rate (person) 168
Incidence rate (spell) 169
Incidence rates 180
Independence of attributes
118
Independent events 74
Indirect standardization 221
Infant mortality rate 215, 278
Issue of the adolescents 255
Index 351
K
Key population statistics of
India 1901-2001 292
Kurtosis 41
L
Landmarks in the evolutions of
India’s national population
policy 299
Level of significance 101
Life expectancy 213
Life table 227
Likelihood ratio 193
Line diagram 9
Literacy 252
Literacy rate in India 271
Local control 94
Longitudinal studies 174
M
Manifold classification 118
Mann-Whitney U test 156
Maternal mortality rate 223
Mean deviation 34
Measurement bias 179
Measurement of morbidity
168
Measurement of mortality 168
Median 17
Median test 154
Mid year population 167
Mode 20
Mode of F-distribution 134
Mortality indicators for all
India, 1971-1998 293
Mortality trends 291
Multiplication rule of
probability 77
Multistage sampling 89
Mutually exclusive events 74
N
Negative predictive value 187
Neonatal mortality rate 215
Net reproductive rate 226
Nominal 2
Non parametric tests 152
Normal distribution 50
Null hypothesis 100
Numerator 167
O
Observational studies 173
Odds ratio 184
One tailed test 102
One way analysis of variance
135
Ordinal 2
P
Paired ‘ t’ test 127
Parameter 89
Percentile 33
Perinatal mortality rate 216
Period prevalence 170
Pictogram 6
Pie chart 6
Placebo 164
Point prevalence 170
Poisson distribution 49
352 Medical Statistics and Demography Made Easy
Population 84
Population at risk 167
Population census 240
Positive predictive value 187
Postnatal mortality rate 215
Power of test 102
Prevalence 169, 191
Primary data 2
Proportion 167
Proportional mortality rate
217
Prospective study 165
Provisional population totals:
India - part I 258
Provisional population totals:
India 255
Q
Quartile deviation 32
R
Random sampling 84
Random series 74
Randomization 93
Randomized controlled
laboratory study 178
Randomized controlled
cllinical trials 177
Randomized cross-over
clinical trials 177
Range 32
Rate 166
Ratio 166
Readers bias 180
Region of acceptance 103
Region of rejection 103
Registration of births and
deaths act, 1969 242
Registration of vital events
241
Regression 64
Regression coefficient 64
Relative risk 181
Replication 93
Retrospective study 165
Role of targets 294
Root mean square deviation
34
Run test 153
Rural-urban distribution of
population 267
S
Sample 84
Sample registration system
242
Sample size 84
Sample surveys 243
Sampling bias 180
Sampling distribution 89
Sampling of attribute 106
Scattered diagram 11
Screening bias 179
Second quartile 32
Secondary data 2
Sensitivity 186
Sex ratio 212
Sign test 155
Significant value 103
Skewness 40
Skewness of F-distribution
134
Index 353
Sources of health information
240
Specificity 187
Stable population 212
Standard deviation 34
Standard error 89
Standard normal variate 52
Standardized death rate 218
State wise distribution of
households 273
Stationary population 212
Statistic 89
Statistical hypothesis 100
Statistical methods in
epidemiology 163
Status of children 254
Status of women’s health 253
Still birth rate 217
Stratified sampling 85
Success 106
Summary of census 2001 283
Sure event 75
Systemic error 178
Systemic sampling 85
T
t- test for difference of mean
126
t- test for significance of
correlation coefficient 128
t- test for single mean 126
Tables 3
Test of significance for
difference of mean 111
Test of significance for
difference of proportion
107
Test for significance for large
samples 105
Test of significance for single
mean 111
Test for single proportion 106
Test of significance 102
Third quartile 32
Total fertility rate 225
Trials and events 74
Two tailed test 102
Type-I error 101
Type-II error 101
V
Variable 2
Vital rates per 1000
population,
India 1901-1990 293