Preview only show first 10 pages with watermark. For full document please download

Medical Statistics And Demography Made Easy®

   EMBED

  • Rating

  • Date

    December 1969
  • Size

    2.1MB
  • Views

    165
  • Categories


Share

Transcript

Medical Statistics and Demography Made Easy® Medical Statistics and Demography Made Easy® Devashish Sharma MSc (Gold Medalist), PhD (Statistics) Professor, Statistics and Demography MLN Medical College Allahabad Central University Allahabad, India ® JAYPEE BROTHERS MEDICAL PUBLISHERS (P) LTD New Delhi • Ahmedabad • Bengaluru • Chennai • Hyderabad Kochi • Kolkata • Lucknow • Mumbai • Nagpur Published by Jitendar P Vij Jaypee Brothers Medical Publishers (P) Ltd Corporate Office 4838/24 Ansari Road, Daryaganj, New Delhi - 110002, India, Phone: +91-11-43574357 Registered Office B-3 EMCA House, 23/23B Ansari Road, Daryaganj, New Delhi - 110 002, India Phones: +91-11-23272143, +91-11-23272703, +91-11-23282021 +91-11-23245672, Rel: +91-11-32558559, Fax: +91-11-23276490, +91-11-23245683 e-mail: [email protected], Visit our website: www.jaypeebrothers.com Branches ❑ 2/B, Akruti Society, Jodhpur Gam Road Satellite Ahmedabad 380 015, Phones: +91-79-26926233, Rel: +91-79-32988717 Fax: +91-79-26927094, e-mail: [email protected] ❑ 202 Batavia Chambers, 8 Kumara Krupa Road, Kumara Park East Bengaluru 560 001, Phones: +91-80-22285971, +91-80-22382956, 91-80-22372664 Rel: +91-80-32714073, Fax: +91-80-22281761 e-mail: [email protected] ❑ 282 IIIrd Floor, Khaleel Shirazi Estate, Fountain Plaza, Pantheon Road Chennai 600 008, Phones: +91-44-28193265, +91-44-28194897, Rel: +91-44-32972089 Fax: +91-44-28193231 e-mail: [email protected] ❑ 4-2-1067/1-3, 1st Floor, Balaji Building, Ramkote Cross Road, Hyderabad 500 095, Phones: +91-40-66610020, +91-40-24758498 Rel:+91-40-32940929, Fax:+91-40-24758499 e-mail: [email protected] ❑ No. 41/3098, B & B1, Kuruvi Building, St. Vincent Road Kochi 682 018, Kerala, Phones: +91-484-4036109, +91-484-2395739 +91-484-2395740 e-mail: [email protected] ❑ 1-A Indian Mirror Street, Wellington Square Kolkata 700 013, Phones: +91-33-22651926, +91-33-22276404, +91-33-22276415 Rel: +91-33-32901926, Fax: +91-33-22656075 e-mail: [email protected] ❑ Lekhraj Market III, B-2, Sector-4, Faizabad Road, Indira Nagar Lucknow 226 016, Phones: +91-522-3040553, +91-522-3040554 e-mail: [email protected] ❑ 106 Amit Industrial Estate, 61 Dr SS Rao Road, Near MGM Hospital, Parel Mumbai 400 012, Phones: +91-22-24124863, +91-22-24104532, Rel: +91-22-32926896, Fax: +91-22-24160828 e-mail: [email protected] ❑ “KAMALPUSHPA” 38, Reshimbag, Opp. Mohota Science College, Umred Road Nagpur 440 009 (MS), Phone: Rel: +91-712-3245220, Fax: +91-712-2704275 e-mail: [email protected] USA Office 1745, Pheasant Run Drive, Maryland Heights (Missouri), MO 63043, USA Ph: 001-636-6279734 e-mail: [email protected], [email protected] Medical Statistics and Demography Made Easy © 2008, Devashish Sharma All rights reserved. No part of this publication and CD ROM should be reproduced, stored in a retrieval system, or transmitted in any form or by any means: electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the author and the publisher. This book has been published in good faith that the material provided by author is original. Every effort is made to ensure accuracy of material, but the publisher, printer and author will not be held responsible for any inadvertent error(s). In case of any dispute, all legal matters are to be settled under Delhi jurisdiction only. First Edition: 2008 ISBN 978-81-8448-353-6 Typeset at JPBMP typesetting unit Printed at Ajanta Offset & Packagins Ltd., New Delhi This book is dedicated to My Parents Late Dr BK Sharma and Mrs Kusum Sharma for being the constant source of enlightenment in the path of my mundane life My Teacher Professor MK Singh for moulding my inner-self and outer appearance to make me what I am Preface There are many books on general applied statistics, assuming various level of mathematical knowledge, but no book is available which is specially designed for Medical Students at undergraduate level. The main feature of this book is that it will help medical students at undergraduate and postgraduate levels, as well as those students who are preparing for various PGME examinations. The present book, which is explicitly directed towards medical applications, will have two special aspects. First, use of examples almost entirely related to medical problems, which I think, help the research workers and students to understand the underlying computational points. Second, the choice of statistical topics reflects the extent of their usage in medical research. Several topics, such as vital statistics, statistical methods in epidemiology and health information would not normally be included in the general book on applied statistics. This book is intended to be useful to both medical research workers with very little mathematical expertise as well as those students who are preparing for various PGME examinations. The emphasis throughout is on the general concept underlying statistical techniques. Proofs are regarded as of secondary importance, and are usually omitted. Though, there are many mathematical formulae, but these are necessary for computations and the relationship between various methods. They rarely involve other than very simple algebraic manipulations. Some computational steps, such as those involve in probability and significance test are perhaps more difficult. I have given viii Medical Statistics and Demography Made Easy some solved examples clearly mentioning every steps involve in the computation. Nearly 50 unsolved questions mainly related to medical problems are included, which will help undergraduate students in their professional examination. For students preparing for PGME examination, nearly 300 MCQs related to various topics are included in this book. These includes questions asked in various competitive examinations as well as questions which I thought are important for such tests. Going through these questions will help them to solve problems related to Statistics and Demography in their competitive examinations. I owe thanks to my colleagues especially in Department of Obstetrics and Gynaecology and of Community Medicine. Special thanks to my wife Mrs. Anita Sharma, and my son Dr. Pulak Sharma who helped me a lot by suggesting me to frame this work according to problems which he and his friends are facing. I express my deep sense of gratitude to my publisher Jaypee Brothers Medical Publishers (P) Ltd for their untiring efforts in bringing out this book in such an elegant form. Suggestions and criticism for further improvement of this book as well as errors and misprint will be most gratefully received and duly acknowledged. Devashish Sharma Contents 1. Classification and Tabulation ...................................... 1 2. Measure of Central Tendency .................................... 15 3. Measure of Dispersion ................................................ 31 4. Theoretical Discrete and Continuous Distribution ................................................................... 47 5. Correlation and Regression ........................................ 61 6. Probability ..................................................................... 73 7. Sampling and Design of Experiments ..................... 83 8. Testing of Hypothesis ................................................. 99 9. Non-parametric Tests ................................................ 151 10. Statistical Methods in Epidemiology ..................... 163 11. Vital Statistics (Demography) .................................. 209 12. Health Information .................................................... 239 13. A Report on Census 2001 .......................................... 247 14. National Population Policy ...................................... 287 Unsolved Questions .......................................................... 305 Answers of MCQs and Unsolved Questions ............... 327 Appendix : Statistical Tables ................................................. 335 Index ...................................................................................... 349 Chapter 1 Classification and Tabulation 2 Medical Statistics and Demography Made Easy There are two types of data, (1) Primary data and (2) Secondary data. Primary data is one which was originated by the investigator and Secondary data is that data which the investigator does not originate but obtains from someone’s record. Both primary and secondary data are broadly divided in two categories: 1. Attributes (Qualitative data). 2. Variables (Quantitative data). Attributes: are qualitative characteristics which are not capable of being described numerically or, the data obtained by classifying the presence or absence of attribute, e.g. Sex, Nationality, Colour of eyes, Socioeconomic status. They can further divided into two groups: (a) Nominal (b) Ordinal. (a) Nominal: The quality that can be easily differentiated by mean of some natural or physical line of demarcation, e.g. some physical characteristic such as colour of eyes, sex, physical status of a person, etc. (b) Ordinal: An ordered set is known as ordinal, i.e. when the data are classified according to some criteria which can be given an order such as socioeconomic status. Variable: are quantitative characteristics which can be numerically described. Variables may be discrete or continuous. Discrete variables: can take exact values, e.g. Number of family members, number of living children, etc. Continuous variables: if a variable can take any numerical value within a certain range is called continuous variable, e.g. Height in cm, Weight in kg, etc. Classification and Tabulation 3 REPRESENTATION OF DATA Data may be representation either by means of graph or diagram or by means of tables. Tables Tables are of two types: (1) Simple table or Complex depending the number of measurements of single or multiple sets of item, (2) Frequency distribution table. There are certain general principles, which should be followed while presenting the data into tabulated form: 1. A table should be numbered. 2. A title should be given, title should be brief and self explanatory. 3. Heading of columns and rows should be clear. 4. Data must be presented according to size and importance. 5. If percentage or averages are to be compared it should be placed as close as possible. 6. Foot note may be given where necessary. Simple Table Table 1.1: Showing number of patients attending hospital in winter season* Months November December January February Male Female No. % No. % 250 350 100 400 25.00 35.00 10.00 40.00 150 100 70 180 30.00 20.00 14.00 36.00 Source* = Hospital Outdoor attendance 4 Medical Statistics and Demography Made Easy Frequency Distribution Table In a frequency distribution table, the data is first split up into convenient groups (class interval) and the number of items (frequencies) which occur in each group is shown in adjacent column. Following are the ages of 23 cases admitted to a hospital: 20, 35, 46, 10, 5, 25, 48, 33, 37, 41, 26, 29, 15, 6, 29, 56, 69, 66, 64, 25, 26, 56, 42. Age group Tally marks Frequencies 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 ⎜⎜ ⎜⎜ ⎜⎜⎜⎜ ⎜⎜ ⎜⎜⎜ ⎜⎜⎜⎜ ⎜⎜ ⎜⎜⎜ 2 2 7 3 4 2 3 Table 1.2: Age distribution of admitted cases Age group Cases admitted (in years) No % 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 2 2 7 3 4 2 3 8.69 8.69 30.46 13.04 17.39 8.69 13.04 Total 23 100 Classification and Tabulation 5 In constructing frequency distribution table, the question that arise is: into how many groups the data should be split? As per rule it might be stated that when there is large data, a maximum of 20 groups, and when there is not much data, a minimum of 5 groups could be conveniently taken. As far as possible class interval should be equal. GRAPHS OR DIAGRAMS Bar chart: This is a simple way of representing data. In bar diagram the length of bar is proportional to the magnitude to be represented. Bar charts are of three types: (a) Simple bar chart, (b) Multiple bar chart, (c) Component bar chart. (a) Simple bar diagram (b) Multiple bar diagram (c) Component bar diagram Figure 1.1 6 Medical Statistics and Demography Made Easy Pie chart: In pie chart the area of segment of circle represents frequency. The total frequency comprises of 360°. Area of each segment depends upon the angle corresponding to frequency of each group. Pie diagram is particularly useful when the data is represented in percentage. In such cases 1% is equal to 3.6°. Figure 1.2 Pictogram: Small pictures or symbols are used to present data Figure 1.3 Classification and Tabulation 7 Cumulative Frequency Curve or Ogive: Cumulative frequencies are obtained by adding the frequencies of each variable. The cumulative frequency table is obtained as follows: Age in years Frequencies 20 21 23 35 36 45 67 5 3 7 10 3 5 8 Total 41 Cumulative frequency 5 5+3=8 8 + 7 = 15 15 + 10 = 25 25 + 3 = 28 28 + 5 = 33 33 + 8 = 41 Less than Cumulative Frequency Curve: Less than cumulative frequency table is expressed as: Age in years Frequencies Cumulative frequency 20 21 23 35 36 45 67 5 3 7 10 3 5 8 Less than or equal to 20 = 5 Less than or equal to 21 = 8 Less than or equal to 23 = 15 Less than or equal to 35 = 25 Less than or equal to 36 = 28 Less than or equal to 45 = 33 Less than or equal to 67 = 41 Total 41 8 Medical Statistics and Demography Made Easy Figure 1.4 More than Cumulative frequency curve: More than cumulative frequency table is expressed as: Age in years Frequencies Cumulative frequency 20 21 23 35 36 45 67 5 3 7 10 3 5 7 More than or equal to 20 = 41 More than or equal to 21 = 36 More than or equal to 23 = 33 More than or equal to 35 = 26 More than or equal to 36 = 16 More than or equal to 45 = 13 More than or equal to 67 = 8 Total 41 Classification and Tabulation 9 Figure 1.5 Line Diagram: Line diagram are used to show the trend with the passage of time. Time is independent variable represented on X-axis and the dependent variable on Y- axis. It is essential to show zero point on y-axis. Figure 1.6 10 Medical Statistics and Demography Made Easy Histogram: Histogram is used to represent a continuous frequency distribution, is essentially an area chart in which the area of the bar represents the frequency associated with the corresponding interval. It is not essential to show zero point on X-axis (horizontal axis) but necessary to show it on vertical axis. Figure 1.7 Frequency Polygon: It is obtained by joining the upper mid points of Histogram blocks by a straight line. Frequency Curve: It is obtained by joining the upper mid points of Histogram blocks by a smooth line. Figures 1.8A and B Classification and Tabulation 11 Scattered Diagram: Scattered diagram is used to represent two variables simultaneously. Each point represent one individual. Figure 1.9 Comparison between Bar diagram and Histogram: 1. Bar diagram is used to represent the frequency mainly characterized by qualitative variables and discrete variable, while Histogram is used to represent frequencies characterized by continuous variable. 2. In bar diagram length of bar represents frequency, while in histogram area of bar represents frequency. MULTIPLE CHOICE QUESTIONS 1. Scatter diagram show: (a) Trend event with the passage of time (b) Frequency distribution of a continuous variable (c) The relation between maximum and minimum values (d) Relation between two variables (AI,90) 12 Medical Statistics and Demography Made Easy 2. Sex composition can be demonstrated in which of the following: (a) Age pyramid (b) Pie chart (c) Component bar chart (d) Multiple bar chart (JIPMER, 91) 3. Quantitative data can be best represented by: (a) Pie chart (b) Pictogram (c) Histogram (d) Bar diagram (PGI, 80; AMC, 83, 87) 4. Percentage of data can be shown in: (a) Graph presentation (b) Pie chart (c) Bar diagram (d) Histogram (PGI, 79; Delhi, 87) 5. Graph showing relation between 2 variables is a: (a) Scatter diagram (b) Frequency polygon (c) Picture chart (d) Histogram (AI, 96) 6. Weight in kg is a: (a) Discrete variable (c) Nominal scale (b) Continuous variable (d) None of the above (AI, 96) 7. All are the example of nominal scale except: (a) Age (b) Sex (c) Body weight (d) Socioeconomic status (AI, 96) 8. The average birth weights in a hospital are to be demonstrated by statistical representation. The is best done by: (a) Bar chart (b) Histogram (c) Pie chart (d) Frequency polygon (AIIMS 95) Classification and Tabulation 13 9. All are included in the nominal scale except: (a) Colour of eye (b) Sex (c) Socioeconomic status (d) Occupation (MP, 98) 10. Age and sex distribution is best represented by: (a) Histogram (b) Pie chart (c) Bar diagram (d) Age pyramid (DNB, 2001) 11. Continuous quantitative variables are expressed by: (a) Bar chart (b) Histogram (c) Frequency polygon (d) Ogive (e) Pie chart (PGI, 2002) 12. Cumulative frequencies are represented by: (a) Histogram (b) Line diagram (c) Pictogram (d) Ogive 13. In which type of graphical representation frequencies are represented by area of a rectangle (a) Bar diagram (b) Component bar diagram (c) Age pyramid (d) Histogram 14. Two variables can be plotted together by: (a) Pie chart (b) Histogram (c) Frequency polygon (d) Scatter diagram (AI,95) 15. Which of the following statement is false: (a) Primary data is originated by the investigator (b) Primary data originated by an investigator may be used as secondary data by other investigator (c) Data obtained from records of Hospitals are secondary data (d) None of the above statements are true 14 Medical Statistics and Demography Made Easy 16. Best way to study relationship between two variables is: (a) Scatter diagram (b) Histogram (c) Bar chart (d) Pie chart (AI,92) 17. All are the examples of nominal scale except: (a) Race (b) Sex (c) Iris colour (d) Socioeconomic status (AI,96) 18. Low birth weight statistics of a hospital is best shown by: (a) Bar charts (b) Histogram (c) Pictogram (d) Frequency polygon (AIIMS, Dec 95) 19. Categorical values are: (a) Age (c) Gender (b) Weight (Manipal, 2002) 20. If the grading of diabetes is classified as “mild”, “moderate” and “severe” the scale of measurement used is: (a) Interval (b) Nominal (c) Ordinal (d) Ratio 21. The best method to show the association between height and weight of children in a class is by: (a) Bar chart (b) Line diagram (c) Scatter diagram (d) Histogram (AI, 2002) 22. Mean and standard deviation can be worked out only if data is on: (a) Interval/Ratio scale (b) Dichotomous scale (c) Nominal scale (d) Ordinal scale (AIIMS, 2005) Chapter 2 Measure of Central Tendency 16 Medical Statistics and Demography Made Easy Statistical constants which enables us an idea about the concentration of values in the central part of the distribution. The following are five measures of central tendencies: 1. Arithmetic Mean or simply Mean. 2. Median. 3. Mode. 4. Geometric Mean. 5. Harmonic Mean. Arithmetic Mean: A.M. of a set of observations is their sum divided by the number of observations. The arithmetic mean X of n observations X1, X2 ............ Xn is In case of frequency distribution where the variable and frequencies are: Variable Frequencies x1 f1 x2 f2 The arithmetic mean is x2 f3 x4 f4 ............ ............ xn ............ ............ fn where i = 1, 2, 3, 4, ....... n and Short Cut Method: Let ui = xi – A, where A is any arbitrary constant, In case of continuous variables formed Grouped frequency distribution., ‘xi’ are taken as the mid value of the class interval, i.e. xi = (Lower + Upper Limit)/2, and then calculate mean. In case of short cut method we will generate a variable ui = (xi – A)/h, where h is the length of class interval or class Measure of Central Tendency 17 width, and the mean of the variable x will be Properties of arithmetic mean: 1. Sum of deviations of a set of values from their arithmetic mean is zero. 2. Sum of squares of deviation of a set of values is minimum when taken about mean. Merits and Demerits of Arithmetic Mean Merits 1. It is based on all observations. 2. Of all averages, arithmetic mean is affected least by fluctuations of samples, i.e. arithmetic mean is a stable average. 3. If is the mean of n1 observations and if the mean of n2 observations then the combined mean of two series is Demerits 1. AM cannot be used if we are dealing with qualitative data. 2. AM cannot be obtained if a single observation is missing. 3. AM is affected very much by extreme values. 4. AM cannot be calculated if extreme class is open, i.e. below 10 or above 90. 5. In extremely asymmetrical (Skewed) distribution usually AM is not a suitable measure of location. Median: Median of a distribution is the value of the variable which divide it into two equal parts. If there are n observations then arrange the values either is ascending or descending order. If ‘n’ is odd then 18 Medical Statistics and Demography Made Easy th value is the median and if n is even then median will be the average of th and th observation. For example if there are 9 (i.e. odd) values than arrange these values in either in ascending or descending order and median is , i.e. 5th values. Suppose if number of observation are even, i.e. 10 then median lies between 5th and 6th value. In case of discrete frequency distribution median is calculated by forming a cumulative frequency table, then steps for calculating median are: (i) Find where . (ii) See the cumulative frequency just greater than . (iii) The value of x corresponding to cumulative frequency just greater than is median. In case of continuous frequency distribution the class corresponding to the cumulative frequency just greater than or in rare cases equal to (where C.F. is exactly equal to ) is called median class and the value of median is obtained by the following formula: Where l is the lower limit of median class, h is the class width, N =  fi , C is the cumulative frequency preceding to median class and f is the frequency of median class. Measure of Central Tendency 19 Median can also be obtained by less than and greater than cumulative frequency curves of Ogives. The intersection of less than and greater than cumulative frequencies curve is median. Figure 2.1 Merits and Demerits of Median Merits 1. It is not at all affected by extreme values. 2. It can be calculated for distribution with open end class. 3. Median is the only average to be used while dealing with qualitative data. Which cannot be measured quantitatively but can still arrange in ascending or descending order. Demerits 1. In case of even number of observations median cannot be determined exactly. 2. It is not based on all observations. 20 Medical Statistics and Demography Made Easy Mode: Mode is the value which occurs most frequently in a set of observations. In the following set of 10 observations; “5, 20, 16, 10, 20, 5, 16, 16, 18, 14” 16" is the most frequently occurred value, therefore 16 is the mode of the set of observations. In case of discrete frequency distribution, the mode in the value of x corresponding to maximum frequency. The mode is determined by method of grouping if : (i) The maximum frequency is repeated (ii) If the maximum frequency occurs in the very beginning or at the end of the distribution. In case of continuous distribution Mode can be determined by following formula: f1 is the maximum frequency, the group corresponding to maximum frequency is called Modal group, l if the lower limit of modal group, h is the class width, f0 and f2 are the frequencies preceding and following to modal group. Mode can also be obtained by Histogram: Figure 2.2 Measure of Central Tendency 21 Merits and Demerits of Mode Merits 1. Mode is not affected by extreme values. Demerits 1. Mode is ill-defined. It is not always possible to find a clearly defined mode. In some cases distribution has two modes is called bimodal. 2. It is not based on all observations. 3. As compared to mean, mode is affected to a great deal by fluctuation of sampling. Relationship between Mean, Median and Mode: If a distribution is moderately asymmetrical then Mode = 3 Median – 2 Mean EXAMPLE FOR CALCULATING MEAN, MEDIAN AND MODE In case discrete distribution Table 2.1 Variable (xi) Frequency (fi) Cumulative Frequency ui = xi – A (A = 47) ui.fi 25 28 34 47 52 55 60 5 7 10 12 6 4 6 5 12 22 34 40 44 50 –22 –19 –13 0 5 8 13 –110 –133 –130 0 30 32 78 Total 50 –233 N   f1  50 22 Medical Statistics and Demography Made Easy Mean Mean = [(25×5)+(28×7)+(34×10)+(47×12)+(52×6)+(55×4)+ (60×6)]/50 = 2117/50 = 42.34 Short Cut Method Let u1  x1  A, where Mean  X  A  U  47  4.66  42.34 Median N  25. 2 Cumulative frequency just greater than 25 is 34. The value of xi corresponding to 34 is 47. Therefore median of this set of data is 47. In this example total frequency N = 50, therefore Mode The maximum frequency in the above Table is 12. The value of xi corresponding to maximum frequency is also 47. The mode of this set of data is 47. In case of continuous frequency distribution: Table 2.2 Groups fi Cumu. freq. xi = (U+L)/2 xi.f i ui = (xi-A)/h ui.fi 10-20 20-30 30-40 40-50 50-60 60-70 70-80 5 3 7 10 12 7 6 5 8 15 25 37 44 50 15 25 35 45 55 65 75 75 75 245 450 660 455 450 -3 -2 -1 0 1 2 3 -15 -6 -7 0 12 14 18 Total 50 2410 16 Measure of Central Tendency 23 A = 45, h = 10, N = 50, U = upper limit of class interval, L = Lower limit of class interval Mean Mean =  fi x i 2410   48.2 N 50 Short Cut Method: Mean of ui is U  Mean of xi is  fi ui 16   0.32 N 50 X  A  h U  45  10  0.32  45  3.2  48.2 Median N  25, the cumulative 2 frequency 25 lies in the group 40 – 50 (this is a rare case In this example N = 50, therefore where C.F. of a group is equal to N , therefore 40 – 50 is the 2 median group. Lower limit of median group is 40, i.e. l = 40, frequency of median group is 10, i.e. f = 10, Cumulative frequency preceding to median group is 15, i.e. C = 15, and class width is 10, i.e. h = 10. Then the mean is calculated by the formula N  Median  l + h   C  /f 2   25 – 15  = 40 + 10   10  24 Medical Statistics and Demography Made Easy = Therefore, median of this set of data is 50.0 Mode The maximum frequency in the above table is 12, therefore Modal group is 50 – 60, the formula for calculating mode in grouped frequency distribution is: Therefore, in this example, l the lower limit of Modal group is 50, frequency of modal group is f1 = 12, width of class interval, h = 10, the frequencies preceding and following modal group are 10 and 7 respectively, i.e. f0 = 10 and f2 = 7. Then mode is calculated as 10  12  10  20  50   50  2.85  52.85 24  10  7 7 Thus mode of the data represented in Table 2.2 is 52.85. Mode = 50 + Geometric Mean: The geometric mean G of n observations xi, i = 1, 2, .......... n is the nth root of their product. G   x i . x 2 . x 3 .......... x n  1/n Properties of geometric mean: 1. If any observation is zero, geometric mean becomes zero. 2. If any observation is negative, geometric mean becomes imaginary, regardless of the magnitude of other observations. 3. Geometric mean is used to find out the rate of population growth. Measure of Central Tendency 25 Harmonic Mean: Harmonic mean is the reciprocal of arithmetic mean of the reciprocals of observations. HM = 1 , where i = 1, 2, 3, ......... n 1  1/x i N Relationship between Arithmetic, Geometric and Harmonic Mean: HM < GM < AM and GM2 = AM × HM MULTIPLE CHOICE QUESTIONS 1. What is the mode in statistic: (a) Value of middle observation (b) Arithmetic average (c) Most commonly occurring value (d) Difference between the highest and lowest value (AI, 88; AIIMS, 86) 2. The frequently occurring value in a data is: (a) Median (b) Mode (c) Standard deviation (d) Mean (TN, 91) 3. Mean incubation period of leprosy is calculated by: (a) Median (b) Harmonic mean (c) Mode (d) Geometric mean (PGI, 81, AMC, 86, 87) 4. Calculate the mode of 70, 71, 72, 70, 70: (a) 70 (b) 71 (c) 71.5 (d) 72 (PGI 79, AMC 85,88) 26 Medical Statistics and Demography Made Easy 5. Arrange the values in a serial order is to determine: (a) Mean (b) Mode (c) Median (d) Range (AIIMS, 94) 6. Determination of which statistical parameter requires quantities to be arranged in ascending or descending orders: (a) Mean (b) Median (c) Mode (d) SD (AIIMS, Dec 95) 7. 10 babies were born in a hospital, 5 were less than 2.5 kg and 5 were greater than 2.5 kg, the average is: (a) Arithmetic mean (b) Geometric mean (c) Median (d) Mode average (AIIMS, 97) 8. The mean of 10 observations is 25,but later on it was found that an observation 24 was wrongly written as 14. What will be the mean of correct sample: (a) 24.5 (b) 25.5 (c) 26 (d) 26.5 9. Mean height of 10 female students of a class is 150 cm, and the mean height of 20 male students is 175 cm. What will be the mean height of all the 30 students of the class: (a) 166 (b) 166.6 (c) 168 (d) 166.8 10. If mean of a series is 10 and median is 15, what will be the mode of the series: (a) 20 (b) 25 (c) 30 (d) 35 Measure of Central Tendency 27 11. Which of the following measures of central tendency will be calculated when the class interval is not closed: (a) Mean (b) Median (c) Mode (d) Geometric mean 12. Which measure of central tendency is most suitable to determine the rate of population growth: (a) Arithmetic mean (b) Geometric mean (c) Harmonic mean (d) Median 13. Relation between arithmetic man, geometric mean and harmonic mean is: (a) GM < HM< AM (b) HM< GM < AM (c) AM < GM< HM (d) GM< AM< HM 14. Complete the following relation: (a) 2 (c) 1 Mode – Median = ? (Median – Mean) (b) 3 (d) 1.5 15. Which of the following measure of central tendency is extensively used in microbiological research: (a) Harmonic mean (b) Arithmetic mean (c) Geometric mean (d) None of the above 16. The most suitable average to be used while dealing with socioeconomic status is: (a) Arithmetic mean (b) Median (c) Geometric mean (d) Harmonic mean 17. The geometric mean of the following set of data is:Data: 15, 23, 45, 0, 34, 10, 9 (a) 19.4 (c) 45 (b) 0 (d) 17 28 Medical Statistics and Demography Made Easy 18. The mean and median of 100 items are 50 and 52 respectively. The value of the largest item is 100. It was later found that it is actually 110. Therefore, the true mean is ——— and true median is ———. (a) 50 and 52 (b) 50.10 and 52.5 (c) 50.10 and 52 (d) 50 and 52.5 19. The point of insertion of the ‘less than’ and ‘greater than’ ogive correspond to: (a) The mean (b) The median (c) The geometric mean (d) None of these 20. Which measure of central tendency can be calculated from a frequency distribution with open end interval: (a) Mean (b) Geometric mean (c) Harmonic mean (d) Median 21. The relationship between AM, GM, and HM is: (b) HM2 = AM × GM (a) GM2 = AM × HM (c) AM = ½ (GM × HM) (d) None of the above 22. Which measures of central tendency does not influenced by extreme values: (a) Mode (b) Mean (c) Median (d) Harmonic mean 23. Values are arranged in ascending and descending order to calculate: (a) Mode (b) Mean (c) Median (d) Standard deviation (AI,98) Measure of Central Tendency 29 24. Number of cases of malaria detected in 10 years are 100, 160, 190, 250, 300, 300, 320, 320, 550, 380. How to calculate the average number of cases per year: (a) Arithmetic mean (b) Geometric mean (c) Mode (d) Median (AIIMS, June 2000) 25. Calculate the median from the following values; 1.9, 1.9, 1.9, 1.9, 2.1, 2.4, 2.5, 2.5, 2.5, 2.9 (a) 1.2 (b) 1.9 (c) 2.25 (d) 2.5 (AIIMS, Nov 2000) 26. Malaria incidence in village in the year 2000 is 430, 500, 410, 160, 270, 210, 300, 350, 4000, 430, 480, 540, which of the following is the best indicator for assessment of malaria incidence in that village by the epidemiologist: (a) Arithmetic mean (b) Geometric mean (c) Median (d) Mode (AIIMS, May 2001) 27. The median of values 2,5,7,10,10,13,25 is: (a) 10 (b) 13 (c) 25 (d) 5 (AIIMS,Nov 2001) 28. The incidence of malaria in an area is: 250, 300, 320, 300, 5000, 200, 350,. The best value to give idea of incidence in past 7 years; (a) Median (b) Mode (c) Arithmetic mean (d) Geometric mean (AIIMS, Nov 2001) 30 Medical Statistics and Demography Made Easy 29. Which of the following statements is/are correct regarding mean, median and mode: (a) Mode nominal value (b) Mean is sensitive to extreme values (c) Median is not sensitive to extreme values (Manipal, 2002) 30. For a negatively skewed data mean will be: (a) Less than median (b) More than median (c) Equal to median (d) One (AIIMS, 2005) Chapter 3 Measure of Dispersion 32 Medical Statistics and Demography Made Easy DISPERSION Dispersion means “scatteredness”. Dispersion gives an idea about the homogeneity (less dispersed) or heterogeneity (more scattered) of the distribution. Measure of Dispersion Range: The range is the difference between two extreme observations. If A and B are greatest and smallest observations respectively then Range = A – B Range is a simple but crude measure of dispersion. Quartile Deviation or Semi-Inter Quartile Range: Quartiles divide the total frequency into four equal parts. Figure 3.1 Q1 = First Quartile (The frequency between first quartile and origin is 25% of total frequency). Q2 = Second Quartile (The frequency between second quartile and origin is 50% of total frequency). Q3 = Third Quartile (The frequency between third quartile and origin is 75% of total frequency). Measure of Dispersion 33 (Q 3 – Q1 ) 2 Quartile deviation is a better index than range because it make use of 50% of observations. In case of continuous frequency distribution the quartile is calculated by the following formula: Quartile deviation = Where l is the lower limit of quartile class, h is the class width, N   fi , C is the cumulative frequency preceding to quartile class and f is the frequency of quartile class. For first quartile i = 1, for second quartile i = 2 and for third quartile i = 3. It is to be noted that second quartile is equal to median Decile divides the total frequency into 10 equal parts, the formula for calculating Decile is Where l is the lower limit of Decile class, h is the class width, N   fi , C is the cumulative frequency preceding to decile class and f is the frequency of decile class. For first decile i = 1, for second decile i = 2 and for third decile i = 3 …. and for 9th decile i = 9. Percentile: Percentile divides the total frequency into 100 equal parts. The formula for calculating percentile is: Where l is the lower limit of percentile class, h is the class width, N   fi , C is the cumulative frequency preceding to 34 Medical Statistics and Demography Made Easy percentile class and f is the frequency of percentile class. For first percentile i = 1, for second percentile i = 2 and for third percentile i = 3…. and for 99th percentile i = 99. Mean Deviation: If xi; fi, i = 1, 2, 3, .... n is a frequency distribution then mean deviation from the average A (usually Mean, Median, Mode) is given by: Mean Deviation Where  fi  N Mean deviation is least when taken from Median Standard Deviation and Root Mean Square Deviation: Standard deviation is the positive square root of the arithmetic mean of the square of deviations of the given values from their arithmetic mean: Where N   fi and x  Mean Square of Standard Deviation is known as Variance. Root Mean Square Deviation: Root mean square deviation S is given by: S   fi  x i  A  2  N where N   fi and A  is any arbitrary number Relation between σ and S: Standard Deviation is minimum value of Root Mean Square Deviation S Relation between Mean Deviation from Mean and SD Mean deviation from mean < SD Measure of Dispersion 35 Coefficient of Dispersion When we want to compare the variability of two series which differ widely in their averages or which are measured in different units. We will calculate coefficient of dispersion, which is a pure number independent of units. The coefficient of dispersion based on different measure of dispersion: Based on Range CD = (A – B) / (A + B) Where A and B are the maximum and minimum values. Based on Quartile Deviation: CD = (Q3 – Q1) / (Q3 + Q1) Where Q1 and Q3 are first and third quartiles respectively. Based on Standard Deviation: CD = SD / Mean Coefficient of Variation 100 times of coefficient of dispersion based on standard deviation is called coefficient of variation CV = (SD / Mean) × 100 The series having greater CV is said to be more variable than the series having less CV or in other words the series is more homogenous if the CV is less. Examples for Calculating Standard Deviation; Quartile, Coefficient of Dispersion and Coefficient of Variation: In case of Discrete Data: Simple Method 36 Medical Statistics and Demography Made Easy Variable xi 18 45 34 22 35 39 17 Total –12 15 4 –8 5 9 –13 210 724 No. of cases = 7 SD    xi  x n  2  724  103.42  10.16 7 Range = Max (A) = 45; Min (B) = 17 = A – B = 28 Coefficient of Dispersion (Based on Range)   A  B  28   0.45  A  B  62 Coefficient of dispersion (Based on SD)  144 225 16 64 25 81 169 SD 10.16   0.338 Mean 30 SD  Coefficient of variation     100  33.8  Mean  Measure of Dispersion 37 Short-cut Method: Variable xi ui2 ui = (xi – A) 18 45 34 22 35 39 17 Total –17 10 –1 –13 0 4 –18 289 100 1 169 0 16 324 – 35 899 No. of cases = 7; Let A = 35 Mean u = – 35 7 = – 5; therefore Mean (In this case we simply change the origin and SD is independent of Origin) In case of continuous frequency distribution: Age group fi Cumm. xi freq. (U+L)/2 fi . xi x i2 fi . xi2 20 – 30 5 5 25 25 × 5 = 125 625 625 × 5 = 3125 30 – 40 22 27 35 22 × 35 = 770 1225 1225 × 22 = 26950 40 – 50 20 47 45 20 × 45 = 900 2025 2025 × 20 = 40500 50 – 60 10 57 55 10 × 55 = 550 3025 3025 × 10 = 30250 60 – 70 3 60 65 65 × 3 = 195 4225 4226 × 3 = 12678 Total N = 60 2540 113503 38 Medical Statistics and Demography Made Easy U = Upper limit of class interval; L = Lower limit of class interval  fi .x i 2540   42.33 N 60 Standard Deviation Mean x    fi .x i 2 (σ) =  N  =   x     2  113503 2   42.33  60 1891.71  1791.82  99.89  9.9 Quartiles  iN  Quartile = l + h   C  /f, where i = 1, 2, 3  4  First Quartile (Q1): N = 60; for first quartile i  1; iN 60   15 4 4 Cumulative frequency just above 15 is 27, therefore 30 – 40 is the first quartile group Thus in the above formula: 1 = 30, h = 10, C = 5 and f = 22, i = 1. Second Quartile or Median (Q2): N = 60; for second quartile i  2; iN 60 60  2   30 4 4 2 Cumulative frequency just above 30 is 47, therefore 40 – 50 is the second quartile group. Thus in the formula: l = 40, h = 10, C = 27 and f = 20, i = 2. Measure of Dispersion 39 Third Quartile (Q3): N = 60; for third quartile i  3; iN 60 180  3   45 4 4 4 Cumulative frequency just above 45 is 47, therefore 40 – 50 is the third quartile group Thus in the formula: l = 40, h = 10, C = 27 and f = 20, i = 3. Q 3  40  10  45  27  180  40   40  9  49 20 20 Coefficient of Dispersion (Based on Quartile)   Q 3  Q i   (49  34.45)  Q 3  Q i  (49  34.45)  14.55  0.174 83.55 Coefficient of Dispersion (Based on Standard Deviation)  SD 9.9   0.2338 Mean 42.33 Coefficient of Variation  0.23  100  23.38 Short Cut Method: Age group 20 30 40 50 60 – – – – – 30 40 50 60 70 Total fi x1 (U + L)/2 ui = (x i – A) /h 5 22 20 10 3 25 35 45 55 65 –2 –1 0 1 2 60 fi × ui ui2 –10 – 22 0 10 6 4 1 0 1 4 –16 fi × ui2 20 22 0 10 12 64 40 Medical Statistics and Demography Made Easy U = Upper limit of class interval; L = Lower limit of class interval A (Arbitrary constant) = 45; h (Class width) = 10 Mean x  A  hu  45  10   – 0.267    45 – 2.67  42.33  f . u 2  2 64 2 SD (u)   i i   u     0.2672   1.06  0.07  .99  N  12   SD (x) = h × SD(u) = 10 × 0.99. (In this case we change the origin as well as scale while creating a new variable ui; therefore we have to multiply SD of ui by ‘h’ to obtain the Standard deviation of xi).   SKEWNESS Skewness means lack of symmetry. A distribution is said to be skewed if Mean  Median  Mode Measure of Skewness Skewness of a distribution can be measured by following formulae: 1. Sk = Mean – Median 2. Sk = Mean – Mode For comparing two series we calculate coefficient of skewness Karl Pearson’s Coefficient of Skewness: Sk  (Mean  Mode)  Measure of Dispersion 41 If mode is ill defined then (Mean  Median) Sk  3  The limits for Karl Pearson’s coefficient of skewness if + 3. In practice these limits rarely attained Skewness is positive if Mean > Mode or Mean > Median, and negative if Mean (M) < Mode (Mo) or M < Md. Figure 3.2 Figure 3.3 KURTOSIS Kurtosis (Curvature of curve) enables us an idea about the flatness of curve. It is measured by coefficient 2 . Figure 3.4 42 Medical Statistics and Demography Made Easy A - is called normal curve or Mesokurtic curve . B - which is flatter than normal curve is called Platykurtic curve . C - Which is more peaked than normal curve called Leptokurtic curve . MULTIPLE CHOICE QUESTIONS 1. In statistics, spread of dispersion is described by the: (a) Median (b) Mode (c) Standard deviation (d) Mean (Kerala, 88) 2. In statistical analysis what is used to mention the dispersion of data: (a) Mode (b) Range (c) Standard error of (d) Geometric mean mean (PGI, 81, AMC 87, 92) 3. Measure of dispersion is: (a) Mean (b) Mode (c) Standard deviation (d) Median Kerala, 94) 4. Among the measure of dispersion which is most frequently used: (a) Range (b) Mean (c) Median (d) Standard deviation (Karn, 94) 5. Best index to detect deviation is: (a) Variation (b) Range (c) Mean deviation (d) Standard deviation (AIIMS, 96) Measure of Dispersion 43 6. Mean weight of 100 children was 12 kg. The standard deviation was 3. Calculate the percent coefficient of variation: (a) 25% (b) 35% (c) 45% (d) 55% (AIIMS, Nov 2000) 7. Mean square deviation will be minimum when taken from ————. (a) Mean (b) Median (c) Arbitrary constant (d) Mode 8. Sum of absolute deviation about median is: (a) Least (b) Greatest (c) Zero (d) Equal 9. If mean and mode of the given distribution is equal then its coefficient of skewness is ————-. (a) 3 (b) Zero (c) 1 (d) None of the above 10. Least value of root mean square of deviation is: (a) Mean deviation from median (b) Mean deviation (c) Standard deviation (d) Mean deviation from arbitrary constant 11. If mean of the distribution is 40 and median is 50 find the mode the nature of the distribution: (a) 70 and positively skewed (b) 70 and negatively skewed (c) 60 and negatively skewed (d) 60 and positively skewed 12. If each of a set of observations of a variable is multiplied by a constant (non-zero), the standard deviation of the resultant variable: 44 Medical Statistics and Demography Made Easy (a) Is unaltered (c) Decreases (b) Increases (d) In unknown 13. Mean, SD and Variance have the same units: (a) True (b) False 14. Which quartile divides the total frequencies in 3: 1 ratio: (a) First quartile (b) Second quartile (c) Third quartile (d) Inter quartile range (AI, 2003) 15. If 25% of the items are less than 10 and 25% are more than 40 the deviation is: (a) 20 (b) 15 (c) 10 (d) 40 16. If in a frequency curve of scores, the value mode was found to be lower than mean the distribution is: (a) Symmetric (b) Negatively skewed (c) Positively skewed (d) Normal 17. In any discrete distribution (when all the values are not same) the relations between Mean deviation (MD) and standard deviation (SD) is: (a) MD = SD (b) MD > SD (c) MD < SD (d) None of these 18. If maximum value of a distribution is 60 and minimum value is 40 he coefficient of dispersion is: (a) 0.5 (b) 0.3 (c) 0.25 (d) 0.2 19. In a perfectly symmetrical distribution 50% of items are above 60 and 75% items are below 75. Therefore the of quartile deviation and coefficient of skewness is: (a) 15 and 0.5 (b) 15 and 0.25 (c) 30 and 0.5 (d) 30 and 0.25 Measure of Dispersion 45 20. Match the following: (1) Range (a) (2) Quartile deviation (b) (3) Coefficient of variation (c) X max  X min (4) Mean deviation (d) (a) 1-A, 2-B, 3-C, 4-D (c) 1-C, 2-B, 3-A, 4-D  fi x i  x N (b) 1-C, 2-A, 3-B, 4-D (d) 1-C, 2-D, 3-A, 4-B 21. Root mean square deviation is: (a) Standard deviation (b) Standard error (c) Standard variation (d) Standard error of proportion (AI,97) 22. Right sided skewed deviation causes: (a) Median is more than mean (b) SD more than variance (c) Tale to the right (d) Not affected at all (AI, 98) 23. In a hospital, 10 babies were born on same day. All of them had birth weight 2.8 kg. The standard deviation will be: (a) Zero (b) One (c) –1 (d) 0.28 (AI,2001) 24. Median incubation period means: (a) Time for 50% cases to occur (b) Time between primary case and secondary cases (c) Time between onset of infection and period of maximum infectivity (JIPMER, 2003) 46 Medical Statistics and Demography Made Easy 25. If the systolic blood pressure in a population has a mean of 130 mm Hg and a median of 140 mm Hg, the distribution is said to be: (a) Symmetrical (b) Positively skewed (c) Negatively skewed (d) Either positively or negatively skewed depending on the standard deviation 26. If each value of a given group of observations is multiplied by 10, the standard deviation of the resulting observations is: (a) Original std. deviation × 10 (b) Original std. deviation/10 (c) Original std. deviation – 10 (d) Original std. deviation it self Chapter 4 Theoretical Discrete and Continuous Distribution 48 Medical Statistics and Demography Made Easy THEORETICAL DISCRETE DISTRIBUTION Binomial Distribution Let a random experiment be performed repeatedly, and let the occurrence of an event in a trial be called a success and its non-occurrence a failure. Consider a set of n independent trials (‘n’ being finite), in which the probability ‘p’ of success in any trial is constant for each trial. The q = 1 – p, is the probability of failure in any trial. If there are ‘x’ success in ‘n’ trial, then the number of failure will be (n – x). But ‘x’ success in n trials can occur in nCx ways and the probability for each of these ways is px qn – x. Hence, the probability of ‘x’ success in ‘n’ trials in any order whatsoever is given by the expression:  n  x n x  xp q   The probability distribution of number of success so obtained is called binomial probability distribution. A random variable is said to follow binomial distribution if it assumes only non-negative values. Two independent constants are ‘n’ and ‘p’ in the distribution, known as parameters. ‘n’ is also sometimes known as the degree of binominal distribution. Physical Conditions for Binomial Distribution We get binomial distribution under the following experimental conditions: 1. Each trial results in two mutually exclusive disjoint outcomes, termed as success and failure. Theoretical Discrete and Continuous Distribution 49 2. The number of trials ‘n’ is finite. 3. The trials are independent of each other. 4. The probability of success ‘p’ is constant for each trial. Mean and Standard Deviation of Binomial Distribution If a random variable X follows a binomial distribution with parameters ‘n’ and ‘p’ then its mean is np and variance is npq Mean = np Variance = npq POISSON DISTRIBUTION Poisson distribution is a limiting case of binomial distribution under the following conditions: 1. ‘n’ the number of trials is indefinitely large n   2. ‘p’ the constant probability of success for each trial and is indefinitely small, i.e. 3. (say) is finite. Thus and , where is a positive real number. A random variable is said to follow a Poisson distribution if it assume only non-negative values and its probability mass function is given by: = 0 otherwise Here  is known as the parameter of the distribution. Remarks Poisson distribution occurs when there are events which do not occur as outcomes of a definite number of trials (unlike 50 Medical Statistics and Demography Made Easy binomial distribution) of an experiment but which occur at random points of time and space wherein our interest lies only in the number of occurrence of events, not in nonoccurrence. For example: Number of deaths from a disease (not in form of epidemic) such as heart attack, or cancer, or due to snake bite. Mean and Variance of Poisson Distribution Poisson distribution is the only distribution in which mean and variance are equal to λ. THEORETICAL CONTINUOUS DISTRIBUTION Normal (or Gaussian) Distribution The Binominal and Poisson distributions both related to a discrete random variable. The most important continuous distribution is the Gaussian (CF Gauss, 1777-1855), or as it is frequently called, the normal distribution. Chief Characteristics of the Normal Distribution The normal probability curve with mean μ and standard deviation σ is given by the equation 2  0 1. The curve is bell shaped and symmetrical about the line . 2. Mean, median and mode of distribution coincide. 3. As x increases numerically, f(x) decreases rapidly, the maximum probability occurring at the point and is given by Theoretical Discrete and Continuous Distribution 51 4. 5. Since f(x) being the probability, can never be negative, no portion of the curve lies below x-axis. 6. x-axis is an asymptote to curve. 7. The point of inflexion where the curve changes its shape from concave to convex of the curve are given by 8. Relation between Quartile deviation, Mean deviation and Standard deviation is given by: 9. The total area under normal probability curve is unity. Shape of Curve Figure 4.1 A variable X is said to be a normal variate if it follows a normal probability distribution with mean μ and variance σ2 2 and is represented as X ~ N ( ,  ). If and and then X + Y ~ N . 52 Medical Statistics and Demography Made Easy The sum as well as the difference of the two independent normal variate is also a normal variate. In X ~ N (μ, σ2) then kX will be distributed normally with mean kμ and variance k2σ2, i.e. kX ~ N (kμ, k2σ2), also X+a will be distributed normally with mean μ + a and variance σ2, i.e. X+a ~ N (μ + a, σ2) STANDARD NORMAL VARIATE If x ~ N (μ, σ2), then is a standard normal variate with mean 0 and variance 1. Area Properties Standardized variable z Figure 4.2 The above curve of normal distribution showing the scales of the original variable which differ from μ by +σ, + 2σ Theoretical Discrete and Continuous Distribution 53 and + 3σ. From the above Figure it is clear that a relatively small proportion of the area under the curve lies outside the pair of values x = μ + 2σ and x = μ – 2σ. In fact the probability that x lies within μ + 2σ is very nearly 0.95 and the probability that lies outside this range in correspondingly 0.05. In X and Y are two independent standard normal variate then U = X + Y and V = X – Y are also independently distributed as a normal variate with mean 0 and variance 2. The following tables gives the area under the normal probability curve for some important values of normal variate x. Distance from mean ordinate in terms of + σ Area under normal curve x+1σ x + 1.96 σ x+2σ x + 2.58 σ x+3σ 68.3% 95% 95.4% 99% 99.7% Importance of Normal Distribution 1. Most of the distribution occurring in practice, i.e. Binomial, Poisson can be approximated by Normal distribution. 2. Many distribution of sample statistic tend to normal for large samples and as such they can be studied with the help of normal distribution. 3. The entire theory of small samples tests viz. ‘t’, ‘F’, χ2 tests is based on the fundamental assumption that the parent population from which the sample is drawn follows a normal distribution. 54 Medical Statistics and Demography Made Easy MULTIPLE CHOICE QUESTIONS 1. In a standard normal curve the area between one standard deviation on either side will be: (a) 68% (b) 85% (c) 99.7% (d) None of the above (AI, 88, AIIMS, 86) 2. Normal distribution curve depends on: (a) Mean and sample (b) Mean and median (c) Median and standard deviation (d) Mean and standard deviation (AI, 90) 3. The area under a normal distribution curve for SD of 2 is: (a) 68% (b) 95% (c) 97.5% (d) 100% (AI, 93) 4. Mean + 1.96 SD included following % of values in a distribution: (a) 68% (b) 99.5% (c) 88.7% (d) 95% (AI, 96) 5. Shape of normal curve is: (a) Symmetrical (b) Curvilinear (c) Linear (d) Parabolic (Assam, 95) 6. SD is 1.96 the confidence limits is: (a) 63.6% (b) 66.6% (c) 95% (d) 99% 7. 95% of confidence limits exist between: (b) + 2 SD (a) + 1 SD (c) +3 SD (d) 4 SD [Hint: 1.96 is approximately equal to 2] (AI,98) (AI,99) Theoretical Discrete and Continuous Distribution 55 8. All are true regarding standard distribution curve except: (a) One standard deviation including 95% of the values (b) Median is the mid point (c) Mode is the common value recurrently occurring (d) Mean and mode coincides (AI, 2000) 9. The relation between mean deviation about mean and quartile deviation is: (a) Mean deviation is less than quartile deviation (b) Mean deviation is more than quartile deviation (c) Mean deviation is equal to quartile deviation (d) They are not related to each other 10. The point of inflexion of normal curve are: (a) Mean + SD (b) Mean + 2SD (d) Mean + 2/3 SD (c) Mean + 3 SD 11. If X and Y are two independent normal variate then X– Y is also a normal variate: (a) True (b) False 12. The mean and variance of a normal distribution: (a) Are same (b) Cannot be same (c) Are sometimes equal (d) Are equal in the limiting case, as n → ∞ 13. For a normal distribution: (a) Mean> Median > Mode (b) Mean < Median < Mode (c) Mean > Median < Mode (d) Mean = Mode = Median 14. The standard normal distribution is represented by: (a) N (0,0) (b) N (0,1) (c) N (1,0) (d) N (1,1) 56 Medical Statistics and Demography Made Easy 15. If in a normal distribution the standard deviation is equal to 45, then the mean deviation from mean is equal to: (a) 45 (b) 40 (c) 36 (d) 30 16. In a normal distribution the number of observations less than divided by mean are included in the range: (a) Mean + 3 SD (b) Mean + 1 SD (c) Mean + 2 SD (d) Mean + 0.67 SD [Hint: As mean divides the total area into two equal parts (i.e. 50% of observations will lie below mean and 50% of observations lie above mean). The first quartile of normal distribution is μ – 0.6745σ. These limits will include 50% of observations. Therefore number of observations included within limits Mean + 0.67 SD will be less than that divided by mean]. 17. Normal distribution is: (a) Very flat (b) Very peaked (c) Smooth (d) Bell shaped symmetrical distribution about mean 18. There are two independent normal variate X and Y. X ~ N (6, 3) and Y ~ N (3, 6). Then the distribution of X–Y is: (a) N (3,3) (b) N (3,6) (c) N (–3, 9) (d) N (3,9) 19. Total area under the normal probability curve is: (a) 100 (b) 10 (c) 1 (d) 0.05 Theoretical Discrete and Continuous Distribution 57 20. Binomial distribution tends to normal distribution if: (a) n →∞ and neither p or q is very small (b) n →∞ and p → 0 (c) n →∞ and q → 0 (d) None of the above 21. Normal distribution is symmetrical only for some specified values of X: (a) True (b) False 22. For a normal distribution, quartile deviation, mean deviation and standard deviation are in the ratio: (a) 4/5 : 2/3: 1 (b) 2/3: 4/5: 1 (c) 1: 4/5 : 2/3 (d) 4/5: 1: 2/3 23. The mean deviation about mean of a normal distribution is: (a) (b) (c) (d) [Hint: is approximately equal to ] 24. If X is distributed Normally with mean m and variance s2, then a linear combination of X, i.e. a X+ b will also be a Normal Variate with: (a) Mean aμ and variance a2σ2 (b) Mean aμ + b and variance a2σ2 (c) Mean μ + b and variance b2σ2 (d) Mean bμ + a and variance b2σ2 25. In the estimation of standard probability, Z Score is applicable to: 58 Medical Statistics and Demography Made Easy (a) (b) (c) (d) Normal distribution Skewed distribution Binominal distribution Poisson distribution (UPSC, 2001) 26. A non-symmetric frequency distribution is known as: (a) Normal distribution (b) Skewed distribution (c) Cumulative frequency distribution (d) None of the above (Orissa, 99) 27. The area between one standard deviation on either side of mean in a normal distribution is: (a) 62% (b) 68% (c) 90% (d) 99% (AIIMS, May 95) 28. True about normal distribution curve is all except: (a) Mean, median and mode coincides (b) Total area of the curve is one (c) Standard deviation is one (d) Mean of the curve is hundred (AIIMS, Dec.97 [SD of standard normal curve is 1] 29. Which statement is true about standard normal distribution curve: (a) Mean 1 and standard deviation 0 (b) Mean 0 and standard deviation1 (c) Curve skews towards left (d) Curve skews towards right (AIIMS, Nov 99) 30. In a normal distribution curve, True statement is: (a) Mean = SD (b) Median = SD (c) Mean = 2 Median (d) Mean = Mode (AIIMS, May 2001) 31. Systolic BP of a group of person follow normal distribution curve. The mean BP is 120. The values above 120 are: Theoretical Discrete and Continuous Distribution (a) 25% (c) 50% 59 (b) 75% (d) 100% (AIIMS,Nov 2001) 32. All are true in normal distribution curve except: (a) Is bell shaped , symmetrical and on the x axis (b) Occurs only in normal people (c) Median=mode=mean (Manipal, 2002) 33. A population study showed a mean glucose of 86 mg/ dL. In a sample of 100 showing normal curve distribution, what percentage of people have glucose above 86? (a) 65 (b) 50 (c) 75 (d) 60 (AI, 2002) 34. The standard normal distribution: (a) Is skewed to the left (b) Has mean = 1.0 (c) Has standard deviation = 0.0 (d) Has variance = 1 (AI, 2002) Chapter 5 Correlation and Regression 62 Medical Statistics and Demography Made Easy ASSOCIATION AND CORRELATION Association Association may be defined as the concurrence of two random variables when they occur more frequently together than one would expect by chance. Correlation Correlation indicates the degree of association between two random variables CORRELATION A series where each term of series may assume values of two or more variables. For example, if we measure the heights and weights of certain group of persons, we will get a distribution known as Bi-variate distribution. If the two variables deviate in the same direction then correlation is said to be Positive. But if deviate in opposite direction then the correlation is said to be negative. Scatter diagram is the simplest way to represent a bivariate distribution. Karl Pearson Correlation of Coefficient Correlation coefficient between two random variables x and y, usually denoted by rx y, is a numerical measure of linear relationship between them: Cov(x y)  1     xy  x y  / x  y rx y  x  y n  Graphical representation of the standard data for different values of r. Correlation and Regression 63 Figure 5.1 Properties of Correlation Coefficient 1. Correlation coefficient ‘r’ lies between –1 and +1 2. Correlation coefficient is independent of change of origin and scale. 3. `TWO independent variables are uncorrelated. If x and y are two independent variables then rx y = 0. 4. But two uncorrelated variables may or may not independent rx y = 0, merely implies the absence of any linear relationship. Standard Error of Correlation Coefficient If ‘r’ is the correlation coefficient is a sample of n pair of observations, then standard error is given by: SE (r)  (1  r 2 ) n 64 Medical Statistics and Demography Made Easy REGRESSION Regression Analysis Regression analysis is a mathematical measure of the average relationship between two or more variables in terms of original units of the data. The line of regression is obtained by the principles of least square. Let us suppose that in a bi-variate distribution (xi, yi); (i = 1, 2, ...n); y is dependent variable and x is independent variable. Let the line of regression of y on x is given by: y = a + bx Where a and b are constant, estimated by the method of least square ‘b’ is the slope of the regression equation of y on x. The regression y on x is given by  y  (y  y)  r   xx  x  The line of regression x on y is given by:     (x  X )  r  x  y  y  y    Regression Coefficient will never be of different signs. The correlation coefficient can also be calculated on the basis of regression coefficient:  ‘r’= byx . bxy Where   and bxy  r  x   y  byx . bxy  r 2 Hence,  Correlation and Regression 65 It may be noted that the sign of correlation coefficient is the same as that of regression coefficient, since the sign of each depends upon the co-variance term. Thus if regression coefficients are positive, ‘r’ is positive and if the regression coefficients are negative, ‘r’ is negative. Solved Example Find the correlation coefficient and line of regression between height and weight of 10 individuals: Case no. 1 Height 175 Weight 65 2 3 4 166 56 182 78 167 66 5 6 7 176 169 72 69 182 81 8 9 10 190 187 151 87 84 60 Correlation Coefficient Height (xi) 175 166 182 167 176 169 182 190 187 151 Total N = 10 Weight ui = vi = (yi) (xi – 170) (yi – 70) 65 56 78 66 72 69 81 87 84 60 ui2 vi2 ui .vi +5 –4 +12 –3 +6 –1 +12 +20 +17 –19 –5 –14 +8 –4 +2 –1 +11 +17 +14 –10 25 16 144 9 36 1 144 400 289 361 25 196 64 16 4 1 121 289 196 100 –25 +56 +96 +12 +12 +1 +132 +340 +238 +190 +45 18 1425 1012 1052 66 Medical Statistics and Demography Made Easy SD (vi )  ‘r’  1012  (1.8)2  101.2  3.24  97.96  9.89 10   u i . vi / N  u . v u . v   1052/10  4.5  1.8 11.05  9.89  105.2  8.1  0.88 109.28 Mean of x = 170 + 4.5 = 174.5; Mean of y = 70 + 1.8 = 71.8 SD (x) = SD (u) = 11.05 and SD (y) = SD (v) = 9.89 11.05   Re gression of x on y : (x  174.5)    0.88   y  71.8  9.89   x  174.5  0.98(y  71.8) x  174.5  0.98y  70.36 or, x  174.5  70.36  0.98 y Similarly y  0.78   64.31 Thus by putting the value of one variable in regression equation we can predict the value of other variable Correlation and Regression 67 MULTIPLE CHOICE QUESTIONS 1. Correlation between two variables is a numerical measure of: (a) Relationship between them (b) Linear relationship between them (c) Quadratic relationship between them (d) All the above 2. If the correlation coefficient between two variables are zero, then: (a) Two variables are independent (b) Two variables are linearly related (c) There is a perfect correlation between the two variables (d) There may be a non-linear relation between the two variables 3. The correlation coefficient between X and Y will have positive sign when: (a) X is increasing and Y is decreasing (b) Both X and Y are increasing (c) X is decreasing and Y is increasing (d) There is no change in X and Y 4. The coefficient of correlation: (a) Can take any value between –1 and +1 (b) Is always less than –1 (c) Is always greater than +1 (d) Cannot be zero 5. The coefficient of correlation between X and Y is +0.24. There covariance is 3.5 and the variance of X is 16. The SD of Y is: 68 Medical Statistics and Demography Made Easy (a) (c) 0.24 4  3.5 (b) 16 3.5  0.24 (d) 3.5 0.24  4 6. The coefficient of correlation is independent of: (a) Change of scale only (b) Change of origin only (c) Both change of origin and scale (d) Neither change of origin nor change of scale 7. Probable error of r is: (a) (c) 0.6745 (1  r 2 ) n (b) 0.6745 (1  r 2 ) n (d) 0.6745 (1  r 2 ) n 8. If one of the regression coefficient is greater than unity then the other will be: (a) Also greater than unity (b) less than unity (c) will equal to 1 (d) All the above 9. If two variables are uncorrelated then the two line of regression, i.e. X on Y and Y on X will: (a) Coincides (b) Perpendicular (c) The angle between will be equal to 45° (d) The two lines are parallel to each other 10. If one of the regression coefficient is positive then the other will be: (a) Also positive Correlation and Regression 69 (b) Will be negative (c) May or may not be positive (d) Not depends on the sign of the regression coefficient 11. If the correlation coefficient between two variables X and Y is 0.63. All the values of X is and Y is multiplied by a non- zero constant 6. The correlation between the new variables will be: (a) More than 0.63 (b) Less than 0.63 (c) 0.63 (d) Cannot be calculated 12. Regression coefficient is independent of: (a) Change of scale only (b) Change of origin only (c) Change of origin as well as scale (d) Neither change of origin nor scale 13. If the two lines of regression X on Y and Y on X coincides then the correlation will be: (a) r = + 1 (b) r = 0 (c) r = +0.5 ( d) – 1 < r < 1 14. If the lines of regression are given as x + 2y – 5 = 0 and 2x + 3y = 8. Then the mean of x and y respectively are: (a) 1, 2 (b) 1, 2 (c) 2, 5 (d) 2, 3 [Hint: The lines of regression pass through Mean x and therefore at the point the lines of regression will be and , by solving these two equations we can calculate the values of mean of a and y] 70 Medical Statistics and Demography Made Easy 15. The following statistics is used to measure the linear association between two characteristics in the same individuals: (a) Coefficient of variation (b) Coefficient of correlation (c) Chi-square (d) Standard error (Karnat, 96) 16. All are the features of correlation of coefficient except: (a) Cause effect association cannot be shown (b) Risk association can be revealed (c) Correlation risk to disease (d) Indicates linear relationship (AIIMS, 97) 17. When the height and weight is perfectly correlated, coefficient of correlation is: (a) +1 (b) –1 (c) 0 (d) More than 1 (AIIMS, 2000) 18. Height to weight is a/an: (a) Association (b) Correlation (c) Proportion (d) Index (AIIMS, 96) [Hint: Association is the relationship between two random variables and correlation coefficient shows the degree of association]. 19. Correlation coefficient tends to lie between: (a) Zero to –1.0 (b) –1.0 to +1.0 (c) +1.0 to zero (d) +2.0 to –2.0(AIIMS, June 97) 20. If the correlation between height and weight is 2.6. True is: (a) Positive correlation (b) No association Correlation and Regression 71 (c) Negative correlation (d) Calculation of coefficient is wrong (AIIMS, June 2000) 21. In a regression between height and age follow y = a + bx. The curve is: (a) Hyperbola (b) Sigmoid (c) Straight line (d) Parabola (AIIMS, Nov 2001) 22. The correlation between IMR and socioeconomic status is best depicted by: (a) Correlation (+1) (b) Correlation (+0.5) (c) Correlation (– 1) (d) Correlation (– 0.8) (AIIMS, Nov 2001) [Hint: The IMR decreases with the increase in socioeconomic status, but it is not a perfectly correlated]. 23. The correlation between variables A and B in a study was found to be 1.1. This indicates: (a) Very strong correlation (b) Moderately strong correlation (c) Weak correlation (d) Computational mistake in calculating correlation (AI, 2002) 24. A Cardiologist found a highly significant correlation coeffcient (r = 0.90, p = 0.01) between the systolic blood pressure valuse and serum cholesterol values of the patients attending his clinic. Which of the following statements is wrong interpretation of the correlation. (a) Since there is a high correlation the magnitudes of both the measurements are likely to be close to each other. (b) A patient with a high level of systolic BP is also likely to have a high level of serum cholesterol. 72 Medical Statistics and Demography Made Easy (c) A patient with a low level of systolic BP is also likely to have a low level of serum cholesterol. (d) About 80% of the variation in systolic blood pressure among his patients can be explained by their serum cholesterol values and vice versa. (AI, 2005) 25. Total Cholesterol level = a + b (calorific intake) + c (physical activity) + d (body mass index); is an example of: (a) Simple linear regression (b) Simple curvilinear regression (c) Multiple linear regression (d) Multiple logistic regression (AI, 2005) Chapter 6 Probability 74 Medical Statistics and Demography Made Easy Random Series: If a coin is tossed very large number of times, and the result of each toss is written down, the result may be something like the following (H standing for heads and T for tails): H, H, T, T, T, H, T, H, H, H, T, T, H, H, T, H, ....................... Such a sequence is called Random Sequence or Random Series. Trial and Events: Each toss of the above series is called Trial and each result is called Outcome or Events. In the above series in first trial, the outcome is head. Exhaustive Events: The total number of possible events in any trial is known as Exhaustive Events or Exhaustive Cases. Thus in tossing of a coin there are only two events – Head and Tail. Or in throwing of a die there are six exhaustive cases since one of the six faces 1,2,3, .......... 6 will come uppermost. Mutually Exclusive Events: Events are said to be mutually exclusive if the happening of one precludes the happening of all the others. For example, In throwing of a die all 6 faces 1 to 6 are mutually exclusive – since if one of these faces comes, the possibility of all the other faces in the same trial is ruled out. Equally Likely Events: If all the events in a trial have equal chance of taking place, there is no reason to except one in preference to others. For example, In throwing of an unbiased die, all the six faces are equally likely to come. Independent Events: Several events are said to be independent if happening of an event is not affected by the supplementary knowledge concerning the occurrence of any number of remaining events. For example: in tossing of an unbiased coin the event of getting head in the first toss is Probability 75 independent of getting a head in the second, third and subsequent tosses. MATHEMATICAL OR CLASSICAL PROBABILITY If in a trial result there are ‘n’ exhaustive, mutually exclusive and equally likely cases and out of them ‘m’ are favourable to the happening of an event ‘E’, then the probability of happening of an event ‘E’ is: m p  P(E)  n and the probability of non occurrence of the event E: (n  m) m  1  1 p n n Thus, p + q = 1 Obviously, p and q are non negative and cannot exceed 1, i.e. 0 < p < 1. q Sure Event: If the probability of occurrence of an event is 1, i.e. p = P(E) = 1 the E is called Sure Event. Impossible Event: If the probability of an occurrence of an event ‘E’ is zero, i.e. p = P(E) = 0 then E is called Impossible Event. ADDITIVE AND MULTIPLICATIVE PROPERTY OF PROBABILITY Here we will consider the two basic laws of probability, i.e. the addition and multiplication operation of probability. Addition Rule If in a population of doctors, the probability of a male doctor is 0.8 and the doctor is a surgeon is 0.4. If ‘A’ is defined that a doctor is male the probability of occurrence of A is P (A) = 0.8, 76 Medical Statistics and Demography Made Easy similarly if B is that the doctor is surgeon then probability of occurrence of B is P (B) = 0.4. If the two separate probabilities are added then the result is 0.8 + 0.4 = 1.2, which is wrong because the probability of occurrence of an event cannot exceed 1. This is because of the double event – person that is male and also surgeon is counted twice, once when we are calculating the probability of male doctor and another as a part of surgeon, thus the probability of double event is subtracted. This can be clear by the following diagram: Figure 6.1 Figure 6.2 In Figure 6.1 the shaded portion is included in circle A as well as in circle B, i.e. while calculating the probability of male doctors the surgeons who are male are included in it, and while calculating the probability of surgeons, the portion of males who are surgeon is also included. Probability 77 Therefore in additive law the probability of double event is subtracted. As shown in Figure 6.2. The additive property of probability states that: If A and B are two events the combined probability of two events is given by: P (A)  P(B) P (B)  P(A  B) P(A  B)  P(A) i.e. Prob (A or B or both) = Prob (A) + Prob (B) – Prob (A and B) In case of Mutually Exclusive Events: i.e. P (A or B) = Prob (A) + Prob (B) In case of mutually exclusive events (Fig. 6.3) The probability of occurrence of male surgeon is independent of the probability of occurrence of female surgeon. Figure 6.3 Thus if the probability of male surgeon in a population of doctors, i.e. P (A) = 0.3 and the probability of female surgeon, i.e. P(B) 0.1. Then the probability of surgeon in the population of Doctors is: P (A or B) = P (A) + P (B) = 0.3 + 0.1 = 0.4 Multiplication Rule When the events are not mutually exclusive: 78 Medical Statistics and Demography Made Easy Figure 6.4 Suppose in the Figure 6.4 there are n points in the square and m1 the number of points in the circle A; m2 number of points in the circle B and m3 be the number of points common to both A and B. (assume m1 > 0 and m2 > 0). Then the probability that both the events A and B occurs if given by: P (A and B) = P (A ) × P ( B given A) Or P (A and B) = P (B) × P (A given B) P (B given A) is known as condition probability of occurrence of B with the condition that A had already occurred, and P (A given B) is the conditional probability of occurrence of A when B had already occurred. In the above example, m m P(A)  1 ; P(B)  2 , n n P(B given A) Thus,  m  m  m P(A and B)   1   3   3 n  n   m1  Probability 79  m  m  m P(A and B)   2   3   3 n  n   m2  Which is equal to number of points common to both A and B to total number of points, i.e. n. Or In case of independent events: The multiplication rule is: P (A and B) = P (A) . P (B) Suppose that two random sequence of trials are proceeding simultaneously; for example, at each stage a coin may be tossed and a die is thrown. What is the probability of a particular combination of result, for example a head (H) on the coin and a 5 on the die? The result is given by simple multiplication rule. P (H and 5) = P (H) × P (5) In this example, the probability of 5 on a die was not affected by whether or not H occurred on the coin. Or in other words the two events are said to be independent and by multiplication rule the probability of H and 5 is equal to: 1 1 1 P(H and 5)  P (H) . P (5)    .     2   6  12 MULTIPLE CHOICE QUESTIONS 1. The Probability of Sure event is: (a) 0 (b) 0.5 (c) – 1 (d) + 1 2. Out of 1000 individuals surveyed, it was observed the 260 were suffering from respiratory disorders and 470 were from diabetes. And 170 were suffering from diabetes as well as respiratory disorders. The probability of persons suffering from respiratory problems is: 80 Medical Statistics and Demography Made Easy (a) 0.26 (b) 0.43 (c) 0.17 (d) 0.47 [Hint: Total person suffering from respiratory disorders also includes those who are suffering from respiratory disorders as well as diabetes also]. 3. In the above problem the probability of individuals who are suffering from diabetes alone is: (a) 0.47 (b) 0.17 (c) 0.26 (d) 0.43 4. Find the probability of persons suffering from respiratory disorders, diabetes as well as both diabetes and respiratory disorders: (a) 1.07 (b) 0. 17 (c) 0.90 (d) 0.69 5. Find the probability of persons suffering from diabetes as well respiratory disorders: (a) 0.90 (b) 0.17 (c) 1.17 (d) 0.47 6. The probability of any events in any case does not exceed: (a) 0.5 (b) 0.9 (c) –1 (d) 1 7. The probability of any event lies between: (a) – 1 < P < 1 (b) 0 < p < 1 (c) 0 < P < 1 (d) –1 < P < 0 8. In a population incidence of ocular deficiency in male is 20%, and in females is 25%. What is the probability of ocular disease in the population: (a) 0.05 (b) 0.25 (c) 0.45 (d) None of the above Probability 81 9. In question no. (8) what is the probability of diabetes in the population: (a) 0 (b) 0.25 (c) 0.20 (d) None of the above 10. The events A and B are mutually exculsive, so: (a) Prob. (A or B) = Prob (A) + Prob (B) (b) Prob (A and B) = Prob (A) . Prob (B) (c) Prob (A) = Prob (B) (d) Prob (A) + Prob (B) = 1 (AI, 2005) Chapter 7 Sampling and Design of Experiments 84 Medical Statistics and Demography Made Easy POPULATION The group of individuals under study is called population or universe. The population may be finite or infinite. SAMPLE A finite subset of individuals in a population is called a sample and the number of individuals in a sample is called sample size. The sample characteristic are utilized to approximately determine or estimate the population. The error involved in such approximation is known as sampling error which is inherent and unavoidable in any and every sampling scheme. Types of Sampling Some of the commonly known and frequently used sampling techniques are: 1. Random sampling 2. Stratified sampling 3. Systemic sampling 4. Cluster sampling Random Sampling In this case the sampling units are selected at random. A random sample is one in which each unit of population has an equal chance of being included in the sample. Suppose we take a sample of size n from a finite population of size N. Then there are NCn possible samples. A sampling technique in which each of NCn samples has equal chance of being selected is known as Random Sampling and the sample obtained by this technique is termed as random sample. In simple random sampling each unit of the population has equal chance of being included in the sample and that Sampling and Design of Experiments 85 this probability is independent of the previous drawing. To ensure that sampling is simple, it must be done with replacement, if population is finite. However, in case of infinite population replacements are not necessary. Stratified Sampling If the population is not homogenous, then entire heterogeneous population is divided into a number of homogenous groups, usually called strata. The units are sampled at random from each of these stratum, the sample size in each stratum varies according to the relative importance of the stratum in the population. The sample which is the aggregate of the sampled units of each stratum is termed as stratified sample. Such a sample is a good representative of the population when the population considered is heterogeneous. Systemic Sampling In systemic sample the number of units in population should be a product of number of units in sample (i.e. sample size). If there are N units in the population and they are numbered in some order. Suppose we want to draw a sample of n units from this population, then there should be a constant k which when multiplied by sample size (n) will be equal to population size (N), i.e. n . k = N or k = N/n. We divide the N units of population units into n groups of k unit each as follows: 1 2 3 4 i k k+1 k+2 k+3 k+3 i+k 2k 2k + 1 2k + 2 2k + 3 2k + 4 i + 2k 3k - - - (n – 1)k + 1 (n – 1)k + 2 (n – 1)k + 3 (n – k)k + 4 i + (n – 1)k (n – 1)k + k = nk = N 86 Medical Statistics and Demography Made Easy In systemic sampling, to select a sample of n units, if k = N/n then every kth unit is selected commencing with a randomly chosen number between 1 and k. Hence, the selection of the first unit determines the whole sample. Let the ith unit be selected at random from first k unit, then the sample will consist of ith, (i+k)th, (i+2k)th and [i +(n-1)k)th unit of the population. In system sampling the first unit will be drawn at random and the remaining unit will follow a systemic pattern. Example: Suppose from a population of size N = 5,000, we want to draw a sample of size 250 (i.e. n = 250), then 5, 000  20. Therefore, in systemic sampling the first unit of 250 the sample is selected at random from the first 20 unit of the population. Let us draw the 6th unit from the first 20 unit. Then the first unit of the sample will be the 6th unit of the population, the second unit of the sample will be the 26th unit of the population, the next unit will be the 46th unit of the population and so on. In this way we can draw a sample of size 250. k Advantages of Systemic Sampling 1. Easier to draw without mistake. 2. More precise than simple random sampling as more evenly spread over population. Disadvantages of Systemic Sampling 1. If the list has periodic arrangement then it can fare very badly. Cluster Sampling Contrary to Simple Random sampling and Stratified sampling, where single subjects are selected from the Sampling and Design of Experiments 87 population, in cluster sampling the subjects are selected in groups or clusters. Cluster sampling is used when ‘natural’ grouping are evident in the population. The total population is divided into groups or clusters. Elements within a cluster should be as heterogeneous as possible. But there should be homogeneity between clusters. Each cluster must be mutually exclusive and collectively exhaustive. A random sampling technique is then used on relevant clusters to choose which clusters to include in the study. In single-stage cluster sampling, all the elements from each of the selected clusters are used. In two-stage cluster sampling a random sampling technique is applied to the elements from each of the selected clusters. One version of cluster sampling is area sampling or geographical cluster sampling. Clusters consist of geographical areas. A geographically dispersed population can be expensive to survey. Greater economy than simple random sampling can be achieved by treating several respondents within a local area as a cluster Example: Suppose we want to conduct interviews with hotel managers in a major city about their training needs. We could decide that each hotel in the city represents one cluster, and then randomly select a small number, e.g. say 10. Then we can contact the managers of these 10 hotels for interview. When all the managers of the selected 10 hotels are interviewed then this is referred to as ‘one-stage cluster sampling’. If the subjects to be interviewed are selected randomly within the selected clusters, it is called ‘two-stage cluster sampling’. This technique might be more appropriate if the number of subjects within a unit is very large (e.g. instead of interviewing managers, we want to interview employees). 88 Medical Statistics and Demography Made Easy Advantages of Cluster Sampling 1. The main objective of cluster sampling is to reduce the costs, i.e. cluster sampling reduced field costs. 2. Applicable where no complete list of units is available (special lists only need be formed for cluster). Disadvantages of Cluster Sampling 1. Clusters may not be representative of whole population but may be too alike. 2. Analysis is more complicated than for simple random sampling. Difference between Cluster Sampling and Random Sampling 1. In simple random sampling single subjects are selected from the population, while in cluster sampling the subjects are selected in a groups or clusters. 2. As compared to random sampling the cluster sampling is more evenly spread over the population. Difference between Stratified and Cluster Sampling 1. Unlike stratified sampling, the clusters are thought of as being typical of the population, rather than subsection as in stratified sampling in which we divide the heterogeneous population into homogeneous subsection (strata). 2. In stratified sampling subjects are selected randomly within strata. While in cluster sampling all units of the selected cluster are interviewed (one-stage cluster sampling). 3. In stratified sampling the strata should be homogeneous, there should be maximum homogeneity within strata. But in cluster sampling the clusters should be as Sampling and Design of Experiments 89 heterogeneous as possible, each cluster should be a small scale version of the population. In other words there should be maximum heterogeneity within clusters and minimum between clusters. Multistage Sampling We can also combine cluster sampling with stratified sampling. For example, if we want to interview employees in a randomly selected clusters of hotels(in above example of cluster sampling). We might stratified employees based on some characteristic (e.g. seniority, job function, etc) and then randomly select employees from each of these strata. This type of sampling is referred as Multistage Sampling. Parameter and Statistic In order to avoid verbal confusion with the statistical constants of the population, viz. mean (μ) standard deviation (σ), etc which are usually referred to as parameters, statistical measures computed from the sample observations alone, e.g. mean ( x ) and standard deviation (s), etc have been termed as statistic. Sampling Distribution If we draw a sample of size n from a population of size N, then the total number of possible samples will be NCn = k (say). For each of these k samples we will compute mean and standard deviation , then there will be k values of mean as well as standard deviation. The set of values so obtained, one for each sample is called sampling distribution. Standard Error The standard deviation of sampling distribution is known as its standard error (SE). 90 Medical Statistics and Demography Made Easy The standard errors of some well known statistics, for large samples, are given below, where n is the sample size, σ is the population standard deviation, and P the population proportion, and Q = 1 – P, n1 and n2, represents the sizes of two independent random samples respectively drawn from the population(s). Statistic Standard error Sample mean: Sample proportion p Difference between two samples means Difference between two samples proportions (p1 – p2)  P1 Q l   P2 Q 2      n1   n 2  Utility of Standard Error Standard error plays a very important role in the large sample theory and forms the basis of testing of hypothesis. The magnitude of standard error gives an index of the precision of the estimate of the parameter. The reciprocal of standard error is taken as the measure of reliability or precision of statistic. Thus, in order to double the precision. Which amounts to reducing the standard error to half, the sample size has to be increased four times. Sampling and Design of Experiments 91 SE enables us to determine the probable limits within the population parameters may be expected to lie. The probable limits for population proportion P are given by: p3 pq n Confidence Limits based on Mean and Standard Error 95% confidence limits 99% confidence limits Mean + 2 SE Mean + 3 SE Size of a Statistical Investigation One question most commonly asked about the planning of a statistical study is how many observations should be made? In any review of this problem at the planning stage is likely to be important to relate the sample to a specified degree of precision. Suppose we want to compare the means of two population μ1 and μ2 assuming that they have the same known standard deviation, σ, and two equal samples of size ‘n’ are to be taken. If the standard deviation are known to be different the present result may be thought of as an approximation (taking σ to be the mean of two values). If the comparison is of two proportions, π1 and π2, σ may be taken approximately to be the pooled value. 1  1    1   1 2 2   2  1   We now consider two ways in which the precision may be specified. 92 Medical Statistics and Demography Made Easy Given Standard Error Suppose it is required that the standard error of the difference between the observed means and is less than ε; equivalently the width of the 95% confidence interval might be specified to be not wider than + 2ε. This implies Given Difference to be Significant We might require that if x1  x 2 is greater in absolute value than some value d0, then it shall be significant at some specified level (say at two sided test 2α level). Denote by u2α; (for 2α = 0.05, u2α = 1.96). Then DESIGN OF EXPERIMENTS While planning of a clinical experiment to compare the effect of various treatments on some type of experimental units. Then the problem is how the treatments should be allotted to these units. The allotments of treatment to experimental units should be such that the disparity between the characteristic of units receiving different treatments should be eliminated. This cannot be eliminated completely but it can be reduced if the groups of experimental units to which treatments were to be applied were made alike in various relevant respect. The three basic principle of doing these are: 1. Randomization Sampling and Design of Experiments 93 2. Replication 3. Local Control. Randomization In simplest form the randomization means that the choice of treatment for each unit should be made by an independent act of randomization (by toss of a coin or by using random number table). In clinical trials the total number of patients is often not known in advance, since many patients may become available for inclusion in the trial sometime after it started. The simplest method is then to be allocate treatment by an independent random choice for each treatment. Replication An important principle of experimental design is Replication, the use of more than one experimental unit for each treatment. Various purpose are served by replication: (a) An appropriate amount of replication ensures that the comparison between treatments are sufficiently precise, the sampling error between two means decreases as the amount of replication in each group increases. (b) The effect of sampling variation can be estimated only if there is an adequate number of degree of replication. For example, In comparison of means of two groups, for instance, if both samples were as low as 2, the degree of freedom for a ‘t’ test would only be 2, the critical point of ‘t’ at 2 degree of freedom are very high and the test therefore loses a great deal in effectiveness merely because of the inadequacy of the estimate of within group variation. (c) Replication may be useful in enabling observation to be spread over a wide variety of experimental conditions. 94 Medical Statistics and Demography Made Easy Local Control The third basic principle concerns the reduction in random variation between experimental units is Local control. As we know that the formula for the standard error of a mean is , shows that effect of random error can be reduced either by increasing the ‘n’ (number of replication) or by decreasing ‘σ’. This suggests that experimental units should be as homogenous as possible in their response to treatment. In clinical trials, For example, it may be that a precise comparison could be effected by restricting age, sex, clinical conditions and other features of the patients, but these restrictions may make it too difficult to generalized for the result. A useful solution to this dilemma is to subdivide the units into relatively homogenous groups called blocks. Treatments can then be allocated randomly within blocks so that each block provided a small experimental unit. The precision of the overall comparison between treatments is then determined by random variability within blocks rather then between different blocks. This is called a randomized Block Design. There are some more complex designs allowing simultaneously comparing more than one set of treatments. But they are beyond the scope of this book. MULTIPLE CHOICE QUESTIONS 1. If the mean is 230 and the standard error is 10, the 95% confidence limits would be: (a) 210 to 250 (b) 220 to 240 (c) 225 to 235 (d) 230 to 210 (AI, 89) Sampling and Design of Experiments 95 2. All of the following are examples of random sampling method except: (a) Stratified sampling (b) Quota sampling (c) Systemic sampling (d) Simple random sampling (AI, 96, AIIMS, 2000) 3. Area under 2SD of normal curve is: (a) 66% (b) 95% (c) 97% (d) 99% (AI, 93) 4. True regarding “Double blind” of people study: (a) Participant is not aware to study or control group (b) Neither the doctor not the participants is aware of the group allocation and the treatment received (c) The participants, the investigator and the person analyzing the data are all blind (d) All the above (AI, 96) 5. Sampling error is: (a) (b) (c) (d) None (AI, 2001) [There are only two types of error for testing a hypothesis, αerror is type-I error and β-error is type-II error, sampling error is inherent in sample while estimating population parameters on the basis of samples drawn, a proper sampling will reduce the sampling error]. 6. Which is true in cluster sampling: (a) Every nth case is chosen for study (b) Natural group is taken as sampling unit (c) Stratification of the population is done (d) Involves use of random number [Cluster sampling clusters are elected by natural demarcation and every unit of cluster is selected as sampling unit] (AIIMS, 92) 96 Medical Statistics and Demography Made Easy 7. In a sampling method adopted for VIP coverage evaluation survey of a district is: (a) Random sampling (b) Cluster sampling (c) Stratified sampling (d) Multistage sampling (JIPMER, 80, Orissa 91) 8. If you are doing a survey of a village divide the population into lanes and rows select 5 lanes random and survey all houses of the lane is type of: (a) Simple random sampling (b) Stratified sampling (c) Systemic sampling (d) Cluster sampling [Hint: In cluster sampling we divide the population into clusters according to geographical criteria and then take all units of the cluster; at least in first stage cluster sampling]. 9. Simple random sampling. True is: (a) Adjacent number is considered while taking sample (b) Each unit has an equal chance of being drawn in the sample (c) Each portion of sample represents a corresponding strata of universe (d) None of the above (AIIMS, 2001) 10. For a survey, a village is divided into 5 lanes then each lane is sampled randomly. It is an example of: (a) Simple random sample (b) Stratified random sampling (c) Systemic random sampling (d) All of the above (AIIMS, 96) 11. True about simple random sampling is: (a) All person have equal right to be selected (b) Only selected person have right to be selected Sampling and Design of Experiments 97 (c) Techniques provides least number of possible samples (d) Every fixed unit is taken for sampling (AIIMS, June 98) 12. If sample size is bigger in random sampling, which of the following is/are true: (a) It approaches maximum samples (b) Reduces non-sampling error (c) Increases the precision of the result (d) Decrease standard error [Hint: Precision is inversely proportional to standard error, to double the precision we have to reduce the standard error to half, thus increasing the sample size four times]. (AIIMS, June 99) 13. In a random sample the chance of being picking up is: (a) Same and known (b) Not same and not known (c) Same and not known (d) Not same but known [Hint: If a sample of size ‘n’ is drawn from a population of size N the probability of selection of each unit is 1/N]. (AIIMS,Nov 99) 14. While calculating the incubation period for measles in a group of 25 children, the standard deviation is 2 and mean incubation period is 8 days. Calculate standard error: (a) 0.4 (b) 1 (c) 2 (d) 0.5 15. In a population of pregnant female. Hb is estimated on 100 women with standard deviation of 1 gm. The standard error is: 98 Medical Statistics and Demography Made Easy (a) 1 (c) 0.01 (b) 0.1 (d) 10 (AIIMS, Nov 2001) 16. In a controlled trial to compare two treatment, the main purpose of randomization is to ensure that: (a) Two groups will be similar in prognostic factors (b) The clinician does not know which treatment the subjects will receive (c) The sample may be referred to a known population (d) The clinician can predict in advance which treatment the subjects will receive (AIIMS, 2002) 17. Mean hemoglobin of a sample of 100 pregnant women was found to be 10 mg% with a standard deviation 1.0mg%. The standard error of the estimate would be: (a) 0.01 (b) 0.1 (c) 1.0 (d) 10.0 (AIIMS, 2004) 18. Which sampling method is used in assessing immunization status of children under an immunization programme: (a) Quota sampling (b) Multistage sampling (c) Stratified random sampling (d) Cluster sampling [Hint: In cluster sampling we divide the population in small cluster, which are representative of populations, Cluster sampling involves less time and cost]. (AIIMS, 2004) Chapter 8 Testing of Hypothesis 100 Medical Statistics and Demography Made Easy Statistical Hypothesis A statement about population which we want to verify on the basis of information available from a sample. Test a Statistical Hypothesis It is a two-action decision problem after the experimental sample values have been obtained, the two action being acceptance or rejection of hypothesis under consideration. Null Hypothesis Null hypothesis is the hypothesis of no difference, which is usually denoted by H0. Alternative Hypothesis Every statistical hypothesis is being tested to observe that null hypothesis is accepted or rejected. Which is meaningful only when it is being tested against a rival hypothesis. This hypothesis is denoted by H1. Wrongly rejecting a null hypothesis seems to be more serious error than wrongly accepting it. Critical Region Let x1, x2, ........ xn be the sample observation denoted by “O”. All the values of “O” will be aggregate of samples and they constitute a space called sample space. We consider x1, x2, ........ xn as a point in ‘n’ dimensional sample space. We divide the sample space into two distinct parts ω and . We reject the null hypothesis HO if the observed sample point fall in ω. The region ω is known as critical region. Testing of Hypothesis 101 Figure 8.1 Types of Errors Table related to decision and hypothesis. Decision from sample Accept H0 Reject H0 True statement H0 True Correct Wrong (Type-I error) Correct H0 False Wrong (Type-II error) The probability of Type-I and Type-II errors are denoted by  and  respectively.  = Probability of Type-I error, i.e. Probability of rejecting H0 when it is true.  = Probability of Type-II error, i.e. probability of accepting H0 when H0 is false. Level of Significance  the probability of Type-I error is known as the level of significance. It is also called the size of critical region. 102 Medical Statistics and Demography Made Easy Power of Test (1 – ) is called the power of test to test the hypothesis H0 against alternative hypothesis H1 Since Type-I error is deemed to be more serious than the Type-II error. The usual practice is to control Type-I error at a predetermined level and choose a test which minimizes . Steps in Solving Testing of Hypothesis Problem 1. Explicit knowledge about the nature of population, about which the hypothesis are set-up. 2. Setting up the null and alternative hypothesis. 3. Choose a suitable statistic called test statistic which will reflect the probability of H0 and H1. 4. On the basis of test statistic, reject or accept the null hypothesis. Test of Significance A very important aspect of sampling theory is the study of the test of significance which enables us to decide on the basis of sample results, if (i) The deviation between the observed sample statistic and the hypothetical parameter values or (ii) The deviation between two independent sample statistic. Is significant or might be attributed to chance or fluctuating of sampling. One Tailed and Two Tailed Tests In any test, the critical region is represented by a portion of the area under the probability curve of the sampling distribution of the test statistic. Testing of Hypothesis 103 A statistical hypothesis where the alternative hypothesis is one tailed (right tailed or left tailed) is called a one tailed test For example, testing mean of a population Against the alternative is called one tailed test. A test where the alternative hypothesis is two tailed such as: H0 : x   Against the alternative Is called two tailed test. Critical Values or Significant Values The value of the test statistic which separates the critical region (rejection region) and the acceptance region is called critical value or significant value. It depends upon: (i) The level of significance used. (ii) The alternative hypothesis, whether it is two tailed or single tailed. Suppose that the critical value of the test statistics at a level of significance The value of for a two tailed test is given by is such that the area between the left  and to the right of is also 2 area α is divided into two equal parts. of is . . Thus, the total 104 Medical Statistics and Demography Made Easy Two Tailed Test (Level of Significance α) Figure 8.2 In case of single–tail test, the critical value is determined so that total area to the right of it (for right tailed test) is and for left tailed test the total area to the left of is  . Figure 8.3 Figure 8.4 Testing of Hypothesis 105 Thus, the critical value of Z for a single tailed test (left or right) at a level ‘ ’ is same as the critical value of Z for a two tailed test at a level of significance ‘2 ’. Critical values (Zα) of ‘Z’ Critical values (Zα) Level of significance 1% 5% 10% Right tailed test Z  2.33 Z  1.96 Z  1.64 Z  1.64 Z  1.28 Left tailed test  Z   2.33  Z   1.64  Z   1.28 Two tailed test TEST OF SIGNIFICANCE FOR LARGE SAMPLES For large values of n, almost all the distribution are very closely approximated by normal distribution. Thus we can apply the normal test, which is based upon the fundamental properties of normal probability curve (area property). 1. Compute the test statistic Z under H0. 2. If Z  3 , H0 is always rejected. 3. If , we test its significance at certain level of significance, usually at 5% and sometimes at 1% level of significance. Thus for a two tailed test if > 1.96, H0 is rejected at 5% level of significance. Similarly if > 2.58, H0 is rejected at 1% level of significance. For practical purpose, sample may be regarded as large if n > 30. 106 Medical Statistics and Demography Made Easy Sampling of Attributes Sampling from a population is divided into two mutually exclusive classes – one class possessing a particular attribute say ‘A’ and other class not possessing that attribute ‘ ’ The presence of an attribute in a sampling unit may be termed as success and its absence is failure. Test for Single Proportion If x is the number of success in n independent trials with constant probability ‘P’. Then observed proportion of success proportion SE(p) = and SE of , where Q = 1 – P. Then test statistic for large n Under the null hypothesis that the sample proportion is equal to population proportion, i.e. the sample is drawn from the same population with proportion of success P. The probable limits for normal variate of the observed proportion of success are: PQ n If P is not known than taking p (the sample proportion) as an estimate of P. Then the probability limits for the proportion in the population. P  3 SE  p  , i.e. P  3 p3 pq , where q   1  p  n Testing of Hypothesis 107 In particular 95% confidence limits for P are p + 1.96 , and 99% confidence limits for P is given by p + 2.58 . TEST OF SIGNIFICANCE FOR DIFFERENCE OF PROPORTION Let x1 and x2 be the number of person possessing certain characteristic (attribute), say A, in a random sample of size n1 and n2 from the two population respectively. Then sample proportions are given by: If P1 and P2 are the population proportion, then under the null hypothesis H0 : P1 = P2, the test statistic for difference of proportion.  p1  p2  ~ N 0, 1 Z    1 1  PQ     n1 n 2  Generally we do not have any information about the proportion “A” of population in such circumstances the estimate of population proportion under null hypothesis.  H 0 : P1  P2  P(say) is calculated. The estimate  of P    (n 1 p1  n 2 p2 ) and Q  (1  P) (n 1  n 2 ) Then, Test Statistic 108 Medical Statistics and Demography Made Easy Solved Examples Test for Single Proportion QUESTION: Thirty peoples were attacked by a viral disease in a village and only 28 survived. If the survival rate of this viral infection is reported to be 85%. Then test whether the survival rate by this infection in this village is more then the reported survival rate at 5% level of significance. SOLUTION: Setting of Hypothesis Null hypothesis: The survival rate in this village is equal to proportion of survival = 0.85 the reported survival rate, i.e. H0 : P = 0.85 Alternative hypothesis: Survival rate in this village is more than 85%, i.e. H1 : P > 0.85 (One tail test) Total number of persons survived x = 28 Total number of person attacked by infection = 30 x 28 ;  0.93. n 30 The reported survival rate = 85%, i.e. P = 0.85; Proportion of person survived; p  therefore Q = 1 – 0.85 = 0.15 The Test Statistic: p  P Z ~ N  0, 1 PQ n Z  0.93  0.85  0.85  0.15 30 Z  1.25  0.08 0.08   1.25 0.0042 0.064 Testing of Hypothesis 109 Tabulated value of Z at 0.05 (i.e. critical value) = 1.64 (For one tailed test). Because Zcal < Ztab; therefore Null hypothesis is accepted. Conclusion: The survival rate in the village is not more than the reported survival rate. Test of Significance of Difference of Proportion (When population proportion is not known): QUESTION: A survey conducted by a health agency, it was found that in Town A out of 876 births 45% were male, while in town B out of 690 birth 473 were males. Is there any significant difference in the proportion of male child in the two towns. SOLUTION: Proportion of male child in Town A p1 = 0.45; therefore q1 = (1 – p1) = (1 – 0.45) = 0.55 Total number of Birth in town A is 876, i.e. n1 = 876 In Town B out of 690 birth 473 were males therefore, Setting of Hypothesis Null hypothesis: There is no significant difference between the proportion of male child in two towns, i.e. H0 : P1 = P2 Alternative hypothesis: H 1 : P1  P2 (Two tail test). Because population proportion is not known, therefore we have to estimate it from sample proportions: 110 Medical Statistics and Demography Made Easy  Q   1  0.55   0.45 therefore, Test statistics: Z Z  p1  p 2  1 1  PQ   n n 2   1    0.45  0.68 1 1  0.55  0.45     876 690  0.23 0.23   2.87 0.247  0.026 0.08 Critical value of Z at 5% level of significance (for two tail test) = 1.96; which is less than Zcal. Thus null hypothesis is rejected. Conclusion: There is a significant difference between proportion of male birth in two Towns. Test of Significance for Single Mean If x1, x2, ........... xn is a random sample from a normal population with mean μ and SD σ, then for large samples the statistic Z  x –   ~ N  0, 1  n Under the null hypothesis H0 : x   , i.e. the sample is drawn from the population with mean μ. If the population standard deviation is unknown then we use sample standard as an estimate of Confidence limits for μ: Testing of Hypothesis 111 95% confidence limits for μ is + 1.96 and 99% confidence limits for μ is + 2.58 Test of Significance for Difference of Means Let be the mean of random sample of size n1 from a population mean and SD , and be the mean of an independent random sample of size n 2 from another population with mean and SD . Under the null hypothesis then the test statistic becomes (for large samples). Remarks: 1. If 12  22  2 , i.e. samples have been drawn from the population with common SD s then under 2. If  is not known, then its estimate based on sample variance is used. The unbiased estimate of by: Estimate of is given 112 Medical Statistics and Demography Made Easy 3. If 12  2 2 and and are not known then they can be estimated on the basis of sample. This results in some error, which will be very less and can be ignored if samples are large. There estimated for large samples are given by and 2 2  S 2 2 In this case the test statistic is: x1  x 2 Z ~ N  0, 1 S 12 S 2 2  n1 n 2 However if the sample sizes are small, then a small sample test ‘t-test’ for difference of means should be used. Solved Example Test of Significance for Single Mean QUESTION: A sample of 900 individuals has a mean haemoglobin of 12.7 mg%. Is the sample drawn from a population with mean 13.6 mg% and SD 2.70. SOLUTION: Setting of Hypothesis Null hypothesis: The sample is drawn from the population with mean 13.6, i.e. H 0 :   13.6. Alternative hypothesis: H1 :   13.6 (Two tail test). The Test Statistic: Z  x      12.7  13.6    0.9   0.9   1,  n 2.70 900 2.70 30 0.9 Z 1 Testing of Hypothesis 113 Critical value of Z at 5% level of significance (for two tail test) = 1.96, i.e. Ztab = 1.96; which is more than the calculated value of Z . Hence we accept the null hypothesis. Conclusion: The sample is drawn from a population with haemoglobin level 13.6 and SD 2.70. Test of Significance for Difference of Mean QUESTION: A random sample is drawn from two hospitals and following data related to blood pressure of adult males hospital workers were obtained: Mean blood pressure Standard deviation No. of cases Hospital A Hospital B 127.56 mmHg 10.37 mmHg 700 140.78 mmHg 13.77 mmHg 360 Is the blood pressure of male workers of Hospital B is significantly higher than those working in Hospital A. SOLUTION: Setting of Hypothesis Null hypothesis: There was no significant difference between the blood pressure of workers working in two hospitals, i.e Alternative hypothesis: Test statistics: In this example (one tail test). 114 Medical Statistics and Demography Made Easy x1 = 127.56; S1 = 10.37; n1 = 700 = 140.78; S2 = 13.77 and n2 = 360 Putting these values in test statistic Z  13.22 13.22   16.12 0.82 0.153  0.526 The calculated value of Z is much higher than the tabulated value of Z. Thus we can reject the null hypothesis. Conclusion: The difference in the mean values of blood pressure of workers of two hospitals is highly significant. Thus we can say that the mean value of workers working in Hospital B is significantly higher than those working in Hospital A. EXACT SAMPLING DISTRIBUTION χ2 – Distribution) Chi-Square Distribution (χ The square of standard normal variate is known as ChiSquare variate with 1 degree of freedom. If x ~ N ( , 2 ), then is a standard 2  x    normal variate then Z 2    is a Chi-Square    distribution with 1 degree of freedom. In general if xi (i = 1, 2, ........n) are n independent normal variate with mean μi and variance i2 (i = 1, 2, ........n); then Testing of Hypothesis 115 is a Chi-Square distribution with ‘n’ degree of freedom. Remarks: 1. Normal distribution is a particular form of distribution when n = 1 2. - distribution tends to normal distribution for large degree of freedom. In practice for n > 30, then approximation to normal distribution is fairly good. - Degree of Freedom The number of independent variate which make the statistic (e.g. ) is known as degree of freedom and is usually represented by (nu). In general, the number of degree of freedom, is the total number of observations less than number of independent constraints. In a set of n observations usually the degree of freedom (df) for are (n – 1) because of a linear constraint on frequencies. Mean and Standard Deviation of Mean and SD of is ‘n’ and “ -distribution with “n” degree of freedom ” respectively. Mode and Skewness of Mode of - Distribution - Distribution distribution with n degree of freedom is (n – 2) Skewness = 116 Medical Statistics and Demography Made Easy 2 Skewness is greater than zero for n > 1 thus  distribution is positively skewed. Further, skewness is inversely proportional to square of roof of df it rapidly tends to symmetry as the df increases, consequently as ‘n’increases. Figure 8.5 For n = 2 the curve will meet the y= f(x) axis at x = 0, i.e. at f(x) = 0.5 For n = 1, it will be an inverted J-shaped curve. Conditions for the Validity of - Distribution For the validity of Chi-Square test for “goodness of fit” between theory and experiment. The following conditions must be satisfied. 1. Sample observations should be independent. 2. N, total frequency should be reasonably large, say greater than 50. 3. No theoretical cell frequency should be less than 5. Testing of Hypothesis 117 Critical Values Figure 8.6 The value known as the upper (right-tailed) - point, or critical value, can be calculated from – table for different values of n and . The value of increases as ‘n’ (df) increases and the level of significance decreases. Application of - Distribution - distribution has large number of application. Some of which are: (1) to test the ‘Goodness of fit’ and (2) to test the independence of ‘attributes’. 1. Goodness of fit: A very powerful test for testing the significance of discrepancy between theory and experiment. It enables us to find if the deviation of the experiment from theory is just a chance or is it really due to the inadequacy of theory to fit the observed data. If Oi (i = 1,2, ........ n) is the set of observed (experimental) frequencies and Ei (i = 1, 2, ........ n) are the corresponding set of expected frequencies (theoretical or hypothetical), then Chi-Square is given by: 118 Medical Statistics and Demography Made Easy 2 follow a  distribution with (n – 1) degree of freedom. 2. Independence of attributes: Four-fold classification: Comparison of two proportions (2 × 2 contingency table): An alternative method of representing the proportions is a 2 × 2 contingency table or fourfold classification. The total frequency or grand total is split into different dichotomies represented by two ‘horizontal’ rows and the two ‘vertical columns. There are four combinations (2 × 2) of rows and column categories and the corresponding frequencies occupy the four inner cells of the body of the table. The comparison can be done by applying significance tests (discussed for comparing several proportions). The 2 × 2 contingency table is described as: Positive Negative Total Group 1 Group 2 Group 1 + Group 2 r1 ni – r1 r2 n2 – r 2 R (r1 + r2) N–R n1 n2 N (n1 + n2) Manifold Classification Comparison of several proportions (2 × k contingency table): The comparison of two proportions was considered from two point of view – the sampling error of the difference of proportions and the significance test. Testing of Hypothesis 119 When more than two proportions are compared the calculation of standard errors between pairs of proportions requires several comparison, and an undue number of significant differences may arise. provides a method by which we can compare several proportions. Suppose there are k groups of observations and that in the ith group ni individuals have been observed, of whom ri shows a certain characteristic (say being positive). The proportion of positive, is denoted by pi. The data may be described as follows: 1 2 i r1 ni – r1 r2 n2 – r2 ri n i – ri Total n1 n2 ni nk N Proportion positive p1 p2 pi pk P= R/N Positive Negative k All groups rk R n k – rk N – R The frequencies form 2 × k contingency table (there being 2 rows and k columns). test requires for each of the observed frequency Oi, an expected frequency which is calculated by the formula: The quantity is calculated and finally 120 Medical Statistics and Demography Made Easy 2   (O i  Ei )2 Ei The summation is over the 2k cells in the table. On the null hypothesis that all k samples are drawn randomly from populations with the same proportions of 2 positives, the  is distributed approximately as (k – 1)(2 – 1) df General Contingency Table (r × s) Let us consider two attributes A and B. A is divided into r classes A1, A2, ........ Ar and B is divided into s classes B1, B2 ........ Bs. The cell frequencies can be expressed as (r × s) manifold contingency table. A1 A2 A3 - - Ar B1 (A1B1) (A2B1) (A3B1) (ArB1) B2 (A1B2) (A2B2) (A3B2) (ArB2) B3 (A1B3) (A2B3) (A3B3) - - - - - - - - - - - - - - - - - - - - - Bs (A1Bs) (A2Bs) (A3Bs) (ArB3) (ArBs) (Ai Bj) is the number of person possessing the attributes (Ai) and (Bj) [ i =, 1,2, ....... r; j = 1, 2, ...... s]. Testing of Hypothesis 121 Also where “(where Oij is the observed frequency of “Col i” and “Row j” and Eij is the corresponding expected frequency.)” Under the null hypothesis that attributes are independent: 2 The  - test is distributed as -variate with (r – 1) (s – 1) degree of freedom SOLVED EXAMPLE Fourfold Contingency Table Comparison of Two Proportion (2 × 2 Contingency Table) The same question mentioned while calculating difference of proportion can also be expressed as follows: Town A Town B Total Male Female 394 482 473 217 867 699 Total Births 876 690 1566 Two proportions can also be compared by applying test. Setting of Hypothesis Null hypothesis: There is no significant difference between the proportion of male birth of two Towns. 122 Medical Statistics and Demography Made Easy The test statistic is: Where Oi are the observed value and Ei are expected values. In this example there are four observed values two values for males corresponding to Town A and B and two for females for Town A and B (i.e. 394, 473, 482 and 217 respectively). The expected value for these four observed values is calculated as follows: Expected value for 394, i.e E (394) = 867  876  484.98 1566 Similarly: 2  E(473)  867  690  382.01 1566 E(482)  699  876  391.01 1566 E(217)  699  690  307.98 1566 (394  484.98)2 (473  382.01)2 (482  391.01)2   498.98 382.01 391.01  (217  307.98)2 307.98 2  17.06  21.67  21.17  26.87  86.77 2 Calculated value of  is much more than tabulated value of at (2-1) × (2-1) = 1 degree of freedom. Hence we reject the null hypothesis. Testing of Hypothesis 123 Conclusion: The proportion of male birth in two towns is not same. In town B the proportion of male birth is much higher when compared with town A. Manifold Contingency Table Comparison of Several Proportions: The 2 × k Contingency Table: QUESTION: The following table showing the persons suffering from Respiratory illness in different groups: Presence of respiratory illness Absence Total Children Adolescents Adult Elderly people Total 76 47 65 79 267 54 67 89 46 256 130 114 154 125 523 Find out that the proportion of persons suffering from respiratory illness in different categories is same. SOLUTION: In the above table there are eight observed values corresponding to four columns and two rows. Therefore this is a (2 × 4) contingency table. The expected values corresponding to each observed values are calculated as follows: E(65)  267  154  78.61; 523 E(79)  267  125  63.81 523 124 Medical Statistics and Demography Made Easy E(54)  256  130  63.63; 523 E(67)  256  114  55.80 523 E(89)  256  154  75.38; 523 E (46)  256  125  61.85 523  (76  66.36)2 (47  58.19)2 (65  78.61)2 (79  63.81)2 2      66.36 58.19 78.61 63.81  (54  63.63)2 (67  55.80)2 (89  75.38)2 (46  61.85)2  + + +  63.63 55.80 75.38 61.85  2 Critical value of  at (2 – 1) × (4 – 1) = 3 degree of freedom and 5% level of significance is than calculated value of hypothesis. = 9.35. Hence tab is less , therefore, we reject the null Conclusion: The incidence of respiratory illness in different groups is not same. Exact Sampling Distribution The entire sampling theory was based on the application of normal test. However if the sample size ‘n’ is small the normal test cannot be applied. In such cases exact sample test was developed. Some of these tests are: 1. t-test; 2. F-test; 3. Fisher Z transformation. The exact sample tests can, however, be applied to large samples also though the converse is not true. Testing of Hypothesis 125 In all the exact samples tests, the basic assumption is that “the population (s) from which the sample (s) are drawn is (are) normal”. Student’s ‘t’ distribution: Let xi (i = 1, 2, .......... n) be a random sample of size n from a normal population with mean and variance . Then the Student’s t- is defined by the statistic: (x i  x)2  xi 2 and S  is the unbiased estimate (n  1) n of population variance. Where x  Application of ‘t’-Distribution ‘t’-distribution has a wide number of application some of which are:   1. To test if the sample mean x differ significantly from the hypothetical value of its population mean . 2. To test the significance difference between two sample means. 3. To test the significance between sample correlation coefficient. Assumptions for Student’s ‘t’ Test 1. The parent population from which the sample is drawn is normal. 2. The sample observations are independent, i.e. the sample is random. 3. The population SD σ is unknown. 126 Medical Statistics and Demography Made Easy ‘t’- Test for Single Mean If x1, x2, ..........xn is a random sample drawn from a population with a specified mean μ0, then under the null hypothesis: 2 where S    x i  x 2 (n  1) follows a ‘t’ distribution with (n – 1) degree of freedom. It calculated t > tabulated t, null hypothesis will be rejected, at the level of significance adopted. ‘t’ - Test for Difference of Means Suppose we want to test if (a) Two samples xi; (i = 1, 2, ........... n1) and yj; (j = 1, 2, ........... n2) have been drawn from the population with same mean or (b) Two samples x and y differ significantly or not. Under the null hypothesis (a) The sample have been drawn from the population with same means, i.e. μx = μy or (b) The sample means The Test Statistics and do not differ significantly Testing of Hypothesis 127  n1  1 S12   n 2  1 S 2 2    Where S 2   n1  n 2  2  Follows a ‘t’ distribution with (n1 + n2 – 2) degree of freedom. Assumptions of ‘t’- Test for Difference of Means 1. Parent population from which the samples have been drawn are normally distributed. 2. The population variances are equal and unknown, i.e.  x 2   y 2  2 . 3. The two samples are random and independent of each other. Paired ‘t’-Test for Difference of Means Paired ‘t’-test is applied (i) When the sample sizes are equal. (ii) The two samples are not independent but the sample observations are paired together, the pair of observations (xi, yi); (i = 1, 2, ........... n) corresponding to ith unit of the sample. Here instead of applying the difference of means, we consider the increment. Under the null hypothesis H 0 : d = 0 , i.e. the increment are due to fluctuation of samples. The Test Statistic: Where d   di (di  d)2 and S 2  . n (n  1) 128 Medical Statistics and Demography Made Easy ‘t’- Test for Testing Significance of Correlation Coefficient If r is the observed correlation coefficient in a sample of n pair of observations from a bi-variate normal population. The under the null hypothesis that population correlation coefficient is zero, the test statistic. r ‘t’   n  2 1  r2   Follows a student ‘t’ distribution with (n – 2) degree of freedom. If t comes out to be significant then we reject H0. SOLVED EXAMPLES Test for Significance for Single Mean (For Small Sample) QUESTION: A random sample of 10 students has the following IQ 67, 110, 115, 75, 63, 117, 120, 115, 100 and 97. Do these data support that the sample is drawn from a population of Medical students with IQ =100. SOLUTION: Setting of hypothesis: The sample is drawn from a population of medical student with IQ = 100, i.e. H 0 :   100 . Alternative hypothesis: H1 :   100 (Two Tail Test) The Test Statistic is: “t”  Where S     xi2  n x  n  1  x  100  S n 2 ; is an unbiased estimate of  Testing of Hypothesis 129 From the above data we can calculate Mean and SD ‘S’; which is equal to: x S x i 976   97.6; and n 10 (99558  10(97.6)2 (99558  95257.6)   21.85 (10  1) 9 By putting these values in test statistic we can calculate the value of ‘t’ t   97.6  100  21.85 10  2.4 2.4   0.34 21.85 6.91 3.16 The tabulated value of ‘t’ at (n – 1) = 9 degree of freedom at 5% level of significance is 2.62. The tabulated value of ‘t’ is more than the calculated value; hence we accept the null hypothesis. Conclusion: The sample is drawn from the population of medical students with IQ = 100. “t” Test for Difference of Mean between Two Independent Groups QUESTION: Two groups of rats were placed on diets with high and low protein contents and the gain in weight were recorded after 2 months. The results of gain in weight are as follows: Group A (high protein diet): 140 146 117 160 107 102 123 114 145 121 127 132 107 153 97 120 63 110 115 120 120 150 96 74 86 Group B (low protein diet): 130 Medical Statistics and Demography Made Easy Find out whether there is any significant difference between the weight gain in rats of two groups. SOLUTION: Setting of hypothesis Null hypothesis: H 0 : 1   2 ; and Alternative hypothesis: Mean and SD of the two groups can be calculated which will be equal to: Group A: Group B: n 2  11; x 2  104.63 and S 2  24.68 The Test Statistics x1  x 2 1 1 S  n1 n 2 Where S2 is the pooled estimate of variance and is equal to ‘t’  S2   n 1  1 S12   n 2  1 S 2 2  n1  n 2  2  In this problem S2 = 454.73 (by putting the values of n1, n2, S1 and S2 in the above formula) Thus S  454.73  21.32. The test statistic will be equal to: t  128.11  104.63 23.48 23.48    2.75 1 1 21.32 0.071  0.091 8.52 21.32  14 11 Tabulated value of ‘t’ at (n1 + n2 – 2) degree of freedom, i.e. 23 df is 2.04 which is less than calculated value of ‘t’. Hence, we reject the null hypothesis. Testing of Hypothesis 131 Conclusion: Weight gain of rats in Group A (high protein diet) is significantly more than those rats which are on low protein diet. Paired ‘t’ Test” for Difference of Mean QUESTION: In a clinical trial the anxiety score of 10 patients were recorded (baseline value). A new tranquillizer was given to each patient for one month. After one month the anxiety scores were again recorded. Which are as follows: Case number 1 2 3 4 5 6 7 8 9 10 Baseline values (xi) 23 21 24 19 17 26 22 17 12 15 After one month (yi) 15 20 26 17 17 21 16 12 12 11 Find out whether the new tranquillizer is effective to psychoneurotic patients. SOLUTION: Setting of hypothesis Null hypothesis: There is no difference in mean anxiety score;, i.e. H0 : 1  2 Alternative hypothesis: The Test Statistic where di = xi – yi d is the mean of di and S is standard deviation of di 132 Medical Statistics and Demography Made Easy The mean ad SD of di is calculated as follows: Case No. Base line values (xi) After one month (yi) di = xi – yi di2 1 2 3 4 5 6 7 8 9 10 23 21 24 19 17 26 22 17 12 15 15 20 26 17 17 21 16 12 12 11 8 1 –2 2 0 5 6 5 0 4 64 1 4 4 0 25 36 25 0 16 Total 31 – 2= 29 175  (175  84.1)  3.17 9 Put these values in test statistic we can get the value of ‘t’ t  2.9 2.9 =  2.89 3.17   1.003    10  Tabulated value of ‘t’ at (n – 1) degree of freedom, i.e. 9 degree of freedom is 2.26; which is less than calculated value of t = 2.89. Hence we reject the null hypothesis. Conclusion: We can safely say that the new tranquillizer is effective on psychoneurotic patients. Testing of Hypothesis 133 ‘t’ Test for Significance of Correlation Coefficient QUESTION: If in a sample of 30 individuals, the correlation coefficient between height and weight is r = +0.46. Find out whether this correlation coefficient is significant in the population. SOLUTION: Setting of hypothesis Null hypothesis: H 0 :   0 ; where ρ is the population coefficient, i.e. the observed sample correlation is not significant of any correlation in the population. Alternative hypothesis: The Test Statistics is distributed as ‘t’ distribution with (n – 2) degree of freedom. In this problem r = +0.46; n = 30, putting these values in the formula we get ‘t’  0.46 2 1   0.46      30  2  0.46  5.29 2.43   2.76 0.88 0.88 Tabulated value of ‘t’ at 28 degree of freedom and 5% level of significant is 2.048 which is less than calculated value of ‘t’. Thus we reject the null hypothesis. Conclusion: On the basis of this sample we can say that there is a significant positive correlation between height and weight of individuals. 134 Medical Statistics and Demography Made Easy F - Statistic If X and Y are two independent Chi-Square variate with ν1 and ν2 degree of freedom, then F- statistic is defined by: X Y F   /    1    2  Thus F is defined as the ratio of two independent ChiSquare variate divided by their respective degree of freedom and it follows a F-distribution with (ν1, ν2) degree of freedom. Mode of F - Distribution 1. Since F > 0. mode exists if and only if ν1 > 2 2. Mode of F-distribution is always < 1. Skewness of F - Distribution Coefficient of Skewness is given by: Since mean > 1 and mode < 1. Hence F-distribution is highly positively skewed. Critical values of F - distribution Figure 8.7 Testing of Hypothesis 135 Application of F - Distribution F-test for Equality of Population Variance Suppose we want to test (i) Whether two independent samples xi; (i = 1, 2, ...... n1) and yj, (j = 1, 2, ...... n2) have drawn from normal population with same variance  2 . (ii) Whether the two independent estimates of the population variance are homogenous or not. Under the null hypothesis 2 Where: Sx    xi  x  (say) 2  n 1  1 2 and Sy    yj  y  2  n 2  1 Follows F-distribution with  1 , 2  degree of freedom; where and . F-test for Equality of Several Means F-test can be used for testing equality of several means using the technique of Analysis of Variance (ANOVA). COMPARISON OF SEVERAL GROUPS One-way Analysis of Variance The technique ‘analysis of variance’ forms a powerful method of analyzing the way in which the mean values of a variable is affected by classifications of the data of various sorts. This technique concerned with the comparison of means rather than variances. 136 Medical Statistics and Demography Made Easy ‘t’ distribution for the comparison of the means of two groups of data, distinguishing between the paired and unpaired cases. The analysis of variance’, is a generalization of unpaired ‘t’ test, appropriate for any number of groups, It is entirely equivalent to unpaired ‘t’ test when there are just two groups. Some examples of a one-way classification of data into several groups are as follows: (a) The reduction in blood sugar recorded for groups of individuals given different doses. (b) The values of certain lung function test recorded for men of the same age group in a number of different occupational categories. Suppose there are k groups of observations on a variable y, and that the ith group contains n i observations. The numbering of the groups from 1 to k is quite arbitrary, although if there is a simple ordering of groups it will be natural to use this in the numbering. Groups 1 2 ........ i ........ k All group combined Number of cases n1 n2 ........ ni ........ nk N= Mean of y ........ Sum of y Sum of y2 ........ ni = T/N T1 T2 ........ Ti ........ Tk T= Ti S1 S2 ........ Si ........ Sk S= Si Note that the entries N, T and S in the final column are the sum along the corresponding rows, but is not the sum of . ( will be the mean of ) if all the ni are equal otherwise Testing of Hypothesis 137 In one way analysis of variance total sum of squares about the mean of N values of y can be portioned into two parts: (1) The sum of squares of each reading about its own mean and (2) The sum of squares of the deviations of each group mean about the grand mean (y ij  y)2  (y ij  y)2  (y i  y)2 We can write this result as: Total SSq = Within group SSq + Between SSq Where SSq stands for sum of squares. Now, if there are very large differences between group means, as compared to with the within-group variation, the between SSq is likely to be larger than within-group SSq. If on the other hand, all the group means are nearly equal then there is a considerable variation within groups. The relative sizes of the between and within group SSq should be therefore, provide an opportunity to assess the variation between group means in comparison with that within groups. The total sum of squares as well as sum of squares between and within groups can be obtained by the following formulae: Total Sum of Squares:   y ij  y  2 ij   T2  S     N      Within Sum of Squares: For the ith group   yij  y i  j 2   S i  T2  i n  i      138 Medical Statistics and Demography Made Easy Summing over k groups, therefore:   y ij  y i  ij 2  T2  S1   1 n   1    T22    S 2      n2 T2   Si    i i i  ni T2  S    i i  ni    Tk 2   ......  S k       nk          Between Sum of Squares:   yi  y  2  Total SSQ  Within group SSQ ij   T2  S     n T2    i i  ni    Ti 2      S      i  n i       T2    N Summarizing the results, we have the following formulae for portioning the total sum of squares:  T 2  T2    1  N Between groups   n1 Within groups T2 S   1  i  n1 Total S i T2 N    Testing of Hypothesis 139 Testing for difference between mean of more than two groups (i.e. k > 2): Suppose that the ni observations in the ith group from a random sample from a population with mean μ i and variance  2 , As in two sample t-test we assume that is same for all groups. To examine the evidence for the difference between the μi we shall test the null hypothesis that the μi do not vary, being equal to some common value μ. There are three ways for estimating . These are as follows: From total sum of squares: The whole collection of N observations may be regarded as a random sample of size N, and consequently: Is an estimate of  2 . From within group SSq: Separates unbiased estimated may got for each group in turn: A combined estimate based purely on variation within groups may be derived by adding the numerator and denominator of these ratio to gibe within group mean sum of squares (or MSSq): S2W  Within group SSq within group SSq    n i  1 N k From between groups SSq: Since both S2T and S2w are unbiased estimate of  2 . By subtracting them we can get the third unbiased estimate by the between groups mean square. 140 Medical Statistics and Demography Made Easy This we can form the analysis of variance table: Source df Between groups k – 1 Sum of squares  Ti 2 T 2    B  i n i N  Within groups  Ti 2  S       A  B N–k  i ni   Total N – 1 S    Mean sum of squares F-ratio S2B S 2B S 2w S2w T2  A N  The difference between means could be made to depend largely on the F-test in the analysis of variance at  1 = (k – 1) and 2 =(N – k) degree of freedom. If k = 2 the situation considered above is precisely that for which the unpaired (or two sample) t test is. The variance ratio, F will have 1, and N – 2 degree of freedom at t will have n1 + n2 – 2, i.e. (N – 2) degree of freedom. The value of F is equal to the square of the value of ‘t’. The distribution of F on 1 and N – 2 degrees of freedom is precisely the same as the distribution of the square of a variable following ‘t’ distribution on N – 2 degree of freedom. Testing of Hypothesis 141 If k > 2 we may examine the difference between a particular pair of mean, choose because the contrast between these particular groups is of logical interest. The standard error of the difference between two mean, say and may be estimated by: and the difference is tested by referring: To the ‘t’ distribution with N-k degree of freedom. (Since this is the number of degree of freedom associated with the estimated variance s2). Confidence limits for the difference in mean may be set in usual way, using tabulated percentiles of ‘t’ on N-k degree of freedom. “The only function of the analysis of variance in this particular comparison has been replace the estimate of variance on n1 + n2 – 2 degree of freedom (which would be used in the two samples).” Solved Example Comparison of Several Means (ANOVA) QUESTION: In a clinical trial, Twenty patients undergoing operation were divided into four groups. Four different Anaesthetic drugs were tested. The drugs were alloted at random in these groups. The blood pressure was recorded just after induction. The results of this trial was as follows: 142 Medical Statistics and Demography Made Easy Group 1 Group 2 Group 3 Group 4 179 138 134 198 103 178 175 112 165 186 172 135 135 182 150 181 186 180 172 178 Find the affect of different drugs on blood pressure in patients. SOLUTION: Setting of hypothesis Null hypothesis: There is no significant difference between the mean values blood pressure between groups, i.e. H0 : 1   2   3   4   Alternative hypothesis: One way analysis of variance: Group 1 Group 2 Group 3 Group 4 . Total (Ti ) Number of cases (ni) Mean ( ) Sum of squares (Si = ∑yi Ti2/n i 2) All groups 179 138 134 198 103 752 5 178 175 112 165 186 816 5 172 135 135 182 150 774 5 181 186 180 172 178 896 5 150.4 163.2 154.8 179.2 118,854 136,674 121,658 160682 S = 537886 113100.8 133171.2 119815.2 160563.2 Sum of squares between groups = T =3238 N = 20 Testing of Hypothesis 143 Total sum of squares =  (T)2  S   [537, 886  524232.2]  13653.8  N  Analysis of variance table: Source Degree of freedom Sum of squares Mean sum of squares Sum squares between groups 24 – 1 = 3 2418.2  2418.2     3  Error sum of squares  19 – 3 = 16 (13653.8 – 2418.2)   16  = 11235.6 Sw 2= 702.25 F-value SB2 = 806.06 Total sum of 20 – 1 = 19 squares  11235.6  13653.8 The critical value of F (from F table) at 3 and 16 degree of freedom is Ftab = 3.24; which is more than calculated value of F (From Analysis of variance table). Hence we accept the null hypothesis, i.e. there is no significant difference between the mean blood pressure values in four groups. Conclusion: There is no significant different between the blood pressure just after induction of different drugs. The four drugs have same effect on blood pressure of patients. 144 Medical Statistics and Demography Made Easy Comparison of mean values of blood pressure in Group 1 and Group 4 on the basis of analysis of variance table: Mean blood pressure of patients in Group 1 = 150.4 Mean blood pressure of patients in Group 4 = 179.2 Number of cases in both groups = 5 Standard error The critical value of ‘t’ at (N – 2), i.e. 18 degree of freedom is 2.10 which is more than the calculated value of ‘t’. Hence, we accept the null hypothesis. That there is no significant difference between the blood pressure values of group 1 and group 4. Thus by the use of analysis of variance table we can compare the mean values of two groups also. MULTIPLE CHOICE QUESTIONS 1. pq indicates: n (a) Standard error of proportion (b) Difference between proportion (c) Standard error of mean (d) Standard deviation from the mean (AI, 93) Testing of Hypothesis 145 2. The number of degree of freedom in a table of (4 × 4) is: (a) 4 (b) 8 (c) 9 (d) 16 (AI,95) 3. Confidence limits is: (a) Range and standard deviation (b) Median and standard error (c) Mean and standard error (d) Mode and standard deviation (AI,99) 4. All are true regarding student t-test except: (a) Standard error of mean is not estimated (b) Standard population is selected (c) Two samples are compared (d) Student’s t- map (table) is required for calculation (AI, 2000) 5. A community has a population of 10,000 individuals, beta carotene was given to 6,000 individuals and the remaining population was not given beta carotene. After some time 3 in the first group developed lung cancer and 2 in the second group also developed lung cancer. The correct statement is: (a) Beta carotene and lung cancer have no association (b) The P-value is not significant (c) The study is not designed properly (d) Beta carotene is associated with lung cancer (AI, 2001) 6. If the mean is 230 and the standard error is 10, the 95% confidence limits would be: (a) 210 to 250 (b) 220 to 240 (c) 225 to 235 (d) 230 to 210 (AI, 89) 146 Medical Statistics and Demography Made Easy 7. Significant ‘p’ value is all except: (a) 0.005 (b) 0.05 (c) 0.01 (d) 0.1 8. The mean BP of a group of persons was determined and after an interventional trial, the mean BP estimated again. All the test to be applied to determine the significance of intervention is: (a) Chi-Square (b) Paired ‘t’ test (c) Correlation coefficient (d) Mean deviation (AIIMS, 95) 9. Which of the following is a pre-requisite for the ChiSquare test to compare: (a) Both samples should be mutually exclusive (b) Both sample need not be mutually exclusive (c) Normal distribution (d) All of the above (UPSC 2000) 10. If a group of persons taking part in a controlled trial of an anti-hypertensive drug the blood pressures were measured before and after giving the drug. Which of he following tests will you use for comparison: (a) Paired t-test (b) F test (c) ’t’-test (d) Chi-Square test (AIIMS,2000, Dec 97) 11. About test of significance between two large population, one of the following statement is true: (a) Null hypothesis states that two means are equal (b) Standard error of difference is the sum of the standard error of 2 means (c) Standard error of means are equal Testing of Hypothesis 147 (d) Standard error of difference between population is calculated [Hint: Null hypothesis is usually the hypothesis of no difference, is to be tested for the possible reason of rejection under the assumption that it is true.The denominator for test of difference between two population is the standard error of difference of means or proportion not the standard error of difference between population]. (AIIMS, Dec 98) 12. True about Chi-Square test is: (a) Null hypothesis is equal (b) Doesn’t measures the significance (c) Measures the significant difference between two proportions (d) Test correlation and regression (AIIMS, June 99) 13. For 95% confidence limits true is: (a) 1.95 of standard error of mean (b) Reduces 95% of values (c) 2.95 of standard error of mean (d) Normal distribution + 2.5 SD (AIIMS, June 95) 14. Standard error of mean indicates: (a) Dispersion (b) Distribution (c) Variation (d) Deviation [Hint: Standard error is merely the standard deviation of some statistic calculated from a sample (in this case, the mean) is an indefinitely long series of repeated sampling]. (AIIMS, Nov. 99) 15. In a ‘p’ test p indicates the probability: (a) Accepting null when it is false (b) Accepting when it is true (c) Rejecting null when it is true (d) Rejecting null when it is false [Hint: Level of significance is also the critical region] (AIIMS,June 2000) 148 Medical Statistics and Demography Made Easy 16. In a group of 100 children, the weight of a child is 15 kg. The standard error is 1.5 kg. Which one of the following is true: (a) 95% of all children weigh between 12 and 18 kg (b) 95% of all children weigh between 13.5 and 16.5 (c) 99% of all children weigh between 12 and 18 (d) 99% of all children weigh between 13.5 and 16.5 (AIIMS,May 2001) 17. A group tested for a drug shows 60% improvement as against a standard group showing 40% improvement. The best test to test the significance of result is: (a) Student’s ‘t’ test (b) Chi-Square test (c) Paired ‘t’ test (d) Test for variance (AIIMS, Nov 2001) 18. A test was done to compare serum cholesterol levels in obese and non-obese women. The test for significance of difference is: (a) Paired ‘t’ test (b) Students ‘t’ test for independent variables (c) Chi-Square test (d) Fisher test (AIIMS, Nov 2001) 19. Which of the following is a parametric test of significance: (a) U test (b) ‘t’ test (JIPMER, 2003) 20. For testing the statistical significance of the difference in heights of school children among three socioeconomic groups, the most appropriate statistical test is : (a) Student’s ‘t’ test (b) Chi-Square test Testing of Hypothesis 149 (c) Paired ‘t’ test (d) One way analysis of variance (one way ANOVA) (AI, 2002) 21. In a study, variation in cholesterol was seen before and after giving a drug. The test which would give its significance is (a) Unpaired ‘t’ test (b) Paired ‘t’ test (c) Chi-Square test (d) Fisher’s test (AI, 2002) 22. An investigator wants to study the association between maternal intake of iron supplements (Yes/ No) and birth weights (in gm) of newborn babies. He collects relevant data from 100 pregnant women and their newborns. What statistical test of hypothesis would you advise for the investigator in this situation ? (a) Chi-Square test (b) Unpaired or independent t-test (c) Analysis of variance (d) Paired t-test [Hint: The investigator classify the pregnant women into two groups depending upon intake of iron supplement. Thus there are two independent groups and mean birth weights of the babies can be compared]. (AIIMS, 2003) 23. A randomized trial comparing the efficacy if two drugs showed a difference between the two with a ‘p’ value of < 0.005. In reality, however, the two drugs do not differ. This is therefore is an example of: (a) Type-I error (α-error) (b) Type-II error (β error) (c) 1 – α (d) 1 – β [Hint: Rejecting null hypothesis, when it is true is called typeI error] (AIIMS, 2002) 150 Medical Statistics and Demography Made Easy 24. If we reject null hypothesis when it is actually true, is known as: (a) Type –I error (b) Type II error (c) Power (d) Specificity (AIIMS, 2004) 25. A randomized trial comparing the efficacy of two drugs showed a difference between two (with a p valuse < 0.05). Assume in reality, however the two drugs do not differ. This is therefore an example of: (a) Type I error (α error) (b) Type II error (β error) (c) 1 – α (d) Power of Test. (AIIMS, 2004) 26. The Hb level in healthy women if 13.5 g/dl and standard deviation is 1.5 g/dl, what is the Z score for a women with Hb level 15.0: (a) 9.0 (b) 10.0 (c) 2.0 (d) 1.0 (AIIMS, 2004) Chapter 9 Non-parametric Tests 152 Medical Statistics and Demography Made Easy Non-parametric (NP) tests does not depend on the particular form of the basic frequency function from which the samples are drawn. Non-parametric tests does not make any assumption regarding the form of the population. Advantages of Non-parametric Tests 1. Non-parametric methods are very simple and easy to apply. 2. No assumption is made about the form of frequency function of the parent population from which the sample is drawn. 3. NP tests can apply to the data which are mere classification (i.e. which are measured in nominal scale). 4. NP tests are available to deal with the data which are given in ranks, or whose seemingly numerical score have the strength of ranks (i.e. scores are given in grades, i.e. A–, A, A+, B, B+). Disadvantages of Non-parametric Tests 1. NP tests can only be used if the measurements are nominal or ordinal. If a parametric test exists it is more powerful than NP tests. Remarks Since no assumption is made about parent population, the non-parametric methods are some times referred as distribution free methods. These tests are based on the ‘Ordered Statistic’ theory. A sample x1, x2 ......... xn is an ordered sample. If x1 < x2 < x3 ......... < xn . The whole structure of NP methods rests on simple but fundamental property of order statistic. Non-parametric Tests 153 Run Test Suppose x1, x2 ............ xn1 is an ordered sample from a population and y1, y2, ............ yn2 be an independent ordered sample from other population. We want to test if the samples have been drawn from the same population or from different population. Let us combine two samples and arrange the observations in order of magnitude to give the combined ordered sample: x1, x2 y1, y2, y3 x3, x4, x5 y4, y5 1(l = 2) 2(l = 3) 3(l = 3) 4(l = 2) x6 ............ Run: A run is defined as a sequence of one kind surrounded by a sequence of other kind and the number of elements in a run is usually referred as the length ‘l’ of the run. If both samples came from same population, there would be a thorough mingling of xi and yj in combined sample and the number of runs in the combined sample would be large. On the other hand if the samples came from two different population then their ranges do not overlap, then there would be only two runs. Of the type x1, x2 ............ xn1 and y1, y2, ............ yn2. Generally, any difference in mean and variance would tend to reduce the number of runs. Thus alternative hypothesis will entail too few runs. Procedure: In order to test the null hypothesis that the samples have come from the same population. We have to count the number of runs ‘U’ in the combined ordered sample. When n1 and n2 are large then under null hypothesis ‘U’ is asymptomatically normal with 2n l n 2 Mean (U)   1 and nl  n2 154 Medical Statistics and Demography Made Easy Variance (U)  2n l n 2  2n l n 2  n l  n 2   n l  n 2  2  n l  n 2  1 Thus we can use the normal test: Z U  Mean  U  Variance  U  ~ N  0, 1 This approximation is fairly good if each of n1 and n2 is greater than 10. Since alternative hypothesis is ‘too few runs’ the test is ordinarily one tailed with only negative values leading to the rejection. OTHER NON-PARAMETRIC TESTS Median Test Median test is a statistical procedure for testing, if the two independent ordered samples differ in their central tendencies. If x1, x2 ........ xn1 and y1, y2, ........ yn2 be two independent ordered samples and z1, z2, ........ zn1 + n2 be the combined ordered sample. Let m1 be the number of x’s and m2 be the number of y’s exceeding the median value of combined series. No. of observations > Median No. of Observations < Median (m1+m2) Total Sample 1 Sample 2 Total m1 n1 – m1 m2 n2 – m2 m1 + m2 (n1+n2) – n1 n2 (n1 + n2) If the frequencies are small we can compute the exact probabilities. However, if the frequencies are large, we may Non-parametric Tests 155 use χ2 test with 1 degree of freedom for testing H0 (the null hypothesis, that the samples came from the same population). The approximation test is fairly good, if both n1 and n2 exceed 10. Sign Test Sign test is used under the following conditions: (a) When any given pair of observations two things being compared. (b) For any pair, each of the two observations is made under similar extraneous conditions. (c) Different pairs are observed under different conditions. Third condition (condition ‘c’) implies that di = (xi – yi); i = 1, 2, 3 ........ n have different variance and thus renders the paired ‘t’ test invalid, which would have otherwise being used unless there was obvious non-normality. Sign test is based on the sign (plus or minus) of the deviation di = (xi – yi). No assumptions are made regarding the parent population. The only assumptions are: (1) Measurements are such that the deviations di = (xi – yi) can be expressed in term of positive or negative. (2) Variables have continuous distribution. (3) di’s are independent. Different pairs (xi, yi) may be from different population (say with respect to age, weight, stature, education). The only requirement is that within each pair, there is matching with respect to relevant extraneous factors. 156 Medical Statistics and Demography Made Easy Procedure: Let (xi, yi), i = 1, 2, 3 ........ n be n paired observations drawn from the two population. Under the null hypothesis that two population are equal. Find out the difference between each pair of observations, i.e. di = xi – yi. Let us define Ui such that If xi > yi (i.e. positive sign); Ui = 1; and if xi < yi (i.e. negative sign) Ui = 0. Since Ui; i = 1, 2, 3 ........ n are independent. Therefore U   U1 For large samples, (n > 30), we may regard U to be asymptotically normal (under null hypothesis) with mean and variance equal to: Mean of U  n and Variance 2 Thus, and we may use Normal test. Mann-Whitney Wilcoxon ‘U’ Test The non-parametric test for two samples was the most widely used test when we do not make assumption about the parent population. Let x1, x2, ........ xn1 and y1, y2, ........ yn2 be two independent ordered samples of size n1 and n2. Non-parametric Tests 157 Mann-Whitney test is based on the pattern of x’s and y’s in the combined order samples. x1, x2, y1, y2, y3, x3, x4, x5, y4, y5, x6 ........ Let ‘T’ denote the sum of ranks of the y’s in the combined sample. The rank of y in the combined sample is: 3, 4, 5, 8, 9 ........ Then T = 3 + 4 + 5 + 8 + 9 U  n1 . n 2  n 2  n 2  1 T 2 If ‘T’ is significantly large or small then H0 will be rejected. It has been established that under the null Hypothesis U is asymptotically normally distributed with mean (μ, σ2) where Then  Hence n n  n  n 2  1 n1 n 2 and 2  1 2 1 2 12 U ~ N  0, 1  A normal test can be used if both n1 and n2 are greater than 8. Z Solved Example Run Test QUESTION: In the given set of data drawn from two populations; Apply Run and test the hypothesis whether the samples are drawn from the population with same distribution function: xi 15 77 01 65 69 69 58 40 81 16 20 20 00 84 22 y j 28 26 46 66 36 86 66 17 43 49 85 40 51 40 10 158 Medical Statistics and Demography Made Easy SOLUTION: Setting the Hypothesis Null hypothesis: The two populations have same distribution function. H0: f1(.) = f2(.) Alternative hypothesis: H1: f1(.)  f2(.) The Test Statistics: Where Mean  U   2n1n 2  1 and n1  n 2 Variance  U   2n 1n 2  2n1 n 2  n1  n 2   n 1  n 2  2  n 1  n 2  1 Calculate the number of RUN is the combined ordered series. For this first arrange xi and yj in ascending order: S.No. 1 xi yi 2 3 4 5 6 7 8 9 10 11 12 13 14 15 00 01 15 16 16 20 22 40 58 65 69 69 77 81 84 10 17 26 28 36 40 40 43 46 49 51 66 66 85 86 Combine the two series in ordered form in terms of xi and yj: x1, x2, y1, x3, x4, x5, y2, x6, x7, y3, y4, y5, 1 2 3 4 5 6 x8, y6, y7, y8, y9, y10, y11, x9, x10, y12, y13, 7 8 9 10 .x11, x12, x13, x14, x15, 11 y14, y15 12 Non-parametric Tests 159 Thus, we can see that in the combined series there are 12 runs (the sequence of one kind of series). Therefore U = 12 (Total number of Runs). The mean and variance of U: Mean  U   Variance  U   2  15  15   1  15  1  16; and  15  15  2  15  15  2  15  15  15  15  2 15  15   15  15  1  450  450  30   30 2  29  450  430 193500   7.43 900  29 26100 Thus the test statistic Z is Variance  U   12  16 4   1.47 7.43 2.72 The tabulated value of Z is more than the calculated value (i.e. Z = 1.47). Hence, we accept the null hypothesis. That the distribution of two populations is same. Conclusion: The distribution of two populations from which the two samples are drawn is same. Z  Sign Test QUESTION: In the above example if (xi, yi ) be the pair of observations are drawn from the two population Then apply sign test and find out whether the distribution of two population are equal: xi 15 77 01 65 69 69 58 40 81 16 20 20 00 84 22 y j 28 26 46 66 36 86 66 17 43 49 85 40 51 40 10 160 Medical Statistics and Demography Made Easy SOLUTION: Setting of Hypothesis Null hypothesis: The two populations have same distribution function. H0: f1(.) = f2(.) Alternative hypothesis: H1: f1(.)  f2(.) The Test Statistic is S.no. 1 xi yj 15 77 01 65 69 69 58 40 81 16 20 20 00 84 22 28 26 46 66 36 86 66 17 43 49 85 40 51 40 10 – + – – + – – + + – – – – + + di = (x i –y i ) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Ui = 1, if xi > yi (i.e. positive sign) and 0 if xi < yi (i.e. negative sign) U   U i  6 (There are total 6 pairs in which xi > yi). Thus Test statistic Z is: Tabulated value of Z is more than the calculated value. Hence, we accept the null hypothesis, i.e. the distribution functions of two populations are same. Conclusion: The two sample are drawn from the same population Non-parametric Tests 161 Mann-Whitney U Test QUESTION: In the same set of data Apply Mann-Whitney U test to compare the distribution function of the population. The combined observations of two series are arranged in ascending order: (As in Run Test): Ranks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x1 x2 y1 x3 x4 x5 y2 x6 x7 y3 y4 y 5 x8 y6 y 7 Ranks 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 y8 y9 y10 y11 x9 x10 y12 y13 x11 x12 x13 x14 x15 y14 y15 T (sum of ranks of y in combined ordered series) is calculated from the above table, which is equal to: T = 3 + 7 + 10 + 11 + 12 + 14 + 15 + 16 + 17 + 18 + 19 + 22 + 23 + 29 + 30 = 246 U  n1 . n 2  n 2  n 2  1 15  15  1  T  225   246 2 2  225  120  246  99 Mean and variance of ‘U’ is: Mean (U) Variance  (U) n 1 .n 2  n 2  n 2  1 12  15 . 15  15  15  1 225  31  12 12  581.25 Thus, test statistics Z is  99  112.5  13.5   0.55   24.11  24.11 162 Medical Statistics and Demography Made Easy Tabulated value of Z is more than the calculated value of Z. Hence, we accept the null hypothesis, i.e. the two samples are drawn from the same population. Conclusion: The distribution function of the populations from which the two samples are drawn is same. MULTIPLE CHOICE QUESTIONS 1. Statistical tests that are non-parametric include: (a) Regression (b) Correlation (c) The student’s test (d) Rank correlation (e) Wilcoxon rank sum test (PGI, 80, AIIMS 80) 2. If the distribution of population is not known which of the following test will be used: (a) F-test (b) Students ‘t’ test (c) ANOVA (d) Sign test 3. For large sample size Mann-Whitney U test the test statistics “U” is Normally distributed with: (a) N (μ, 1) (b) N (μ, σ2) 2 (c) N (0, σ ) (d) N (0, 1) Chapter 10 Statistical Methods in Epidemiology 164 Medical Statistics and Demography Made Easy Epidemiology is a study of the distribution and determinants of health related states or events in a specified population. Epidemiology is by definition concerned with certain problems affecting groups of individuals rather then single subjects. In broad terms Epidemiology is concerned with the distribution of disease, chronic as well as communicable diseases which gives rise to epidemics of the classical sort. Some important terms used in epidemiological studies: Baseline: Health state (disease severity, confounding condition) of individuals at the beginning of a prospective study. A difference (asymmetry) in the distribution of baseline values between groups will bias the results. Blinding (Masking): Blinding is a method to reduce bias by preventing observers and/or experimental subjects involved in any analytic study from knowing the hypothesis being investigated, the case control classification, the assignment of individuals or groups, or the different treatment being provided. Blinding reduces bias by preserving symmetry in the observer’s measurements and assessment. This bias is usually not due to deliberate deception but due to human nature and prior held belief about the area of study. Placebo: A placebo is the dummy treatment used in a control in place of actual treatment. If a drug is being evaluated, the inactive carrier is used along with active drug. So it is as similar as possible in appearance and in administration to the active drug. Placebo are used to blind observers and for human trials, the patient to which group the patient is allocated. Case definition: The set of history, clinical sign and laboratory findings that are used to classify an individual as a case or not for an epidemiological study. Case definition Statistical Methods in Epidemiology 165 are needed to exclude individuals with the other conditions that occurs at an endemic background, rate in a population or other characteristics that will confuse or reduce the precision of a clinical trial. Cohort: A group of individuals identified on the basis of a common experience or characteristic that is usually monitored over time from the point of assembly. Experimental unit: In an experiment, the experimental unit are the units that are randomly selected or allocated to a treatment and the unit upon which the sample size is calculated and subsequently data analysis must be based. Prospective study (Data): Data collection and the events on interest occur after individuals are enrolled (e.g. clinical trials or cohort studies) This prospective collection enables the use of more solid consistent criteria and avoid potential biases or retrospective recall. Prospective studies are limited to those conditions that occurs relatively frequently and to studies with relatively short follow-up periods so that sufficient number of eligible individuals can be enrolled and followed within a reasonable period. Retrospective study (Data): All events of interest have already occurred and data are generated from historical records (secondary data) or from recall (which may result in the presence of significant recall bias). Retrospective data is relatively inexpensive compared to prospective studies because of the use of available information and is typically used in case-control studies. Retrospective studies of rare conditions are much more efficient than prospective studies. Basic Measures of Epidemiology Measurements of epidemiology includes the following: 1. Measurement of mortality, morbidity, etc. 166 Medical Statistics and Demography Made Easy 2. Measurement of the presence or absence or distribution of the characteristic or attributes of the disease. 3. Measurements of demographic variables. 4. Measurement of the presence, absence or distribution of the environmental and other factors suspected of causing the disease. Parameters of Measurements Epidemiologist usually express disease magnitude as rate, ratio or proportion. These three are the basic parameters of measuring epidemiology. Rate: A rate measures the occurrence of some particular event (occurrence of death or disease) in a population during a given period of time. The rate is expressed as per thousand. For example: Death rate  Total number of deaths in a year  1000 Mid year population The rates can be broadly classified as (1) Crude rate. (2) Specific rates. (3) Standardized rates. Ratio: In ratio the numerator is not a component of denominator. The numerator and denominator may involve an interval of time and may be instantaneous in time. For example: In sex ratio (Male: Female), the numerator will be the number of males population during a given period, and the denominator will be the number of female population during the same period. If number of males = ‘a’ and number of females = ‘b’ a Then, Ratio  b Statistical Methods in Epidemiology 167 Thus we can see that the numerator is not a component of denominator. Proportion: A proportion is a ratio which indicate the relation of magnitude of a part of the whole. The numerator is always included in the denominator. A proportion is usually expressed as percentage. In the above example the proportion of Male: Female is: Proportion of male in the population  a  100 a  b Numerator: Numerator refers to number of times an event occurs. The numerator is a component of denominator in calculating rates but not in a ratio. Denominator: Literal meaning of denominator is the number below line in a fraction. In epidemiology generally, we use three types of denominator. Mid year population: While calculating rates (death, birth) the denominator comprises the mid-year population. Because of the population size changes daily due to birth, deaths and migration, therefore, we use mid year population as a denominator for calculating rates. The mid year population refers to the population estimated as on 1st July Population at risk: For calculating morbidity statistics the population exposed to risk is used as denominator. The term is applied to all those to whom an event could have happened whether it did or not. For example: While calculating general fertility rate, the women of reproductive age group (15-49 years) is taken as denominator, because women < 15 years and > 49 years of age generally does not give birth, therefore, they are not exposed to risk. 168 Medical Statistics and Demography Made Easy Related to events: In some situation the denominator may be related to total events instead of total population. For example: While calculating maternal mortality rate the denominator will be number of live birth. Measurements of Mortality The measures of mortality are Crude Death Rate, Age Specific Death Rates, Standardized Death Rates. Which will be discussed in details in the following heading. Measurements of Morbidity Morbidity is defined as ‘any departure, subjective or objective from the state of physiological well-being’. The morbidity could be measured in terms of three units. (a) Person who were ill. (b) The illness (period or spell) that these persons experienced. (c) The duration (says or weeks, etc.) of illness. Disease is frequently measured by incidence and prevalence rates (though prevalence is referred as rates, but it is actually the ratio). Incidence rate (Person): The number of new cases occurring in a defined population during a specified period of time. No. of new cases of a specified disease during a given period of time Incidence rate   1000 Population at risk during that period Persons   Example: If there are 1,000 new cases of illness in a population of 50,000 in a year then incidence rate is: Incidence rate  1000  1000  20 per thousand per year 50, 000 Statistical Methods in Epidemiology 169 Incidence rate must include unit of time, the incidence of disease in the above example is 20 per 1000/year. Features of incidence rate: 1. Only New cases 2. During a given period of time 3. In a specified population (population at risk) 4. Unit of time should be mentioned. Incidence rate (Spells): The number of new spells of illness in a defined population during a specified time. No. of spells of sickness starting in a defined period of time Incidence rate   1000 Mean number of persons exposed Spells   to risk in that period Incidence measures the rate at which new cases are occurring in a population. It is not influenced by the duration of disease. Use of incidence rate: Incidence rates are useful in determining the causality of diseases. The incidence rate is useful for taking action (a) To control disease. (b) Distribution of disease and efficacy of prevention and therapeutic measures. If the incidence rate is increasing, it might indicate failure or ineffectiveness of the current control programme and there is a need for a new disease control programme. Prevalence: The total number of all individuals who have an attribute or disease at a particular time (or during a particular period) divided by the population at risk of having attribute or disease at this point of time or mid way through the period. 170 Medical Statistics and Demography Made Easy Prevalence refers specially to all current cases (old and new) existing at a given point of time, or over a given period of time in a given population: Prevalence are of two types: (1) Point prevalence. (2) Period prevalence. Point prevalence: Point prevalence of a disease as a measure of all cases (old and new) of a disease at one point of time in relation to defined population. No. of all current cases  old and new  of a specified disease existing at a Point given po int of time Pr evalence   100 Estimated population at the same point of time In point prevalence ‘point’ may be a day, several days or even few weeks depending upon the time it takes to examine the population. Period prevalence: It includes cases arising before but existing into or through the year as well as those cases arising during the year. Period prevalence it is a combination of point prevalence and incidence. No. of all current cases  old and new  of a specified disease existing at a Period given po int of time Pr evalence   100 Estimated mid int erval population at risk Incidence and Prevalence can best explained by following Figure Statistical Methods in Epidemiology 171 Figure 10.1 From the above figure number of new cases in the given period (January 2000 – December 2000) are 3 (case 2, 5 and 8). Therefore for incidence, number of new cases will be 3. For point prevalence at January 2000, three cases will be included (case 3,6, and 7). While for point prevalence at December 2000 2 cases will be included (case 5 and case 8). For period prevalence (during a period from January 2000 to December 2000) 6 cases will be included (Case 2, 3, 5, 6, 7and 8; case 2, 5 and 8 are new cases and 3, 6 and 7 are old cases). Case no 1 and 4 are excluded because these two cases fell outside the given period). Use of Prevalence Prevalence helps to estimate the magnitude of health/disease problem in the community and to identify potential high risk population. Prevalence data provide an indication of the extent of a condition and may have implications to the provision of services needed in a community. 172 Medical Statistics and Demography Made Easy Prevalence rate is especially useful for administrative and planning purpose. Both measures of prevalence are proportions - as such they are dimensionless and should not be described as rates (Friis and Sellers, 1999). • Friis RH and Sellers TA Epidemiology for public health practice 2nd ed., Aspen Publishers, Inc. (1999). Incidence # New cases* Population at risk* * During specified time period Prevalence Remember, incidence means NEW. Prevalence means ALL. Relation between Incidence and Prevalence If the population is stable and incidence and duration are unchanging: Then Prevalence = Incidence × Duration Or Incidence = And Duration = Statistical Methods in Epidemiology 173 From the above relation we can say that the longer the duration of disease the prevalence rate will be high in relation to incidence. If shorter the duration of illness the disease is acute and of short duration (either because of rapid recovery or death) the prevalence will be relatively low as compared to incidence. Decrease in prevalence may take place not only from a decrease in incidence but also from a decrease in duration of illness either more rapid recovery or more rapid death. Epidemiological Studies Epidemiological studies can be classified as observational studies and experimental studies: Observational studies were further divided into Descriptive studies and Analytical studies. While Experimental studies were divided into Randomized controlled trials, Field trials and Community trials. Observational Studies In observational studies the allocation or assignment of factors is not under control of investigator. In an observational study, the combination are self selected or are ‘experiments of nature’. Observational studies provide a weaker empirical evidence because of the potential of large confounding biases to be present where there is an unknown association between a factor and outcome. The greatest value of these type of studies is that they provide preliminary evidence that can be used as the basis for hypothesis in stronger experimental studies. Descriptive studies: The objective of descriptive studies is to describe the distribution of variables in a group. Statistics serve only to describe the precision of those measurements or to make statistical inferences about the values in the 174 Medical Statistics and Demography Made Easy population from which the sample is drawn. Such studies asked questions about: (a) When the disease occurring-time distribution. (b) Where it is occurring-place distribution. (c) Who is getting the disease - person distribution. Measurement of morbidity in descriptive studies: Measurement of morbidity has two aspects – Incidence and Prevalence. Incidence can be obtained from longitudinal studies and prevalence from cross-sectional studies. Beside case series and case report the descriptive studies may use cross-sectional and longitudinal studies to obtain estimates of the health and disease problems of the population. Case series: A descriptive, observational study of a series of cases, typically describing the manifestations, clinical course and prognosis of condition. A case series provides a weak empirical evidence because of the lack of comparability unless the findings are dramatically different from expectations. Case series are best used as a source of hypothesis for investigation by stronger study design. Unfortunately, the case series is the most commonly used in clinical trials. Case report: A description of a single case, typically describing the manifestations, clinical course and prognosis of that case. Due to the wide range of natural biologic variation in these aspects, a single case report provides little empirical evidence to the clinicians. They do describe how other diagnosed and treated the condition and what the clinical outcome was. Longitudinal studies (Incidence Study): Longitudinal studies are those studies in which the observations are repeated in the same population over a prolonged period of time by means of follow-up examinations. Longitudinal Statistical Methods in Epidemiology 175 studies are useful in (a) identifying the risk factors of disease and (b) for finding out the incidence rate or rate of occurrence of new cases of the disease in community. Cross-sectional studies (Prevalence Study): A descriptive study of the relationship between disease and other factors at one point of time (usually) in a defined population. Cross-sectional studies lack any information on timing of exposure and outcome relationship and include only prevalent cases. Cross-sectional studies are more useful for chronic than short-lived diseases. This type of studies tells about distribution of a disease in a population rather than its aetiology. Analytical studies: In analytical studies, the subject of interest is the individual within the population. The object is not to formulate but to test hypothesis. Although individuals are evaluated in analytical studies, the inference is not to the individual but to the population from which they are selected. Measurement of morbidity in analytical studies: Analytical studies comprise two distinct types of observational studies (a) Cohort study and (b) Case control study studies. From these studies we can determine (1) whether or not a statistical association exists between a disease and a suspected factor and (2) if it exists , the strength of the association. Cohort study: A prospective, analytical, observational study, based on data, usually primary, from a follow-up period of a group in which some have had, have or will have the exposure of interest and to determine the association between the exposure and an outcome. ‘Cohort’ is defined as a group of people who share a common characteristic or experience within a defined period. In a cohort study a population of individuals selected usually by geographical or occupational criteria rather then 176 Medical Statistics and Demography Made Easy on medical grounds. The population is classified by the factor or factors of interest and followed prospectively in time so that the rates of occurrence of various manifestations of disease can be observed and related to the classification by aetiological factors. Because of their prospective nature, cohort studies are stronger than case-control studies when well executed but they are more expensive. Case control study: A retrospective, analytical, observational study often based on secondary data in which the proportion of cases with a potential risk factors are compared to the proportions of controls (individuals without the disease) with the same risk factor. The method is appropriate when the classification by the disease is simple (i.e. presence or absence of a specific condition). A further advantage is that, by mean of the retrospective enquiry, the relevant information can be obtained comparatively quickly. A central problem in a case control study is the method by which the controls are chosen. Ideally, they should be on average similar to the cases in all respect except in the medical condition under study and in associated aetiological factors. These studies are commonly used for initial, inexpensive evaluation of risk factors with long induction of periods. Unfortunately, due to the potential for many forms of bias in this study type, case control studies provide relatively weak empirical evidence even when properly executed. Case control studies are often called retrospective studies while cohort studies are called prospective studies. Experimental Studies The hallmark of the experimental study is that the allocation or assessment of individuals is under control of investigator Statistical Methods in Epidemiology 177 and thus can be randomized. The key is that the investigator controls the assignment of the exposure of the treatment but otherwise symmetry of potential unknown confounders is maintained through randomization. Properly executed experimental studies provide the strongest empirical evidence. The randomization also provides a better foundation for statistical procedures than do the observational studies. The following are some important randomized control trials: Randomized controlled clinical trial (RCT): A prospective, analytical experimental study using primary data generated in the clinical environment. Individuals similar at the beginning are randomly allocated two or more treatment groups and the outcomes the groups compared after sufficient follow-up time. Properly executed, the RCT is the strongest evidence of the clinical efficacy of preventive and therapeutic procedures in the clinical setting. Randomized cross-over clinical trial: A prospective, analytical, experimental study using primary data generated in the clinical environment. Individuals with a chronic condition are randomly allocated to one of two treatment group, and after a sufficient treatment period and often washout period, are switched to other treatment for the same period. In this type of study design each patient serves as his own control. The patients are randomly assigned to a study group and control group. The study receives the treatment under consideration. The control group receive some alternative form active treatment or placebo. The two groups are observed over a time. The patients in each group are taken off their medication or placebo to allow for possible 178 Medical Statistics and Demography Made Easy elimination of the medication from the body and for the possibility of any ‘carry out’ effects. After this period the two groups are switched. Those who received the treatment under study are changed to control group therapy or placebo, and vice versa. Carry over studies has an advantage that during the course of investigation, patients will receive the new therapy. But this design is susceptible to bias if carry over effects of first treatment occurs. Randomized controlled laboratory study: A prospective, analytical, experimental study using primary data generated in the laboratory environment. Laboratory studies are very powerful tolls for doing basic research because all extraneous factors other than those of interest can be controlled or accounted for (e.g. age, gender, genetics, nutrition, environment, etc.). However, this control of other factors is also the weakness of this type of study. If any interaction occurs between these factors and the outcome of interest, which is usually the case, the laboratory results are not directly applicable to clinical setting unless the impact of these interactions are also investigated. Bias Occurred in the Studies Systemic Error Almost all studies have bias, but to varying degree. Bias can be reduced only by a proper study design and execution and not by increasing the sample size( which increases the precision by reducing the opportunity for a random chance deviation from the truth). The critical question is whether or not the results could be due to large part to bias, thus making the conclusion invalid. Statistical Methods in Epidemiology 179 Observational study design are inherently more susceptible to bias than are experimental study design. Following are some bias which can occur in any study: Confounding bias: Confounding is the distortion of the effect of one risk factor by the presence of another. Confounding occurs when another risk factor for a disease is also associated with the risk factor being studied but acts separately. Age, gender, breed are often confounding risk factors. Confounding can be controlled by restriction, by matching on the confounding variable. Systemic error due to the failure to account for the effect of one or more variables that are related to both the causal factor being studied and the outcome, and are not distributed in the same manner between the groups being studied. Confounding can be accounted for if the confounding variable are measured and are included in the statistical model of the cause-effect relationships. Ecological (Aggregation) bias: Systemic error that occurs when an association observed between variables representing group averages is mistakenly taken to represent the actual association that exists between these variables for individuals. This bias occurs when the nature of the association at the individual level is different from the association observed at the group level. Measurement bias: Systemic error that occurs because of the lack of blinding or related reasons such as diagnostic suspicion, the measurement method (instrument or observer of instrument) are consistently different between groups in the study Screening bias is one of the most important measurement bias. Screening bias: The bias that occurs when the presence of a disease is detected earlier during its latent period by 180 Medical Statistics and Demography Made Easy screening tests but the course of the disease is not be changed by earlier intervention. Because the survival after screening detection is longer than survival after detection of clinical signs, ineffective intervention appears to be effective unless they are compared appropriately in clinical trials. Readers bias: Systemic errors of interpretation made during inference by the users or reader of clinical information. Such biases are due to clinical experience, tradition, prejudice and human nature. The human tendency is to aspect information that supports preconceived opinions and to reject that which do not support preconceived openions. Sampling (Selection) bias: Systemic error that occurs when, because of design and execution errors in sampling, selection, or allocation methods, the study comparisons are between groups that differ with respect to the outcome of interest for reasons other than those under study. Analysis of Epidemiological Studies Analysis of Cohort Study The analysis of epidemiological studies are done and the data are analyzed in term of: (a) Incidence rate of outcome among exposed and nonexposed. (b) Estimation of risk. (a) Incidence Rates In cohort study, we can determine incidence directly in those exposed and those non exposed. The frame work of the cohort study can be represented as follows: Statistical Methods in Epidemiology 181 Cohort Disease Total Positive Negative Exposed Non-exposed a c b d (a + b) = H1 (c + d) = H2 Total (a + c) = V1 (b + d)= V2 N Then incidence rates are: Incidence of exposed Incidence of non-exposed (b) Estimation of Risk The risk of outcome of disease or death in exposed and nonexposed cohort is determined by two indices (a) relative risk and (b) attributable risk Relative Risk Relative risk is the ratio of the incidence of the disease (or death) among exposed and the incidence among non-exposed. This may also referred and risk ratio. Estimation of relative risk is important in aetiological studies,. It directly measures the ‘strength’ of the association between suspected cause of effect. A relative risk of 1 indicates no association; relative risk of greater than 1 suggests a ‘positive’ association between exposure and disease under study. The larger the relative risk, the greater the strength of the association between suspected factor and disease. 182 Medical Statistics and Demography Made Easy  a    H Re lative risk (RR)   1   c     H2  Attributable Risk Attributable risk (AR) is the difference in incidence rates of disease (or deaths) between exposed group and non-exposed group. This may also be referred as “Risk difference”. Attributable risk are often expressed as percent. Attributable risk indicates to what extent the disease under study can be attributed to exposure. Relative Risk vs Attributable Risk Relative risk is important in aetiological enquires, larger the relative risk the stronger the association between cause and effect. Attributable risk gives a better idea than relative risk about the impact of successful preventive or public health programme. Statistical Methods in Epidemiology 183 Analysis of Case Control Study In case control study data are analyzed in terms of: (a) Exposure rates among cases and controls to suspected factor (b) Estimation of disease risk associated with exposure (Odds ratio). Exposure Rates A case control study provides a direct estimation of exposure rate (frequency of exposure) to a suspected factor is a disease and non-disease group. The framework of a case control study in form of 2 × 2 contingency table. Factor Case Control Total Exposed Non-exposed a c b d (a + B) =H1 (c + d) = H2 Total (a + c) = V1 (b + d)= V2 N Exposure rate for cases  a a  c Exposure rate for control  b  b  d The exposure rate for exposed and non-exposed can be compared by applying suitable statistical tests (comparing the proportion of two groups be z-test for proportion or the association between two groups and factors by Chi-Square test). 184 Medical Statistics and Demography Made Easy Estimation of Risk Association with Exposure A typical case control study does not provide incidence rate from which a relative risk (RR) can be directly calculated. The common association measure for a case control study is the Odds Ratio. Odds Ratio Odds ratio is a measure of the strength of association between risk factor and outcome. Cases must be a representative of those with disease and control of those without disease. a to , these two quantities can be b thought of as odd in favour of having the disease. It is the ratio of Odds Ratio Odds ratio is a key parameter in the analysis of case control study. Important Features of Relative Risk (Risk Ratio) and Odds Ratio: (a) The odds ratio is used in retrospective design called case control study, while the risk ratio is useful in Cohort (prospective) study design. (b) Both the odds ratio and the relative risk compare the likelihood of an event between two groups. The odds ratio compares the relative odds of death (disease) in each group, while the relative risk (risk ratio) compares the probability of death (disease) in each group rather than odds. Statistical Methods in Epidemiology 185 (c) Both the odds ratio and the relative risk are computed by division and are relative measures. (d) Both the risk ratio and the odds ratio takes on valuse between zero (0) and infinity (  ). One is the natural value means that there is no difference between the groups compared, close to zero and infinity measures a large difference. A risk ratio/odds ratio larger than 1 means that the group one has larger proportion than group two, if the opposite is true the risk ratio/odds ratio will be smaller than 1. If we swap the two proportions the risk ratio/odds ratio will take on its inverse (1/RR; 1/OR). (e) The odds ratio can be compared with risk ratio. The risk ratio is easier to inerpret than odds ratio. Howeer, in practice the odds ratio is used more often. This has to do with the fact that odds ratio is more closely related to the frequently used statistical techniques such as logistic regression. (f) The risk ratio gives the percentage difference in classification between group one and group two, while odds ratio gives the ratio of the odds of suffering some fate. The odds themselves are also ratio. (g) Both odds ratio and risk ratio are non negative valuse and lies between 0 and  (0 < OR < ; 0 < RR < α). (h) The significance of odds ratio can be tested by using 95% confidence interval. If the value 1 is not included within 93% CI, then odds ratio is significant at 5% level (p<0.05). Diagnostic Tests In epidemiological studies much use is made of diagnostic test, based either on clinical observations or on laboratory techniques, by means of which individuals are classified as 186 Medical Statistics and Demography Made Easy healthy or as falling into one of a number of disease categories. Such tests are, of course, important throughout the whole medicine, and in particular from the basis of screening programme for the early diagnosis of disease. Most such tests are imperfect instruments, in the sense that healthy individuals will occasionally be classified wrongly as being ill, while some individuals who are really ill may fail to detect. How should we measure the ability of a particular diagnostic test to give the correct diagnosis both for healthy and for ill subjects? Properties of diagnostic tests have traditionally been described using sensitivity, specificity, positive and negative predictive values. These measures, however, reflect population characteristics and do not easily translate to individual patients. In clinical practice, physician are often faced with interpreting the results of diagnostic tests. These results are not absolute. A negative test does not always rule out disease and some positive results can be false. Clinical epidemiology has long focused on sensitivity and specificity, as well as positive and negative predictive values, as a way of measuring diagnostic utility. The test is compared against a reference (gold) standard, and the results are tabulated in a 2 × 2 contingency table. The gold standard is a test that is considered to be the most accurate among all known tests. All the other should be compared with this test, in order to indicate whether they are reliable, so that less accurate tests are not preferred. Sensitivity: Sensitivity is the proportion of those with the disease who test positive. Sensitivity is a measure of how well the test detects disease when it is really there; a sensitive test has few false negative. Statistical Methods in Epidemiology 187 Specificity: Specificity is the proportion of those without disease who test negative. It measures how well the test rules out disease when it is really absent; a specific test has few false positive. Predictive values: Considering sensitivity and specificity we can choose what is necessary or helpful, but the most important is predictive value. Results of a test can be positive or negative. In case the test is positive or abnormal, it is necessary to know some important information about the disease. The positive predictive value express how many times the positive results of the test really represents disease. The positive predictive value expresses the proportion of those with positive test results who truly have disease. On the other hand, negative predictive value is the probability of a negative result really correlates to a disease free person. Thus we can summarize these diagnostic tests as: Sensitivity: is disease focused–i.e. the percentage of people with the disease that the test correctly identifies. Specificity: is wellbeing or normal focused–i.e. the percentage of normal people the test correctly identifies as normal. Positive predictive value: focuses of the positive results–i.e. the percentage of positive results that are correct. Negative predictive value: focuses on the negative results– i.e. the percentage of negative results that are correct. Early Diagnostic and Screening Test Defining normality and abnormality: One of the central concerns in clinical medicine is differentiating the normal from the abnormal. How does one, 188 Medical Statistics and Demography Made Easy for instance, decide that somebody has hypertension? This will not a big problem if the frequency distribution of BP in hypertensive people and non hypertensive people completely different and did not overlap. In reality, they overlap (Figure 10.2) and no matter which cut-off point is used for diagnosis, some hypertensive will be wrongly labeled as normotensive, while some normotensive will be diagnosed as hypertensive. If the cut-off point is moved to the left, the number of false negative will decrease at the expense of more false positive. If the cut-off point is shifted to the right the reverse will happen. Figure 10.2 An ideal test will completely separate the diseased and the disease-free groups and there would be no overlap (Fig. 10.3) Such ideal test are very rare. Overlap is almost seen and this makes it difficult to validate tests. Statistical Methods in Epidemiology 189 A test with complete separation of groups results in a perfect diagnostic performance A test with partial separation of groups results in a intermediate diagnostic performance A test with no separation of groups results in no diagnostic information Figure 10.3 190 Medical Statistics and Demography Made Easy Validity of Test A diagnostic test is valid if it detects most people with the target disorder and excludes most people without disorder, and if a positive test usually indicates that the disorder is present. To understand this, we need to understand the need to validate tests against a gold standard. Using a 2 × 2 table, we could compute the sensitivity, specificity, positive predictive value, and the negative predictive value of the test. It is important that all new tests should be validated by comparison against a test which is established and considered a gold standard. Diagnostic test are generally not 100% accurate. If the sensitivity is very high, the specificity tends to be low. Suppose the data be classified as: Gold standard* Test Result Total Positive Negative + – a c b d a+b c+d Total a+c b+d N [* By Gold standard we can classify the individual as presence/ absence of a particular disease] ‘a’= True positive ‘b’ = False positive ‘c’ = False negative ‘d’ = True negative Sensitivity = The proportion of person with the condition who test positive. = Statistical Methods in Epidemiology 191 Specificity = The proportion of persons with out the condition who test negative. = Positive predictive value: The proportion of person with a positive test who have the condition. = a  a  b Negative predictive value: The proportion of person with a negative test, who do not have the condition. = Diagnostic accuracy: The following condition given the diagnostic accuracy of the test =  a  d  a  b  c  d Prevalence: Prevalence of the disease is the total positive cases by gold standard to total cases. = Predictive Value in Relation to Prevalence Positive predictive value (PPV) is a function of specificity, sensitivity and prevalence. 192 Medical Statistics and Demography Made Easy The positive predictive value is expressed as percentage. It is influenced by the sensitivity, specificity of the screening test and the prevalence of disease. SENSITIVITY AND SPECIFICITY IN TERMS OF TYPE-I AND TYPE–II ERRORS Table related to decision and hypothesis (Types of error). Decision from sample True statement Accept H0 Reject H0 H0 True H0 False 1 – β = power (Type-II error) = β α 1–α 1 1 Total Table related to diagnostic test: Gold Standard* Test Result Total Positive Negative + – a c b d a+b c+d Total a+c b+d N From the above tables we can see that β (Type-II error) is false negative and α the type-I error is false positive. (1 – β) the power of test is true positive and (1 – α) is true negative.  a  As we know that sensitivity of a test is  a  c  therefore,   which is power of test is equal to sensitivity, similarly is equal to specificity. Statistical Methods in Epidemiology 193 Thus we see that there is an analogy here with significance test. If the null hypothesis is that an individual is a true positive and a negative test is regarded as significant. The α is analogous to significance level and 1 – β is analogous as power of test, the alternative hypothesis is that individual is true negative. Likelihood Ratio A fairly new concept in diagnostic tests is the concept of likelihood ratios. Likelihood ratios are more practical way of making sense of diagnostic test result and have immediate clinical relevance. In general a useful test provides high positive likelihood ratio and a small negative likelihood ratio. Likelihood ratios are independent of disease prevalence. They may be understood using the following analogy. Assume that the patient test positive on diagnostic test; if this were a perfect test, it would mean that the patient would certainly have a disease (true positive). The only thing that stops us from making this conclusion is that some patients without disease also test positive (false negative). We therefore have to correct the true positive (TP) rate by the false positive (FP) rate, this is done mathematically by dividing one by the other. Pr obability of positive test in those with disease Positive likelihood ratio  Pr obability of positive test without disease  TP rate FP rate 194 Medical Statistics and Demography Made Easy  a    a  c      b      b  d  Likewise, if a patient test negative, we are still worried about the likelihood of this being a false negative (FN) rather than a true negative (TN). This likelihood is given mathematically by the probability of a negative test in those with diseases, compared to the probability of a negative test in those without disease. Probability of negative test in those with disease Negative likelihood ratio  Pr obability of negative test without disease  FN rate TN rate  c    a  c      d      b  d  Likelihood ratios have number of useful properties: 1. Because they are based on a ratio of sensitivity and specificity, they do not vary in different populations or setting. Statistical Methods in Epidemiology 195 2. They can be used directly at the individual patient level. 3. They allow the clinician to quantitate the probability of disease for any individual patient. The interpretation of likelihood ratios is intuitive: The larger the positive likelihood ratio, the greater the likelihood of disease; the smaller the negative likelihood ratio, the lesser the likelihood of disease. For example: A 50-year-old male with the positive stress test. It is known that a more than 1 mm depression of exercise stress testing have a sensitivity and specificity of 65% and 89% respectively for coronary artery disease when compared with reference standard of angiography [Ref: (Diamond GA et al Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. N Eng J Med 1979; 300: 1350-8)]. This means that positive likelihood ratio  0.65  5.9 1  0.89  Thus we can say that the likelihood of this patient having a disease has increased by approximately six-fold given the positive test result. Thus we can say that the likelihood ratios are useful and practical way of expressing the power of diagnostic tests in increasing and decreasing the likelihood of disease. Unlike sensitivity and specificity, which are the population characteristics, likelihood ratios can be used at the individual patient level. MULTIPLE CHOICE QUESTIONS 1. Prevalence of disease affects: (a) Sensitivity (b) Specificity (c) Predictive value (d) Repeatability (AI, 92) 196 Medical Statistics and Demography Made Easy 2. Sensitivity of a test: (a) True positive/True positive + False negative (b) True negative/True negative + False positive (c) False negative/True negative + True positive (d) False negative/True positive + False negative (AI, 92, 93, 97) 3. Which of the following is not true for case control study. (a) Easy to carry out (b) Inexpensive (c) Attributable risk can be measured (d) No attrition problem (AI, 94) 4. All is true about prevalence except: (a) Rate (b) Specifically for old and new cases (c) prevalence = incidence × duration (d) Prevalence is of two types (AI, 96) 5. Case control study provides all except: (a) Incidence (b) Relative risk (c) Odds ratio (d) Strength of association (AI, 97) 6. True about prevalence all except: (a) Rate (b) Ratio (c) Duration of disease affects it (d) Numerator and denominator are separate (AI,98) 7. Incidence rate is measured by: (a) Case control study (b) Cohort study (c) Cross-sectional study (d) Cross over study(AI, 98) 8. Predictive value for positive test is defined as : (a) True positive/true positive + False negative × 100 (b) True positive/True positive + False positive × 100 Statistical Methods in Epidemiology 197 (c) False positive/True positive + False positive × 100 (d) False positive/ True positive + False negative × 100 (AI, 99) 9. Specificity of a test means all except: (a) Identify those without disease (b) True positive (c) True negative (d) An ideal screening test should have 100% specificity (AI, 2000) 10. ELISA test for HIV was done in a population. What will be the result of performing double screening ELISA test: (a) Increased sensitivity and positive predictive value (b) Increased sensitivity and negative predictive value (c) Increased specificity and positive predictive value (d) Increased specificity and negative predicted value (AI, 2001) [Hint: By performing double screening, the true positive will increase and the value of false negative will decrease] 11. Incidence is calculated by: (a) Retrospective study (b) Prospective study (c) Cross-sectional study (d) Random study (AIIMS, May 95) 12. Prevalence is a: (a) Rate (c) Proportion (b) Ratio (d) Mean (AIIMS, Feb 97) 13. Incidence of disease among exposed minus that of nonexposed is equal to: (a) Relative risk (b) Attributable risk (c) Odds ratio (d) None of the above (AIIMS, June 97) 198 Medical Statistics and Demography Made Easy 14. Specificity is related to: (a) True positive (c) False positive (b) True negative (d) False negative (AIIMS, Dec 97) 15. ELISA test has sensitivity of 95% and specificity of 95%. Prevalence of HIV carriers is 5%. The predictive value of positive test is: (a) 95% (b) 50% (c) 100% (d) 75% [Solution: The Positive predictive value is given by PPV  Prevalence  sensitivity Prevalence × sensitivity  (1  Prevalence) (1  specficity)  0.05  .95 0.05  0.95  (1  0.05) (1  0.95)  0.05  .95 1   0.5 0.05  0.95 (1  1) 2 and is expressed in percentage = 50% (AIIMS, June 99) 16. All of the following are true about case control study except: (a) Relatively cheap (b) Relative risk can be calculated (c) Used for rare cases (d) Odds ratio can be calculated (AIIMS,June 2000, AI 2002)) 17. Which of the following are best for calculating the incidence of a disease: (a) Case control (b) Cohort (c) Cross-sectional study (d) Longitudinal study (AIIMS,Nov 2000) Statistical Methods in Epidemiology 199 18. Too much false positive in a test is due to which of the following: (a) High prevalence (b) Test with high specificity (c) Test with high sensitivity (d) High incidence (AIIMS, Nov 2000) 19. In a community, the specificity of ELISA test is 99% and sensitivity is 99%. The prevalence of the disease is 5/1000. Then positive predictive value of the test is: (a) 33% (b) 67% (c) 75% (d) 99% [Solution: The Positive predictive value is given by PPV  Prevalence  sensitivity Prevalence  sensitivity  (1  Prevalence) (1  specficity) Prevalence  5  0.005, specificity 1000  0.005  0.99 0.005  0.99  (1  0.005) (1  0.99)  0.005  0.99 0.005  0.99  (0.995)(0.01)  0.005  0.99 0.99 (0.005  0.01) (take 0.995  0.99) 0.005 0.015 = approximately 0.33 and is expressed as percentage = 33%] (AIIMS, May 2001)  200 Medical Statistics and Demography Made Easy 20. In a village of 1 lakh population, among 20,000 exposed to smoking 200 developed cancer, and among 40,000 people unexposed 40 developed cancer. The relative risk of smoking in the development of cancer is: (a) 20 (b) 10 (c) 5 (d) 15 [Hint: Incidence of smokers = 200 ; 20, 000 Incidence of non-smokers = Relative Risk = ] (AIIMS, May 2001) 21. A women exposed to multiple sex partners has 5 times increased risk for CaCx. The attributable risk is: (a) 20% (b) 50% (c) 80% (d) 100% [Solution: Let incidence rate among non-exposed is x, then incidence rate among exposed is 5 times higher therefore the incidence rate among exposed is 5x. According to definition of attributable risk AR = And expressed in percentage = 80%] (AIIMS,Nov 2001) 22. True about case control study All except: (a) Less expensive (b) Those with disease and not diseased compared Statistical Methods in Epidemiology 201 (c) Attributed risk is estimated (d) None of these AIIMS,Nov 2001) 23. Which of the following is true about cohort study: (a) Incidence can be calculated (b) It is from effect to cause (c) It is inexpensive (d) Shorter time than case control (JIPMER,2003) 24. For the calculation of positive predictive value of a screening test, the denominator is comprised of: (a) True positives +False negatives (b) False positives + True negatives (c) True positives + False positives (d) True positives + True negatives (AI, 2003) 25. The table below shows the screening test results of disease ‘Z’ in relation to the true disease status of the population being tested: Screening test results Yes Disease Total No Positive negative 400 100 200 600 600 700 Total 500 800 1300 The specificity of the screening test is: (a) 70% (b) 75% (c) 79% (d) 86% 26. If prevalence of diabetes is 10%, the probability that three people selected at random from the population will have diabetes is: 202 Medical Statistics and Demography Made Easy (a) 0.01 (b) 0.03 (c) 0.001 (d) 0.003 [Hint: There are two rules of probability, the addition law and the multiplication law. 1 = 0.1 10 The probability of all 3 having diabetes can be calculated using the multiplication law of probability. It will be Probability of one person having diabetes is p = p × p × p = 0.1×0.1×0.1 = 0.001 ] 27. The usefulness of a screening test depends upon its: (a) Sensitivity (b) Specificity (c) Reliability (d) Predictive value (AI, 2002) 28. In a low prevalence area for Hepatitis B, a double ELISA test was decided to be performed in place of a single test which used to be done. This would cause an increase in the: (a) Specificity and positive predictive value (b) Sensitivity and positive predictive value (c) Sensitivity and negative predictive value (d) Specificity and negative predictive value (AI, 2002) 29. The association between coronary artery disease and smoking was found to be as follows. Smokers Non-smokers Coronary art dis No. coronary art dis 30 20 20 30 Statistical Methods in Epidemiology 203 The Odds ratio can be estimated as (a) 0.65 (b) 0.8 (c) 1.3 (d) 2.25 30  30 = 2.25 ] [Hint: Odds ratio = 20  20 (AI, 2002) 30. A screening test is used in the same way in two similar populations; but the proportion of false positive results among those who test positive in population A is lower than those who test positive in population B. What is the likely explanation? (a) The specificity of the test is lower in population A (b) The prevalence of the disease is lower in population A (c) The prevalence of the disease is higher in population A (d) The specificity of the test is higher in population A [Hint: When false positive result in population A is less than that of B. Then PPV of population A is higher than that of B, thus by the formula the prevalence of population A is higher than that of B] (AIIMS, 2003) 31. Residence of three village with three different types of water supply were asked to participate in a study to identify cholera carries. Because several cholera deaths had occurred in the recent past, virtually everyone present at the time submitted to examination. The proportion of residents in each village who were carries was computed and compared. This study is a : (a) Cross- sectional study. (b) Case-control study. 204 Medical Statistics and Demography Made Easy (c) Concurrent cohort study. (d) Non-concurrent. (AIIMS, 2003) 32. A drug company is developing a new pregnancy-test kit for use on an outpatient basis. The company used the pregnancy test on 100 women who are known to be pregnant. Out of 100 women, 99 showed positive test. Upon using the same test on 100 non-pregnant women, 90 showed negative result. What is the sensitivity of the test ? (a) 90% (b) 99% (c) Average of 90 and 99 (d) Cannot be calculated from the given data [Hint: Pregnant Non-pregnant Total Test positive Test negative 99 1 10 90 109 91 Total 100 100 200 Sensitivity = 99 = 0.99 (expressed in percentage = 99%)] 100 (AIIMS, 2003) 33. Which of the following relationship between different parameters of a performance of a test is correct: (a) Sensitivity = 1 – specificity (b) Positive predictive value = 1 – negative predictive value (c) Sensitivity is inversely proportional to specificity (d) Sensitivity = 1 – positive predictive value [Hint: Both sensitivity and specificity can not be increase simultaneously. If one increase then other will decrease] (AIIMS, 2004) Statistical Methods in Epidemiology 205 34. Which of the following is not an advantage of a prospective cohort study: (a) Precise measurement of exposure is possible (b) Many disease outcomes can be studies simultaneously (c) It usually cost less than a case control study (d) Recall bias is minimized compared with a case control study 35. The incidence rate of a disease is five times greater in women than in men, but the prevalence rate shows no sex difference. The best explanation is that: (a) The crude death rate (by all causes) is greater in women (b) The case fatality rate for this disease is lower in women (c) The case-fatality rate is greater in women (d) Risk factors for the disease are more common in women 36. In a study of a disease in which all cases that developed were ascertained, if the relative risk for the association between factor and disease is equal to or less than 1 then: (a) The factors protect against the development of the disease (b) There is either no association or a negative association between the factors and disease (c) Either matching is not done properly (d) There is a significant positive association between the diseases [Hint: The risk ratio 1 indicate that there is no difference between two groups, and the range of Risk Ratio lies between (0 60 4,000 12,000 6,000 8,000 36 48 66 158 3,000 20,000 4,000 3,000 30 100 48 60 1,000 4,000 3,000 2,000 Total 30,000 308 30,000 238 10,000 Find out the death rate of which district is higher. 9. The following data given the number of women in child bearing age and yearly birth in five year age groups for a city. Calculate the general fertility rate and total fertility rates. If the ratio of male to female is 13:12. What is the gross reproductive rate? 310 Medical Statistics and Demography Made Easy Age Group Female pop Births Age Group Female pop Births 15 – 19 20 – 24 25 – 29 30 – 34 16,000 15,000 14,000 13,000 400 1710 2100 1430 35 – 39 40 – 44 45 – 49 Total 12,000 11,000 9,000 60,000 960 330 36 6690 10. A total of 1,000 individuals were surveyed and classified as: Hypertensive Normotensive Total Smokers Non-smokers 250 50 250 450 500 500 Total 300 700 1000 (a) Calculate the prevalence of hypertension from the study. (b) Calculate smoking rate among hypertensive and normotensive. (c) Find out whether, smoking is associated with hypertension. (d) Find out the risk associated with hypertension. 11. A comparative evaluation of Ziehl-Neelsen staining and culture on Lowenstein Jensen medium in the diagnosis of pulmonary and extrapulmonary tuberculosis patients. Following results were obtained: Unsolved Questions 311 Z-N stain L-J culture (Gold standard) Positive Negative Total Positive Negative 16 16 0 12 16 28 Total 32 12 44 Find out the sensitivity, specificity, positive predictive value, negative predictive value and diagnostic accuracy of Z.N. Stain. 12. Following are the marks obtained by students in an examination: Marks No. of students Marks No. of students 20 – 30 25 60 – 70 27 30 – 40 26 70 – 80 15 40 – 50 36 > 80 10 50 – 60 42 (a) Find out the quartile deviation (b) Also comment about the skewness of the distribution. 13. Form a frequency distribution table of the following data and calculate the two most suitable measures of central tendencies: 32 47 41 51 30 39 18 48 54 32 31 46 15 37 32 56 300 21 45 32 37 41 44 18 650 47 390 42 44 37 56 48 53 42 37 41 51 50 47 48 312 Medical Statistics and Demography Made Easy 14. The haemoglobin levels of patients are as follows: Hb% No. of cases Hb% No. of cases 6 7 8 9 10 14 23 26 30 130 11 12 13 14 110 70 50 12 (a) Find out the median of the above distribution by using ogives. (b) Also find out the mean by using short cut method. 15. A random sample of patients selected from the Cardiology OPD of a hospital have following values of blood pressure: Blood pressure No. of cases Blood pressure No. of cases 130 – 140 14 160 – 170 23 140 – 150 24 170 – 180 40 150 – 160 54 180 – 190 32 Calculate coefficient of dispersion (based on Quartiles and based on Mean and SD). 16. Find the correlation coefficient and line of regression between height and weight of 10 individuals: Unsolved Questions 313 Case No. Height Weight Case No. Height Weight 1 175 65 6 169 69 2 166 56 7 182 81 3 182 78 8 190 87 4 167 66 9 187 84 5 176 72 10 151 60 17. A survey conducted by a health agency, it was found that in Town A out of 876 birth 46% were male, while in Town B out of 690 birth 473 were males. Is there any significant difference in the proportion of male child in the two towns. Clearly state the hypothesis which is to be tested. 18. A sample of 900 individuals has a mean haemoglobin of 12.7 mg%. Is the sample drawn from a population with mean 13.6 mg% and SD 2.70. 19. A random sample is drawn from two hospitals and following data related to blood pressure of adult males hospital workers were obtained: Hospital A Hospital B Mean blood pressure 127.56 mmHg 140.78 mmHg Standard deviation 13.77 mmHg 10.37 mmHg No. of cases 360 700 Is the blood pressure of male workers of Hospital B is significantly higher than those working in Hospital A. 314 Medical Statistics and Demography Made Easy 20. Two groups of rats were placed on diets with high and low protein contents and the gain in weight (in gms) were recorded after 2 months. The results of gain in weight are as follows: Group A (high protein diet): 140 117 160 123 145 127 107 146 107 102 114 121 132 153 Group B (low protein diet): 97 63 110 120 96 74 86 120 115 120 150 Find out whether there is any significant difference between the weight gain in rats of two groups. 21. In a clinical trial the anxiety score of 10 patients were recorded (baseline value). A new tranquillizer was given to each period for one month. After one month the anxiety scores were again recorded. Which are as follows: Case No. Baseline value (xi) After one month (yi) Case No. Baseline value (xi) After one month (yi) 1 2 3 4 5 23 21 24 19 17 15 20 26 17 17 6 7 8 9 10 26 22 17 12 15 21 16 12 12 11 Find out whether the new tranquillizer is effective to psychoneurotic patients. Unsolved Questions 315 22. Concentration of haemoglobin (xi) and bilirubin (yi) for infants with haemolytic disease of newborn are as follows: Case No. (xi) (yi) Case No. (xi) (yi) 1 2 3 4 15.8 12.3 9.5 9.4 1.8 5.6 3.6 3.8 5 6 7 8 9.2 8.8 7.6 7.4 5.6 5.6 4.7 6.8 Calculate the correlation coefficient and comment whether haemoglobin level is directly proportional to bilirubin levels. 23. Most recent amount smoked by all patients other than those with cancer of the lung, from a retrospective survey, are as follows: Dis. Cigarette daily Total Group 0 1–4 5–14 15–24 > 24 Cancer RDS CHD GI Dis. Others 236 42 22 39 38 78 33 19 31 31 237 128 64 143 91 110 98 38 81 44 57 34 23 34 18 718 335 166 328 215 Total 377 185 663 371 166 1762 Find out whether various disease groups are associated with daily cigarette smoking. Also mention the degree of freedom required in this problem. 316 Medical Statistics and Demography Made Easy 24. Following table shows the number of individuals in various age groups who were found in a survey to be positive and negative for Schistosoma mansoni eggs in the stool. Age in yesrs 0–10 10–20 20–30 Total 30–40 > 40 Test + Test – 14 87 16 33 14 66 7 34 6 11 57 231 Total 101 49 80 41 17 288 Find out whether the presence of Schistosoma mansoni eggs in the stool is related to age. 25. Number of children who were nasal carrier or noncarrier of Streptococcus pyogenes, classified by size of tonsils. The results of survey as follows: Present but not enlarged Tonsils Enlarged Total Greatly enlarged Carrier Non-carrier 19 497 29 560 24 269 72 1326 Total 516 589 293 1398 Find out whether nasal carrier are associated with size of tonsils. 26. Two groups of female rats were placed on diets with high and low protein content, and gain in weight between the 28th and 84th days of age was measured for each rat. The results were as follows: Unsolved Questions 317 High protein diet (n – 12) 134 146 104 119 124 161 107 83 Low protein diet (n – 8) 113 129 97 123 70 118 101 85 107 132 94 115 Find out whether there is any significant increase in the weight of rats who were given high protein diet. 27. In a clinical trial to assess the value of a new method of treatment (A) in comparison with the old method (B). patients were divided at random into two groups. Out of 257 patients treated by method A. 41 died, of 244 patients treated by method B, 64 died. Find out whether difference in fatality rate of group A is less than group B. 28. Fill in the blanks: (a) Statistical hypothesis under test is called .................. (b) The probability of type-I error is given by ................... (c) The probability of type-I error is also called ................... (d) If β is the probability of type II error, the (1–b) is called ................ of the test. (e) The power of function is related to type ............. error. (f) In any testing problem, the type ................... error is considered more serious then type .................. error. (g) The level of significance of a test is related to type ............... error and is given by ................. 318 Medical Statistics and Demography Made Easy (h) Critical region provides a criteria for .................. Null hypothesis. (i) The choice of one tailed and two tailed test depends on ................. 29. Calculate standard deviation of the following two series: Series A 25 30 45 60 10 100 70 Series B 100 120 180 240 40 400 280 30. Two random samples of size 16 and 25 are drawn from normal population and the data of abdominal skin fold thickness are as follows: Sample No. of observation Sum of observation Sum of square observations 1 2 16 25 76 105 561 680 Find out whether there is any significant difference between skin fold thickness of two groups. 31. Fill in the blanks: (a) Absolute sum of deviation is minimum from ................. (b) The sum of squares of deviation is least when measured from ..................... (c) If 25% of the items are less than 10 and 25% are more than 40, the coefficient of quartile deviation is ................. Unsolved Questions 319 (d) In a symmetric, distribution the upper and lower quartile are equidistant from .................. (e) If mean and the mode of a given distribution are equal, then its coefficient of skewness is .................. (f) In any distribution, the standard deviation is always ..................... the mean deviation from mean. 32. A clinical researcher postulates that weight bearing exercise prevents the development of osteoporosis by increasing secretion of calcitonin a hormone that inhibits bone re-absorption. He wishes to test the hypothesis by comparing blood levels of calcitonin in subjects who exercise to those in subjects who do not. The mean calcitonin secretion (µg/dl) in study and control groups of women alongwith their respective standard deviation are given below: Study group No. of women (ages 25 to 45) Sample mean Sample SD Control group 100 100 0.60 0.20 0.54 0.15 Test the desired hypothesis based on the above observation. 33. A community health director observes that exposure of a particular pesticide results in a higher rate of miscarriage. To test the hypothesis regarding exposure and miscarriage, he selects 40 women experiencing a miscarriage and 160 women experiences a normal pregnancy from the records of the hospital. The 200 subjects were interviewed to determine their prior exposure to the pesticide. The results are summarized as: 320 Medical Statistics and Demography Made Easy Exposed Not Exposed Total 30 60 10 100 40 160 Miscarriage Normal preg. Explain the type of study design and finds odds in favour of exposure pesticide. 34. Test whether there is any association between marital status and breast cancer among females: Breast Cancer Married Unmarried Yes No 26 16 9 49 35. Compute crude death rates of population A, B and C from the table and also compare the death rate of population A and B taking population C as standard population. Age Group PA DA PB DB < 10 10 – 20 20 – 40 40 – 60 > 60 16,000 25,000 45,000 21,000 12,000 425 560 955 752 600 20,000 12,000 50,000 30,000 10,000 600 240 1250 1050 550 PC DC 12,000 372 30,000 660 62,000 1612 15,000 525 3,000 180 36. In Allahabad city, 20% of a random sample of 900 school children had defective eye sight, while in Kanpur city 15% of random sample of 1,600 children had the same defect. Is the difference between two proportions significant? 37. Draw two systemic samples of size 5 from the data given below: 3, 4, 7, 5, 1, 6, 8, 2, 7, 4, 7, 11, 9, 3, 4, 6, 13, 11, 11, 10 Unsolved Questions 321 38. A screening test is 90% sensitive and 60% specific. Calculate Positive and negative likelihood ratio of the test. 39. Two population of women using oral contraceptives and no contraceptive device were followed-up for occurrence of myocardial infarction and observation are given below: Myocardial infarction No Myocardial infarction 25 35 40 100 OC users Non-users Explain what type of study design has been adopted, also find the relative risk of myocardial infarction due to Oral contraceptive. 40. On the basis of two stage screening programme adopted blood sugar at first stage and glucose tolerance test (GTT) at second stage for detecting diabetes. Calculate net sensitivity and net specificity on the basis of following results. I stage Diabetes (+) Diabetes (–) Total Test (+) Test (–) 425 125 1575 7875 2000 8000 Total 550 9450 10,000 II stage Diabetes (+) Diabetes (–) Total Test (+) Test (–) 400 25 175 1400 575 1425 Total 425 1575 2000 322 Medical Statistics and Demography Made Easy 41. A random sample of 25 patients is taken from ICCU of a hospital and the outcome cured (C) or death (D) was recorded according to the date of admission of the patient, which are as follows: C C C D D D C C C C C D D C D D D D C C D C D D C Apply a run test to test that whether the sequence of cured and death is random. 42. Two samples are drawn from a two populations whose distribution is not known. In one group (Group A, n1 = 10) a high caloric diet was given and the second group (Group B, n2 = 10) was on normal diet. The weight gain in two groups were recorded after a month and the increase in weight was recorded in these group: Group A 12 10 12 15 9 6 10 5 15 9 7 16 18 12 9 8 6 9 10 5 Group B Apply suitable test to find out whether the weight gain in two groups are same. 43. A coefficient of correlation of 0.4 is derived from a random sample of size 102 pairs of observation. Is the value of ‘r’ is significant. 44. In four families each containing eight persons, the chest measurements (in cm) of these persons are given below. Calculate whether there is any significant difference between the chest measurement of these families. Unsolved Questions 323 Family 1 Family 2 Family 3 Family 4 35 53 47 60 85 66 49 55 67 39 33 65 69 66 58 42 56 47 33 79 90 49 57 62 56 78 44 42 39 67 68 86 45. The following table gives the frequency distribution of pulse rate of 60 normal persons: Pulse rate No. of persons Pulse rate 45 – 50 50 – 55 55 – 60 3 7 20 60 – 65 65 – 70 70 – 75 No. of persons 15 9 6 Calculate upper and lower quartile and the coefficient of dispersion. 46. The value of mean and median of 100 observations are 50 and 52 respectively. The value of the largest item is 100. It was found later that the correct value is actually 120. Find the correct value of mean and median and also calculate the mode and second quartile. 47. Two laboratories carry out independent estimates of content of progesterone in a particular brand of oral contraceptive. A sample is taken from each batch, halved and the separate halved sent to two laboratories. The following data are obtained: 324 Medical Statistics and Demography Made Easy No. of sample 9 Mean value of the difference of estimate Standard deviation of difference 0.8 16 Find out whether there is significant difference between the content of progesterone in oral contraceptive on the basis of report of two laboratories. 48. Calculate the correlation coefficient for the following height (in inches) of father (x) and their sons (y): x 65 66 67 67 68 69 70 72 y 67 68 65 68 72 72 69 71 49. In an investigation on neonatal blood pressure in relation to maturity following results were obtained: Babies 9 days old 1. Normal 2. Neonatal asphyxia Number 50 15 Mean systolic SD BP 75 69 8 6 Is the difference in mean systolic BP between the two groups statistically significant? 50. From a field area 40 females using oral contraceptive and 60 females using other contraceptive were randomly selected and the number of hypertensive cases from the groups were recorded as given below: Unsolved Questions 325 Type of Contraceptive Total No. of hypertensive Oral Others 40 60 12 18 Find whether there is any significant difference between Oral contraceptive users in Hypertensive and normotensive females. Answers of MCQs and Unsolved Questions 327 Answers of MCQs and Unsolved Questions 328 Medical Statistics and Demography Made Easy Answers of MCQs Chapter 1: Classification and Tabulation 1. d 2. a 3. c 4. b 5. a 6. b 7. d 8. b 9. c 10. d 11. b 12. d 13. d 14. d 15. d 16. a 17. d 18. b, d 19. c 20. c 21. c 22. a Chapter 2: Measure of Central Tendency 1. c 2. b 3. d 4. a 5. c 6. b 7. c 13. b 8. c 14. a 9. b 15. c 10. b 16. b 11. b 17. b 12. b 18. c 19. b 25. c 20. d 26. c 21. a 27. a 22. c 28. a 23. c 29. b, c 24. a 30. a Chapter 3: Measure of Dispersion 1. c 7. a 2. b 8. a 3. c 9. b 4. d 10. c 5. d 11. b 6. a 12. b 13. b* 19. b 14. c 20. b 15. b 21. a 16. c 22. c 17. c 23. a 18. d 24. a 25. c 26. a * because variance is the square of standard deviation Chapter 4: Theoretical Discrete and Continuous Distribution 1. a 7. b 2. d 8. a 3. b 9. b 4. d 10. a 5. a 11. a 6. c 12. b Answers of MCQs and Unsolved Questions 329 13. d 14. b 15. c 16. d 17. d 18. d 19. c 25. a 20. a 26. a 21. b 27. b 22. b 28. d 23. a 29. b 24. b 30. d 31. c 32. b 33. b 34. d Chapter 5: Correlation and Regression 1. b 2. d 3. b 4. a 5. 7. a 8. b 9. b 10. a 11. 13. a 14. b 15. b 16. b 17. 19. b 20. d 21. c 22. d 23. 25. c Chapter 6: Probability 1. d 2. b 3. a 7. c 8. c 9. a 4. c 10. a d c a d 5. b 6. 12. 18. 24. c a a d 6. d Chapter 7: Sampling and Design of Experiments 1. a 2. b 3. b 4. b 5. d 6. b 7. b 8. d 9. b 10. b 11. a 12. c, d 13. a 14. a 15. b 16. a 17. b 18. d Chapter 8: Testing of Hypothesis 1. a 2. c 3. c 4. a 7. d 8. b 9. a 10. a 13. a 14. a 15. c 16. a 19. b 20. d 21. b 22. b 25. a 26. d 5. 11. 17. 23. a a b a 6. 12. 18. 24. a c b a Chapter 9: Non-parametric Tests 1. e 2. d 3. b Chapter 10: Statistical Methods in Epidemiology 1. c 2. a 3. c 4. a 5. a 6. a 330 Medical Statistics and Demography Made Easy 7. 13. 19. 25. 31. 37. b b a b a a 8. 14. 20. 26. 32. 38. b b b c b b 9. 15. 21. 27. 33. 39. a b c a c d 10. 16. 22. 28. 34. b b c a c 11. 17. 23. 29. 35. b b a d c Chapter 11: Vital Statistics (Demography) 1. c 2. d 3. c 4. b 5. c 7. d 8. c 9. b 10. d 11. a 12. 18. 24. 30. 36. b c c c b 6. a 12. d 13. d 14. b 15. c 16. d 17. b 18. b 19. a 20. a 21. c 22. b 23. a 24. a 25. b 26. a 27. c 28. d 29. a 30. b 31. a 32. a 33. d 34. a 35. a 36. d 37. d 38. b 39. c 40. a 5. a 6. d Chapter 12: Health Information 1. a 2. c 3. b 7. a 8. d 9. b 4. d Chapter 13: A Report on Census 2001 1. b 2. d 3. b 4. c 5. c 6. b 7. d 8. c 9. b 10. b 11. b 12. c 13. b 14. b Chapter 14: National Population Policy 1. c 2. b 7. b 8. a 3. b 4. a 5. a 6. d Answers of MCQs and Unsolved Questions 331 Answers of Unsolved Questions 1. Null hypothesis H0 : µA = µB, Alternative hypothesis H1 : µA ≠ µB; Mean (a) = 51.28’ SD (a) = 2.28; Mean (B) = 53.14, SD (B) = 1.67; “t” = 2.95, d.f. = 12, P < 0.05. 2. H0: µA = µB; H1: µA ≠ µB; Mean (difference) = 2; SD (d) = 2.64, “t” = 2.27, d.f. = 8, p > 0.05. 3. H0: No association between coronary artery disease and smoking; χ2Cal = 4, d.f. = 1; p < 0.05. 4. Hint: Go through Chapter 2. 5. Mean = 132.4; Median = 131.22; Mode = 132.5; approximately symmetrical. 6. Correlation coefficient “r” = + 0.82. 7. Regression line x on y: x = 57.4 + 0.58y Regression line y on x: y = 26 + 0.96x Estimate of cholesterol for blood pressure ‘x = 160’ is 179.6. 8. Crude death rate (A) = 10.26; CDR (B) = 7.93 Standardized death rate (A) = 9.7; SDR (B) = 10.6 9. GFR = 77.4; TFR = 2.56; GRR = 1.23 10. Prevalence = 300/1000; Rate of smokers for Hypertensive = 83.33%; Rate of smoking for Normotensive = 35.71; χ2 = 190.46, Risk Ratio = 5. 11. Sensitivity = 50%, Specificity = 100%, PPV = 100%, NPV = 42.85%, Diagnostic Accuracy = 63.36%. 12. Q1 = 37.78, Q3 = 135.75; Coff. of dispersion = 0.24. 332 Medical Statistics and Demography Made Easy 13. Median = 43.33; Mode = 43.33. 14. Median = 11; Mean = 10.52. 15. Coff. of dispersion (based on SD) = 0.09 Coff. of dispersion (based on Quartile) = 0.07 16. Correlation coefficient “r” = 0.79; Regression line between Height (Ht) and weight (Wt) is Ht = 111.32 + 0.88 Wt. 17. H0: P1 = P2; H1: P1 ≠ P2; Z = 9.16; p < 0.001. 18. H0: µ = 13.6; H1 : µ ≠ 13.6; Z = 10, p < 0.001. 19. H0: µA = µB, H1 : µA < µB; Z = 15.94; p < 0.001. 20. H0: µA = µB, H1 : µA ≠ µB; Mean (A) = 128.14, SD (A) = 18.33; Mean (B) = 104.63, SD (B) = 24.60; ‘t’ = 2.27, d.f. = 23; p < 0.05. 21. H0: µx = µy, H1 : µx > µy; Mean (difference) = 2.9; SD (d) = 3.17; ‘t’ = 2.89, d.f. = 9; p < 0.05. 22. Correlation coefficient ‘r’ = – 0.58; inversly proportional. 23. H 0 ; No association between disease groups and cigarette smoking: χ2 = 27.18, d.f. = 16; p < 0.05. 24. H0: No relation between age and presence of Shistosoma mansoni eggs, χ2 = 10.35, d.f. = 4; p < 0.05. 25. H0: Nasal carrier are not associated with size of tonsils; χ2 = 7.85, d.f. = 2; p < 0.05. 26. H0: µ1 = µ2; H1: µ1 ≠ µ2; ‘t’ = 1.84, d.f. = 18, p > 0.05. 27. H0: µA = µB; H1: µA < µB; Z = 2.77, p < 0.01. 28. (a) Null hypothesis, (b) α; (c) Level of significance; (d) Power; (e) Type II (f) Type I, Type II; (g) Type I, α ; (h) Rejecting (i) Alternative hypothesis. 29. SD (A) = 30.64; SD (B) = 122.56. Answers of MCQs and Unsolved Questions 333 30. H0: µ1 = µ2; µ1 ≠ µ2; Mean (1) = 4.75, SD (1) = 3.65; Mean (2) = 4.20, SD (2) = 3.15; ‘t’ = 0.51; d.f. = 39; p > 0.05. 31. (a) Median; (b) Mean; (c) 15; (d) Mean; (e) zero; (f) less. 32. H0: µ1 = µ2; H1: µ1 ≠ µ2; ‘Z’ = 2.5, p < 0.05. 33. Retrospective study; Odds ratio = 5. 34. H0: No association between marital status and breast cancer; χ2 = 20.02, d.f. = 1; p < 0.001. 35. CDR (A) = 27.66; CDR (B) = 30.24; CDR (c) = 27.45 Standardized death rate (A) = 24.53; SDR (B) = 26.26. 36. H0: P1 = P2; H1: P1 ≠ P2; ‘Z’ = 3.21; p < 0.001. 37. Hint: Systematic sampling; 20 = 5 × k; k = 20/5 = 4. 38. Positive likelihood ratio = 2.25; Negative likelihood ratio = 0.16 39. Prospective study; Risk ratio = 1.48. 40. Sensitivity = 72.2%, Specificity = 98.14% 41. H0: sequence of crude and death in this series is random, No. of run = 11, “z” = 1.02; p > 0.05 (i.e. accept H0). 42. H0: µ1 = µ2; H1: µ1 ≠ µ2; Mann Whitney U-test, ‘Z’ = 0.01; p > 0.05. 43. ‘t’ = 4.39; d.f. = 100, p < 0.001. 44. H0: µ1 = µ2 = µ3 = µ4; H1: µ1 ≠ µ2 ≠ µ3 ≠ µ4: Analysis of variance, ‘F’ = 0.14; d.f. (3.28); p > 0.05. 45. Q1 = 56.25; Q2 = 65.00, Coeff. of dispersion = 0.07. 46. Mean = 50.20; Median = 52, Mode = 55.6. 47. H0: d = 0; H1: d ≠ 0, ‘t’ = 0.15, d.f. = 8, p > 0.05. 48. Correlation coefficient ‘t’ = 0.60. 49. H0: µ1 = µ2; H1: µ1 ≠ µ2; ‘t’ = 2.65, d.f. = 63. p < 0.05. 50. H0: P1 = P2; H1: P1 ≠ P2; ‘Z’ = 0, p > 0.05. Appendix Statistical Tables 336 Medical Statistics and Demography Made Easy Table 1: Areas under normal curve Normal probability curve is given by f  x   1  x   2  1 exp       x    2  2     and standard normal probability curve is given by   z  1  1  exp   z 2  ,    z   2  2  Figure A-1 The following table gives the shaded area in the diagram, viz.... P(0 < Z < z) for different values of z. Appendix 337 Tables of Areas ↓Z→ 0 .0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 .0000 .0398 .0793 .1179 .1554 .1915 .2257 .2580 .2881 .3159 .3413 .3643 .3849 .4032 .4192 .4332 .4452 .4554 .4641 .4713 .4772 .4821 .4861 .4893 .4918 .4938 .4953 .4965 .4974 .4981 .4987 .4990 .4993 .4995 .4997 .4998 .4998 .4999 1 .0040 .0438 .0832 .1217 .1591 .1950 .2291 .2611 .2910 .3186 .3438 .3655 .3869 .4049 .4207 .4345 .4463 .4564 .4649 .4719 .4778 .4826 .4864 .4896 .4920 .4940 .4955 .4966 .4975 .4982 .4987 .4991 .4993 .4995 .4997 .4998 .4998 .4999 2 3 4 5 6 7 8 9 .0080 .0478 .0871 .1255 .1628 .1985 .2324 .2642 .2939 .3212 .3461 .3686 .3888 .4066 .4222 .4357 .4474 .4573 .4656 .4726 .4783 .4830 .4868 .4898 .4922 .4941 .4956 .4967 .4976 .4982 .4987 .4991 .4994 .4995 .4997 .4998 .4999 .4999 .0120 .0517 .0910 .1293 .1664 .2019 .2357 .2673 .2967 .3238 .3485 .3708 .3907 .4082 .4236 .4370 .4484 .4582 .4664 .4732 .4788 .4834 .4871 .4901 .4925 .4943 .4957 .4968 .4977 .4983 .4988 .4991 .4994 .4996 .4997 .4998 .4999 .4999 .0160 .0557 .0948 .1331 .1700 .2054 .2389 .2703 .2995 .3264 .3508 .3729 .3925 .4099 .4251 .4382 .4495 .4591 .4671 .4738 .4793 .4838 .4875 .4904 .4927 .4945 .4959 .4969 .4977 .4984 .4988 .4992 .4994 .4996 .4997 .4998 .4999 .4999 .0199 .0596 .0987 .1368 .1736 .2088 .2422 .2734 .3023 .3289 .3531 .3749 .3944 .4115 .4265 .4394 .4505 .4599 .4678 .4744 .4798 .4842 .4678 .4906 .4929 .4946 .4960 .4970 .4978 .4984 .4989 .4992 .4994 .4996 .4997 .4998 .4999 .4999 .0239 .0636 .1026 .1406 .1772 .2123 .2454 .2764 .3051 .3315 .3554 .3770 .3962 .4131 .4279 .4406 .4515 .4608 .4686 .4750 .4803 .4846 .4881 .4909 .4931 .4948 .4961 .4971 .4979 .4985 .4989 .4992 .4994 .4996 .4997 .4998 .4999 .4999 .0279 .0675 .1064 .1443 .1808 .2157 .2486 .2794 .3078 .3340 .3577 .3790 .3980 .4147 .4292 .4418 .4525 .4616 .4693 .4756 .4808 .4850 .4884 .4911 .4932 .4959 .4962 .4972 .4979 .4985 .4989 .4992 .4995 .4996 .4997 .4998 .4999 .4999 .0319 .0714 .1103 .1480 .1844 .2190 .2517 .2823 .3106 .3365 .3599 .3810 .3997 .4162 .4306 .4429 .4535 .4625 .4699 .4761 .4812 .4854 .4887 .4913 .4934 .4951 .4963 .4973 .4980 .4986 .4990 .4993 .4995 .4996 .4997 .4998 .4999 .4999 .0359 .0759 .1141 .1517 .1879 .2224 .2549 .2852 .3133 .3389 .3621 .3830 .4015 .4177 .4319 .4441 .4545 .4633 .4706 .4767 .4817 .4857 .4890 .4916 .4936 .4952 .4964 .4974 .4981 .4986 .4990 .4993 .4995 .4997 .4998 .4998 .4999 .4999 338 Medical Statistics and Demography Made Easy 3.9 .5000 .5000 .5000 .5000 .5000 .5000 .5000 .5000 .5000 .5000 Table 2: Ordinates of the normal probability curve The following table gives the ordinates of the standard normal probability curve, i.e., it gives the value of 1  1  exp   z 2  ,    z   2  2  for different values of z, where  z  Z X  E X  X    ~ N  0, 1 x  Obviously   z     z  Z .00 .01 .02 .03 .04 .05 .06 .07 .08 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 .3989 .3970 .3910 .3814 .3683 .3521 .3335 .3123 .2897 .2661 .2420 .2179 .1942 .1714 .1497 .1295 .1109 .0940 .0790 .0656 .0540 .0440 .3989 .3965 .3902 .3802 .3668 .3503 .3312 .3101 .2874 .2637 .2396 .2155 .1919 .1691 .1476 .1276 .1092 .0925 .0775 .0644 .0529 .0431 .3989 .3961 .3894 .3790 .3653 .3485 .3292 .3079 .2850 .2313 .2371 .2131 .1895 .1669 .1456 .1257 .1074 .0909 .0761 .0632 .0519 .0422 .3988 .3956 .3885 .8778 .3637 .3467 .3271 .3056 .2827 .2589 .2347 .2107 .1872 .1647 .1435 .1238 .1057 .0893 .0748 .0620 .0508 .0413 .3986 .3951 .3876 .3765 .3621 .3448 .3251 .3034 .2803 .2565 .2323 .2083 .1849 .1626 .1415 .1219 .1040 .0878 .0734 .0608 .0498 .0404 .3984 .3954 .3867 .3752 .3605 .3429 .3230 .3011 .2780 .2541 .2299 .2059 .1826 .1604 .1394 .1200 .1023 .0863 .0721 .0596 .0488 .0396 .3982 .3939 .3857 .3739 .3589 .3410 .3209 .2989 .2756 .2516 .2275 .2036 .1804 .1582 .1374 .1182 .1006 .0848 .0707 .0584 .0478 .0387 .3980 .3932 .3847 .3725 .3572 .3391 .3187 .2966 .2732 .2492 .2251 .2012 .1781 .1561 .1354 .1163 .0989 .0833 .0694 .0573 .0468 .0379 .3977 .3925 .3836 .3712 .3555 .3372 .3166 .2943 .2709 .2468 .2227 .1989 .1758 .1539 .1334 .1145 .0973 .0818 .0681 .0562 .0459 .0371 .09 .3973 .3918 .3825 .3697 .3538 .3352 .3144 .2920 .2685 .2444 .2203 .1965 .1736 .1518 .1315 .1127 .0957 .0804 .0669 .0551 .0449 .0363 Appendix 339 Contd... Contd... Z .00 .01 .02 .03 .04 .05 .06 .07 .08 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 .0355 .0283 .0224 .0175 .0136 .0104 .0079 .0060 .0044 .0033 .0024 .0017 .0012 .0009 .0006 .0004 .0003 .0002 .0347 .0277 .0219 .0171 .0132 .0101 .0077 .0058 .0043 .0032 .0023 .0017 .0012 .0008 .0006 .0004 .0003 .0002 .0339 .0270 .0213 .0167 .0129 .0099 .0075 .0056 .0042 .0031 .0022 .0016 .0012 .0008 .0006 .0004 .0003 .0002 .0332 .0264 .0208 .0163 .0126 .0096 .0073 .0055 .0040 .0030 .0022 .0016 .0011 .0008 .0005 .0004 .0003 .0002 .0325 .0258 .0203 .0158 .0122 .0093 .0071 .0053 .0039 .0029 .0021 .0015 .0011 .0008 .0005 .0004 .0003 .0002 .0317 .0252 .0198 .0154 .0119 .0091 .0069 .0051 .0038 .0028 .0020 .0015 .0010 .0007 .0005 .0004 .0002 .0002 .0310 .0246 .0194 .0151 .0116 .0088 .0067 .0050 .0037 .0027 .0020 .0014 .0010 .0007 .0005 .0003 .0002 .0002 .0303 .0241 .0189 .0147 .0113 .0086 .0065 .0048 .0036 .0026 .0019 .0014 .0010 .0007 .0005 .0003 .0002 .0002 .0297 .0235 .0184 .0143 .0110 .0084 .0063 .0047 .0035 .0025 .0018 .0013 .0009 .0007 .0005 .0003 .0002 .0001 .09 .0290 .0229 .0180 .0139 .0107 .0081 .0061 .0046 .0034 .0025 .0018 .0013 .0009 .0006 .0004 .0003 .0002 .0001 340 Medical Statistics and Demography Made Easy Table 3: Significant values of t-distribution (Two tail areas) Probability (Level of Significant) d.f. (v) 0.50 0.10 0.005 0.02 0.01 0.001 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1.00 0.82 0.77 0.74 0.73 0.72 0.71 0.71 0.70 0.70 0.70 0.70 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.36 0.69 0.39 0.69 0.69 0.68 2.68 0.68 0.68 0.68 0.68 0.67 6.31 0.92 0.35 2.13 2.02 1094 1.90 1080 1.83 1.81 1.80 1.78 1.77 1.76 1.75 1.75 1.74 1.73 1.73 1.73 1.72 1.72 1.71 1.71 1.71 1.71 1.70 1.70 1.70 1.70 1.65 12.71 4.30 3.18 2.78 2.57 2.45 2.37 2.31 2.26 2.23 2.20 2.18 2.16 2.15 2.13 2.12 2.11 2.10 2.09 2.09 2.08 2.07 2.07 2.06 2.06 2.06 2.05 2.05 2.05 2.04 1.96 31.82 .6397 4.54 3.75 3.37 3.14 3.00 2.92 2.82 2.76 2.72 2.68 2.05 2.62 2.60 2.58 2.57 2.55 2.54 2.53 2.52 2.51 2.50 2.49 2.49 2.48 2.47 2.47 2.46 2.46 2.33 63.66 6.93 5.84 4.60 4.03 3.71 3.50 3.36 3.25 3.17 3.11 3.06 3.01 2.98 2.95 2.92 2.90 2.88 2.86 2.85 2.83 2.82 2.81 2.80 2.79 2.78 2.77 2.76 2.76 2.75 2.58 636.62 31.60 12.94 8.61 6.86 5.96 5.41 5.04 4.48 4.59 4.44 4.32 4.22 4.14 40.7 4.02 3.97 3.92 3.88 3.85 3.83 3.79 3.77 3.75 3.73 3.71 3.69 3.67 3.66 3.65 3.29 Appendix 341 Table 4: Significant values χ  α  of chi-square distribution (Right tail areas for given probability 2 Where Degree of freedom 0 = .99 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 .000157 .0201 .115 .297 .554 .872 1.239 1.646 2.088 2.558 3.053 3.571 4.107 4.660 4.229 5.812 6.408 7.015 7.633 8.260 8.897 9.542 10.196 10.856 ) and is degrees of freedom (d f) 0.95 0.50 0.10 0.05 0.02 0.01 .00393 .103 .352 .711 1.145 1.635 2.167 2.733 3.325 3.940 4.575 5.226 5.892 6.571 7.261 7.962 8.682 9.390 10.117 10.851 11.591 11.338 13.091 13.848 .455 1.386 2.366 3.357 4.351 5.348 6.346 7.344 8.343 9.340 10.341 11.340 12.640 13.339 14.339 15.338 16.338 17.338 18.338 19.337 20.337 21.337 22.337 23.337 2.06 4.605 6.251 7.779 9.236 10.645 12.017 13.362 14.684 15.987 17.275 18.549 19.812 21.064 22.307 23.542 24.769 25.989 27.204 28.412 29.615 30.813 32.007 32.196 3.840 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.296 27.587 28.869 30.144 31.410 32.671 33.924 35.172 36.415 5.214 7.824 9.837 11.668 13.388 15.033 16.622 18.168 19.679 21.161 22.618 24.054 25.472 26.873 28.259 29.633 30.995 32.346 33.687 35.020 36.343 37.659 38.968 40.270 6.635 9.210 11.341 13.277 15.086 16.812 18.475 20.090 21.666 23.209 24.725 26.217 27.688 29.141 30.578 32.000 33.409 34.805 36.191 37.566 38.932 40.289 41.638 42.980 Contd... 342 Medical Statistics and Demography Made Easy Contd... Degree of freedom 0 = .99 25 26 27 28 29 30 11.524 12.198 12.879 13.565 14.256 14.953 0.95 0.50 0.10 0.05 0.02 0.01 14.611 15.379 16.151 16.928 17.705 18.493 24.337 25.336 26.336 27.336 28.336 29.336 34.382 35.363 36.741 37.916 39.087 40.256 37.652 38.885 40.113 41.337 42.557 43.773 41.566 41.856 44.140 45.419 46.693 47.962 44.314 45.642 46.963 48.278 49.588 50.892 Note: For degrees of freedom quantity greater than 30, the may be used as a normal variate with unit variance. Appendix 343 Table 5: Significant values of the variance ratio F-distribution (Right tail areas 5 percent points) 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 240 2 3 4 5 6 8 12 24  161.4 199.5 215.7 224.6 230.2 234.0 238.9 243.9 249.0 254.3 18.51 19.00 19.16 19.25 19.30 19.35 19.37 19.41 19.45 19.50 10.13 9.55 9.28 9.12 9.01 8.94 8.84 8.74 9.64 9.55 7.71 6.94 6.59 6.39 6.26 6.16 6.04 5.91 5.77 5.65 6.61 5.79 5.41 5.19 5.05 4.95 4.82 4.68 4.53 4.96 5.99 5.14 4.76 4.53 4.39 4.28 4.15 4.00 3.84 3.67 5.59 4.74 4.35 4.12 3.97 3.87 3.78 3.57 3.41 3.23 5.32 4.46 4.07 3.84 3.69 3.58 3.44 3.28 3.12 2.93 5.12 4.26 3.865 3.63 3.48 3.37 3.23 3.07 2.90 2.71 4.96 4.10 3.71 3.48 3.33 3.22 3.07 2.91 2.74 2.54 4.84 3.98 3.59 3.365 3.20 3.09 2.95 2.79 2.61 2.40 4.75 3.88 4.49 3.26 3.11 3.00 2.85 2.69 2.50 2.30 4.67 3.80 5.51 3.18 3.02 2.92 2.7 2.60 2.42 2.21 4.60 3.74 3.51 3.11 2.96 2.85 2.70 2.53 2.35 2.13 4.54 3.68 3.29 3.06 2.90 2.79 2.64 2.48 2.29 2.07 4.49 3.63 3.4 3.01 2.85 2.74 2.59 2.42 2.24 2.01 4.45 3.59 3.20 2.96 2.81 2.70 2.55 2.38 2.19 1.96 4.41 3.55 3.96 2.93 2.77 2.66 2.51 2.34 2.15 1.92 4.38 3.52 3.13 2.90 2.74 2.63 2.48 2.31 2.11 1.88 4.35 3.49 3.10 2.87 2.71 2.60 2.45 2.28 2.08 1.84 4.32 3.47 3.07 2.84 2.68 2.57 2.42 2.25 2.05 1.81 4.30 3.44 3.05 2.82 2.66 2.55 2.40 2.23 2.03 1.76 .28 3.42 3.03 2.80 2.64 2.53 2.38 2.20 2.00 1.76 4.26 4.40 3.01 2.78 2.62 2.51 2.36 2.18 1.98 1.73 4.24 3.38 2.99 2.76 2.60 2.49 2.34 2.16 1.96 1.71 4.22 3.37 2.98 2.74 2.59 2.47 2.32 2.15 1.95 1.60 4.21 3.35 2.96 2.73 2.57 2.46 2.30 2.13 1.93 1.67 4.20 3.34 2.95 2.71 2.56 2.44 2.29 2.12 1.91 1.65 4.18 3.33 2.93 2.70 2.54 2.43 2.28 2.10 1.90 1.64 4.17 3.32 2.92 2.69 2.53 2.42 2.27 2.09 1.89 1.62 4.08 3.23 2.84 2.61 2.45 2.34 2.18 2.00 1.79 1.51 4.00 3.15 2.76 2.52 2.37 2.25 2.10 1.92 1.70 1.30 3.92 3.87 2.68 2.45 2.29 2.17 2.02 1.83 1.62 1.25 3.84 2.99 2.60 2.37 2.21 2.09 1.94 1.75 1.52 1.00 47 74 76 35 59 22 42 01 21 60 18 62 36 85 29 62 49 08 16 03 97 16 12 55 16 84 63 33 57 18 26 52 37 70 56 99 16 31 17 18 37 15 93 07 38 28 94 77 17 63 12 86 43 24 62 85 56 12 37 22 04 32 92 97 19 35 94 53 78 34 32 73 67 27 99 35 13 35 77 72 43 46 75 95 12 39 31 59 29 44 86 62 66 26 64 40 96 88 33 50 44 84 50 83 49 57 16 78 09 36 42 56 96 38 33 83 42 27 27 17 16 92 39 54 24 95 64 47 96 81 50 96 34 20 50 95 14 89 16 07 26 50 43 55 55 56 27 47 14 26 68 82 38 87 45 34 87 58 44 11 08 54 06 67 07 96 36 57 71 27 46 26 755 72 09 19 09 99 37 30 82 88 19 82 54 61 20 07 31 22 13 97 16 45 20 79 83 00 2 17 77 98 52 49 46 42 32 05 31 89 12 64 59 15 83 11 53 34 37 04 10 42 17 98 53 90 03 62 51 25 36 34 37 86 46 76 07 93 74 50 07 46 63 32 79 72 43 06 93 16 09 00 19 32 31 96 23 47 71 44 09 71 37 78 93 09 74 47 00 45 49 62 24 38 88 78 67 75 38 62 62 32 53 15 90 17 70 04 59 52 06 20 80 54 87 21 12 15 90 33 27 13 57 06 76 33 43 34 85 76 14 22 42 35 76 86 51 52 26 07 55 12 18 37 24 18 68 66 50 85 02 06 20 33 73 00 84 16 36 38 10 44 Table 6: Random sampling numbers 13 03 66 49 60 06 88 53 87 96 50 58 13 77 80 07 58 14 32 04 54 79 12 44 10 45 53 98 43 25 07 42 27 45 51 59 21 53 07 97 94 72 38 55 10 86 35 84 83 44 99 08 60 24 88 88 23 74 77 77 07 68 23 93 60 85 26 92 39 66 02 11 51 97 26 83 21 46 24 34 88 64 85 42 29 34 12 52 02 73 14 79 54 49 01 30 80 90 99 80 05 10 53 39 64 76 79 54 28 95 73 10 76 30 Contd... 19 44 21 45 11 05 79 04 48 91 06 38 79 43 10 89 14 81 30 344 Medical Statistics and Demography Made Easy 34 57 42 39 94 90 27 24 23 96 67 90 05 46 19 26 97 71 99 95 68 74 27 00 29 16 11 35 38 31 66 14 68 20 67 05 07 68 26 14 Contd... 93 10 56 61 52 40 84 51 78 58 82 94 10 16 25 30 25 37 68 98 70 88 85 65 98 47 45 18 73 97 66 75 16 86 91 13 65 86 29 94 60 23 85 53 75 14 11 00 90 79 59 06 20 38 47 70 76 53 61 24 22 09 54 58 87 64 75 33 97 15 83 06 33 42 96 55 59 48 66 68 35 98 87 37 59 05 73 96 51 06 62 09 32 38 44 74 29 55 37 49 85 42 66 78 36 71 88 02 40 15 64 19 51 97 33 30 97 90 32 69 15 99 47 80 22 95 05 75 14 93 11 74 26 01 49 77 68 65 20 10 13 64 54 70 41 86 90 19 02 20 12 66 38 50 13 40 60 72 30 82 92 61 73 42 26 11 52 07 04 01 67 02 49 87 34 44 71 96 77 53 03 71 32 10 78 05 27 60 02 90 19 94 78 75 86 22 91 57 84 75 51 62 08 50 63 65 40 62 33 10 00 37 45 66 82 78 38 69 57 91 59 99 11 67 06 09 14 93 31 75 71 34 04 81 53 84 67 36 03 93 77 15 12 42 55 38 86 55 08 06 74 02 91 41 91 26 54 10 29 30 59 06 44 32 13 76 22 59 39 40 60 76 16 40 00 04 13 96 10 34 56 51 95 17 08 83 98 33 51 78 47 70 92 01 52 33 58 46 45 25 78 29 92 55 27 20 12 82 16 78 21 90 53 74 43 46 18 92 65 20 06 16 63 85 01 37 22 43 49 89 29 30 56 91 48 09 24 42 04 57 83 93 16 47 50 90 08 90 36 62 68 86 16 62 85 52 76 45 23 27 52 58 29 94 15 57 07 49 47 02 02 38 02 48 27 68 15 97 11 40 91 05 56 44 29 16 52 37 95 67 02 45 75 51 55 07 54 60 04 48 05 77 24 67 39 00 74 38 93 74 37 94 50 84 26 97 55 49 96 73 74 51 48 94 43 66 80 59 30 33 31 38 98 32 62 57 52 91 24 92 Contd... 70 09 29 16 39 11 95 44 3 17 03 30 95 08 89 06 95 04 67 51 Appendix 345 53 26 23 20 25 50 22 79 75 96 74 38 30 43 25 63 55 07 54 85 17 90 41 60 91 34 85 09 88 90 55 63 35 63 98 02 64 85 58 34 Contd... 21 22 26 16 27 23 06 58 36 37 57 04 13 82 23 77 59 582 50 38 17 21 13 24 84 99 86 21 82 55 74 39 77 18 70 58 21 55 81 05 69 82 89 15 87 67 51 46 69 26 37 43 48 14 00 71 19 99 69 90 71 48 01 51 61 61 99 06 65 01 98 73 73 22 39 71 23 31 31 94 50 22 10 54 48 32 00 72 51 91 80 81 82 95 00 41 52 04 99 58 80 28 07 44 64 28 65 17 18 82 33 53 97 75 03 61 23 49 73 28 89 06 82 82 56 69 26 10 37 81 00 94 22 42 06 50 33 69 68 41 36 00 04 00 26 84 94 94 88 46 91 79 21 49 90 72 12 96 68 36 38 61 59 62 90 94 02 25 61 74 09 33 05 39 55 12 96 10 35 45 15 54 63 61 18 62 82 21 38 71 77 62 03 32 85 41 93 47 81 37 70 13 69 65 48 67 90 61 44 12 93 46 27 82 78 94 02 48 33 59 11 43 36 04 13 86 23 75 12 94 19 86 24 22 38 96 18 45 03 03 48 91 03 69 26 24 07 96 42 97 82 28 83 49 36 26 39 88 76 09 43 82 69 38 37 98 79 49 32 24 47 08 72 02 94 44 07 13 24 90 40 78 11 18 70 33 62 28 92 02 94 31 89 48 37 95 02 41 30 35 45 12 15 65 15 41 67 24 85 71 80 54 44 07 30 27 18 43 12 57 86 23 83 18 42 19 80 00 88 37 04 46 05 70 69 36 39 89 48 29 98 29 80 97 57 95 60 49 65 07 04 31 60 37 32 99 07 20 60 12 00 06 13 85 65 47 75 55 54 03 45 53 35 16 90 02 25 97 18 82 83 66 29 72 65 53 91 65 34 92 07 94 80 04 89 96 99 17 99 62 26 24 54 13 80 53 12 79 81 18 31 13 39 61 00 74 32 14 10 54 03 27 28 21 07 09 19 07 35 75 49 47 88 87 33 83 23 17 34 60 Contd... 91 12 19 49 39 38 81 78 85 66 66 38 94 67 76 30 70 49 72 65 346 Medical Statistics and Demography Made Easy 92 95 45 08 85 84 78 17 76 31 44 66 24 73 60 37 67 28 15 19 03 62 08 07 01 72 88 45 96 43 50 22 96 31 78 84 36 07 10 55 Contd... 90 10 59 83 68 66 22 40 91 73 71 28 75 28 67 18 30 93 55 89 61 08 07 87 97 44 15 14 61 99 14 16 65 12 72 27 27 15 18 95 56 23 48 60 65 21 86 51 19 84 35 84 57 54 30 46 59 22 40 66 70 98 89 79 03 66 26 23 60 43 19 13 28 22 24 57 37 60 45 51 10 93 64 24 73 06 63 22 20 89 11 52 40 01 02 99 75 21 44 10 23 35 58 31 52 38 75 30 72 94 58 53 19 11 94 16 41 75 75 19 98 08 89 66 16 05 41 88 93 36 49 94 72 94 08 96 66 46 13 34 05 86 75 56 56 92 99 57 48 475 26 53 12 25 63 56 48 91 90 88 85 99 83 21 00 68 58 95 98 56 50 75 25 71 38 30 86 98 24 15 11 29 85 48 53 156 42 67 57 69 11 45 12 96 32 33 97 77 94 84 34 76 62 24 55 54 36 47 07 47 17 96 74 16 36 72 80 27 96 97 76 29 27 06 90 35 72 29 23 07 17 30 75 16 66 85 61 85 61 19 60 81 89 93 27 02 24 83 69 40 76 96 67 88 02 22 45 42 02 75 76 33 30 91 33 42 58 94 65 90 86 73 60 68 69 84 23 28 57 12 48 34 14 98 42 35 37 69 95 22 31 89 40 64 36 64 53 88 55 76 45 91 78 94 29 48 52 40 39 91 57 62 60 36 38 38 04 61 66 39 34 58 56 05 38 96 18 06 69 07 20 70 81 74 25 56 01 08 83 43 60 93 27 49 87 32 51 07 58 12 18 31 19 45 39 98 63 84 15 78 01 63 86 01 22 14 03 14 56 78 95 99 24 19 48 99 45 69 73 64 64 14 63 47 13 50 37 16 80 35 60 17 62 59 03 01 76 62 42 63 18 52 59 59 88 41 18 36 30 34 78 43 01 50 45 30 08 03 37 91 96 52 02 00 34 48 11 86 44 72 75 76 16 92 22 64 27 73 61 25 Contd... 39 32 80 38 83 52 39 78 19 08 46 48 61 88 15 98 64 42 11 08 Appendix 347 81 86 91 71 66 96 83 60 17 69 93 30 29 31 01 33 84 40 31 59 53 51 35 37 93 02 49 84 18 79 75 38 51 21 29 95 90 46 20 71 Contd... 95 60 62 89 73 36 92 50 38 23 08 43 71 30 10 29 32 70 67 13 22 79 98 03 05 87 29 10 86 84 45 478 62 88 61 13 68 29 95 83 00 80 82 43 50 83 03 34 24 88 65 35 46 71 78 39 92 13 13 27 18 24 54 38 08 56 06 31 37 58 13 82 40 44 71 35 33 80 20 92 47 36 97 46 22 20 28 57 79 02 025 88 80 91 32 01 98 03 02 79 72 59 20 82 23 14 81 75 81 39 00 33 81 14 76 20 75 54 44 64 00 87 56 68 71 82 39 95 53 37 41 69 30 88 95 71 66 07 95 64 18 38 95 72 77 11 38 820 74 67 84 96 37 47 62 34 99 27 94 72 38 82 15 32 91 74 62 51 73 42 93 72 34 89 87 62 40 96 64 28 79 07 74 14 01 21 25 94 24 10 07 36 39 23 00 33 14 94 85 54 58 53 80 82 93 97 06 02 16 14 51 04 23 30 22 74 71 78 04 96 69 89 08 99 20 90 84 74 10 20 72 19 05 63 58 82 94 32 05 53 32 35 32 70 49 65 63 77 33 92 59 76 38 15 40 14 58 66 72 84 81 96 16 80 82 96 61 76 52 16 21 47 25 56 92 53 45 50 01 48 76 35 46 60 96 42 29 15 83 55 45 45 15 34 54 73 94 95 32 14 80 23 70 47 59 68 08 48 90 23 57 15 35 20 01 19 19 52 90 52 26 79 50 18 26 63 93 49 94 42 09 18 71 47 75 09 38 74 76 98 92 18 80 97 94 86 67 44 76 45 77 60 30 89 25 03 81 33 14 94 82 05 67 63 66 74 04 18 70 54 19 82 88 99 43 56 14 13 53 56 80 98 72 49 39 54 32 55 47 96 48 11 12 82 11 54 44 80 89 07 84 90 16 30 67 13 92 63 14 09 56 08 57 93 71 29 99 55 74 93 25 07 42 21 98 26 08 77 54 11 27 95 21 24 99 56 81 62 60 89 39 35 79 30 60 4 09 09 36 06 44 97 77 98 31 93 07 54 41 30 348 Medical Statistics and Demography Made Easy Index A Addition rule of probability 75 Age and sex composition 211 Age pyramid 211 Age specific fertility rate 224 Alternative hypothesis 100 Analysis of variance table 140 Analytical studies 175 Application of ‘t’ distribution 125 Arithmetic mean 16 Association 62 Assumption for student’s ‘t’ test 125 Attributable risk 182 Attributes 2 B Bar chart 5 Base line 164 Basic population data 256 Binominal distribution 48 Blinding (Masking) 164 C Case control study 176 Case definition 164 Case report 174 Case series 174 Census 2001 250 Chi square distribution 114 Classical probability 75 Cluster sampling 86 Coefficient of dispersion 35 Coefficient of variation 35 Cohort 165 Cohort study 175 Comparative statistics of different indicators 279 Comparison of several proportions (2 × k contingency table) 118 Comparison of two proportions by Chi square 118 Concept of population policy 289 Conditional probability 78 Confidence limits 107 Confounding bias 179 Contingency table (2 × 2 table) 121 Continuous variable 2 Correlation 62 Country health profile 261 Critical region 100, 103 Critical value 103 Cross-sectional studies 175 Crude birth rate 224, 277 Crude death rate 214, 278 Cumulative frequency curve 7 350 Medical Statistics and Demography Made Easy D Decile 33 Degree of freedom 115 Demographic cycle 210 Denominator 167 Density 252 Density of population 213 Dependency ratio 212 Descriptive studies 173 Design of experiments 92 Diagnostic accuracy 191 Direct standardization 219 Discrete variable 2 Dispersion 32 E Ecological bias 179 Equally likely events 74 Exact sampling distribution 114 Exhaustive events 74 Experimental studies 176 Experimental unit 165 Exposure rates 183 F Failure 106 Family size 213 Fertility trends 251 First quartile 32 Fourfold classification 118 Frequency curve 10 Frequency distribution table 4 Frequency polygon 10 F-statistic 134 F-test for equality of population variance 135 F-test for equality of several means 135 G General contingency table (r × s) 120 General fertility rate 224 Geometric mean 24 Goals of national population policy 295 Goodness of fit 117 Gross reproductive rate 225 Growth rate 230, 252 H Harmonic mean 25 Histogram 10 History of census 248 Hospital records 243 I Impossible event 75 Incidence rate (person) 168 Incidence rate (spell) 169 Incidence rates 180 Independence of attributes 118 Independent events 74 Indirect standardization 221 Infant mortality rate 215, 278 Issue of the adolescents 255 Index 351 K Key population statistics of India 1901-2001 292 Kurtosis 41 L Landmarks in the evolutions of India’s national population policy 299 Level of significance 101 Life expectancy 213 Life table 227 Likelihood ratio 193 Line diagram 9 Literacy 252 Literacy rate in India 271 Local control 94 Longitudinal studies 174 M Manifold classification 118 Mann-Whitney U test 156 Maternal mortality rate 223 Mean deviation 34 Measurement bias 179 Measurement of morbidity 168 Measurement of mortality 168 Median 17 Median test 154 Mid year population 167 Mode 20 Mode of F-distribution 134 Mortality indicators for all India, 1971-1998 293 Mortality trends 291 Multiplication rule of probability 77 Multistage sampling 89 Mutually exclusive events 74 N Negative predictive value 187 Neonatal mortality rate 215 Net reproductive rate 226 Nominal 2 Non parametric tests 152 Normal distribution 50 Null hypothesis 100 Numerator 167 O Observational studies 173 Odds ratio 184 One tailed test 102 One way analysis of variance 135 Ordinal 2 P Paired ‘ t’ test 127 Parameter 89 Percentile 33 Perinatal mortality rate 216 Period prevalence 170 Pictogram 6 Pie chart 6 Placebo 164 Point prevalence 170 Poisson distribution 49 352 Medical Statistics and Demography Made Easy Population 84 Population at risk 167 Population census 240 Positive predictive value 187 Postnatal mortality rate 215 Power of test 102 Prevalence 169, 191 Primary data 2 Proportion 167 Proportional mortality rate 217 Prospective study 165 Provisional population totals: India - part I 258 Provisional population totals: India 255 Q Quartile deviation 32 R Random sampling 84 Random series 74 Randomization 93 Randomized controlled laboratory study 178 Randomized controlled cllinical trials 177 Randomized cross-over clinical trials 177 Range 32 Rate 166 Ratio 166 Readers bias 180 Region of acceptance 103 Region of rejection 103 Registration of births and deaths act, 1969 242 Registration of vital events 241 Regression 64 Regression coefficient 64 Relative risk 181 Replication 93 Retrospective study 165 Role of targets 294 Root mean square deviation 34 Run test 153 Rural-urban distribution of population 267 S Sample 84 Sample registration system 242 Sample size 84 Sample surveys 243 Sampling bias 180 Sampling distribution 89 Sampling of attribute 106 Scattered diagram 11 Screening bias 179 Second quartile 32 Secondary data 2 Sensitivity 186 Sex ratio 212 Sign test 155 Significant value 103 Skewness 40 Skewness of F-distribution 134 Index 353 Sources of health information 240 Specificity 187 Stable population 212 Standard deviation 34 Standard error 89 Standard normal variate 52 Standardized death rate 218 State wise distribution of households 273 Stationary population 212 Statistic 89 Statistical hypothesis 100 Statistical methods in epidemiology 163 Status of children 254 Status of women’s health 253 Still birth rate 217 Stratified sampling 85 Success 106 Summary of census 2001 283 Sure event 75 Systemic error 178 Systemic sampling 85 T t- test for difference of mean 126 t- test for significance of correlation coefficient 128 t- test for single mean 126 Tables 3 Test of significance for difference of mean 111 Test of significance for difference of proportion 107 Test for significance for large samples 105 Test of significance for single mean 111 Test for single proportion 106 Test of significance 102 Third quartile 32 Total fertility rate 225 Trials and events 74 Two tailed test 102 Type-I error 101 Type-II error 101 V Variable 2 Vital rates per 1000 population, India 1901-1990 293