Objective: Numerically examine the heights of a sample of students using the median, interquartile range, and related measures. Determine whether these measures give a good description of the distribution and assess numerically how the distribution depends on gender.
Problem Description: Measurements were made on a random sample of 40 students. The raw data is given in Appendix A of Rees and is reproduced as a set of variables in StatObjects. The following variables were recorded:
The student data is used in several examples and exercises in Rees and will be used in examples here. This example steps through a sequence of analyses in preparation for the exercise.
The moments, as represented by the mean and standard deviation, may not be the most appropriate measures of location and dispersion. The quantiles (or functions of the quantiles) provide alternative measures which are less susceptible to the effects of outliers and skewness. More specifically, we can use the sample median and interquartile range to measure location and dispersion.
Quantiles are given in text reports at the bottom of the normal quantile plot. The normal quantile plot is obtained from the normalPlot menu item of the Histogram menu from the histogram plot. Thus, the height histogram must be obtained before the height quantiles can be obtained. Press on the Histogram menu of the histogram, select normalPlot, and release. A normal probability plot will appear.
The quantile plot controls in the tool bar are essentially the same as for historams and are described in the Histogram Module Example #1 and are not repeated here. Once you have generated the plot, you can use the triangular reveal button near the bottom of the plot (labeled "Quantiles") to show the quantiles. By clicking on the triangular reveal button the following plot with quantiles is obtained:
A convenient way to numerically summarize a sample of values is to report the min, lower quartile, median, upper quartile, and max in this order. These statistics collectively are called a five-number summary and they form the basis of the (simple) box plot. The first five values in the Quantiles report give the five-number summary. Verify that the five-number height summary is given by the sequence: 152.0, 162.0, 167.5, 176.0, and 184.0 cm.
The location (or center) of the sample distribution can be estimated by the sample median. The sample median height is 167.5 cm which is only slightly smaller than the sample mean (168.25 cm). This small difference is partially attributable to the bimodality of the distribution due to including both males and females in the sample.
The dispersion of the sample distribution can be estimated by the sample interquartile range. The lower and upper sample quartiles are 162.0 and 176.0 cm, respectively. Consequently, the estimated interquartile range (IQR) is 14.0 cm.
The sample range is another measure of dispersion. Specifically, the minimum value is 152.0 cm, the maximum value is 184.0 cm, and the resulting range is 32.0 cm. However, the min, max and range are not reliable summary measures. They are the extreme values (or in the case of the range a function of the extreme values) which can be highly variable.
Normal quantile plots are used primarily to assess normality. The points will follow a straight line (approximately) if the underlying distribution generating the data is normal. A straight line can be fit to the ordered values by selecting the Robust Fit menu item from the QuanPlot menu. In addition, the quartiles, including the median, can be viewed graphically. Select the Quantile Lines menu item from the QuanPlot menu. The following plot shows the quantile lines and robust fit:
The middle horizontal black line in the above plot is drawn at the median value of 167.5 cm. The outer black lines are drawn horizontally at the quartiles. Compare the quartile values in the Quantiles report to the y-axis intersection values to verify their equivalence. If outliers are present in the sample, outer red lines are drawn and all observations beyond these lines are considered outliers. Since no red lines were drawn, none of the student heights are outliers.
The ordered values do not fit the purple fitted line well which confirms the non-normality. However, that is not the issue in this module. We want to relate the sample IQR and standard deviation. For a theoretical normal distribution, the IQR is equal to 1.348 standard deviations. Thus, the sample standard deviation can be estimated by 14/1.348 = 10.38 cm. Graphically, this is the slope of the fitted line, i.e., the slope of the robust fitted line is an estimate of the standard deviation. It is somewhat surprising that the standard deviation estimated from the interquartile range (10.38 cm) exceeds the sample standard deviation (9.108 cm). The higher IQR-estimated standard deviation appears to be related to the upturn in points near the beginning of the male heights.
When height data from males and females are combined, neither the moment nor the quantile measures of location and spread will be very meaningful. The measures of location will land somewhere between the two modes and the measures of spread will include the gender shift in the distribution. As in the case of moments, the sample height quantiles need to be examined separately for males and females.
The height histogram conditioning on sex must be obtained before the gender-based normal quantile plots can be displayed. As above, press on the Histogram menu and select the normalPlot menu item to show the gender-based normal quantile plots. Also select the Robust Fit and Quantile Lines menu items from the QuantPlot menu. The following plot shows the female height normal quantile plot:
Click on the right-hand side of the conditioning slider to show the male height normal quantile plot. The following plot results: Verify that the five number summary for males is the sequence: 169.0, 175.0, 180.0, 182.0, and 184.0 cm. The male sample median is 180.0 cm and the interquartile range is 7.0 cm.
The sample median height for males is 180.0 cm as compared to 164.0 cm for females. This can be experienced dynamically by clicking on the slider repeatedly (right, left, right, etc.). The movement up and down, specifically of the middle black line, is a measure of the shift between the two sample distributions.
The sample height interquartile range for males is 7.0 cm as compared to 10.5 cm for the females. The change in the variability is seen dynamically by the change in the slope of the robust fitted line when toggling between the male and female normal quantile plots. The lower IQR for males may be an anomaly due to the small sample size.