Objective: Visually explore the heights of a sample of students using a histogram. Characterize the shape of the distribution and assess how the distribution depends on gender.
Problem Description: Measurements were made on a random sample of 40 students. The raw data is given in Appendix A of Rees and is reproduced as a set of variables. The following variables were recorded:
The student data is used in several examples and exercises in Rees and will be used in examples here. This example steps through a sequence of analyses in preparation for Exercise #2.
The first task is to examine graphically the heights of students. The height histogram shows how the values are distributed along a single dimension.
Before interpreting the histogram, we will discuss the graph controls in the ToolBar along the top of the plot. The next control item is a popup menu which provides options for the selected plot type. By pressing on the Options menu, various menu items can be selected which act on, or add to, the plot. For example, Normal Density superimposes an estimated normal density curve on the histogram with the center (the sample mean) and the points of inflection (the mean plus/minus the sample standard deviation) indicated by vertical lines. (The mean and standard deviation are discussed in the Moments Module.)
The next set of controls (the Color Palette) allow the color of selected points in all linked plots to be set.The Double Arrow control animates the plot by dynamically changing the number of bins. Before continuing, experiment with these controls.
The Data button open a linked window (a Names List) containing observation labels. This is used to identify selected points in the plot or to find points in the plot by label names. Default labels (1 through n) are given to observations if labels are not supplied. The last control is the Reset button which reverts the plot to the opened plot.
A histogram tells us how the values are distributed along a single dimension. In particular, we try to characterize visually the center, dispersion, and shape of the distribution. Since many statistical techniques assume an underlying normal distribution, i.e., a bell-shaped curve, we will assess normality. A visual inspection suggests that normality is not tenable since there are too many high values.
A more precise method for assessing normality is to fit a normal curve to the data. This can be done by selecting the Normal Density item in the Histogram popup menu. The lack of symmetry is at least partially due to mixing males and females in the same histogram. The distribution is actually bimodal which can be seen more easily by increasing the number of bins from 7 to 8 using the Double Arrow control.
The bars of the histogram are made up of blocks, one for each student, i.e., when you click on a bar a particular observation is selected. The observation can be identified in the Names List. Open the Names List by clicking on the Data button in the tool bar. Verify that the shortest student is #3 by clicking on the left-most bar (assuming you have 8 bars showing). By holding down the shift key, multiple observations can be selected and identified. Dragging provides a better way of selecting adjacent observations. Dragging is done by pressing the mouse and holding down while you drag across one or more bars. Drag across the right-most bar (of 8) and identify the tallest students (#1, #29, #35, #36, #37, and #39) using the Names List. Highlights the smallest student in blue and the tallest students in red: Are the tallest students males? We need another type of histogram to determine this.
The males and females can be identified individually and collectively by forming a height histogram conditioning on sex. This allows histograms of females and males to be examined separately and dynamically. Initially the females are displayed. The slider in the tool bar is used to step through the levels of the categorical variable,i.e., sex in this case. Note that the shortest student is indeed a female. The distribution of female heights more closely follows the normal distribution than the overall histogram, although some deviations from symmetry are apparent.
Click on the right part of the slider to change the histogram to males. If you still have the tallest students highlighted in the overall histogram, you will see that they are all male by toggling back and forth between the female and male conditional histograms. The male histogram has a skewness value of about -1 (skewness will be discussed in the Moments Module), which indicated extreme negative skewness. The shape is clearly evident in the male histogram.
In summary, Height is not normally distributed since the distribution is made up of two groups--males and females. By sex, females heights are approximately normal whereas male heights show extreme negative skewness. We can ask: Are the Height distributions for males and females different or is the apparent difference due to sampling variability? This question will be explored later.