Example #1: New Jersey County Areas

Contents: Objective Problem Description Stem and Leaf Plot Histogram Histogram Properties Interpretation

^ Objective

Visually examine the distribution of county sizes in the State of New Jersey. Characterize the shape of the distribution and compare it to symmetrically distributed data.

^ Problem Description

Values were obtained for the sizes of counties in the State of New Jersey. The following variables were recorded:

County: New Jersey county
Area: Size of county in square miles

The 21 areas constitute all possible values, i..e., these areas are not sampled from some larger population of counties in New Jersey. Therefore, only descriptive summaries and graphs are meaningful.

The New Jersey County data can be displayed by a dotplot, box and whisker plot, stem and leaf plot, and a standard histogram among others. These plots reveal different aspects of the same underlying data values. The dotplot and the box and whister plot are discussed elsewhere.

^ Stem and Leaf Plot

The stem and leaf plot is easy to construct by pencil and paper for small datasets, but it lacks the flexibility of the histogram. The following stem and leaf plot displays the New Jersey county areas and shows their approximate symmetry. The stems (the numbers to the left of the vertical bars) are the values of the most significant digit (hundreds here). The original values can be constructed by combining the stem values with the leaf values (the least significant digit or digits). For example, the smallest county is 47 sq mi.

hundreds | units

0 | 47
1 | 03, 30, 92
2 | 21, 28, 34, 67
3 | 07, 12, 29, 62, 65
4 | 23, 68, 76
5 | 00, 69, 27
6 | 42
7 |
8 | 19

Burlington County has the largest area (819 sq mi) and it stands out from the other county areas. The remaining county areas are approximately symmetric. Counties can be identified by examining the data source link in the Problem Description above or by clicking on the Data button in the Histogram applet.

Usually, leaf values are rounded to one significant digit in stem and leaf plots. If the above values are rounded to the nearest tenth, the units value is always 0 and thus is dropped. As an illustration, the stem corresponding to 100-199 would be 1 | 0 3 9 after rounding. The resulting stem-and-leaf is:

hundreds | tens

0 | 5
1 | 0 3 9
2 | 2 3 3 7
3 | 0 1 3 6 7
4 | 2 7 8
5 | 0 7 3
6 | 4
7 |
8 | 2

^ Histogram

The histogram below displays the areas of the 21 counties in New Jersey. The histogram classifies each county area (and hence the county) into one of a series of contiguous groups. Verify that 9 counties have areas between 200 and 400 sq mi and 15 have areas between 200 and 600 sq mi.

^ Histogram Properties

The histogram display depends on default choices for: 1) the starting value (0 sq mi), 2) the number of groups (6), and 3) the group width (200 sq mi). The starting value must be less than the smallest data value. The group widths are all equal, but this is not an absolute requirement. The default group boundaries (0 sq mi, 200 sq mi, etc.) and group width (200 sq mi) are chosen to be "nice" numbers by an algorithm in the Histogram applet. The user can change the number of groups (or bars) in the histogram by clicking on the arrow button in the tool bar. Currently, the starting value cannot be changed. However, since the outer group boundaries (0 and 1000 sq mi) remain fixed, the group width changes when the number of groups is increased or decreased.

The above histogram does not represent the areas very well. Perhaps the histogram display will be improved if the number of groups is changed. Increase the number of bars to 9 by clicking on the right hand arrow tool five times. These 9 bars should correspond to the 9 stems in the stem and leaf plot above, but a slight discrepancy between the two plots is seen. What is its cause?

^ Interpretation

The displayed distribution of values (areas here) in a histogram depends strongly on the three properties mentioned above. Increasing the number of bars to 9 gives a better representation of the data. The Burlington County area clearly stands out and the remaining values are approximately symmetrically distributed. Click on the bar corresponding to the largest value and then click on red in the color palette to mark Burlington County. Then click on the Data button to display all the values and their labels. The color information is linked between the datasheet and the histogram.

The choice of boundary values causes minor changes in the displayed distribution. Specifically, Cumberland County has an area of 500 sq mi. This value falls in the 500-599 sq mi stem in the stem and leaf plot, whereas it falls in the 400-500 sq mi group in the histogram display, i.e., the groups in the histogram include the upper but not the lower limit. This accounts for the difference in frequencies between the two displays.

The dotplot and the box and whisker plot provide alternative views of the data, but these plots are not discussed here.