Chapter 3

CHAPTER 3Numerical Descriptive Measures USING STATISTICS: Evaluating the Performance of Mutual Funds 3.1 3.2 MEASURES OF CENTRAL TENDENCY, VARIATION, AND SHAPE The Mean The Median The Mode Quartiles The Geometric Mean The Range The Interquartile Range The Variance and the Standard Deviation The Coefficient of Variation Z Scores Shape Visual Explorations: Exploring Descriptive Statistics Microsoft Excel Descriptive Statistics Output Minitab Descriptive Statistics Output NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION The Population Mean The Population Variance and Standard Deviation The Empirical Rule The Chebychev Rule 3.3 COMPUTING NUMERICAL DESCRIPTIVE MEASURES FROM A FREQUENCY DISTRIBUTION 3.4 EXPLORATORY DATA ANALYSIS The Five-Number Summary The Box-and-Whisker Plot 3.5 THE COVARIANCE AND THE COEFFICIENT OF CORRELATION The Covariance The Coefficient of Correlation 3.6 PITFALLS IN NUMERICAL DESCRIPTIVE MEASURES AND ETHICAL ISSUES A.3 USING SOFTWARE FOR DESCRIPTIVE STATISTICS A.3.1 Microsoft Excel A3.2 Minitab A3.3 (CD-ROM Topic) SPSS LEARNING OBJECTIVES In this chapter, you learn: • To describe the properties of central tendency, variation, and shape in numerical data • To calculate descriptive summary measures for a population • To construct and interpret a box-and-whisker plot • To describe the covariance and the coefficient of correlation 72 CHAPTER THREE Numerical Descriptive Measures U S I N G S TAT I S T I C S Evaluating the Performance of Mutual Funds Return to the study of mutual funds introduced in Chapter 2. You want to decide which types of mutual funds to invest in. In the last chapter you learned how to present data in tables and charts. However, when dealing with numerical data, such as the return on investments in mutual funds in 2003, you also need to summarize the data, and ask statistical questions. What is the central tendency for returns of the various funds? For example, what is the mean return in 2003 for the low-risk, average-risk, and high-risk mutual funds? How much variability is present in the returns? Are the returns for high-risk funds more variable than for average-risk funds or low-risk funds? How can you use this information when deciding what mutual funds to invest in? or numerical variables, you need more than just the visual picture of what a variable looks like than you get from the graphs discussed in Chapter 2. For example, for the 2003 returns, you would like to determine not only whether the riskier funds had a higher 2003 return, but whether they also had greater variation, and how the returns for each risk group were distributed. You also want to examine whether there is a relationship between the expense ratio and the 2003 return. Reading this chapter will allow you to learn about some of the methods to measure: F • • • central tendency, the extent to which all of the data values group around a central value variation, the amount of dispersion or scattering of values away from a central point shape, the pattern of the distribution of values from the lowest value to the highest value You will also learn about the covariance and the coefficient of correlation that help measure the strength of the association between two numerical variables. 3.1 MEASURES OF CENTRAL TENDENCY, VARIATION, AND SHAPE You can characterize any set of data by measuring its central tendency, variation, and shape. Most sets of data show a distinct central tendency to group around a central point. When people talk about an “average value” or the “middle value” or the most popular or frequent value, they are talking informally about the mean, median, and mode, three measures of central tendency. Variation measures the spread or dispersion of values in a data set. One simple measure of variation is the range, the difference between the highest and lowest value. More commonly used in statistics are the standard deviation and variance, two measures explained later in this section. The shape of a data set represents a pattern of all the values from the lowest to highest value. As you will learn later in this section, many data sets have a pattern that looks approximately like a bell, with a peak of values somewhere in the middle. 3.1: Measures of Central Tendency, Variation, and Shape 73 The Mean The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean is the only common measure in which all the values play an equal role. The mean serves as a “balance point” in a set of data (like the fulcrum on a seesaw). You calculate the mean by adding together all the values in a data set and then dividing that sum by the number of values in the data set. The symbol X , called X bar, is used to represent the mean of a sample. For a sample containing n values, the equation for the mean of a sample, is written as sum of the values number of values X = Using the series X1, X2, . . . , Xn to represent the set of n values and n to represent the number of values, the equation becomes: X = X1 + X 2 + L + X n n By using summation notation (discussed fully in Appendix B), you replace the numerator n X 1 + X 2 + … + X n by the term ∑ Xi that means sum all the X i values from the first X i =1 value, X1 , to the last X value, Xn , to form Equation (3.1), a formal definition of the sample mean. SAMPLE MEAN The sample mean is the sum of the values divided by the number of values. n X = ∑ Xi i =1 (3.1) n X = sample mean n = number of values or sample size where Xi = ith value of the variable X n ∑ X i = summation of all Xi values in the sample i =1 Because all the values play an equal role, a mean will be greatly affected by any value that is greatly different from the others in the data set. When you have such extreme values, you should avoid using the mean. The mean can suggest what is a “typical” or central value for a data set. For example, if you knew the typical time it takes you to get ready in the morning, you might be able to better plan your morning and minimize any excessive lateness (or earliness) going to your destination. Suppose you define the time to get ready as the time in minutes (rounded to the nearest minute) from when you get out of bed to when you leave your home. You collect the times shown below for 10 consecutive work days: Day: Time (minutes): 1 2 3 4 5 6 7 8 9 10 39 29 43 52 39 44 40 31 44 35 74 CHAPTER THREE Numerical Descriptive Measures TIMES The mean time is 39.6 minutes, computed as follows: X = sum of the values number of values n X = ∑ Xi i =1 n X = 39 + 29 + 43 + 52 + 39 + 44 + 40 + 31 + 44 + 35 10 X = 396 = 39.6 10 Even though no one day in the sample actually had the value 39.6 minutes, allotting about 40 minutes to get ready would be a good rule for planning your mornings, but only because the 10 days does not contain extreme values. Contrast this to the case in which the value on day four was 102 minutes instead of 52 minutes. This extreme value would cause the mean to rise to 44.6 minutes as follows: X = sum of the values number of values n X = X = ∑ Xi i =1 n 446 = 44.6 10 The one extreme value has increased the mean by more than 10% from 39.6 to 44.6 minutes. In contrast to the original mean that was in the “middle,” greater than 5 of the get-ready times (and less than the 5 other times), the new mean is greater than 9 of the 10 get-ready times. The extreme value has caused the mean to be a poor measure of central tendency. EXAMPLE 3.1 THE MEAN 2003 RETURN FOR SMALL CAP MUTUAL FUNDS WITH HIGH RISK The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low, average, and high) and type (small cap, mid cap, and large cap). Compute the mean 2003 return for the small cap mutual funds with high risk. SOLUTION The mean 2003 return for the small cap mutual funds with high risk (MUTUALis 51.53, calculated as follows: FUNDS2004) X = sum of the values number of values n = = ∑ Xi i =1 n 463.8 = 51.53 9 The ordered array for the nine small cap mutual funds with high risk is: 37.3 39.2 44.2 44.5 53.8 56.6 59.3 62.4 66.5 Four of these returns are below the mean of 51.53 and five of these returns are above the mean. 3.1: Measures of Central Tendency, Variation, and Shape 75 The Median The median is the value that splits a ranked set of data into two equal parts. The median is not affected by extreme values, so you can use the median when extreme values are present. The median is the middle value in a set of data that has been ordered from lowest to highest value. To calculate the median for a set of data, you first rank the values from smallest to largest. Then use Equation (3.2) to compute the rank of the value that is the median. MEDIAN 50% of the values are smaller than the median and 50% of the values are larger than the median. Median = n +1 ranked value 2 (3.2) You compute the median value by following one of two rules: • • Rule 1 If there are an odd number of values in the data set, the median is the middle ranked value. Rule 2 If there are an even number of values in the data set, then the median is the average of the two middle ranked values. To compute the median for the sample of 10 times to get ready in the morning, you rank the daily times as follows: Ranked values: 29 31 35 39 39 40 43 44 44 52 Ranks: 1 2 3 4 5 6 7 8 9 10 ↑ Median = 39.5 Because the result of dividing n + 1 by 2 is (10 + 1)/2 = 5.5 for this sample of 10, you must use Rule 2 and average the fifth and sixth ranked values, 39 and 40. Therefore, the median is 39.5. The median of 39.5 means that for half of the days, the time to get ready is less than or equal to 39.5 minutes, and for half of the days the time to get ready is greater than or equal to 39.5 minutes. The median time to get ready of 39.5 minutes is very close to the mean time to get ready of 39.6 minutes. EXAMPLE 3.2 COMPUTING THE MEDIAN FROM AN ODD-SIZED SAMPLE The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low, average, and high) and type (small cap, mid cap, and large cap). Compute the median 2003 return for the nine small cap mutual funds with high risk. MUTUALFUNDS2004 SOLUTION Because the result of dividing n + 1 by 2 is (9 + 1)/2 = 5 for this sample of nine, using Rule 1, the median is the fifth ranked value. The percentage return in 2003 for the nine small cap mutual funds with high risk are ranked from the smallest to the largest: 29 31 35 39 39 40 43 44 44 52 There are two modes. None of the values is most typical because each value appears once.6 59. consider the time to get ready data shown below.4 DATA WITH NO MODE Compute the mode for the 2003 return for the small cap mutual funds with high risk.5 These data have no mode.2 44. For these data. the median and the mode better measure central tendency than the mean.6 59. The extreme value 26 is an outlier.4 66. A set of data will have no mode if none of the values is “most typical.” Example 3.8.4 66. Compute the mode for the following data that represents the number of server failures in a day for the past two weeks. For example.3 62. Thus. the mode is 3.2 44. EXAMPLE 3.5 53. MUTUALFUNDS2004 SOLUTION The ordered array for these data is 37.3 62. the systems manager can say that the most common occurrence is having three server failures in a day.2 44.4 presents a data set with no mode.5.8 and half have returns equal to or above 53. more times than any other value. 1 3 0 3 26 2 7 4 0 2 3 3 6 3 SOLUTION The ordered array for these data is 0 0 1 2 2 3 3 3 3 3 4 6 7 26 Because 3 appears five times. Like the median and unlike the mean. 39 minutes and 44 minutes.2 44. since each of these values occurs twice. You should use the mode only for descriptive purposes as it is more variable from sample to sample than either the mean or the median. For this data set.5 1 2 3 4 5 6 7 8 9 Ranks: ↑ Median The median return is 53. extreme values do not affect the mode.8.5 53.8 56.3 COMPUTING THE MODE A systems manager in charge of a company’s network keeps track of the number of server failures that occur in a day.3 39. The Mode The mode is the value in a set of data that appears most frequently.8 56.76 CHAPTER THREE Numerical Descriptive Measures Ranked values: 37. the median is also equal to 3 while the mean is equal to 4.3 39. Half the small cap high-risk mutual funds have returns equal to or below 53. EXAMPLE 3. Often there is no mode or there are several modes in a set of data. . To illustrate the computation of the quartiles for the time-to-get-ready data. the first quartile.2). you round this down to the eighth ranked value. Using the third rule for quartiles. Equations (3.75 to 3 and use the third ranked value. Q3 = 3( n + 1) ranked value 4 (3. and (3. You interpret the first quartile of 35 to mean that on 25% of the days the time to get ready is less than or equal to 35 minutes. For example. For example. The third ranked value for the get-ready time data is 35 minutes.). Ranked values: 29 31 35 39 39 40 43 44 44 52 1 2 3 4 5 6 7 8 9 10 Ranks: The first quartile is the (n + 1)/4 = (10 + 1)/4 = 2.0% that are larger. median. then the quartile is equal to the average of the corresponding ranked values. the first quartile Q1 is equal to the (7 + 1)/4 = second ranked value. the time to get ready is less than or equal to 44 minutes. Equations (3. and 25. the first quartile Q1 is equal to the (9 + 1)/4 = 2.75 ranked value.1 FIRST QUARTILE Q1 25.0% of the values from the other 75.25 ranked value.75 ranked value.5. and Shape 77 Quartiles 1The Q1. rank the data from smallest to largest. the first quartile Q1 is equal to the (10 + 1)/4 = 2. For example. if the sample size n = 9.5. and on 25% of the days. Rule 3 If the result is neither a whole number nor a fractional half. and 75.4) Use the following rules to calculate the quartiles: • • • Rule 1 If the result is a whole number. halfway between the second ranked value and the third ranked value. etc.0% are larger than the first quartile Q1.3) THIRD QUARTILE Q3 75. respectively.1: Measures of Central Tendency.3) and (3.0% of the values are smaller than the third quartile Q3. Using the third rule for quartiles.0% of the values are smaller than the median and 50. and Q3 are also the 25th.3.0%. if the sample size n = 10. 50th. The third quartile is the 3(n + 1)/4 = 3(10 + 1)/4 = 8.5 ranked value. and 75th percentile. Round 2.4) define the first and third quartiles.0% of the values from the largest 25. . The eighth ranked value for the get-ready time data is 44 minutes. 4. Rule 2 If the result is a fractional half (2. (3. you round the result to the nearest integer and select that ranked value. if the sample size n = 7. the time to get ready is greater than or equal to 35 minutes. the time to get ready is greater than or equal to 44 minutes.4) can be expressed generally in terms of finding percentiles: (p ∗ 100)th percentile = p ∗ (n + 1) ranked value.3).0% of the values are smaller than Q1.0% are larger than the third quartile Q3. The third quartile Q3 divides the smallest 75. The second quartile Q2 is the median—50. Variation.0% are larger. then the quartile is equal to that ranked value. and on 75% of the days. you round up to the third ranked value. Q1 = n +1 ranked value 4 (3. Quartiles split a set of data into four equal parts—the first quartile Q1 divides the smallest 25. You interpret this to mean that on 75% of the days. 5 ranked value 4 Therefore.8 56.2 44.3 and the eighth ranked value is 62.3 62. Since the seventh ranked value is 59. the third quartile Q3 is halfway between 59. and high) and type (small cap.2 = 41. The third quartile of 60.2 44. average.5 ranked value.5 53.2 and the third ranked value is 44. Q3 is the 7.7 and 75% are greater than or equal to 41. using the second rule. Since the second ranked value is 39.3 + 62.78 CHAPTER THREE Numerical Descriptive Measures EXAMPLE 3.4.85 2 The first quartile of 41.5 ranked value 4 Therefore. Q3 = 59.4 66.7. Q1 is the 2.2 + 44. mid cap. using the second rule.5 ranked value. the first quartile Q1 is halfway between 39. Compute the first quartile (Q1) and third quartile (Q3) 2003 return for the small cap mutual funds with high risk. halfway between the seventh ranked value and the eighth ranked value.2.3 and 62.85 and 25% are greater than or equal to 60.85 indicates that 75% of the returns in 2003 for small cap high-risk funds are below or equal to 60.5 1 2 3 4 5 6 7 8 9 Ranks: For these data Q1 = = ( n + 1) ranked value 4 9 +1 ranked value = 2.4. Thus.6 59.2. halfway between the second ranked value and the third ranked value. MUTUALFUNDS2004 SOLUTION Ranked from smallest to largest.85. .4 = 60. the percentage return in 2003 for the nine small cap mutual funds with high risk is: Ranked value: 37. Q1 = 39.2 and 44. Thus.3 39. and large cap).7 2 To find the third quartile Q3 Q3 = = 3( n + 1) ranked value 4 3( 9 + 1) ranked value = 7.5 COMPUTING THE QUARTILES The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.7 indicates that 25% of the returns in 2003 for small cap high-risk funds are below or equal to 41. Variation. the arithmetic mean of the yearly rates of return of this investment is X = ( −0.50 ) + (1. the geometric mean rate of return more accurately reflects the (zero) change in the value of the investment for the two-year period than does the arithmetic mean.6) Ri is the rate of return in time period i To illustrate using these measures. 000 Using Equation (3.50 or − 50%   100. Equation (3. and Shape 79 The Geometric Mean The geometric mean and the geometric rate of return measure the status of an investment over time. However. .50 )) × (1 + (1.00 ) = 0. GEOMETRIC MEAN The geometric mean is the nth root of the product of n values X G = ( X 1 × X 2 × L × X n )1/ n (3.25 or 25% 2 since the rate of return for year 1 is  50. the geometric mean rate of return for the two years.000 that declined to a value of $50. 000  R1 =   = −0. consider an investment of $100.1: Measures of Central Tendency. The geometric mean measures the rate of change of a variable over time. because the starting and ending value of the investment is unchanged. GEOMETRIC MEAN RATE OF RETURN RG = [(1 + R1 ) × (1 + R2 ) × L × (1 + Rn )]1/ n − 1 where (3.6) defines the geometric mean rate of return. 000 − 100.0 ]1/ 2 − 1 = 1−1 = 0 Thus.0 ))]1/ 2 − 1 = [(0.000 at the end of year 1 and then rebounded back to its original $100.5) Equation (3.3. is RG = [(1 + R1 ) × (1 + R2 )]1/ n − 1 = [(1 + ( −0. 000  R2 =   = 1.6). The rate of return for this investment for the two-year period is 0. 000 − 50.5) defines the geometric mean.000 value at the end of year 2. 000 and the rate of return for year 2 is  100.0 )]1/ 2 − 1 = [1.00 or 100%   50.50 ) × ( 2. mid cap.0271]1/ 2 − 1 = 1.7 COMPUTING THE RANGE IN THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.3 62.6847) × (1. The range of 23 minutes indicates that the largest difference between any two days in the time to get ready in the morning is 23 minutes. SOLUTION Using Equation (3. and large cap).4 66.7) To determine the range of the times to get ready. . RANGE The range is equal to the largest value minus the smallest value.53% in 2002 and +50. Compute the range of the 2003 return for the small cap mutual funds with high risk.5001))]1/ 2 − 1 = [(0.7). EXAMPLE 3.35%.7) the range = 66. the geometric mean rate of return in the NASDAQ Composite Index for the two years is RG = [(1 + R1 ) × (1 + R2 )]1/ n − 1 = [(1 + ( −0. the 2003 return for the nine small cap mutual funds with high risk is: 37.2 44.8 56.2.0135 − 1 = 0.5001)]1/ 2 − 1 = [1. and high) and type (small cap.0135 The geometric rate of return in the NASDAQ Composite Index for the two years is 1. Compute the geometric rate of return.6 COMPUTING THE GEOMETRIC MEAN RATE OF RETURN The percentage change in the NASDAQ Composite Index was −31. using Equation (3.5 − 37. The largest difference between any two returns for the small cap mutual funds with high risk is 29.5 Therefore.80 CHAPTER THREE Numerical Descriptive Measures EXAMPLE 3.3153)) × (1 + (0.6). average.01% in 2003. MUTUALFUNDS2004 SOLUTION Ranked from the smallest to the largest. Range = Xlargest − Xsmallest (3.2 44. The Range The range is the simplest numerical descriptive measure of variation in a set of data.6 59.3 39. you rank the data from smallest to largest: 29 31 35 39 39 40 43 44 44 52 Using Equation (3.3 = 29.2. the range is 52 − 29 = 23 minutes.5 53. 3 39. Q1 = 41. and large cap).85 − 41. The interval 35 to 44 is often referred to as the middle fifty. Compute the interquartile range of the 2003 return for the small cap mutual funds with high risk. .8) The interquartile range measures the spread in the middle 50% of the data.6 59. the interquartile range in the 2003 return is 19. INTERQUARTILE RANGE The interquartile range is the difference between the third quartile and the first quartile. EXAMPLE 3. using the range as a measure of variation when at least one value is an extreme value is misleading. or clustered near one or both extremes. Variation. Interquartile range = 44 − 35 = 9 minutes Therefore. the 2003 return for the nine small cap mutual funds with high risk is: 37.15. it does not take into account how the data are distributed between the smallest and largest values. MUTUALFUNDS2004 SOLUTION Ranked from smallest to largest.7 and Q3 = 60.7 = 19.8) and the earlier results on page 78. and the interquartile range.85. it cannot be affected by extreme values.4 66. clustered near the middle.8) and the earlier results on page 78. the range does not indicate if the values are evenly distributed throughout the data set. Q1.3. To determine the interquartile range of the times to get ready 29 31 35 39 39 40 43 44 44 52 you use Equation (3.2 44. Thus.2 44. are called resistant measures. In other words. Because the interquartile range does not consider any value smaller than Q1 or larger than Q3. Interquartile range = 60. it is not influenced by extreme values.8 56.15 Therefore.1: Measures of Central Tendency.5 Using Equation (3. Interquartile range = Q3 − Q1 (3. average. and high) and type (small cap. The Interquartile Range The interquartile range (also called midspread) is the difference between the third and first quartiles in a set of data. Q3. the interquartile range in the time to get ready is 9 minutes. which cannot be influenced by extreme values. Summary measures such as the median.5 53. Q1 = 35 and Q3 = 44.3 62. and Shape 81 The range measures the total spread in the set of data. Although the range is a simple measure of total variation in the data. therefore.8 COMPUTING THE INTERQUARTILE RANGE FOR THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low. mid cap. In statistics. X1. although both of these statistics will be zero if there is no variation at all in a set of data and each value in the sample is the same. For a sample containing n values. X3. This sum is then divided by the number of values minus 1 (for sample data) to get the sample variance (S 2). SAMPLE VARIANCE The sample variance is the sum of the squared differences around the mean divided by the sample size minus one. n S2 = where n ∑ ( X i − X )2 i =1 n −1 (3. you would find that because the mean is the balance point in a set of data. the sample variance (given by the symbol S2) is S2 = ( X1 − X )2 + ( X 2 − X )2 + L + ( X n − X )2 n −1 Equation (3. X2. if you did that. . These statistics measure the “average” scatter around the mean—how larger values fluctuate above it and how smaller values distribute below it. this quantity is called a sum of squares (or SS). .82 CHAPTER THREE Numerical Descriptive Measures The Variance and the Standard Deviation Although the range and the interquartile range are measures of variation. Xn. n S = S2 = ∑ ( X i − X )2 i =1 n −1 (3. A simple measure of variation around the mean might take the difference between each value and the mean and then sum these differences. Because the sum of squares are a sum of squared differences that by the rules of arithmetic will always be nonnegative. . neither the variance nor the standard deviation can ever be negative. For most sets of data. The square root of the sample variance is the sample standard deviation (S).9) X = mean n = sample size Xi = ith value of the variable X ∑ ( X i − X )2 = summation of all the squared differences between the Xi values and X i =1 SAMPLE STANDARD DEVIATION The sample standard deviation is the square root of the sum of the squared differences around the mean divided by the sample size minus one.10) . the variance and standard deviation will be a positive value.9) expresses the equation using summation notation. they do not take into consideration how the values distribute or cluster between the extremes. Two commonly used measures of variation that take into account how all the values in the data are distributed are the variance and the standard deviation. for every set of data these differences would sum to zero. . However. One measure of variation that would differ from data set to data set would square the difference between each value and the mean and then sum these squared differences. 1 shows Step 2. However.40 −0.96 19.10)].6 Time (X) 39 29 43 52 39 44 40 31 44 35 Step 1: (Xi − X ) Step 2: (Xi − X )2 −0. Step 5: Take the square root of the sample variance to get the sample standard deviation.1 shows the first four steps for calculating the variance and standard deviation for the getting ready times data with a mean ( X ) equal to 39. the difference between dividing by n or n − 1 becomes smaller and smaller. and Shape 83 If the denominator were n instead of n − 1.16 73.36 0. Variation. The sum of the squared differences (Step 3) is shown at the bottom of Table 3.60 4.36 11.1. The third column of Table 3.60 4. Step 2: Square each difference.10)] would calculate the average of the squared differences around the mean. For almost all sets of data. the standard deviation is always a number that is in the same units as the original sample data.56 153. Unlike the sample variance.1 shows Step 1. Table 3.40 −8.16 Step 3: Sum: Step 4: Divide by (n − 1): 412.82 . which is a squared quantity. To hand-calculate the sample variance S2 and the sample standard deviation S: Step 1: Compute the difference between each value and the mean. Step 3: Add the squared differences. TABLE 3. You will most likely use the sample standard deviation as your measure of variation [defined in Equation (3. Step 4: Divide this total by n − 1 to get the sample variance.60 0.76 0. the majority of the observed values lie within an interval of plus and minus one standard deviation above and below the mean.3.36 21.36 19. As the sample size increases.1: Measures of Central Tendency. The standard deviation helps you to know how a set of data clusters or distributes around its mean. Therefore. knowledge of the mean and the standard deviation usually helps define where at least the majority of the data values are clustering. The second column of Table 3.40 45.60 −10.36 112.40 0.6 (see page 74 for the calculation of the mean).40 −4. This total is then divided by 10 − 1 = 9 to compute the variance (Step 4).40 12. n − 1 is used because of certain desirable mathematical properties possessed by the statistic S2 that make it appropriate for statistical inference (which will be discussed in Chapter 7). Equation (3.9) [and the inner term in Equation (3.60 3.1 Computing the Variance of the Getting Ready Times X = 39. clustering between X − 1S = 32. EXAMPLE 3..395 .2 − 51. to compute the standard deviation you take the square root of the variance.6 ) 2 + ( 29 − 39.84 CHAPTER THREE Numerical Descriptive Measures You can also calculate the variance by substituting values for the terms in Equation (3. MUTUALFUNDS2004 SOLUTION Table 3.5 − 51.82 = 6.9 COMPUTING THE VARIANCE AND STANDARD DEVIATION OF THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.e.4 9 = 45. In fact. 7 out of 10 get-ready times lie within this interval.37). and high) and type (small cap. average.53) 2 + ( 39.83 and X + 1S = 46. For any set of data. mid cap.6 ) 2 10 − 1 = 412. this sum will always be zero: n ∑ ( X i − X ) = 0 for all sets of data i =1 This property is one of the reasons that the mean is used as the most common measure of central tendency.6 ) 2 + L + ( 35 − 39. Using the second column of Table 3. you can also calculate the sum of the differences between each value and the mean to be zero.1.53) 2 + L + ( 66. Using Equation (3. and large cap).9): n S2 = ∑ ( X i − X )2 i =1 n −1 = ( 39 − 39.77 This indicates that the get-ready times in this sample are clustering within 6.9) on page 82 n S2 = ∑ ( X i − X )2 i =1 n −1 = ( 44.2 illustrates the computation of the variance and standard deviation for the return in 2003 for the small cap mutual funds with high risk.10) on page 82. Using Equation (3.53) 2 9 −1 = 891.16 8 = 111.82 Because the variance is in squared units (in squared minutes for these data).5 − 51.77 minutes around the mean of 39.6 minutes (i. the sample standard deviation S is n S2 = S = ∑ ( X i − X )2 i =1 n −1 = 45. Compute the variance and standard deviation of the 2003 return for the small cap mutual funds with high risk. and standard deviation will all equal zero.2 62. or dispersed.. interquartile range. and standard deviation.55 The standard deviation of 10. standard deviation.2667 −14.8 37. the data are.2333 −7.3 44. The coefficient of variation.0844 60.4678 152.3211 25.5333 Return 2003 Step 1: (Xi − X ) Step 2: (Xi − X )2 44.0011 Step 3: Sum: Step 4: Divide by (n − 1): 891.1378 202.10) on page 82. and standard deviation.53 (i.6 53. the sample standard deviation S is n S = S2 = ∑ ( X i − X )2 i =1 n −1 = 111.9667 49.1: Measures of Central Tendency. interquartile range. and Shape TABLE 3.1111 118.16 111. clustering between X − 1S = 40.3. In fact. the coefficient of variation is a relative measure of variation that is always expressed as a percentage rather than in terms of the units of the particular data. variance. measures the scatter in the data relative to the mean. • • • • The more spread out. interquartile range.08).98 and X + 1S = 62.0333 −12. and variance) can ever be negative. denoted by the symbol CV.6711 5.0667 2.5878 53. variance.5 39. The following summarizes the characteristics of the range.2 66. . The Coefficient of Variation Unlike the previous measures of variation presented.7778 224.8667 7. and standard deviation. If the values are all the same (so that there is no variation in the data).3333 10.3 56. 55.e. interquartile range.395 = 10. or homogeneous the data are.395 Using Equation (3.4 59.7667 5.6% (5 out of 9) of the 2003 returns lie within this interval.55 around the mean of 51.2 Computing the Variance of the 2003 Return for the Small Cap Mutual Funds with High Risk 85 X = 51. The more concentrated. None of the measures of variation (the range. variance.3333 14. the larger the range.5 −7. the smaller the range. the range. Variation.55 indicates that the 2003 returns for the small cap mutual funds with high risk are clustering within 10. interquartile range. variance. 10 illustrates. the farther the distance from the value to the mean.0 pounds. When packages are stored in the trucks in preparation for delivery.6  For the get-ready times. the standard deviation is 17.10 COMPARING TWO COEFFICIENTS OF VARIATION WHEN TWO VARIABLES HAVE DIFFERENT UNITS OF MEASUREMENT The operations manager of a package delivery service is deciding on whether to purchase a new fleet of trucks.10% X  39. divided by the standard deviation. since X = 39. . S CV =   100% X where (3. The operations manager samples 200 packages.8 cubic feet. relative to the mean. the operations manager should compare the relative variability in the two types of measurements. For weight.77  CV =   100% =   100% = 17. EXAMPLE 3.9 pounds. and finds that the mean weight is 26. you need to consider two major constraints—the weight (in pounds) and the volume (in cubic feet) for each item. the coefficient of variation is  2. The larger the Z score.11) S = sample standard deviation X = sample mean For the sample of 10 get-ready times.2  CVV =   100% = 25. How can the operations manager compare the variation of the weight and the volume? SOLUTION Because the measurement units differ for the weight and volume constraints. the coefficient of variation is S  6. and the mean volume is 8. You will find the coefficient of variation very useful when comparing two or more sets of data that are measured in different units as Example 3.2 cubic feet. multiplied by 100%. the coefficient of variation is  3.77. with a standard deviation of 2. The Z score is the difference between the value and the mean. with a standard deviation of 3.9  CVW =   100% = 15%  26. Z Scores An extreme value or outlier is a value located far away from the mean.8  Thus.1% of the size of the mean. the package volume is much more variable than the package weight.0  For volume.0%  8.86 CHAPTER THREE Numerical Descriptive Measures COEFFICIENT OF VARIATION The coefficient of variation is equal to the standard deviation divided by the mean. Z scores are useful in identifying outliers.6 and S = 6. 0 or greater than +3.09 0.3. As a general rule.3 Z Scores for the 10 Get-Ready Times Mean Standard deviation EXAMPLE 3.11 Time (X) Z Score 39 29 43 52 39 44 40 31 44 35 39. The largest Z score is 1.0 or greater than +3.83 for day 4 on which the time to get ready was 52 minutes. mid cap.83 −0.57 for day 2 on which the time to get ready was 29 minutes. and high) and type (small cap.0 − 39.09 −1.68 COMPUTING THE Z SCORES OF THE 2003 RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low. As a general rule.6 6. . and large cap).77 minutes. Variation. average.57 0.09 Table 3. MUTUALFUNDS2004 SOLUTION Table 3.06 −1. The lowest Z score is −1.77 = −0. a Z score is considered an outlier if it is less than -3. You compute the Z score for day 1 from Z = = X −X S 39.42 for a percentage return of 66.12) For the time to get ready in the morning data.6 6.0. The time to get ready on the first day is 39.0 minutes.5. and Shape 87 Z SCORES Z = X −X S (3.65 −0.35 for a percentage return of 37.1: Measures of Central Tendency. the mean is 39. a Z score is considered an outlier if it is less than −3. TABLE 3. The lowest Z score was −1.0.77 −0.3. None of the times met that criterion to be considered outliers.65 0. None of the percentage returns met that criterion to be considered outliers.4 illustrates the Z scores of the 2003 return for the small cap mutual funds with high risk.6 minutes and the standard deviation is 6.50 1.27 0. The largest Z score is 1.3 shows the Z scores for all 10 days. Compute the Z scores of the 2003 return for the small cap mutual funds with high risk. Each half of the curve is a mirror image of the other half of the curve.8 37. The low and high values on the scale balance.5 51. Read the instructions in the popup box (see illustration on page 89) and click OK to examine a dot scale diagram for the sample of 10 get-ready times used throughout this chapter. most of the values are in the upper portion of the distribution.xla macro workbook and select VisualExplorations Descriptive Statistics from the Microsoft Excel menu bar.21 −1.53 10. The data in panel C are positive. or left-skewed. when low and high values balance each other out. variation. A distribution will either be symmetrical. or right-skewed The data in panel A are negative.3 44.69 1.48 0.4 59. positive or right-skewed Figure 3. These extremely small values pull the mean downward so that the mean is less than the median. Open the Visual Explorations. and the mean equals the median. There is a long tail on the right of the distribution and a distortion to the right that is caused by some extremely large values.4 Z Scores of the 2003 Return for the Small Cap Mutual Funds with High Risk Mean Standard Deviation Return 2003 Z Scores 44. or skewed.55 −0.2 62. The data in panel B are symmetrical.5 39. negative or left-skewed Mean = median. In this panel.67 −1. symmetric or zero skewness Mean > median. or right-skewed.35 −0. each with a different shape.03 0. not symmetrical and showing an imbalance of low values or high values.1 A Comparison of Three Data Sets Differing in Shape Panel A Negative.42 Shape A third important property that describes a set of numerical data is shape. FIGURE 3. Visual Explorations: Exploring Descriptive Statistics Use the Visual Explorations Descriptive Statistics procedure to see the effect of changing data values on measures of central tendency. or left-skewed Panel B Symmetrical Panel C Positive.88 CHAPTER THREE Numerical Descriptive Measures TABLE 3.74 0. There is a long tail and distortion to the left that is caused by some extremely small values. and shape. most of the values are in the lower portion of the distribution. .1 depicts three data sets.6 53. Shape influences the relationship of the mean to the median in the following ways: • • • Mean < median.2 66. These extremely large values pull the mean upward so that the mean is greater than the median.17 1.3 56. In this panel. Shape is the pattern of the distribution of data values throughout the entire range of all the values. and Shape 89 Experiment by entering an extreme value such as 10 minutes into one of the tinted cells of column A.1: Measures of Central Tendency. range. mode. From Figure 3. In addition. Microsoft Excel Descriptive Statistics Output The Microsoft Excel Data Analysis ToolPak generates the mean. variance. Variation. the Excel descriptive statistics output for the 2003 return of the funds based on risk level. maximum.2. Kurtosis measures the relative concentration of values in the center of the distribution as compared with the tails and is based on the differences around the mean raised to the fourth power. minimum. Skewness measures the lack of symmetry in the data and is based on a statistic that is a function of the cubed differences around the mean. A skewness value of zero indicates a symmetric distribution.3. along with statistics for kurtosis and skewness. there appears to be slight differences in the 2003 percentage return for the FIGURE 3.2 Microsoft Excel Descriptive Statistics of the 2003 Returns Based on Risk Level . all of which have been discussed in this section. standard deviation. The standard error is the standard deviation divided by the square root of the sample size and will be discussed in Chapter 7. median. Excel computes the standard error. This measure is not discussed in this text (see reference 2). and count (sample size) on a single worksheet. Which measures are affected by this change? Which ones are not? You can flip between the “before” and “above” diagrams by repeatedly pressing Crtl-Z (undo) followed by Crtl-Y (redo) to help see the changes the extreme value caused in the diagram. Compute the range. Compute the mean. c. Compute the range. maximum.90 CHAPTER THREE Numerical Descriptive Measures three risk levels. the mean.3 The following set of data is from a sample of n = 7: 12 7 4 9 0 7 3 a. standard deviation. There was very little difference in the standard deviations of the three groups. interquartile range. and coefficient of variation. PH Grade ASSIST 3. Compute the mean. c.4 The following is a set of data from a sample of n = 5: 7 −5 −8 7 9 a. median. interquartile range.3 Minitab Descriptive Statistics of the 2003 Returns Based on Risk Level PROBLEMS FOR SECTION 3. PH Grade ASSIST . c. Are there any outliers? d. all of which have been discussed in this section. Describe the shape of the data set. and mode. Compute the mean. High-risk funds had a slightly higher mean and median than did low-risk and average-risk funds. 3. variance. coefficient of variation (labeled CoefVar). Minitab computes the sample size (labeled as N). Compute the Z scores. first and third quartiles. minimum. standard deviation. median. b.1 Learning the Basics PH Grade ASSIST 3.2 The following is a set of data from a sample of n = 6: 7 4 9 7 3 12 a. median. and coefficient of variation.3. Compute the range. Are there any outliers? d. standard deviation. There was very little difference in the standard deviations or interquartile ranges of the three groups. and mode. variance. c. High-risk funds had a slightly higher mean. and mode. Describe the shape of the data set. variance. and quartiles than did low-risk and average-risk funds.1 The following is a set of data from a sample of n = 5: 7 4 9 8 2 a. Compute the range. range. interquartile range. Compute the mean. standard deviation. Compute the Z scores. median. variance. interquartile range. From Figure 3. FIGURE 3. and coefficient of variation. and mode. standard deviation (labeled StDev). b. median. PH Grade ASSIST 3. and coefficient of variation. b. and interquartile range (labeled IQR). Describe the shape of the data set. the Minitab descriptive statistics output for the 2003 return of the funds based on risk level. b. there appears to be slight differences in the 2003 percentage return for the three risk levels. Minitab Descriptive Statistics Output For descriptive statistics. Describe the shape of the data set. median. many public universities in the United States raised tuition and fees due to a decrease in state subsidies (Mary Beth Marklein.589 593 1. Are the data skewed? If so.425 922 308 a.3. and Z scores. Variation. 3. Adapted with permission from Consumer Reports. and the most popular meal plan between the 2001–2002 academic year and the 2002–2003 academic year for a sample of 10 public universities. “Public Universities Raise Tuition. 46. are as follows: PH Grade ASSIST Grade X Grade Y 568 570 575 578 584 573 574 575 577 578 a.S. COLLEGECOST University Change in Cost ($) University of California. interquartile range. Are the data skewed? If so. Orono University of Mississippi.720 708 1. Columbus University of South Carolina. standard deviation.. each of which is expected to be 575 millimeters.10 The following data COFFEEDRINK represent the calories and fat (in grams) of 16-ounce iced coffee drinks at Dunkin’ Donuts and Starbucks. and the results representing the inner diameters of the tires. range. (Note: A rate of return of 10% is recorded as 0.” Copyright © 2004 by Consumers Union of U. and coefficient of variation. Based on the results of (a) through (c).0 Source: Extracted from “Coffee as Candy at Dunkin’ Donuts and Starbucks. and standard deviation. Inc. What would be the effect on your answers in (a) and (b) if the last value for grade Y were 588 instead of 578? Explain. median. range.7% from December 2002. For the burgers and chicken items separately: a. . Describe the shape of the distribution of the price of homes sold.7 The following data represent the total fat for burgers and chicken items from a sample of fastfood chains. A sample of five tires of each grade was selected. 9. standard deviation. Oxford University of New Hampshire. Compute the mean. Durham Ohio State University. June 2004.0 530 19. Compute the mean. Which grade of tire is providing better quality? Explain. what conclusions can you reach concerning the change in costs between the 2001–2002 and 2002–2003 academic years? 3. Columbia Utah State University. FASTFOOD Burgers 19 31 34 35 39 39 43 Chicken 7 9 15 16 16 18 22 25 27 33 39 Source: Extracted from “Quick Bites. coefficient of variation. c.20 can be solved manually or by using Microsoft Excel. The following represents the change in the cost of tuition. Manhattan University of Maine.0 260 350 3.5 22. Hagerty. an increase of 6. Compute the variance.S. 1A–2A).0 350 20.6 The operations manager of a plant that manufactures tires wants to compare the actual inner diameter of two grades of tires. Urbana–Champaign Kansas State University. first quartile. Berkeley University of Georgia. Logan 1. b. b.0 420 16. Minitab. NY 10703–1057. Yonkers. January 27.” The Wall Street Journal. SELF Test 3. and Shape 3. Based on the results of (a) through (c). “Housing Prices Continue to Rise. Yonkers..30. a.9 In the 2002–2003 academic year. 2004. median. 2002. Athens University of Illinois. For the full year. Inc. Fees—and Ire.5 Suppose that the rate of return for a particular stock during the past two years was 10% and 30%..” Copyright © 2001 by Consumers Union of U. c. D1).223 869 423 1. how? d.. Product Dunkin’ Donuts Iced Mocha Swirl latte (whole milk) Starbucks Coffee Frappuccino blended coffee Dunkin’ Donuts Coffee Coolatta (cream) Starbucks Iced Coffee Mocha Expresso (whole milk and whipped cream) Starbucks Mocha Frappuccino blended coffee (whipped cream) Starbucks Chocolate Brownie Frappuccino blended coffee (whipped cream) Starbucks Chocolate Frappuccino Blended Crème (whipped cream) Calories Fat 240 8. Compute the variance. For each of the two grades of tires. a shared dormitory room. March 2001.1 million homes (James R. median.” USA Today. c.) PH Grade ASSIST Applying the Concepts Problems 3. b. first quartile. sales hit a record 6.0 510 22. what conclusions can you reach concerning the differences in total fat of burgers and chicken items? 3.1: Measures of Central Tendency.8 The median price of a home in December 2003 rose to $173. and third quartile. NY 10703–1057. or SPSS. compute the mean. and third quartile. Why do you think the article reports the median home price and not the mean home price? 91 3.200.6–3. interquartile range. Compute the geometric mean rate of return. b. August 8.10 and a rate of return of 30% is recorded as 0. Adapted with permission from Consumer Reports. how? d. ranked from smallest to largest. Are the data skewed? If so. A. HOTEL-CAR City San Francisco Los Angeles Seattle Phoenix Denver Dallas Houston Minneapolis Chicago St. what conclusions can you reach concerning the calories and fat in iced coffee drinks at Dunkin’ Donuts and Starbucks? 3.0 45. median. and Z scores. For each variable (hotel cost and rental car cost).5 15. 3. interquartile range. how? d. coefficient of variation. Looking at the distribution of times to failure.0 47. 30(Fall 1999). Based on the results of (a) through (c). Instead of starting from scratch when writing and developing new custom software systems. Comment on the difference in the results. Are the data skewed? If so. median. median. c. coefficient of variation.000.” Decision Sciences. first quartile.342 instead of 342. b. and standard deviation. D. what conclusions can you reach concerning the price of 3-megapixel digital cameras at a camera specialty store during 2003? 3. The following data are given as a percentage of the total code written for a software system that is part of the reuse database.92 CHAPTER THREE Numerical Descriptive Measures For each variable (calories and fat). The waiting time in minutes (defined as the time the customer enters the line to . “A Performance Measure for Software Reuse Projects. Rothenberger. standard deviation. and Z scores. variance.5 75. Compute the mean. a. Hotel Cars 205 179 185 210 128 145 177 117 221 159 205 128 165 180 198 158 132 283 269 204 47 41 49 38 32 48 49 41 56 41 50 32 34 46 41 40 39 67 69 40 Source: Extracted from The Wall Street Journal. how? d.13 A software development and consulting firm located in the Phoenix metropolitan area develops software for supply chain management systems using systematic software reuse. c.C. c. variance. October 10. The numbers of hours they were used until failure were: BATTERIES 342 426 317 545 264 451 1. J. Compute the variance. Louis New Orleans Detroit Cleveland Atlanta Orlando Miami Pittsburgh Boston New York Washington. cities during a week in October 2003. and Z scores. standard deviation.0 25. S. Are there any outliers? Explain. what conclusions can you reach concerning the daily cost of a hotel and rental car? 3.) d. Calculate the range. Compute the variance. b. and third quartile. standard deviation. Based on the results of (a) through (c). 3. median. What would you advise if the manufacturer wanted to be able to say in advertisements that these batteries “should last 400 hours”? (Note: There is no right answer to this question.0 Source: M. Compute the mean. Interpret the summary measures calculated in (a) and (b). Compute the mean. Eight analysts at the firm were asked to estimate the reuse rate when developing a new software system. REUSE 50 62. the firm uses a database of reusable components totaling more than 2. Are the data skewed? If so. Compute the variance. lunch period. and third quartile. coefficient of variation. a. a. 2003. Suppose that the first value was 1. first quartile. W4. Compute the mean. first quartile.049 631 512 266 492 562 298 a.11 The following data represent the daily hotel cost and rental car cost for 20 U.12 The cost of 14 models of 3-megapixel digital cameras at a camera specialty store during 2003 was as follows. and mode.5 37. how? d. which measures of location do you think are most appropriate and which least appropriate to use for these data? Why? b. range. b. and standard deviation. c. and mode. Compute the range. and K. the point is to consider how to make such a statement precise. Dooley. range. interquartile range.15 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1:00 P. median. Based on the results of (a) through (c). Repeat (a) through (c). using this value. Are there any outliers? Explain. b. CAMERA 340 450 450 280 220 340 290 370 400 310 340 430 270 380 a. c.M. range.14 A manufacturer of flashlight batteries took a sample of 13 batteries from a day’s production and used them continuously until they were drained.000 lines of code collected from 10 years of continuous reuse effort. 1131–1153. and third quartile. interquartile range. Are there any outliers? Explain. Compute the mean. b. she asks the branch manager how long she can expect to wait. A random sample of 15 customers is selected.8 24.55 3.5 −3. and the money market deposit.19 The time period from 2000 to 2003 saw a great deal of volatility in the value of investments. the Russell 2000 Index. 2004.19 (b) and 3.01 −5.19 (b).02 1.) SELF Test 3. c. the 30-month certificate of deposit.44 −6. 3.” On the basis of the results of (a) and (b).08 6. b. and the Wilshire 5000 Index from 2000 to 2003.47 a. and third quartile. Compute the variance.54 3. Compare the results of (b) to those of problems 3. a.73 3.66 5. The data in the following table STOCKRETURN represent the total rate of return of the Dow Jones Industrial Index.97 −10. The data in the following table METALRETURN represent the total rate of return for platinum. interquartile range.46 6. As a customer walks into the branch office during the lunch hour. range.20 (b).58 −1.60 5.03 −3. range.02 29. standard deviation.64 0.49 6.64 4.” The Wall Street Journal. gold.17 9. interquartile range. Variation. The branch manager replies. and the Wilshire 5000 Index. a.10 45. February 2. the 30-month certificate of deposit.13 4.17 China is the fastest-growing market for passenger car sales and fourth biggest after the United States.50 6.3 19. 2004. the Standard & Poor’s 500. The waiting time in minutes (defined as the time the customer enters the line to the time he or she reaches the teller window) of all customers during these hours is recorded over a period of one week.97 5. As a customer walks into the branch office during the lunch hour. Calculate the geometric rate of return for the Dow Jones Industrial Index. “A Fear Amid China’s Car Boom.20 26. January 2.2 24. Are the data skewed? If so.18 (b) and 3. first quartile.5 1. 3.76 2. b.35 10. The data in the following table BANKRETURN represent the total rate of return of the one-year certificate of deposit.10 0.68 5. What conclusions can you reach concerning the geometric rates of return of the three metals? c. the Russell 2000 Index.79 8. Compute the mean.19 3.” On the basis of the results of (a) and (b).61 1. Calculate the geometric rate of return for platinum. gold. January 2. evaluate the accuracy of this statement. .90 −9. 3.40 −21. Year Platinum Gold Silver 2003 2002 2001 2000 34.0 5. is also concerned with the noon to 1 P.89 Source: Extracted from The Wall Street Journal. and Germany. and Z scores. Year DJIA SP500 Russell2000 Wilshire5000 2003 2002 2001 2000 25. Year One Year 30 Month Money Market 2003 2002 2001 2000 1. Compute the mean. and the results are as follows: BANK1 4. b.40 −22.77 2. Are there any outliers? Explain. how? d. median. he asks the branch manager how long he can expect to wait. Calculate the geometric rate of return for the one-year certificate of deposit. 2004.3 −23. Compute the variance.1: Measures of Central Tendency.20 The time period from 2000 to 2003 saw a great deal of volatility in the value of metals. 3.73 2.93 3. lunch hour.0 −5.02 5.18 (b) and 3. and the results are as follows: BANK2 9. Passenger car sales increased 61% in 2002 and 55% in 2003 (Peter Wonacott. Compare the results of (b) to those of problems 3.02 5.5 24.9 Source: Extracted from The Wall Street Journal.20 4. how? d.98 3. and Shape when he or she reaches the teller window) of all customers during this hour is recorded over a period of one week.10 −11.09 Source: Extracted from The Wall Street Journal.M.90 −10.20 1. and third quartile. A17). evaluate the accuracy of this statement. The branch manager replies. (Hint: Denote an increase of 61% as R1 = 0. January 2.2 1. Compute the geometric mean rate of increase.21 5.79 a.46 1.40 −20.5 −21. A random sample of 15 customers is selected.82 8. 2004. “Almost certainly less than five minutes. the Standard & Poor’s 500.38 5. What conclusions can you reach concerning the geometric rates of return of the four stock indexes? c.01 8. Japan.34 3. a.20 (b). and the money market deposit from 2000 to 2003. and coefficient of variation.18 The time period from 2000 to 2003 saw a great deal of volatility in the value of stocks. Are the data skewed? If so. coefficient of variation.12 6.61. and silver.74 3. median. standard deviation.30 −15. first quartile. c.16 Suppose that another branch. What conclusions can you reach concerning the geometric rates of return of the three deposits? c. located in a residential area.91 5.90 8. Are there any outliers? Explain. “Almost certainly less than five minutes. Compare the results of (b) to those of problems 3. and silver from 2000 to 2003. b. 1 presented various statistics that described the properties of central tendency. first review Table 3. N µ = ∑ Xi i =1 N = 3. The Population Mean The population mean is represented by the symbol µ.9 37. C2.3 12. .13). LARGEST BONDS.94 CHAPTER THREE Numerical Descriptive Measures 3.13) µ = population mean Xi = ith value of the variable X N ∑ X i = summation of all Xi values in the population i =1 To compute the mean return for the population of bond funds given in Table 3. summary measures for a population.9 Source: Extracted from The Wall Street Journal.8 6. N µ = where ∑ Xi i =1 N (3.5 that contains the five biggest bond funds (in terms of total assets) as of March 1. and shape for a sample. use Equation (3. you will learn about three descriptive population parameters. If your data set represents numerical measurements for an entire population. To help illustrate these parameters.5 + 7. and population standard deviation.3 + 12.5 7.0 + 7. 2004. 2004. POPULATION MEAN The population mean is the sum of the values in the population divided by the population size N.5 5 5 Thus.5%. the mean 2003 return for these bond funds is 7.5.13) defines the population mean. the Greek lowercase letter mu. population variance.5 2003 Return for the Population Consisting of the Five Largest Bond Funds 52-Week Return (in %) Vanguard GNMA Vanguard Total Bond Index Pimco Total Return Admin Pimco Total Return Instl America Bond Fund 3. you need to calculate and interpret parameters. variation.8 + 6. Bond Fund TABLE 3. March 25.5 = = 7.2 NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION Section 3. Equation (3. the population mean.0 7. In this section. The 52-week return for each of these funds is also listed. 5 on page 94.5) 2 5 = 13. the Greek lowercase letter sigma. Like the related sample statistics.5) 2 + ( 7.828 5 .3 − 7.5) 2 + (12.5) 2 + ( 7. N σ2 = where ∑ ( X i − µ )2 i =1 N (3. represents the population variance and the symbol σ.2: Numerical Descriptive Measures for a Population 95 The Population Variance and Standard Deviation The population variance and the population standard deviation measure variation in a population.15) To compute the population variance for the data of Table 3.5) 2 + (6. N σ2 = ∑ ( X i − µ)2 i =1 N = ( 3.8 − 7. The denominators for the right-side terms in these equations use N and not the (n − 1) term that is used in the equations for the sample variance and standard deviation [see Equations (3.25 + 0.15) define these parameters. Equations (3.14) and (3.69 + 1.0 − 7. the Greek lowercase letter sigma squared. represents the population standard deviation.16 5 = 44.10) on page 82]. POPULATION VARIANCE The population variance is the sum of the squared differences around the population mean divided by the population size N.3. The symbol σ2.04 + 29.00 + 0.14).9) and (3.9 − 7. you use Equation (3.5 − 7.14 = 8.14) µ = population mean Xi = ith value of the variable X N ∑ ( X i − µ )2 = summation of all the squared differences between the Xi values and µ i =1 POPULATION STANDARD DEVIATION N σ = ∑ ( X i − µ )2 i =1 N (3. the population standard deviation is the square root of the population variance. 02 ) = (12. at a value less than the mean.12 USING THE EMPIRICAL RULE A population of 12-ounce cans of cola is known to have a mean fill-weight of 12.02 and 12.12 ) Using the empirical rule. In symmetrical data sets. For heavily skewed data sets. approximately 68% of the cans will contain between 12. The squared units make the variance hard to interpret.828 squared percentage return.97.06 ± 2(0. The rule also implies that only about three in 1.00 and 12. You should use the standard deviation that uses the original units of the data (percentage return).7% will contain between 12. that is. the values often tend to cluster around the median and mean producing a bellshaped distribution. N σ = σ2 = ∑ ( X i − µ )2 i =1 N = 8. Is it very likely that a can will contain less than 12 ounces of cola? SOLUTION µ ± σ = 12. The population is also known to be bell-shaped. the variance of the returns is 8. at a value greater than the mean.12 ounces. The empirical rule helps you measure how the values distribute above and below the mean.04. In right-skewed data sets. Approximately 95% of the values are within a distance of ±2 standard deviations from the mean. 12.828 = 2. that is.02 ) = (12.06 ounces and a standard deviation of 0. the Chebyshev rule discussed on page 97 should be applied instead of the empirical rule. Therefore.02.02.02 = (12. Describe the distribution of fill-weights. the typical 2003 return differs from the mean of 7. From Equation (3. it is highly unlikely that a can will contain less than 12 ounces.96 CHAPTER THREE Numerical Descriptive Measures Thus. a large portion of the values tend to cluster somewhat near the median.5 by approximately 2. .04 and 12.7% are within a distance of ±3 standard deviations from the mean. Therefore. In left-skewed data sets. or those not appearing bell-shaped for any other reason. This can help you to identify outliers when analyzing a set of numerical data. EXAMPLE 3. this clustering occurs to the left of the mean.000 will be beyond three standard deviations from the mean. values not found in the interval µ ± 3σ are almost always considered outliers.00.08 ) µ ± 2σ = 12. The empirical rule implies that for bell-shaped distributions only about one out of 20 values will be beyond two standard deviations from the mean in either direction. you can consider values not found in the interval µ ± 2σ as potential outliers. The Empirical Rule In most data sets. where the median and mean are the same. approximately 95% will contain between 12.06 ± 3( 0. This large amount of variation suggests that these large bond funds produce results that differ greatly. 12. As a general rule.06 ± 0. Approximately 99.97 Therefore. and approximately 99.10 ) µ ± 3σ = 12.10 ounces. You can use the empirical rule to examine the variability in bell-shaped distributions: • • • Approximately 68% of the values are within a distance of ±1 standard deviation from the mean.15). 12.08 ounces. the values tend to cluster to the right of the mean. a population of 12-ounce cans of cola is known to have a mean fill-weight of 12.12 ) Because the distribution may be skewed. between 0 and 11.13 Chebyshev (for any distribution) Empirical Rule (bell-shaped distribution) At least 0% At least 75% At least 88.11% of the cans contain less than 12 ounces. the shape of the population is unknown and you cannot assume that it is bell-shaped.7% USING THE CHEBYSHEV RULE As in Example 3. 12.02 ) = (12. regardless of shape.00 and 12.04 and 12. µ + 3σ) EXAMPLE 3.02 = (12.89% will contain between 12. The results you compute using the sample statistics are approximations since you used sample statistics ( X . Describe the distribution of fill-weights.02. the percentage of values that are found within distances of k standard deviations from the mean must be at least (1 − 1/k2) × 100% You can use this rule for any value of k greater than 1.06 ± 2( 0.02. you cannot say anything about the percentage of cans containing between 12. the empirical rule will more accurately reflect the greater concentration of data close to the mean.06 ounces and a standard deviation of 0. S) and not population parameters (µ.12. . The Chebyshev rule states that at least [1 − (1/2)2] × 100% = 75% of the values must be found within ±2 standard deviations of the mean. In each case. The rule indicates at least what percentage of the values fall within a given distance from the mean. However. You can use these two rules for understanding how data are distributed around the mean when you have sample data.3. µ + σ) (µ − 2σ.02 and 12.06 ± 3( 0.6 compares the Chebyshev and empirical rules. 12.00. 12.6 How Data Vary Around the Mean % of Values Found in Intervals Around the Mean Interval (µ − σ.04. TABLE 3.08 ounces.08 ) µ ± 2σ = 12. use the value you calculated for X in place of µ and the value you calculated for S in place of σ. σ). µ + 2σ) (µ − 3σ. you cannot use the empirical rule. Using the Chebyshev rule. Therefore. However.10 ) µ ± 3σ = 12.02 ) = (12. Is it very likely that a can will contain less than 12 ounces of cola? SOLUTION µ ± σ = 12. Consider k = 2. Table 3.06 ± 0. You can state that at least 75% of the cans will contain between 12.10 ounces.2: Numerical Descriptive Measures for a Population 97 The Chebyshev Rule The Chebyshev rule (reference 1) states that for any data set.89% Approximately 68% Approximately 95% Approximately 99.12 ounces. The Chebyshev rule is very general and applies to any type of distribution. and at least 88. if the data set is approximately bell-shaped. b. A Vanguard Short-Term Corp. or ±3 standard deviations of the mean? c. According to the Chebyshev rule.1 9.7 12.5 9. d. PH Grade ASSIST 3.0 13.9 a. Compute the variance and standard deviation for this population. the mean one-year total percentage return achieved by all the funds. Using the results in (c). . Tax-Free Inc. whichever is appropriate. within ±2 standard deviations of the mean.7 11. ±2. c. Compute the mean.5 10.3 10.6 11.3 11. According to the Chebyshev rule.8 10.23 The following data represent the quarterly sales tax receipts (in thousands of dollars) submitted to the comptroller of the Village of Fair Lake for the period ending March 2004 by all 50 business establishments in that locale: TAX SELF Test 10. b.6 9.26 The data in the file ENERGY contains the per capita energy consumption in kilowatt hours for each of the 50 states and the District of Columbia during 1999. Compute the population mean. b. Compute the mean.1 and that the quartiles are. You determined that µ. is 2. and within ±3 standard deviations of the mean? c.98 CHAPTER THREE Numerical Descriptive Measures PROBLEMS FOR SECTION 3.5 7. Compute the mean for this population of the five largest bond funds. What proportion of these states has average per capita energy consumption within ±1 standard deviation of the mean.1 12. Assets (Billions $) 19. what percentage of these funds is expected to be a. PH Grade ASSIST Applying the Concepts 3. Interpret this number. Interpret this parameter.5 7.5 9.6 10. within ±2 standard deviations of the mean? c.3 8.8 10.9 6. Compute the population mean. b.75.4 10.5 10.7 11. at least 93. Index Bond Fund of America A Franklin Calif.8 8. b. to further explain the variation in this data set. a.0 12. Compare and contrast your findings versus what would be expected based on the empirical rule. are there any outliers? Explain. variance. Are you surprised at the results in (b)? 3.8 10. respectively. variance. c. Interpret the standard deviation.3 13. and standard deviation for this population. Compute the variance and standard deviation for this population.5 9.5 11.9 9.22 The following is a set of data for a population with N = 10: 7 5 6 6 6 4 8 6 9 3 a.0 8.5 (Q1) and 10.0 12.27 The data in the file DOWRETURN give the 10-year annualized return (1994–2003) for the 30 companies in the Dow Jones Industrials.6 9. What proportion of these businesses have quarterly sales tax receipts within ±1.7 10.024 mutual funds that primarily invested in large companies. Is there a lot of variability in the assets of the bond funds? 3.6 8. Interpret these parameters.21 The following is a set of data for a population with N = 10: 7 5 11 8 3 6 2 1 9 8 a.2 11. Compute the mean for this population. In addition. the standard deviation. According to the empirical rule.2 15.5 9. or ±3 standard deviations of the mean? d.1 11.25 The following table ASSETS represents the assets in billions of dollars of the five largest bond funds. Use the empirical rule or the Chebyshev rule.5 (Q3). Do (a) through (c) with the District of Columbia removed.0 8.9 14.5 16. what percentage of these funds are expected to be within ±1. suppose you determined that the range in the one-year total returns is from −2. b.24 Consider a population of 1. Are you surprised at the results in (b)? d. and standard deviation for the population. Compare and contrast your findings with what would be expected on the basis of the empirical rule.8 7.75% of these funds are expected to have one-year total returns between what two amounts? 3.3 12.20 and that σ.0 11. within ±1 standard deviation of the mean? PH Grade ASSIST b.0 7. ±2.1 6.8 13.4 5.1 12. How have the results changed? 3.5 12.2 Learning the Basics 3. Compute the population standard deviation. Bond Fund Vanguard GNMA Vanguard Total Bond Mkt.0 to 17.7 11. Compute the population standard deviation. 5. a.2 10.3 10. is 8.6 a. you can compute an approximation of the mean by assuming that all values within each class interval are located at the midpoint of the class.7).3.3: Computing Numerical Descriptive Measures from a Frequency Distribution 3. EXAMPLE 3.7 Frequency Distribution of the 2003 Return for Growth Mutual Funds Annual Percentage 2003 Return 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 60 but less than 70 Total Frequency 2 9 13 15 5 5 49 . you can compute approximations to the mean and the standard deviation. not the raw data.14 APPROXIMATING THE MEAN AND STANDARD DEVIATION FROM A FREQUENCY DISTRIBUTION Consider the frequency distribution of the 2003 return of growth funds (Table 3.3 99 COMPUTING NUMERICAL DESCRIPTIVE MEASURES FROM A FREQUENCY DISTRIBUTION Sometimes you have only a frequency distribution.16) n X = sample mean n = number of values or sample size c = number of classes in the frequency distribution mj = midpoint of the jth class fj = numbers of values in the jth class To calculate the standard deviation from a frequency distribution.17) n −1 Example 3. TABLE 3. APPROXIMATING THE MEAN FROM A FREQUENCY DISTRIBUTION c X = where ∑ mj f j j =1 (3. When this occurs. Compute the mean and standard deviation. APPROXIMATING THE STANDARD DEVIATION FROM A FREQUENCY DISTRIBUTION c S = ∑ ( m j − X )2 f j j =1 (3.14 illustrates the computation of the mean and the standard deviation from a frequency distribution. you assume that all values within each class interval are located at the midpoint of the class. When you have data from a sample that has been summarized into a frequency distribution. 8005 2.3601 20.049.5202 2.51 −15.2449 49 − 1 = 171.7601 1.7601 240. b. Percentage Return Number of Funds(fj) Midpoint(mj) mj fj (mj − X ) (mj − X )2 (mj − X )2fj 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 60 but less than 70 Total 2 9 13 15 5 5 49 15 25 35 45 55 65 30 225 455 675 275 325 1.0 = 40.6813 302.28 Given the following frequency distribution for n = 100: 3.212. the standard deviation.8.998.2449 TABLE 3.100 CHAPTER THREE Numerical Descriptive Measures SOLUTION The computations that you need to calculate the approximations of the mean and standard deviation of the 2003 return for growth mutual funds are summarized in Table 3. the mean.9601 599. Approximate a.4015 1. the standard deviation.165.0409 394.08843 = 13.8 Computations Needed to Calculate the Approximations of the Mean and Standard Deviation of the 2003 Return for Growth Mutual Funds Using Equations (3.985 −25. . b.29 Given the following frequency distribution for n = 100: Class Intervals Frequency Class Intervals Frequency 0—Under 10 10—Under 20 20—Under 30 30—Under 40 40—Under 50 10 20 40 20 10 100 0—Under 10 10—Under 20 20—Under 30 30—Under 40 40—Under 50 40 25 15 15 5 100 Approximate a.5601 30. c X = X = ∑ mj f j j =1 n 1.1601 209.212.49 650.301.51 49 c and S = S = ∑ ( m j − X )2 f j j =1 n −1 8.8005 8.49 14.985.51 4.16) and (3.17) on page 99.49 24.3 Learning the Basics 3.08 PROBLEMS FOR SECTION 3.51 −5. the mean. standard deviation. On the basis of the results of (a). b. approximate the mean of the braking distance. Age of Employees (Years) 20—Under 30 30—Under 40 40—Under 50 50—Under 60 60—Under 70 A Frequency B Frequency 8 17 11 8 2 15 32 20 4 0 For each of the two divisions (A and B).4 U. On the basis of the results of (a) and (b). 3. and shape.-Made Automobile Models “Less Than” Braking Indicated Values Distance (in Ft) Number Percentage 250 260 270 280 290 300 310 320 4 8 11 17 21 23 25 25 101 Foreign-Made Automobile Models “Less Than” Indicated Values Number Percentage 16.S.6 26.0 8.S. Another way of describing numerical data is thrpough exploratory data analysis that includes the five-number summary and the box-and-whisker plot (references 5 and 6).1 discussed sample statistics for numerical data that are measures of central tendency.0 84. . On the basis of the results of (a).6 100.and foreign-made automobiles a. On the basis of the results of (b) and (c).S.0 4.0 3. b.30 A wholesale appliance distributing firm wished to study its accounts receivable for two successive months.4 97. do you think there are differences in the age distribution between the two divisions? Explain. Two independent samples of 50 accounts were selected for each of the two months.S.0 12.0 For U.000 to under $12. c. variation.000 $2.0 32.2 98.0 44. The results are summarized in the following table: Frequency Distributions for Accounts Receivable Amount March Frequency April Frequency 6 13 17 10 4 0 50 10 14 13 10 0 3 50 $0 to under $2.. approximate the a. b. c.0 92.000 $6.000 Total For each month.000 to under $10.-Made Automobile Models “Less Than” Braking Indicated Values Distance (in Ft) Number Percentage 210 220 230 240 (continued) 0 1 2 3 0.000 to under $8.32 The following data represent the distribution of the ages of employees within two different divisions of a publishing company.4 5.000 $4. mean.7 94. EXPLORATORY DATA ANALYSIS Section 3. Construct a frequency distribution for each group. approximate the a.0 100.and foreign-made automobiles seem to differ in their braking distance? Explain. 3.0 32 54 61 68 68 70 71 72 44.0 68.31 The following table contains the cumulative frequency distributions and cumulative percentage distributions of braking distance (in feet) at 80 miles per hour for a sample of 25 U.0 100.000 $8. standard deviation.000 to under $4. do U. On the basis of (a) and (b).0 1. approximate the standard deviation of the braking distance.4 Foreign-Made Automobile Models “Less Than” Indicated Values Number Percentage 0 1 4 19 0. c..3. mean.000 to under $6. d.4 94.4 75.-manufactured automobile models and for a sample of 72 foreign-made automobile models in a recent year: U.000 $10.0 84.4: Exploratory Data Analysis Applying the Concepts 3. do you think the mean and the standard deviation of the accounts receivable have changed substantially from March to April? Explain.S. Table 3. Compute the five-number summary of the 2003 return for the small cap mutual funds with high risk.5) is slightly less than the distance from Xlargest (52 − 39. Distance from Xsmallest to Q1 versus the distance from Q3 to Xlargest.5. the smallest value is 29 minutes and the largest value is 52 minutes (see pages 75 and 77). Both distances are the same. The distance from Xsmallest to the median is less than the distance from the median to Xlargest. The distance from Q1 to the median is greater than the distance from the median to Q3. the five-number summary is 29 35 39. MUTUALFUNDS2004 . Both distances are the same.9 Relationships among the Five-Number Summary and the Type of Distribution Type of Distribution Comparison Left-Skewed Symmetric Right-Skewed Distance from Xsmallest to the median versus the distance from the median to Xlargest. For the sample of 10 get-ready times. and large cap). mid cap. The distance from Xsmallest to Q1 is greater than the distance from Q3 to Xlargest.9 explains how the relationships among the “five numbers” allows you to recognize the shape of a data set. Both distances are the same. TABLE 3.5 − 29 = 10. The distance from Xsmallest to the median is greater than the distance from the median to Xlargest.5 = 12. Calculations done previously in section 3. The distance from Q1 to the median is less than the distance from the median to Q3. Therefore. The distance from Xsmallest to Q1 (35 − 29 = 6) is slightly less than the distance from Q3 to Xlargest (52 − 44 = 8). and the third quartile = 44. the first quartile = 35.102 CHAPTER THREE Numerical Descriptive Measures The Five-Number Summary A five-number summary that consists of Xsmallest Q1 Median Q3 Xlargest provides a way to determine the shape of the distribution. and high) and type (small cap.5).5 44 52 The distance from the median to Xsmallest to the median (39.15 COMPUTING THE FIVE-NUMBER SUMMARY OF THE 2003 PERCENTAGE RETURN OF SMALL CAP HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low. average. Distance from Q1 to the median versus the distance from the median to Q3. EXAMPLE 3. Therefore. the get-ready times are slightly right-skewed. The distance from Xsmallest to Q1 is less than the distance from Q3 to Xlargest.1 show that the median = 39. Therefore.16 THE BOX-AND-WHISKER PLOT OF THE 2003 PERCENTAGE RETURN OF LOW-RISK. AND HIGH-RISK MUTUAL FUNDS The 121 mutual funds that are part of the “Using Statistics” scenario (see page 72) are classified according to the risk level of the mutual funds (low.e. the upper 25% of the data are represented by a whisker connecting the right side of the box to Xlargest. the first quartile = 41. mid cap. The lower 25% of the data are represented by a line (i. Xsmallest.5 The distance from the median to Xlargest (66. Figure 3.3. a whisker) connecting the left side of the box to the location of the smallest value. Similarly.85. The Box-and-Whisker Plot A box-and-whisker plot provides a graphical representation of the data based on the fivenumber summary. This indicates slight right-skewness.4) is slightly less than the distance from Q3 to Xlargest (66.5 − 60. FIGURE 3. AVERAGE RISK.3 = 4. MUTUALFUNDS2004 .8 = 12. The right whisker is slightly longer than the left whisker.8.. average-risk. Thus. EXAMPLE 3.3 = 16. The vertical line at the left side of the box represents the location of Q1 and the vertical line at the right side of the box represents the location of Q3. In addition. The distance from Xsmallest to Q1 (41. the box contains the middle 50% of the values in the distribution. and high) and type (small cap.85 66.4 indicates very slight rightskewness since the distance between the median and the highest value is slightly more than the distance between the lowest value and the median.85 = 5. This indicates left skewness.3 and the largest value is 66.5.5).7 53.8 − 37.7.7 − 37. the five-number summary is 37.7) is less than the distance from Xsmallest to the median (53.65).8 60. and high-risk mutual funds. The box-and-whisker plot of the get-ready times in Figure 3. the results are inconsistent.4: Exploratory Data Analysis 103 SOLUTION From previous computations for the 2003 return for the small cap mutual funds with high risk (see pages 76 and 78).5 − 53.3 41. the median = 53.4 Box-and-Whisker Plot of the Time to Get Ready Xsmallest 20 25 30 Q1 35 Median 40 Time (minutes) Xlargest Q3 45 50 55 The vertical line drawn within the box represents the median. Construct the box-and-whisker plot of the 2003 return for lowrisk. Therefore. the smallest value in the data set is 37.4 illustrates the box-and-whisker plot for the get-ready times. and large cap). and the third quartile = 60. average. ) FIGURE 3. but the median return is closer to the first quartile than to the third quartile. FIGURE 3. (Note: The area under each polygon is split into quartiles corresponding to the five-number summary for the box-and-whisker plot.2 The median percentage return and the quartiles are higher for the highrisk funds than for the low-risk and average-risk funds. the whiskers in the Minitab boxand-whisker plot extend to 1. Minitab displays the box-and-whisker plot vertically from bottom (low) to top (high).5 is the Minitab box-and-whisker plot of the 2003 return for low-risk.5 Minitab Box-andWhisker Plot of the 2003 Return for LowRisk.104 CHAPTER THREE Numerical Descriptive Measures 2If there are outliers.5 times the interquartile range beyond the quartiles or to the highest value. The average-risk funds are right-skewed due to the extremely large return of one fund (78). The high-risk funds appear left-skewed because of the long lower whisker.6 Box-and-Whisker Plots and Corresponding Polygons for Four Distributions Panel A Bell-shaped distribution Panel B Left-skewed distribution Panel C Right-skewed distribution Panel D Rectangular distribution . and high-risk mutual funds. and High-Risk Mutual Funds Figure 3. SOLUTION Figure 3. The asterisk (*) for the average-risk fund represents the presence of outlier values. The low-risk funds appear to be slightly right-skewed since the upper whisker is longer than the lower whisker. average-risk. Average-Risk.6 demonstrates the relationship between the box-and-whisker plot and the polygon for four different types of distributions. the left side of the box-and-whisker plot). Compare your answer in (b) with that from problem 3.6 is right-skewed.XLS University Change in Cost ($) University of California. Orono University of Mississippi.3(d) on page 90. Panel B of Figure 3.34 The following is a set of data from a sample of n = 6: 7 4 9 7 3 12 a.35 The following is a set of data from a sample of n = 7: 12 7 4 9 0 7 3 a. 1A–2A).6 are symmetric. Here. Therefore.33 The following is a set of data from a sample of n = 5: 7 4 9 8 2 a. The concentration of values is on the low end of the scale (i.e. Urbana–Champaign Kansas State University. List the five-number summary.425 922 308 a.2(d) on page 90. PH Grade ASSIST 3. Logan 1. a shared dormitory room.37–3. Compare your answer in (b) with that from problem 3. List the five-number summary. Panel C of Figure 3. The few small values distort the mean toward the left tail. Construct the box-and-whisker plot and describe the shape. Discuss. b. List the five-number summary. 3. Durham Ohio State University. b. b. the right side). the mean and median are equal. c. PROBLEMS FOR SECTION 3. .38 In the 2002–2003 academic year. List the five-number summary.720 708 1. Construct the box-and-whisker plot and describe the shape.” USA Today. Columbia Utah State University.37 A manufacturer of flashlight batteries took a sample of 13 batteries from a day’s production and used them continuously until they were drained. b. The following represents the change in the cost of tuition. c. Construct the box-and-whisker plot and describe the shape. and the median line divides the box in half. Discuss. Columbus University of South Carolina. Construct the box-and-whisker plot and describe the shape. Construct the box-and-whisker plot and describe the shape.42 can be solved manually or by using Microsoft Excel. Construct the box-and-whisker plot and describe the shape. COLLEGECOST. and the most popular meal plan from the 2001–2002 academic year to the 2002–2003 academic year for a sample of 10 public universities.4: Exploratory Data Analysis 105 Panels A and D of Figure 3.223 869 423 1. Applying the Concepts Problems 3. the long left whisker contains the smallest 25% of the values. “Public Universities Raise Tuition. or SPSS.3. c.. Minitab. PH Grade ASSIST 3. Discuss. 2002. Athens University of Illinois.4 Learning the Basics PH Grade ASSIST 3. demonstrating the distortion from symmetry in this data set. Oxford University of New Hampshire.6 is left-skewed. For this left-skewed distribution. List the five-number summary. Compare your answer in (b) with that from problem 3. List the five-number summary. August 8. In addition. 75% of all values are found between the left edge of the box (Q1) and the end of the right whisker (Xlargest). many public universities in the United States raised tuition and fees due to a decrease in state subsidies (Mary Beth Marklein. the length of the left whisker is equal to the length of the right whisker.36 The following is a set of data from a sample of n = 5: 7 −5 −8 7 9 a.4(d) on page 90. Manhattan University of Maine. b. The number of hours until failure are in the file.049 631 512 266 492 562 298 a.e. BATTERIES PH Grade ASSIST SELF Test 342 426 317 545 264 451 1. Discuss..1(d) on page 90. 3. 3. Compare your answer in (b) with that from problem 3. c. and the remaining 25% of the values are dispersed along the long right whisker at the upper end of the scale. the skewness indicates that there is a heavy clustering of values at the high end of the scale (i. 75% of all data values are found between the beginning of the left whisker (Xsmallest) and the right edge of the box (Q3). b.589 593 1. In these distributions. Fees—and Ire. Berkeley University of Georgia. Rothenberger.73 3. June 2000. Eight analysts at the firm were asked to estimate the reuse rate when developing a new software system. Construct the box-and-whisker plot and describe the shape of the distribution of the two bank branches.” Copyright © 2000 by Consumers Union of U.08 6. the covariance and the coefficient of correlation that measure the strength of the relationship between two numerical variables are discussed. “A Performance Measure for Software Reuse Projects. lunch period. March 2001. The waiting time in minutes (defined as the time the customer enters the line until he or she reaches the teller window) of all customers during these hours is recorded over a period of one week. FASTFOOD 3.S.. A random sample of 15 customers is selected.13 4. NY 10703–1057. a. b.34 3. is also concerned with the noon to 1 P.50 6. . Inc.35 10.20 4. List the five-number summary of the waiting time at the two bank branches.0 Source: M.40 The following data represent the bounced check fee (in dollars) for a sample of 23 banks for direct-deposit customers who maintain a $100 balance and the monthly service fee (in dollars) for direct-deposit customers if their accounts fall below the minimum required balance of $1500 for a sample of 26 banks. Adapted with permission from Consumer Reports. c. Should you compare the two bank branches? Explain. REUSE 50 62. lunch hour. The Covariance The covariance measures the strength of the linear relationship between two numerical variables (X and Y). Adapted with permission from Consumer Reports. located in a residential area.17 illustrates its use. A. and the results are as follows: BANK2 9. c.5 15.42 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1:00 P.M. and the results are as follows: BANK1 4. Dooley. THE COVARIANCE AND THE COEFFICIENT OF CORRELATION In section 2.90 8.0 25.000 lines of code collected from 10 years of continuous reuse effort.01 8.41 The following data represent the total fat for burgers and chicken items from a sample of fast-food chains. a.12 6.02 5.91 5.0 47.000. the firm uses a database of reusable components totaling more than 2.21 5.38 5.0 45.68 5.” Decision Sciences.54 3.10 0. In this section. NY 10703–1057. What similarities and differences are there in the distributions for the bounced check fee and the monthly service fee? 3.. 1131–1153. What similarities and differences are there in the distributions for the burgers and the chicken items? 3. and describe the shape of the distribution for the burgers and chicken items. 46.M. Yonkers. J. List the five-number summary. Instead of starting from scratch when writing and developing new custom software systems.106 CHAPTER THREE Numerical Descriptive Measures 3. b.5 Burgers 19 31 34 35 39 39 43 Chicken 7 9 15 16 16 18 22 25 27 33 39 Source: Extracted from “Quick Bites.02 5.47 a. a. BANKCOST1 BANKCOST2 26 28 20 20 21 22 25 25 18 25 15 20 18 20 25 25 22 30 30 30 15 20 29 12 8 5 5 6 6 10 10 9 7 10 7 7 5 0 10 6 9 12 0 5 10 8 5 5 9 Source: Extracted from “The New Face of Banking.64 4. 30(Fall 1999).” Copyright © 2001 by Consumers Union of U. Form the box-and-whisker plot and describe the shape of the data. List the five-number summary for the burgers and for the chicken items. Inc.46 6. Construct the box-and-whisker plot of the bounced check fee and the monthly service fee.55 3.. 3.39 A software development and consulting firm located in the Phoenix metropolitan area develops software for supply chain management systems using systematic software reuse.5 75.5 37. c. you used scatter diagrams to visually examine the relationship between two numerical variables.82 8.77 2.S. b. Construct the box-and-whisker plot for the burgers and the chicken items. The following data are given as a percentage of the total code written for a software system that is part of the reuse database. Equation (3. The waiting time in minutes (operationally defined as the time the customer enters the line to the time he or reaches the teller window) of all customers during this hour is recorded over a period of one week.5.79 8.18) defines the sample covariance and Example 3. Yonkers. A random sample of 15 customers is selected. b.79 Another branch.19 3..17 9. What similarities and differences are there in the distribution of the waiting time at the two bank branches? d.49 6. and K. List the five-number summary of the bounced check fee and of the monthly service fee.66 5. The Calculations area of Figure 3.579 9 −1 = 1.19738.7 Microsoft Excel Worksheet for the Covariance between Expense Ratio and 2003 Return for the Small Cap High-Risk Funds Expense Ratio 1.25 0.18) n −1 COMPUTING THE SAMPLE COVARIANCE Consider the expense ratio and the 2003 return for the small cap high-risk funds. Y ) = 9.2 44.10 Expense Ratio and 2003 Return for the Small Cap High-Risk Funds FIGURE 3.72 1.5 .8 56.33 1.5 53. SOLUTION Table 3.40 1.61 1.18) into a set of smaller calculations.6 59. From cell C17.20 2003 Return 37.19738 TABLE 3.3 39.10 presents the expense ratio and 2003 return for the small cap high-risk funds and Figure 3.3.57 1.3 62.4 66. Compute the sample covariance.42 1. cov( X .7 breaks down Equation (3.2 44.5: The Covariance and the Coefficient of Correlation 107 THE SAMPLE COVARIANCE n cov( X .18) directly.68 1.7 contains a Microsoft Excel worksheet that calculates the covariance for these data. Y ) = EXAMPLE 3. or by using Equation (3.17 ∑ ( X i − X )(Yi − Y ) i =1 (3. the covariance is 1. 0. Figure 3. Perfect means that if the points were plotted in a scatter diagram. The values of the coefficient of correlation range from −1 for a perfect negative correlation to +1 for a perfect positive correlation. and there is only a slight tendency for the small values of X to be paired with the larger values of Y. The Coefficient of Correlation The coefficient of correlation measures the relative strength of a linear relationship between two numerical variables. You can see that for small values of X there is a very strong tendency for Y to be large.9 on page 109 presents scatter diagrams along with their respective sample coefficients of correlation r for six data sets. the Greek letter ρ is used as the symbol for the coefficient of correlation. Panels D through F depict data sets that have positive coefficients of correlation because small values of X tend to be paired with small values of Y.6. Y increases in a perfectly predictable manner when X increases. Panel C illustrates a perfect positive relationship where ρ equals +1. In this case. the coefficient of correlation ρ equals 0. In this case. FIGURE 3. you need to compute the coefficient of correlation. The data in panel B have a coefficient of correlation equal to −0. the coefficient of correlation ρ equals −1. each of which contains 100 values of X and Y. Thus. you are unable to determine the relative strength of the relationship. Y decreases in a perfectly predictable manner. When you have sample data. and as X increases. Correlation alone cannot prove .8 Types of Association between Variables Y Y Panel A Perfect negative correlation (r = –1) X Y Panel B No correlation (r = 0) X Panel C Perfect positive correlation (r = +1) X In panel A of Figure 3. To better determine the relative strength of the relationship. In panel C the linear relationship between X and Y is very weak. Figure 3. the coefficient of correlation r is −0. the coefficient of correlation in panel B is not as negative as in panel A.3. When using sample data.8 there is a perfect negative linear relationship between X and Y. Since the covariance can have any value. and the small values of X tend to be paired with large values of Y. the sample coefficient of correlation r is calculated. The data do not all fall on a straight line. The linear relationship between X and Y in panel B is not as strong as in panel A. Thus.108 CHAPTER THREE Numerical Descriptive Measures The covariance has a major flaw as a measure of the linear relationship between two numerical variables.8 illustrates three different types of association between two variables. and the large values of X tend to be associated with large values of Y. or −1. the large values of X tend to be paired with small values of Y.9. the relationships were deliberately described as tendencies and not as cause-and-effect. you are unlikely to have a sample coefficient of exactly +1. Likewise. When dealing with population data for two numerical variables. Panel B shows a situation in which there is no relationship between X and Y. This wording was used on purpose. all the points could be connected with a straight line. and when X increases. In panel A. there is no tendency for Y to increase or decrease.9. r = −0. so the association between X and Y cannot be described as perfect. In the discussion of Figure 3. 3.18 illustrates its use. or by a cause-andeffect relationship. . you can say that causation implies correlation. Equation (3. that is.19) defines the sample coefficient of correlation r and Example 3. A strong correlation can be produced simply by chance. Therefore. You would need to perform additional analysis to determine which of these three situations actually produced the correlation.5: The Covariance and the Coefficient of Correlation Panel A Panel B Panel C Panel D Panel E Panel F 109 FIGURE 3. but correlation alone does not imply causation. by the effect of a third variable not considered in the calculation of the correlation. that the change in the value of one variable caused the change in the other variable.9 Six Scatter Diagrams Created from Minitab and Their Sample Coefficients of Correlation r that there is a causation effect. 18 illustrates the computation of the sample coefficient of correlation using Equation (3. SOLUTION r = = cov( X .18 COMPUTING THE SAMPLE COEFFICIENT OF CORRELATION Consider the expense ratio and the 2003 return for the small cap high-risk funds.10 and Equation (3. compute the sample coefficient of correlation.110 CHAPTER THREE Numerical Descriptive Measures THE SAMPLE COEFFICIENT OF CORRELATION r = cov( X .19738 ( 0.10 Microsoft Excel Worksheet for the Sample Coefficient of Correlation r between the Expense Ratio and the 2003 Return for Small Cap High-Risk Funds .3943786 FIGURE 3.19). EXAMPLE 3.19) n ∑ ( X i − X )(Yi − Y ) where cov(X. Y ) S X SY (3. Y) = i =1 n −1 n ∑ ( X i − X )2 SX = i =1 n −1 n ∑ (Yi − Y )2 SY = i =1 n −1 Example 3.287663)(10.554383) = 0.19). From Figure 3. Y ) S X SY 1. . bonds and Emerging market stocks was −0. bonds and these five other types of investments can you make? b. “Why Investors Should Put up to 30% of Their Stock Portfolio in Foreign Funds.5 Learning the Basics 3. 3. U.0 350 20.46 The following data COFFEEDRINK represent the calories and fat (in grams) of 16-ounce iced coffee drinks at Dunkin’ Donuts and Starbucks: Product Calories Fat Dunkin’ Donuts Iced Mocha Swirl latte (whole milk) Starbucks Coffee Frappuccino blended coffee Dunkin’ Donuts Coffee Coolatta (cream) Starbucks Iced Coffee Mocha Expresso (whole milk and whipped cream) Starbucks Mocha Frappuccino blended coffee (whipped cream) Starbucks Chocolate Brownie Frappuccino blended coffee (whipped cream) Starbucks Chocolate Frappuccino Blended Crème (whipped cream) 240 260 350 8. In summary.53. stocks and Emerging market debt was 0.45 A recent article (J.18. Minitab. the linear relationship between the two variables is stronger. . How strong is the relationship between X and Y? Explain.45 (a). Those mutual funds with the lowest expense ratios tend to be associated with the lowest 2003 returns.S.13.43 The following is a set of data from a sample of n = 11 items: X 7 5 8 Y 21 15 24 3 6 10 12 4 9 15 18 9 18 30 36 12 27 45 54 a.. Applying the Concepts Problems 3. Yonkers.” Copyright © 2004 by Consumers Union of U. This relationship is fairly weak. the larger values of X are typically paired with the smaller values of Y). past performance does not guarantee future performance.e. What conclusions about the strength of the relationship between the return on investment of U. 2003.S. NY 10703–1057.0 3. You cannot assume that having a low expense ratio caused the low 2003 return. The existence of a strong correlation does not imply a causation effect. or association. Compare the results of (a) to those of problem 3. What conclusions about the strength of the relationship between the return on investment of U. When the coefficient of correlation gets closer to +1 or −1. November 26. November 26. bonds and Emerging market debt was 0. a. r = 0. 2003.0 420 16.” The Wall Street Journal. stocks and Emerging market stocks was 0.44 (a). You can only say that this is what tended to happen in the sample. the larger values of X are typically paired with the larger values of Y) or negatively correlated (i. U. bonds and International Small Cap stocks was −0. Compare the results of (a) to those of problem 3.e. U.S.S.5 22.S. June 2004. D1) that discussed investment in foreign bonds stated that the coefficient of correlation between the return on investment of U.S. between two numerical variables.394. b.S. stocks and these five other types of investments can you make? b.44–3. c. Inc.71.S.0 530 19. as indicated by a coefficient of correlation. or SPSS.49 can be solved manually or by using Microsoft Excel.. 3.” The Wall Street Journal. Clements. U. It only indicates the tendencies present in the data. a. Clements. When the coefficient of correlation is near 0.58.48.44 A recent article (J. U. stocks and International Large Cap stocks was 0. bonds and International Large Cap stocks was −0. D1) that discussed investment in foreign stocks stated that the coefficient of correlation between the return on investment of U. PROBLEMS FOR SECTION 3. U..80. The sign of the coefficient of correlation indicates whether the data are positively correlated (i. bonds and International Bonds was 0.S. 9. Adapted with permission from Consumer Reports. 3. stocks and International Bonds was 0. Those mutual funds with the highest expense ratios tend to be associated with the highest 2003 returns. the coefficient of correlation indicates the linear relationship.03.0 510 22. Compute the covariance.3. U.5: The Covariance and the Coefficient of Correlation 111 The expense ratio and the 2003 return for the small cap high-risk funds are positively correlated.20. little or no linear relationship exists.S. Compute the coefficient of correlation.0 Source: Extracted from “Coffee as Candy at Dunkin’Donuts and Starbucks. S.S.S. U. “Why Investors Should Put up to 30% of Their Stock Portfolio in Foreign Funds. stocks and International Small Cap stocks was 0.10. As with all investments. .112 CHAPTER THREE Numerical Descriptive Measures a.” Copyright 2002 by Consumers Union of U.8 25.0 176.8 1180. What conclusions can you reach about the relationship between exports and imports. Compute the coefficient of correlation.49 The following data CELLPHONE represent the digitalmode talk time in hours and the battery capacity in milliampere-hours of cellphones. St.1 30.00 3. Compute the coefficient of correlation. C2.0 European Union United States Japan China Canada Hong Kong Mexico South Korea Taiwan Singapore 3. February 2002. Compute the covariance. What conclusions can you reach about the relationship between the battery capacity and the digital-mode talk time? d. A1. “Post-Iraq Influence of U. c.3 13.1 243. Miller. Inc.6 227. Louis Atlanta Houston Boston Chicago Denver Dallas Baltimore Seattle/Tacoma San Francisco Orlando Washington–Dulles Los Angeles Detroit San Juan Miami New York–JFK Washington–Reagan Honolulu Turnover Source: Extracted from Alan B. “A Small Dose of Common Sense Would Help Congress Break the Gridlock over Airport Security. Your next step is analysis and interpretation of the calculated statistics.5 266. c. Faces Test at New Trade Talks.1 730.2 18. c. Compute the coefficient of correlation. Adapted with permission from Consumer Reports.4 122. September 9. Krueger. 2001.1 31.25 3. 25.7 9.2 259.00 2.25 2.6 22.25 2. and shape. Which do you think is more valuable in expressing the relationship between exports and imports—the covariance or the coefficient of correlation? Explain. Talk Time Battery Capacity Talk Time Battery Capacity 4.. What conclusions can you reach about the relationship between the turnover rate of pre-boarding screeners and the security violations detected? a.9 7.75 2.1 13.” The Wall Street Journal.25 1.50 2.S.75 1.6 Violations 110 100 90 88 79 70 64 53 47 37 20.2 202.8 403.25 2.9 14.00 2. Your analysis is objective.47 The following data represent the value of exports and imports in 2001 for various countries: EXPIMP Country Exports Imports 874.5 150. Compute the covariance.7 31.2 349.25 2. You would expect cellphones with higher battery capacity to have a higher talk time. What conclusions can you reach about the relationship between calories and fat? 3. b. b. King and S.50 4.5 121. NY 10703–1057.5 15. a.75 800 1500 1300 1550 900 875 750 1100 850 1.1 158.” The New York Times. b.9 a.2 141.50 2. variation. Compute the covariance.75 1.25 2.3 10. City City Turnover Violations 416 375 237 207 200 193 156 155 140 11.48 The following data SECURITY represent the turnover rate of pre-boarding screeners at airports in 1998–1999 and the security violations detected per million passengers.5 10. d. your interpre- .2 21. 2003. Compute the sample covariance.5 3.8 912. d.9 191.S.8 14. Compute the coefficient of correlation. c.00 450 900 900 900 700 800 800 900 900 Source: Extracted from “Service Shortcomings. b. Is this borne out by the data? PITFALLS IN NUMERICAL DESCRIPTIVE MEASURES AND ETHICAL ISSUES In this chapter you studied how a set of numerical data can be characterized by various statistics that measure the properties of central tendency. Yonkers.9 6. November 15.3 116.1 107. Source: Extracted from N. SELF Test 3. Which do you think is more valuable in expressing the relationship between calories and fat—the covariance or the coefficient of correlation? Explain. range. you need to question what you read in newspapers and magazines.5) . when making oral presentations and presenting written reports. In addition. and what you see on the World Wide Web. Different people form different conclusions when interpreting the analytical findings. Table 3. variability. and coefficient of correlation. Ethical Issues Ethical issues are vitally important to all data analysis. the mean for a very skewed set of data) to distort the facts in order to support a particular position. how should you proceed with the objective analysis? Because the data distribute in a slightly asymmetrical manner.. and shape of a numerical variable Mean. Unethical behavior occurs when you willfully choose an inappropriate summary measure (e. You must avoid errors that may arise either in the objectivity of your analysis or in the subjectivity of your interpretation. Z scores. histograms. you were able to present useful information through the use of pie charts. Objectivity in data analysis means reporting the most appropriate descriptive summary measures for a given data set. Over time. In addition. Type of Analysis Numerical Data Describing a central tendency.g. what you hear on the radio or television. The analysis of the mutual funds based on risk level is objective and reveals several impartial findings. and clear manner. Perhaps no comment on this topic is more telling than a quip often attributed to the famous nineteenth-century British statesman Benjamin Disraeli: “There are three kinds of lies: lies. unethical behavior occurs when you selectively fail to report pertinent findings because it would be detrimental to the support of a particular position. geometric mean. range. the focus. and neutral manner. mode.11 provides a list of the numerical descriptive measures covered in this chapter. and then summarized. SUMMARY This chapter was about numerical descriptive measures. and statistics.1–3. because data interpretation is subjective. box-and-whisker plot (sections 3. In the next chapter. variation. neutral. quartiles. variance. In this and the previous chapter. you need to give results in a fair. standard deviation. described.4) Describing the relationship between two numerical variables Covariance. median. You explored characteristics of past performance such as central ten- TABLE 3. you studied descriptive statistics—how data are presented in tables and charts. and other graphical methods. data interpretation is subjective. Everyone sees the world from different perspectives. Now that you have read the chapter and have become familiar with various descriptive summary measures and their strengths and weaknesses.” Ethical considerations arise when you are deciding what results to include in a report. objective. and shape using numerical descriptive measures such as the mean. Thus. you must do it in a fair. and interpreted. You should document both good and bad results. coefficient of correlation (section 3. As a daily consumer of information. much skepticism has been expressed about the purpose. median. interquartile range. damned lies. quartiles. the basic principles of probability are presented in order to bridge the gap between the subject of descriptive statistics and the subject of inferential statistics. analyzed.Summary 113 tation is subjective. and the objectivity of published studies. standard deviation. coefficient of variation. When dealing with the mutual fund data. shouldn’t you report the median in addition to the mean? Doesn’t the standard deviation provide more information about the property of variation than the range? Should you describe the data set as right-skewed? On the other hand.11 Summary of Numerical Descriptive Measures dency. 12) Population Mean N Median Median = n +1 rank value 2 (3.17) n −1 Sample Standard Deviation S = (3.13) N Population Variance N ∑ ( X i − µ )2 (3.114 CHAPTER THREE Numerical Descriptive Measures KEY FORMULAS Sample Mean Z Scores n X = ∑ Xi i =1 (3.3) i =1 σ2 = Third Quartile Q3 Q3 = X −X S Z = 3( n + 1) ranked value 4 (3.5) i =1 σ = RG = [(1 + R1 ) × (1 + R2 ) × L × (1 + Rn )]1/ n − 1 (3.16) n ∑ ( X i − X )2 i =1 (3.18) Sample Coefficient of Correlation r = cov( X .14) N (3. Y ) = ∑ ( X i − X )(Yi − Y ) i =1 n −1 (3.19) TERMS arithmetic mean 73 box-and-whisker plot 103 central tendency 72 Chebyshev rule 97 coefficient of correlation 108 coefficient of variation 85 covariance 106 dispersion 72 empirical rule 96 extreme value 86 five-number summary 102 geometric mean 79 interquartile range 81 left-skewed 88 mean 73 median 75 midspread 81 mode 76 outlier 86 population mean 94 population standard deviation population variance 97 Q1: first quartile 77 Q2: second quartile 77 95 .10) Coefficient of Variation S CV =   100% X KEY (3.9) n −1 ∑ ( m j − X )2 f j j =1 S = Sample Covariance n S2 = ∑ (Xi − X ) i =1 n 2 n −1 (3.4) Population Standard Deviation N ∑ ( X i − µ )2 Geometric Mean 1/ n X G = ( X1 × X 2 × L × X n ) (3.8) Sample Variance n X = ∑ mj f j j =1 Approximating the Standard Deviation from a Frequency Distribution c (3.11) cov( X .7) Interquartile Range Interquartile range = Q3 − Q1 (3.1) n (3.2) µ = First Quartile Q1 n +1 ranked value Q1 = 4 ∑ Xi i =1 (3. Y ) S X SY (3.6) Approximating the Mean from a Frequency Distribution c Range Range = Xlargest − Xsmallest (3.15) N Geometric Mean Rate of Return S2 = (3. 67 manually or by using Microsoft Excel. a medical information bureau check.59 What is meant by the property of shape? 3.61 A quality characteristic of interest for a tea-bag-filling process is the weight of the tea in the individual bags.57 How does the empirical rule help explain the ways in which the values in a set of numerical data cluster and distribute? 3.58 How do the empirical rule and the Chebychev rule differ? 3.61 5.56 What are the differences among the various measures of variation such as the range.68–3.40 5.77 5. a random sample of 27 approved policies was selected and the following total processing time in days was recorded: INSURANCE 73 19 16 64 28 28 31 90 60 56 31 56 22 18 45 48 17 17 17 91 92 63 50 51 69 16 17 .56 5. If the bags are underfilled. variance.58 5.42 5.44 5. Why should the company producing the tea bags be concerned about the central tendency and variation? d. standard deviation. and what are the advantages and disadvantages of each? 3. TEABAGS 5.32 5. and coefficient of variation. two problems arise. how? e. what changes. Interpret the measures of central tendency and variation within the context of this problem.53 5. median.86 using Microsoft Excel.47 5.52 5. Minitab. median.55 What does the Z score measure? 3. Is the company meeting the requirement set forth on the label that. Compute the mean. 3. if any.58 5.58 5. there are 5. standard deviation. and third quartile? 3. the label weight on the package indicates that.36 a. First.55 5.44 5.29 5. which includes a review of the application. and what are the advantages and disadvantages of each? 3.52 What are the differences among the mean.50 5. If the average amount of tea in a bag exceeds the label weight. interquartile range. For this product.53 5.55 5. median.67 5.57 5. The approval process consists of underwriting. first quartile.45 5.57 5.40 5.57 5. the company may be in violation of the truth-in-labeling laws.41 5. differences in the density of the tea.5 grams of tea in a bag. interquartile range.51 5. During a period of one month. Construct a box-and-whisker plot.50 5. Compute the range.67 5. and mode. variance.25 5. would you try to make concerning the distribution of weights in the individual bags? 3.61–3.44 5.53 How do you interpret the first quartile.Chapter Review Problems Q3: third quartile 77 quartiles 77 range 80 resistant measures 81 right-skewed 88 sample coefficient of correlation 109 CHAPTER sample covariance 106 sample mean 73 sample standard deviation sample variance 82 shape 72 skewed 88 spread 72 REVIEW Checking Your Understanding 3.54 5.34 5.51 What is meant by the property of central tendency? 3.5 grams of tea in a bag? If you were in charge of this process.56 5. on average. c. Second. b. savings banks are permitted to sell a form of life insurance called Savings Bank Life Insurance (SBLI). Minitab. possible requests for additional medical information and medical exams. the company is giving away product. and coefficient of variation. We recommend that you solve problems 3.54 5. or SPSS.65 5.54 What is meant by the property of variation? 3. and the extremely fast filling operation of the machine (approximately 170 bags a minute). and a policy compilation stage during which the policy pages are generated and sent to the bank for delivery.45 5. Getting an exact amount of tea in a bag is problematic 115 standard deviation 82 sum of squares 82 symmetrical 88 variance 82 variation 76 Z scores 86 82 PROBLEMS because of variation in the temperature and humidity inside the factory. on average.61 5.47 5. there are 5. The ability to deliver approved policies to customers in a timely manner is critical to the profitability of this service to the bank.62 In New York State. customers may not be able to brew the tea to be as strong as they wish.53 5. and third quartile.42 5.50 What are the properties of a set of numerical data? 3.53 5. The following table provides the weight in grams of a sample of 50 tea bags produced in one hour by a single machine.60 How do the covariance and the coefficient of correlation differ? Applying the Concepts You can solve problems 3.63 5.62 5.32 5.49 5. or SPSS.40 5.46 5. Are the data skewed? If so.50 5. Interpret the measures of central tendency and variability in (a).744 1.31 and 8.64 A manufacturing company produces steel housings for electrical equipment.462 8. Construct a box-and-whisker plot.444 8.385 8. how? d. and standard deviation for the width.784 1. are there any differences between the two central offices? Explain.662 1.405 8. TROUGH 8.756 1.60 4. Calculate the mean. The following data represent samples of 20 problems reported to two different offices of a telephone company and the time to clear these problems (in minutes) from the customers’ lines: PHONE Central Office I Time to Clear Problems (minutes) 1.550 1. FURNITURE 54 5 35 137 31 27 152 2 123 81 74 27 11 19 126 110 110 29 61 35 94 31 26 5 12 4 165 32 29 28 29 26 25 1 14 13 13 10 5 27 4 52 30 22 36 26 20 23 33 68 a.764 1. and standard deviation for the force variable.447 8.410 8. The following are the widths of the troughs in inches for a sample of n = 49. Compute the range. What can you conclude about the strength of the insulators if the company requires a force measurement of at least 1. Are the data skewed? If so. if you had to tell the president of the company how long a customer should expect to wait to have a complaint resolved.409 a. The following data represent the number of days between the receipt of the complaint and the resolution of the complaint.92 0.680 1.436 8. Construct a box-and-whisker plot. median. d.652 1.734 1.45 0.31 inches and 8. and third quartile. the flooring department had expanded from 2 installation crews to an installation supervisor.58 4.762 1. c. Compute the range.419 8. first quartile.592 1. standard deviation.460 8.866 1.02 3.414 8. median.415 8.53 4. 3. c.61 inches wide? 3. and third quartile.55 3.634 1.52 3. c. The main component part of the housing is a steel trough that is made out of a 14-gauge steel coil.820 1. d.489 8.481 8.48 1.53 0.498 8.498 8.479 8.734 1.75 0.80 1. Compute the range. On the basis of the results of (a) through (c).02 0.405 8. a short-circuit is likely to occur.413 8.420 8. how? d. including carpet.810 1.66 Problems with a telephone line that prevent a customer from receiving or making calls are disconcerting to both the customer and the telephone company.60 1.420 8.10 0. The company requires that the width of the trough be between 8. interquartile range. b. Interpret these measures of central tendency and variability.343 8.116 CHAPTER THREE Numerical Descriptive Measures a.422 8. Calculate the mean.870 1.465 8.78 2. If the insulators break when in use.93 1.60 0.383 8. median. interquartile range. In particular.65 The manufacturing company in problem 3. range.410 8.10 1.656 1.97 1. To test the strength of the insulators.72 For each of the two central office locations: a. The distance from one side of the form to the other is critical because of weatherproofing in outdoor applications. Compute the mean. how? d. Force is measured by observing how many pounds must be applied to the insulator before it breaks.317 8.484 8. b.65 1. had undergone a major expansion in the past several years.63 One of the major measures of the quality of service provided by any organization is the speed with which it responds to customer complaints. and 15 installation crews. Compute the mean.48 1. Construct a box-and-whisker plot and describe the shape. Compute the mean. What would you tell a customer who enters the bank to purchase this type of insurance policy and asks how long the approval process takes? 3. On the basis of the results of (a) through (c).32 3.15 3.696 1. Construct a box-and-whisker plot and describe the shape. A sample of 50 complaints concerning carpet installation was selected during a recent year. what would you say? Explain. b. median. variance. b.460 8.411 8.351 8. The data from 30 insulators from this experiment are as follows: FORCE 1. and coefficient of variation. Are the data skewed? If so. median. standard deviation.427 8. A large family-held department store selling furniture and flooring.447 8.458 8.429 8.728 1.85 0. .396 8. and coefficient of variation.439 8. standard deviation.93 5.500 pounds? 3. variance.373 8.736 a.688 1. variance. and coefficient of variation. Are the data skewed? If so. and third quartile.65 0.522 1.52 1. interquartile range.75 0.75 0.30 2. range.752 1.10 1.382 8.05 6.61 inches.312 8.323 8.866 1.403 8.23 0.412 8.420 8.810 1.60 0.48 3. destructive testing is carried out to determine how much force is required to break the insulators.429 8. c.774 1.788 1.610 1.348 8.414 8. It is produced using a 250-ton progressive punch press with a wipe-down operation putting two 90-degree forms in the flat steel to make the trough. c.481 8.64 also produces electric insulators. Construct a side-by-side box-and-whisker plot. first quartile. a measurer. List the five-number summary.97 Central Office II Time to Clear Problems (minutes) 7.10 0. b.662 1.476 8. first quartile. What can you conclude about the number of troughs that will meet the company’s requirements of troughs being between 8.08 1. tipped on end sheets. calories. gathered.62 8. N = no promotion was held a. b. and third quartile.71 10. and coefficient of variation. protein in grams. b. c. Compute the mean.25 9. median.62 12. In a book manufacturing plant the WIP represents the time it takes for sheets from a press to be folded. and fat in grams for 97 varieties of dry and canned dog and cat food. canned dog food. Are the data for any of the types of food skewed? If so.42 11. On the basis of the results of (a) through (c). how? d.96 4. for the variables of cost per serving. What conclusions can you reach concerning the cost per ounce in cents. median. 3.92 11. d.67 In many manufacturing processes the term “work-inprocess” (often abbreviated WIP) is used.75 15. Boyd and T. first quartile.70 Do marketing promotions. standard deviation. NY 10703–1057. how? . Discuss the results of (a) through (c) and comment on the effectiveness of promotions at Royals’ games during the 2002 season. The data in the file TUITION include the difference in tuition between 2002–2003 and 2003–2004 for in-state students and outof-state students. fiber in grams. Inc. and canned cat food). Construct a side-by-side box-and-whisker plot.21 6.” Sport Marketing Quarterly. 3. 173–183). and third quartile. 33–34.29 13. Compute the range.54 8. Compute the mean.58 5. c. 12(2003). Compute the range. “Promotion Timing in Major League Baseball and the Stacking Effects of Factors that Increase Game Attractiveness. median. calories. Krehbiel. Are the data skewed? If so. b. Construct a graphical display containing two boxand-whisker plots. For each variable: a.62 25. increase attendance at Major League Baseball games? An article in Sport Marketing Quarterly reported on the effectiveness of marketing promotions (T. Compute the range. Calculate the mean and standard deviation of attendance for the 43 games where promotions were held and for the 37 games without promotions.25 5. standard deviation.00 2..41 11. The data file ROYALS includes the following variables for the Kansas City Royals during the 2002 baseball season: GAME = Home games in the order they were played ATTENDANCE = Paid attendance for the game PROMOTION—Y = a promotion was held.46 16. WIP Plant A 5.50 7. interquartile range.S. one for the 43 games where promotions were held and one for the 37 games without promotions. February 1998. protein in grams. a. What conclusions can you reach concerning the difference in tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students? 3. first quartile.42 10.29 7. Adapted with permission from Consumer Reports. how? d. and coefficient of variation.54 11. sewn. Construct a five-number summary for the 43 games where promotions were held and for the 37 games without promotions. and third quartile.46 21. Compute the mean. For the four types of food (dry dog food.75 12. Source: Extracted from Copyright 1998 by Consumers Union of U. and sugar in grams for 33 breakfast cereals. c.45 8. dry cat food and canned cat food). such as bobble-head giveaways.Chapter Review Problems 3. c. Compute the range. interquartile range. Inc. cups per can. how? d. standard deviation. October 1999.33 14. Construct a box-and-whisker plot. variance. and bound.58 9.69 State budget cuts forced a rise in tuition at public universities during the 2003–2004 academic year. canned dog food.37 6. fiber in grams.. 18–19..62 7. Construct a side-by-side box-and-whisker plot for the four types (dry dog food.50 7. interquartile range.71 For each of the two plants: a. standard deviation. variance.. Are the data skewed? If so. The following data represent samples of 20 books at each of two production plants and the processing time (operationally defined as the time in days from when the books came off the press to when they were packed in cartons) for these jobs. c.S.25 10. Construct a box-and-whisker plot of the difference in tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students.29 7. C. b.62 5.04 5.17 13.29 16.71 The data contained in the file PETFOOD2 consist of the cost per serving. first quartile. dry cat food.13 13.41 14. Adapted with permission from Consumer Reports. variance. b. first quartile. Compute the mean.92 Plant B 9. Yonkers. variance. and coefficient of variation for the difference in 117 tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students.46 9. and fat in grams: a. Are the data skewed? If so. and the sugar in grams for the 33 breakfast cereals? 3. median. Source: Extracted from Copyright 1999 by Consumers Union of U. Yonkers. NY 10703–1057. and third quartile for the difference in tuition between 2002–2003 and 2003–2004 for in-state students and out-of-state students. and coefficient of variation. are there any differences between the two plants? Explain.68 The data contained in the file CEREALS consists of the cost in dollars per ounce. C. interquartile range. accelerated-life testing is conducted at the manufacturing plant. percentage of homes with eight or more rooms. radio. national and other local expenses.S. and percentage of mortgage-paying homeowners whose housing costs exceed 30% of income: a. April 2002. a shingle should experience no more than 0. local television. and color photo cost. color photo time. c. how? d. c. a shingle is repeatedly scraped with a brush for a short period of time and the amount of shingle granules that are removed by the brushing is weighed (in grams). and third quartile. Based on the results of (a). first quartile. What conclusions about the relationship of energy cost and filter cost to the price of the air cleaners can you make? Source: Extracted from “Portable Room Air Cleaners. variance. interquartile range. median. the file BB2001 contains team-by-team statistics on ticket prices. Yonkers. NY 10703–1057. Compute the coefficient of correlation between price and filter cost. List the five-number summary for the Boston shingles and for the Vermont shingles.76 The data in the file PRINTERS represent the price. text cost. What conclusions can you reach concerning the regular season gate receipts.” Copyright © 2002 by Consumers Union of U.. c. text cost. Are the data skewed? If so. and coefficient of variation. all other operating revenue. c. and color photo cost of computer printers. Yonkers.. median household income. median.. interquartile range. Inc. and luggage capacity. canned dog food. What conclusions can you reach concerning any differences among the four types (dry dog food.. Construct a box-and-whisker plot. Shingles that experience low amounts of granule loss are expected to last longer in normal use than shingles that experience high amounts of granule loss. How strong is the relationship between these two variables? e. For each of the variables of average travel-to-work time in minutes.. In this test.73 The data in the file STATES represent the results of the American Community Survey.75 The data in the file AIRCLEANERS represent the price. regular season gate receipts.. What conclusions can you reach concerning the average travel-to-work time in minutes. Adapted with permission from Consumer Reports. a. 3. and fans complaining about how expensive it is to attend a game and watch games on cable television.000 households taken in each state during the 2000 U. and 140 measurements made on Vermont shingles. a. width.118 CHAPTER THREE Numerical Descriptive Measures d. standard deviation. . Compute the range.72 The manufacturer of Boston and Vermont asphalt shingles provide their customers with a 20-year warranty on most of their products. Construct side-by-side box-and-whisker plots for the two brands of shingles and describe the shapes of the distributions. dry cat food. weight. median household income. Yonkers. Compute the mean. Construct a box-and-whisker plot.77 You want to study characteristics of the model year 2002 automobiles in terms of the following variables: miles per gallon. b. a. Source: Extracted from “Printers.S. b. do you think that any of the other variables might be useful in predicting printer price? Explain. yearly filter energy cost. Census. Compute the mean. Compute the coefficient of correlation between price and energy cost. 3.74 The economics of baseball has caused a great deal of controversy with owners arguing that they are losing money. local television. The data file GRANULE contains a sample of 170 measurements made on the company’s Boston shingles. the fan cost index. Are the data skewed? If so. players arguing that owners are making money. all other operating revenue. and percentage of mortgage-paying homeowners whose housing costs exceed 30% of income? 3. b. Compute the correlation between the number of wins and player compensation and benefits. color photo time.S. Adapted with permission from Consumer Reports. AUTO2002 Source: Extracted from “The 2002 Cars. and canned cat food)? 3.8 grams or less. radio. b. 47. Accelerated-life testing exposes the shingle to the stresses it would be subject to in a lifetime of normal use in a laboratory setting via an experiment that takes only a few minutes to conduct. first quartile. For each of these variables.” Copyright © 2002 by Consumers Union of U. To determine whether a shingle will last as long as the warranty period. and coefficient of variation. how? d. NY 10703–1057. 51.S. and income from baseball operations. a sampling of 700. Adapted with permission from Consumer Reports. February 2002. and income from baseball operations? 3. length. text speed. Inc. player compensation and benefits. 3. Compute the range. national and other local expenses. percentage of homes with eight or more rooms. March 2002.8 grams of granule loss if it is expected to last the length of the warranty period. Compute the coefficient of correlation between price and each of the following: text speed. In addition to data related to team statistics for the 2001 season. NY 10703–1057. variance. In this situation. and cable receipts. standard deviation. and yearly filter cost of room air cleaners. turning circle requirement. Inc.” Copyright © 2002 by Consumers Union of U. Comment on the shingles’ ability to achieve a granule loss of 0. b. a. and cable receipts. and third quartile. player compensation and benefits. and more complex patients. interquartile range. c. b. standard deviation. Medicaid.000 0 N/A Coronary bypass 119 Simple birth Hip replacement El Camino costs are the average of high and low charges for a simple birth with a two-day stay and a hip replacement with a nine-day stay. Sequoia and El Camino Hospitals are Stanford Medical Center’s main local competition. first quartile. Compute the mean.77. . and 50 restaurants located on Long Island. Stanford data are the average cost of all operations. Medicare. service. standard deviation. median. What conclusions can you reach concerning differences between New York City and Long Island restaurants? 3.79 Zagat’s publishes restaurant ratings for various locations in the United States. and reply . Construct a side-by-side box-and-whisker plot.000 10. and third quartile. variance. and El Camino Hospital. first quartile. interquartile range. and third quartile. Construct a box-and-whisker plot. Sequoia costs are averages of the middle 50% of all charges for each operation.000 For New York City and Long Island restaurants. and coefficient of variation. turning circle requirement. and hip replacement) at three competing institutions (El Camino. and price per person: a. width. . Source: Extracted from Zagat Survey 2002 New York City Restaurants and Zagat Survey 2002 Long Island Restaurants. how? d. how? d. The chart below was provided to compare the average 1989 to 1990 hospital charges for three medical procedures (coronary bypass. November 11. and Stanford). service rating. The data file RESTRATE contains the Zagat rating for food.000 Sequoia Stanford 30. 50.78 Refer to the data of problem 3. c. how? d. take a deep breath. and the price per person for a sample of 50 restaurants located in New York City. What conclusions can you reach concerning differences between SUVs and non-SUVs? 3. Source: Stanford Medical Center. Compute the mean. length. weight. 1990) implied that costs at Stanford Medical Center had been driven up higher than at competing institutions because the former was more likely than other organizations to treat indigent. decor. standard deviation. median. What Health Care Costs A comparison of average 1989–90 hospital charges in California for various operations. El Camino Dollars 40. Suppose you were working in a medical center. Are the data skewed? If so. Your CEO knows you are currently taking a course in statistics and calls you in to discuss this. She now requests that you prepare her response. c. Compute the range. Construct a side-by-side box-and-whisker plot for the New York City and Long Island restaurants. median. and coefficient of variation. decor rating. interquartile range.Chapter Review Problems For each of these variables: a.” The New York Times Sunday Business Section. Compute the mean. You want to compare sports utility vehicles (SUVs) with non-SUV vehicles in terms of miles per gallon. She tells you that the article was presented in a discussion group setting as part of a meeting of regional area medical center CEOs last night and that one of them mentioned that this chart was totally meaningless and asked her opinion. and third quartile. b. Are the data skewed? If so. .80 As an illustration of the misuse of statistics. Compute the range.000 20. and luggage capacity. an article by Glenn Kramon (“Coaxing the Stanford Elephant to Dance. sicker. simple birth. Are the data for any of the variables skewed? If so. Sequoia. for the variables of food rating. Sequoia Hospital. and coefficient of variation. For SUVs and non-SUV vehicles. Compute the range. first quartile. What conclusions can you draw concerning the 2002 automobiles? 3. for each of these variables: a. variance. variance. You smile. b. b. the medians. and third quartile.76.S. variance. imported lagers. the calories per 12 fluid ounces. for the variables expense ratio in percentage. The problem is. and five-year return. a. one of whom you particularly want to impress. In addition. NY 10703–1057. June 1996.85 You wish to compare mutual funds that have a growth objective to those that have value objective. and charts for a data set containing several numerical and categorical variables assigned by the instructor for study purposes.81 You are planning to study for your statistics examination with a group of classmates. standard deviation.84 You wish to compare mutual funds that have fees to those that do not have fees. b. Compute the mean. and coefficient of variation. three-year return. and light and nonalcoholic beers). Adapted with permission from Consumer Reports. median. and alcoholic content—regardless of type of product or origin. the standard deviations. and coefficient of variation. Are the data skewed? If so. c. charts.” Copyright © 1996 by Consumers Union of U. c. This person comes over to you with the printout and exclaims. Yonkers. how? d. how? d. the mean for major is 4. how? d. mid cap. threeyear return. and five-year return.S. “I’ve got it all—the means. and third quartile. and five-year return. versus imported) for each of the 69 beers that were sampled. interquartile range. or SPSS to get the needed summary information. and five-year return. What conclusions can you reach about differences between small cap. Construct a box-and-whisker plot. Compute the range. the pie charts—for all our variables. the box-and-whisker plots.. Then perform a similar evaluation comparing each of these numerical variables based on type of product—craft lagers. b. average. Compute the mean. 2003 Return. first quartile. mid cap. Construct a box-and-whisker plot. the type of beer (craft lagers. first quartile. craft ales. regular and ice beers. or high Best quarter—Best quarterly performance 1999–2003 Worst quarter—Worst quarterly performance 1999–2003 3. 2003 Return. craft ales. Appended to your report should be all appropriate tables. 2003 Return. a. a. 2003 Return. variance. three-year return. For each of these two groups. regular and ice beers.” What is your reply? Report Writing Exercises 3. standard deviation. Compute the mean. For each of these two groups. how? d.50. perform a similar evaluation comparing and contrasting each of these numerical variables based on the origins of the beers—those brewed in the United States versus those that were imported. and large cap mutual funds. first quartile. TEAM PROJECTS The data file MUTUALFUNDS2004 contains information regarding 12 variables from a sample of 121 mutual funds. What conclusions can you reach about differences between mutual funds that have a growth objective to those that have value objective? 3. c. first quartile. for the variables expense ratio in percentage. What conclusions can you reach about differences between mutual funds that have fees and those that do not have fees? 3. Compute the range. large cap Objective—Objective of stocks comprising the mutual fund—growth or value Assets—In millions of dollars Fees—Sales charges (no or yes) Expense ratio—ratio of expenses to net assets in percentage 2003 Return—Twelve-month return in 2003 Three-year return—Annualized return 2001–2003 Five-year return—Annualized return 1999–2003 Risk—Risk-of-loss factor of the mutual fund classified as low. interquartile range. median. and coefficient of variation. This individual has volunteered to use Microsoft Excel. and third quartile. I can’t understand why Professor Krehbiel said we can’t get the descriptive stats for some of the variables—I got it for everything! See. Compute the mean. the mean for height is 68. and third quartile.82 The data found in the data file BEER represent the price of a six-pack of 12-ounce bottles. threeyear return. variance. a. and the country of origin (U. Are the data skewed? If so.23. standard deviation. Are the data skewed? If so. b. the percent of alcohol content per 12 fluid ounces. median. and light or nonalcoholic beers. Also.. c. median. interquartile range.120 CHAPTER THREE Numerical Descriptive Measures 3. calories. some of the output looks weird— like the box-and-whisker plots for gender and for major and the pie charts for grade point index and for height. The variables are: Fund—The name of the mutual fund Category—Type of stocks comprising the mutual fund—small cap. imported lagers. interquartile range. and numerical descriptive measures.33. For each of these three groups. for the variables expense ratio in percentage. What conclusions can you reach concerning these variables? 3. Minitab. Inc. variance. Construct a box-and-whisker plot. the mean for grade point index is 2.86 You wish to compare small cap.83 For expense ratio in percentage. and large cap mutual funds? . the mean for gender is 1. standard deviation. Are the data skewed? If so. Construct a box-and-whisker plot. Compute the range. Your task is to write a report based on a complete descriptive evaluation of each of the numerical variables— price. Source: Extracted from “Beers. tables. mid cap. Compute the range. and coefficient of variation. .. CASE Apply your knowledge about the proper use of numerical descriptive measures in this continuing Web Case from Chapter 2. and A. Open to the worksheet containing the data you want to summarize. C.htm.com/Springville/StockToutHome. 5. Tukey. In the Descriptive Statistics dialog box (see Figure A3. 4. What conclusions can you form from that plot that cannot be made from the box-and-whisker plot? Summarize your findings in a report that can be included with the task force’s study. What factors may have limited the number of responses to that question? REFERENCES 1. From the list that appears in the Data Analysis dialog box. M. 6.htm a second time and reexamine their supporting data and then answer the following: 1.. 1981). Identify another graphical display that might be useful and construct it. 2004). Exploratory Data Analysis (Reading. Evaluate the methods StockTout used to summarize the results of its customer survey www. SPSS Base 12. Note that the last question of the survey has fewer responses. Choose the Columns option and Labels in First Row if you are using data that are arranged like the data in the Excel files on the CD-ROM packaged with this text. and D. F. Kendall. NJ: Prentice Hall. and Computing of Exploratory Data Analysis (Boston. 1977). WA: Microsoft Corporation. The Advanced Theory of Statistics. Finish . Appendix 3 Using Software for Descriptive Statistics A3. Can descriptive measures be computed for any variables? How would such summary statistics support StockTout’s claims? How would those summary statistics affect your perception of StockTout’s record? 2.1 MICROSOFT EXCEL For Descriptive Statistics Use the Data Analysis ToolPak. Stuart.0 Brief Guide (Upper Saddle River. MA: Addison-Wesley. 3. 2002).prenhall.. Basics. Applications. MA: Duxbury Press. enter the cell range of the data in the Input Range box. G.com/ Springville/ST_Survey. Compute the appropriate numerical descriptive measures. PA: Minitab Inc. Hoaglin. 2003). vol. P. and generate a box-and-whisker plot. Microsoft Excel 2003 (Redmond. Reexamine the data you inspected when working on the Web Case for Chapter 2. Minitab Version 14 (State College.Appendix 121 RUNNING CASE MANAGING THE SPRINGVILLE HERALD For what variable in the Chapter 2 Managing the Springville Herald case (see page 62) are numerical descriptive measures needed? For the variable you identify: 1. Visit the StockTout Investing Service Web site www. Griffin.prenhall. Velleman. select Descriptive Statistics and click OK. 1 (London: Charles W. Is there anything you would do differently to summarize these results? 3. J. 2. WEB 2.1). Select Tools Data Analysis. 1958). Median. Results appear on a separate worksheet. (For LARGE and SMALL. STDEV.1 Data Analysis Descriptive Statistics Dialog Box To enter one of these functions into a worksheet.10.) In versions of Microsoft Excel earlier than Excel 2003. enter 1 as the K value. Range. First quartile.2).5 (Box-and-Whisker Plot) if you want PHStat2 to produce a box-and-whisker plot as a Microsoft Excel chart. For Box-and-Whisker Plot See section G. open the MUTUALFUNDS2004. As shown in Figure 3. This allows Excel to automatically update the value of n when the size of the table area is changed and ensures that the n − 1 term is always correct. Standard deviation. LARGE. for either first or third quartile. SUM. FIGURE A3. since the covariance Sx and SY already appear in the worksheet. Enter C10 or Risk in the By variables (optional): edit box. COUNT. A3. QUARTILE. (There are no Microsoft Excel commands that directly produce box-and-whisker plots.) For Covariance Open the Covariance.2 Minitab Display Descriptive Statistics Dialog Box Step 2: Select the Statistics button. Step 1: In the Display Descriptive Statistics dialog box (see Figure A3. shown in Figure 3. In the Function Arguments dialog box. enter C7 or ‘Return 2003’ in the Variables: edit box. the formula =E17/(E18 * E19) could also be used in this particular worksheet to calculate the statistic. Third quartile. Minimum. Select Stat Basic Statistics Display Descriptive Statistics. Interquartile range. FIGURE A3. Maximum. and N total (the sample size) check boxes. In the Function dialog box.7 that cell C15 contains a formula that uses the COUNT function. Click the OK button to return to the . Follow the onscreen instructions for modifying the table area if you want to use this worksheet with other pairs of variables.122 CHAPTER THREE Numerical Descriptive Measures by selecting New Worksheet Ply. MEDIAN. and ensures that the n − 1 term is always correct. In the Display Descriptive Statistics—Statistics dialog box (see Figure A3. This allows Excel to automatically update the value of n when the size of the table area is changed. Note in Figure 3. OR you can use any of these sample statistics worksheet functions in your own formulas including AVERAGE (for mean). select Statistical from the drop-down list and then scroll to and select the function you want to use.MTW worksheet. or SMALL. MAX. The worksheet uses the CORREL function to calculate the coefficient of correlation. and for QUARTILE. select an empty cell and then select Insert Function.xls Excel file. VAR. select the Mean. Kth Largest. Note in Figure 3.7 on page 107.3 on page 90. Coefficient of variation.xls Excel file. Click OK. MODE. For Coefficient of Correlation Open the Correlation. Summary statistics.10 on page 110.2 MINITAB Computing Descriptive Statistics To produce descriptive statistics for the 2003 return for different risk levels shown in Figure 3. Follow the onscreen instructions for modifying the table area if you want to use this worksheet with other pairs of variables. enter the cell range of the data to be summarized and click OK. MIN. you may encounter errors in results when using the QUARTILE function. and clicking OK. enter either 1 or 3 as the Quart value.3). and Kth Smallest. shown in Figure 3.10 that cell E16 contains a formula that uses the COUNT function. MTW worksheet.5 on page 104. (If you want to create a box-and-whisker plot for one group. Select Graph Boxplot.4 Minitab Boxplots Dialog Box Step 2: In the Boxplot—One Y. enter C6 or ‘Expense ratio’ and C7 or ‘Return 2003’. select the One Y With Groups choice. enter C7 or ‘Return 2003’ in the Graph variables: edit box. FIGURE A3. select the One Y Simple choice. With Groups Dialog Box The output will be similar to Figure 3. FIGURE A3.5 Minitab Boxplots—One Y. Click the OK button again to compute the descriptive statistics.Appendix 123 Display Descriptive Statistics dialog box. With Groups dialog box (see Figure A3.6 Minitab Correlation Dialog Box . open the MUTUALFUNDS2004. Calculating a Coefficient of Correlation To compute the coefficient of correlation for the expense ratio and the 2003 return for all the mutual funds. Click the OK button. Step 1: In the Boxplots dialog box (see Figure A3.6). Select Stat Basic Statistics Correlation.MTW worksheet.) Click the OK button.5). Click the OK button. FIGURE A3. open the MUTUALFUNDS2004. In the Correlation dialog box (see Figure A3.4).3 Minitab Display Descriptive Statistics— Statistics Dialog Box Using Minitab to Create a Box-and-Whisker Plot To create a box-and-whisker plot for the 2003 return for different risk levels shown in Figure 3. Enter C10 or Risk in the Categorical variables edit box.5 on page 104. FIGURE A3.

Comments

Description