INES- RUHENGERIFaculty of Fundamental Applied sciences Department of Applied Statistics LECTURE NOTES DESCRIPTIVE STATISTICS I LEVEL I APPLIED STATISTICS, 2010 BY Ir. DANCILLE NYIRARUGERO, Tutorial Assistant 1 DESCRITIVE STATISTICS I COURSE OBJECTIVE At the end of this Course students must be knowledgeable about vocabulary, concepts, and statistical procedures used in these studies. Students may be called on to conduct research in their fields, since statistical procedures are basic to research. To accomplish this, they must also be able to collect, organize, analyze, summarize data and present data and communicate the results of the study in their own words. Students must be also able to determine measures of central tendency, measures of dispersion and position. COURSE CONTENTS • Chapter 1: Introduction, Definitions and statistics vocabulary ; • Chapter 2: Frequency distributions and graphs: organizing data, histograms, frequency polygons and ogives, other types of graphs; • Chapter 3: Data description: measures of central tendency, measures of dispersion ( variation), measures of position; • Chapter 4: Exploratory data analysis: Box plot, Moments, Skweness and Kurtosis, Contingency table, presentation and charts. 2 Bibliographie indicative: 1. Marcel AVELANGE : Statistique Descriptive, classe de 3 ème Ed. Sciences et Lettres ; Liège, 197 2. DAGNERIE, P., Statistiques théorique et appliquée, T.1, De Boeck & Larcier s.a, Paris, Bruxelles 199 3. Allan G. Bluman, Elementary Statistics,2004 4. MURRAY R. SPIEGEL, Ph. D., SCHAUM’S OUTLINE OF Theory and Problems of STATISTICS, 3eme Ed, 2008 5. Cottrell M, Genon-Catalot V, Duhamel C, et Meyre T. Exercices de probabilités. Licence-Master-Écoles d'ingénieurs. Cassini, 200 6. Foata D et Fuchs A. Calcul des probabilités. Cours, exercices et problèmes corrigés. Dunod, 2003 7. DOMINICK SALVATORE, Ph. D. DERRICK REAGLE, Ph.D, SCHAUM’S OUTLINE OF Theory and Problems of Statistics and Econometrics, 2th Ed, New York, 2001 8. Saporta G. Probabilités, analyse des données et statistique. Technip, 2006 9. ANDRE FRANCIS, Business Mathematics and Statistics, sixth edition, 2004 10. Douglas A. Lind, Statistical Techniques in Business & Economics, Twelfth Edition, 2005 11. GEORGE K. KING’ORIAH, Fundamentals of Applied Statistics, Nairobi, 2004 12. P.S.S. Sundar Rao and J. Richard, Introduction to Biostatistics and Research Methods, 4 th Ed, 2006; 13. GIARD VINCENT. Statistique Appliquée à la gestion, 2 è me Ed. Economica 2003 ; 14. Walder Masiéri, Statistique et Calcul des Probabilités, 2001. 15. CB Gupta. Vijay Gupta, An Introduction to Statistical Methods, 23rd Revised Edition , 2007; 16. Dr. P.K. Srimani & M. Vinayaka Moorthy, Probability & Statistics, 1 st Edition, Bangarore, 2000 3 CHAPTER 1: INTRODUCTION, DEFINITIONS AND STATISTICS VOCABULARY 1.1 Introduction Statistics refers to the collection, organizing, presentation, analyzing, and interpretation of numerical data to make inferences and reach decisions in all branches of economics, business, medicine, and other social and physical sciences. A. Definition: the Meaning of Statistics The word statistics has two meanings: 1. In plural sense, statistics is considered as a numerical description of quantitative aspect of things. It stands for numerical facts pertaining to a collection of objects. 2. In singular sense, statistics means the science of collection, organization, presentation, analysis and interpretation of numerical data to assist in making more effective decisions. The term statistics is used to mean either statistical data or statistical method. When it used in the sense of statistical data it refers to quantitative aspect of things, and is numerical description. Every data is not statistics. It must fulfil certain essential characteristics to be called statistics. B. Branches of Statistics Statistics is subdivided into two branches: descriptive and inductive or inferential. (i) Descriptive statistics consists of the collection, organization, summarization, and presentation of data in various forms such as tables, graphs and diagrams or using a numerical summary. The purpose of descriptive statistics is to display and pass on information from which conclusions can be drawn and decisions made. Businesses, for example, use descriptive statistics when presenting their annual accounts and reports. 4 (ii) Inferential or inductive statistics consists of generalizing from samples to populations, performing estimations and hypothesis tests, determining relationships among variables, and making decisions. . 1.2 Characteristics of statistics are the following: A. Statistics means an aggregate of facts. Facts can be analyzed only when there are more than one fact. Single fact cannot be analyzed. Example: the weights of 60 students of a class can be statistically analyzed. But the weight of one student cannot be called statistics. Hence, only a collection of many facts can be called statistics. B. Statistics are affected to a marked extent by multiplicity of causes The facts are the results of action and interaction of a number of factors. C. Statistics are numerically expressed. Only numerical facts can be statistically analyzed. Therefore, facts such as ‘Price decrease with increasing production’ can not be called statistics. 5 Statistics Describing data Numerical summaries Visual display Making inferences from samples Estimating parameters Testing hypotheses D. Statistics are enumerated or estimated according to reasonable standards of accuracy. The facts should be enumerated or estimated with required degree of accuracy. The degree of accuracy differs from purpose to purpose. E. Statistics are collected in a systematic manner. The facts should be collected according to planned and scientific methods. Otherwise, they are likely to be wrong and misleading. F. Statistics are collected for a pre – determined purpose There must be a definite purpose for collecting facts. Otherwise, the facts become useless and hence, they cannot be called statistics. G. Statistics are placed in relation to each other The facts must be placed in such a way that a comparative and analytical study becomes possible. Thus, only related facts which are arranged in logical order can be statistics. 1.3 Functions of statistics The following are the six important functions of the science of statistics: i) To present facts in a precise and definite form (i.e., helps proper comprehension and avoids ambiguity). ii) To simplify mass of figures (i.e., condensing the mass of data). iii) To facilitate comparison (by furnishing suitable devices). Statistics adds precision to thinking. iv) To help formulation and testing of hypothesis (by appropriate statistical tools). Statistics helps in comparing different sets of figures. For example, the imports and exports of a country may be compared among themselves or they may be compared with those of another country. v) To help in framing suitable policies and plans (i.e., in making predictions). It guides in the formulation of policies and helps in planning. Planning and policy making by the government is based on statistics of production, demand, 6 etc. it indicates trends and tendencies. Knowledge of trend and tendencies helps future planning. vi) To help in the formulation of policies (i.e., to provide the basic Material). Statistics helps in studying relationship between different factors. Statistical methods may be used for studying the relation between production and price of commodities. Limitations of statistics Statistics deals with only those subjects of inquiry which are capable of being quantitatively measured and numerically expressed. This is an essential condition for the application of statistical methods. 1.4. Origin of statistics The term statistics is linked to the notion of State from Latin STATUS which was changed into Latin word statisticum. Statisticum was the activity of collecting data which helped government to ensure knowledge about state income and possessions. The history of statistics showed that the first census had been made in Sumerian Kingdom (Babylone) around 3000 before J.C. In 2238 before Jesus-Christ, agriculture survey had been done in Chine by King YAO. In 2500 before J.C. in Egypt they had to collect data for taxes. Statistics originated from two quite dissimilar field, games of chance and political states. These two different fields are also termed as two disciplines 1 0 . Primarily analytical 2 0 . Secondarily essentially descriptive. Some of pioneers of statistics are: Pascal (1623-1662), Bernouilli (1654-1705), As regards the descriptive side of statistics it may be stated that statistics is as old as statecraft. Since time immorial men must have been compiling information about wealth and manpower for purpose of peace and war. This activity considerably expanded at each upsurge of social and political development and received added impetus in periods of war. 7 The development of statistics can be divided into three stages: the empirical stage (down to 1600), the comparative stage (1600-1800), the modern stage (1800 up to day). It has now become a useful tool and statistical methods of analysis are now being increasingly used in biology, psychology, education, economics and business. 1.5. Statistics vocabulary • subject or individual is : an item for study; • Population or universe: a population consists of all subjects (the totalities of all observations) that are being studied; • Statistical units: the individual subjects or objects upon whom the data are collected. • Raw data: are collected data have not been organized numerical; • ARRAY: An array is an arrangement of raw numerical data in ascending or descending order of magnitude; • Frequency: the frequency is the number of values in a specific class of the distribution. • Variable: is a characteristic of the subject or individual which varies from unit to unit. Example: height, weight, age, etc., 1.6. Types of variables There are two main types of variables: qualitative and quantitative. A. Qualitative variable A qualitative variable is one that, generally, cannot be expressed in numbers. It is an attribute, and is descriptive in nature. Example sex (male or female); state of birth, cause of death, religious (Catholics, protestants, ect). When the data are qualitative, we are usually interested in how many or what proportion in each category. Qualitative data are often summarized in charts and bar graphs. 8 B. Quantitative variable A quantitative variable is numerical and can be ordered or ranked. Example: level of hemoglobin in the blood; age; heights, weights; body temperatures; the number of children in a family. A quantitative variable can be a discrete variable or a continuous variable. Discrete variables assume values that can be counted and represented by an integer such as 1, 2, 3, etc. Example: number of children in a family, the number of rooms in a house, number of patients in a hospital, etc. Continuous variables can assume all values between any two specific values (within an interval). They are obtained by measuring (ex: heights, weights, age, level of protein in blood, etc.) Figure 1.1: summary of the types of variables 9 Types of variables Qualitative Quantitative Gender Color Marital status Discrete Continuous 1. children in family 2. cows in a farm 3. patient in a hospital oAge oWeight oHeight oTime 1.7 Levels of Measurement Data can be classified according to levels of measurement. The level of measurement of the data often dictates the calculations that can be done to summarize and present the data. There are four levels of measurement: Nominal, Ordinal, Interval and Ratio. a) Nominal-level data or nominal measurement From Latin nomen meaning name, nominal data are the same as qualitative, attribute, categorical, or classification. With the nominal level, the data is classified into categories and cannot be arranged in any particular order. EX: gender, eye color, Religions affiliation, marital status. Nominal level variables must be: mutually exclusive and exhaustive. - Mutually exclusive means an individual or object is included in only one category. - Exhaustive means each individual or object must appear in a category. To summarise, the nominal-level data have the following properties: Data categories are mutually exclusive and exhaustive. Data categories have no logical order. Example: list of jobs in Rwanda, consumption in Rwanda… We usually code nominal data numerically. However, the codes are arbitrary placeholders with no numerical meaning, so it in improper to perform mathematical analysis on them. Example: yes as 1. No as 2. b) Ordinal-level data involves data arranged in some order, but the differences between data values cannot be determined. Example 1: when appreciating student dissertation we can have: Superior, good, average, poor, inferior. The data classifications are mutually exclusive and exhaustive. Data classifications are ranked or ordered according to the particular trait they possess. 10 c) Interval-level data or interval measurement This kind of data is acquired through process of measurement where equal measuring units are employed. The movement in magnitude between one measure to the one above it or below it is identical in the subject population under consideration. The data contains all the characteristics of nominal and ordinal data; the only one difference being the scale of measurement that moves uniformly in equal interval in which real number form can show several decimal places. Example: temperature, shoe size. Data classifications are mutually exclusive and exhaustive. Data classifications are ordered according to the amount of characteristic they possess. Equal differences in the characteristic are represented by equal differences in the measurements. d) Ratio-level data or ratio measurement: Practically all quantitative data are the ratio level of measurement. The ratio level is the "highest” level of measurement. It has all the characteristics of the interval level, but in addition, the o point is meaningful and the ratio between two numbers is meaningful. Ex: Wages, Weight, etc. Data classifications are mutually exclusive and exhaustive. Data classifications are ordered according to the amount of the characteristics they possess. Equal differences in the characteristic are represented by equal differences in the numbers assigned to classifications. The zero point is the absence of the characteristic. 11 Levels of measurements Figure 1.2: Summary of the characteristic for Levels of Measurement 1.8 Statistic Method For the purpose the following, procedure may be adopted with advantages: Collect data: information should be collected regarding Organize the data obtained Present this information by means of diagrams or other visual aids Analyze the data above to determine the average, the extent of disparities that exist. To have an understanding of the phenomenon (interpretation of facts) All this lead to a policy decision for improvement of the existing situation. 1.9 Collection of data Statistics is concerned with the analysis of numerical data, so the first stage in statistical method must be the collection of the data to be analyzed. Data can be collected in two ways: first as primary data and second as secondary data. a) Primary data 12 Nominal Ordinal Interval Ratio Data may only be classified Data are ranked Meaningful difference between values Meaningful o point and ratio between values Type of residence (rural, urban) Rank in class Temperature Number of patients Primary data is data which is collected by the investigator himself with a specific objective. This means that primary data is original in character. Sources of primary data are either censuses or samples. Census A census is the name given to a survey which examines every item of the population Three important official censuses are the population census, the census of distribution and the census of production. A census has the advantages of completeness and being accepted and as representative, but of course must be paid for in terms of manpower, time and resources. Sample A sample is a relatively small subset of a population with advantages over a census that costs, time and resources are much less. Sample is used when it is impossible or impractical to observe the entire group or population. The main disadvantage is that of acceptability by layman. b) Secondary data Secondary data is data that has already been collected by some other investigator or agency, and used by an investigator for his purpose. As far as the investigator is concerned, the data he uses is from a secondary source, that is, he did not collect it. The prime example of secondary data is the official statistics that are published by the Government: Financial statistics, Economic trends, etc. The advantages of using secondary data are savings in time, manpower and resources in sampling and data collection. The dangers of secondary Data 13 If we have to use secondary data, there are dangers to be aware of: (i) The data available may not be very up-to-date. (ii) We do not necessarily know how the data were collected and analyzed or for what reason. They may be biased because of poor collection techniques or simply because they were collected for a different purpose. (iii) We may not be able to find a complete set of data for our purposes in one place. This could mean we would have to collate data from several sources with the chance of making errors while doing so. Obtaining the data from more than one source may also compound the chances of bias discussed in. (iv) There is the distinct possibility of transcription or printing errors in published data. If you are using secondary data to support arguments in reports, articles or essays it is advisable to try to find out more about how the data were collected and analyzed and why they were collected. These mean that: Before using secondary data it is necessary to scrutinize them in the light of the following points: (i) The type and purpose of the institution that publishes statistics as a routine; (ii) The purpose for which the data are issued and the consumers to whom they are addressed; (iii) The nature of the data themselves. Are the data biased? Are the data samples only or complete enumeration? (iv) In what types of units are the data expressed? Are they the same at different times, at different places, and for all cases at the same time or place? (v) Are the data accurate? (vi) Do the data refer to homogeneous condition? (vii) Are the data germane to the problem under study? 14 1.10 Misuse of statistics The figures themselves cannot mislead, but the statisticians who present the figures certainly can. Data can be misused in the following ways: (i) They can be used for the wrong purpose, that is, one that is different from the purpose for which they were collected. (ii) They can be collected incorrectly so that they are biased (iii) They can be analyzed carelessly so that the results obtained from them are misleading. . 1.11. Data classification The data collected from the sample is generally referred to as the raw data, because it is not arranged and organized into any format. Raw data conveys very little information to the investigator or to anyone interested in that investigation. Therefore, the mass of numbers must be classified. Classifications are the process of arranging the available facts into in groups or classes according to their resemblances, affinities and other relationships. The main objectives of classifying data are: 1. To condense the mass of data into a concise format; 2. to bring out the relevant points of similarity and dissimilarity, and thus facilitate comparison; 3. To make the statistical treatment of the data easy. Types of classification 15 Generally, classification of data may be of the following types: spatial or geographical, temporal or chronological, qualitative, and quantitative. Spatial or geographical classification: this classification is based on space, that is, geographical locations. For example, data on human population may be classified on the basis of different continents or countries or states of a country or districts of a state or towns and villages of a district. Temporal or chronological: data are arranged on the basis of time (years, months, days, hours, minutes and seconds). Qualitative classification: data are classified on the basis of quality or attribute such as sex, colour, behaviour, religion, marital status, literacy, etc. Quantitative classification: the classification of data is done according to some variable (characteristics) that may be measured, such as, height, weight etc., in this type of classification there are two elements: the variable and frequency. Classification of units on the basis of one variable is called simple or one-way classification. Simultaneous classification of units on the basis of two variables is called two – way classification. A table that presents the two way classification is called Contingency table. CHAPTER2: FREQUENCY DISTRIBUTIONS AND CHARTS 16 2.1 FREQUENCY DISTRIBUTIONS 2.1.1. Introduction After collecting the data, the researcher must organize and present them so they can be understood by those who will benefit from reading the study. The most convenient method of organizing data is to construct a frequency distribution. The most useful method of presenting the data is by constructing charts and graphs. This chapter describes how to organize data by constructing frequency distributions and how to present the data by constructing charts and graphs. The charts and graphs illustrated here are histograms, frequency polygons, ogives, pie graphs. 2. 1.2 .Organizing Data Before the data obtained from a statistical survey or investigations have been worked on, they are called raw data. Since little information can be obtained from looking at raw data. The following table gives an example of a set of raw data. Table 2.1 Marks in Statistics obtained by 20 Students of Level I STEA in 2004 Data as originally collected 15 18 7 12 17 9 13 14 12 14 16 11 10 8 9 16 13 14 10 8 In order to make the data easily understandable, the first task of the researcher is to prepare an “array ". The array is prepared by arranging the values of the variable in an ascending or descending order. Data array give a general idea of distribution. Example: the raw data of table 2.1 have been arrayed and are shown in table 2.2. Table 2.2 Raw data of Table 1 put into an array 7 8 8 9 9 10 10 11 12 12 13 13 14 14 14 15 16 16 17 18 From this table, the highest and lowest marks are immediately seen and the marks which occur most frequently are readily identified. 17 After arranging the data, their bulk must be condensed, reduced, and simplified so that the mind comprehends them easily. A first step in such a condensation would be achieved by representing the repetitions of a particular value of observation by tallies instead of rewriting the value itself. The number of tallies corresponding to any given values is the frequency of that value and usually represented by the letter f. Frequency means thus the number of times a certain value of the variables is repeated in the given data. A table so formed is known as frequency distribution In other words a frequency distribution is the organization of raw data in table form, using classes and frequencies. Statistical table A statistical table presents numerical data in columns and rows. The main object of statistical table is to arrange the physical presentation of numerical facts that the attention of the reader is automatically directed to the information. Some of advantages statistical tables are: • Tabulated data can be easily understood than facts stated in the form of descriptions; • They facilitate quick comparison; • They leave a lasting impression; • They make easier the summation of items and detection of errors and omissions; • A tabular arrangement makes it unnecessary to repeat explanations, phrases and headings; • All unnecessary details and repetitions are avoided. 2.1.3. Types of frequency distributions 18 There are two types of frequency distributions: simple frequency distribution or one- way table and grouped frequency distribution A. Simple frequency distributions A simple frequency distribution consists of a list of data values, each showing the number of items having that value. a) Quantitative Variable X Frequencies x 1 n i x n n n Total N Example: There are data from a classroom marks in probability exam in 2005. 16, 14,5, 8, 15, 15, 9, 12, 10, 9, 11, 11, 10, 17, 12, 10,14,5 Table 2.3. Frequency distribution of the marks obtained by 18 students Marks x i 5 8 9 10 11 12 13 14 15 16 17 Total N Tally marks II I II II III I I II II I I 18 frequencies n i 2 1 2 2 2 1 1 2 2 1 1 18 Tally chart is used to record the occurrence of repeated values systematically b) Qualitative Example: The experience consists to know the number of students in Level I statistics in 2010 according to their sex. There are 45 students in Level I STA, then gender is coded as G for girl and B for boy. Table 2.4. Distribution of 45 students in Level I STA according to their sex clothes in 2009. Tally marks frequency n i B 19 G Total To convert a frequency distribution to relative frequency distribution, each the frequencies is divided by the total number of observations. When a relative frequency is multiplied by hundred it gives percentage. It is a percentage distribution. Cumulative frequency distribution is used when we require information on number of observations whose characteristic is less than a given value. Data may be arranged in such a way as to form a cumulative frequency distribution. This is obtained by adding the numbers of observations in value cumulatively. Cumulative distributions may be constructed for relative frequencies and percentages by adding either the relative frequencies or the percentages in a cumulative way as has been for absolute frequencies. i n i i-1 i i i n i i-1 1 1 % *100, total of f % equal 100 f % 100 n relative frequency :f = , N total of f equal 1 f 1 cumulative frequency or i i p p i i i i n f N cum of ni n f · · · · · · ∑ ∑ ∑ ∑ Example 2.3 : 20 Marks x i 5 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 Total N frequencies n i 2 1 2 2 2 1 1 2 2 1 1 18 Relative frequency f i Cumulative Frequency B .Grouped frequency distribution When the number of distinct data values in a set of raw data is large more than 20, a simple frequency distribution is not appropriate, since there will be too much information, not easily assimilated. In this case, a grouped frequency distribution is used. A grouped frequency distribution organizes data items into groups or classes of values, each showing how many items have values included within the group, known as the class frequency. The number of classes is usually between 5 and 15 Definitions associated with frequency distribution classes a) Class limits : are the lower and upper values of the classes; b) The lower class limit represents the smallest data value that can be included in the class; c) The upper class limit represents the largest data value that can be included in the class; d) Class boundaries: are the lower and upper values of a class that mark common points between classes. These classes are used when there are the closed intervals. e) Class width (or length): is the difference between the lower and upper class boundaries. If all class intervals of a frequency distribution have equal widths, this common width is denoted by C in such case C is equal to the difference between two successive lower class limits or two successive upper class limits. Class width = Upper boundary – lower boundary; 21 f) Class mark or class mid- point: the class midpoint m X is obtained by adding the lower and upper class limits and dividing by 2, or adding the lower and upper boundaries and dividing by 2 lower boundary + upper boundary 2 lower limit +upper limit 2 m m X or X · · Formulation of grouped frequency distributions A tabulation of n data values into k classes called bins, based on values of data. The bin limits are cutoff points that define each bin. Bins must have equal widths and their limits cannot overlap. 1) calculate the range : highest value minus lowest value (W) 2) find number of classes (K) using following formula: K number of classes is 2 K ≥ N. (rule of Sturge 1 ) 4 2, 5 Yule's rule. K N · 3) Calculate class interval or class widths ( lengths ) 1 h K ω · − or 1 Herbert Sturge proposed 2 1 log k N · + 22 4) The first class’s boundary of the frequency distribution equal lowest value of series - 2 h The last class’s boundary of the frequency distribution equal the first class boundary + Hk The completed frequency distribution is: Class limits Frequency Cumulative Frequency Relative Frequency Percentage Total EXCLUSIVE AND INCLUSIVE CLASS-INTERVAL Class-interval of the type ( ) : ( , ) x a x b a b < < · are called exclusive (opened) since they exclude the upper limit of the class. The following data are classified on this basis. Income 50-100 100-150 150-200 200-250 250-300 No.of persoms 88 70 52 30 23 In this method, the upper limit of one class is the lower limit of the next class. Class – intervals of the type { } [ ] : , x a x b a b < < · are called inclusive since they include the upper limit of the class. The following data are classified on the basis. Income 50-99 100-149 150-199 200-249 250-299 No.of persoms 60 38 22 16 7 However, to nsure continuity and to get correct class-limits, exclusive method of classification should be adopted. To convert inclusive class-intervals into exclusive, we have to make an adjustment. 23 Adjustment: find the difference between the lower-limit of the second class and upper limit of the first class. Divide it by 2, subtract the value so obtained from all the lower limits and add the value to all upper limits. In the above example, the adjustment factor is 100 99 0.5 2 − · The adjusted classes would then be as follows: Income 49.5-99.5 99.5-149.5 149.5-199.5 199.5-249.5 249.5-299.5 No.of persoms 60 38 22 16 7 Example: the following data show the height in millimeters for 106 maize plants after 2 weeks. 129 148 139 141 150 148 138 141 140 146 153 141 148 138 145 141 141 142 141 141 143 140 138 138 145 141 142 131 142 141 140 143 144 135 134 139 148 137 146 121 148 136 141 140 147 146 144 142 136 137 140 143 148 140 136 146 143 143 145 142 138 148 143 144 139 141 143 137 144 133 146 143 158 149 136 148 134 138 145 144 139 138 143 141 145 141 139 140 140 142 133 139 149 139 142 145 132 146 140 140 140 132 145 145 142 149 Construct a grouped frequency distribution for the data. Solution The procedure for constructing a grouped frequency distribution for numerical data follows: 1. Determine the classes intervals: Find the highest value and lowest value: H = 158 and L = 121 Find the range: R = highest value – lowest value = H – L, so R = 158 – 121 = 37 Find the class width by dividing the range by the number of classes. Width = R 37 3, 7 4 number of classes 10 · · ≈ 24 Find the lower limit of the first class of distributions by taking: the lower limit of series - width 2 = 4 121 119 2 − · The upper class limit of the first class = the lower limit + width = 119 + 4 = 123 Find the upper class limit, the high value of distributions by taking: the lower value of distributions+ width* number of classes = 119 + 4 *10 = 159 The completed frequency distribution is: Class limits Class mark or Mid point Frequency Cumulative Frequency Relative Frequency Percentage 119 - 123 123 - 127 127 - 131 131 – 135 135 – 139 139 – 143 143 – 147 147 – 151 151 – 155 155 - 159 121 125 129 133 137 141 145 149 153 157 1 0 1 7 15 39 28 13 1 1 1 1 2 9 24 63 91 104 105 106 0.009 0.000 0.009 0.066 0.142 0.368 0.264 0.123 0.009 0.009 0.9 0.0 0.9 6.6 14.2 36.8 26.4 12.3 0.9 0.9 Total 106 1.000 100 Exercise Example: construct a grouped frequency distribution of students of applied statistic Level I in INES in 2010 according to: height, weight, age. 25 2.1.4. Two – way Frequency Distribution ( Bivariate Frequency Distribution) A two – way Frequency Distribution is used when two variables are involved. A two – way frequency table has class intervals for one variable as columns and for the other variables as rows. The boxes formed at the intersection of rows and columns thus represent a joint – class. The column and row where are the total are named marginal distributions. The others columns and rows are named conditional distribution. The frequency of this joint class is the number of items that has the value of the first variable in the class given by the column heading and the value of the second variable in the class given by the row heading. The method of constructing of the two – way table consists of the following steps: • Determine the class intervals for each of the variables; • Place one of the variables at the top of the table and the other on the left – hand side; • Place each item in the approximate box; • Total the tallies in each box and in each row and column. The grand total of rows and columns should check with the total number of items. Example1: the following table shows the performance of students in two subjects: statistics and Accountancy. Roll number of students Marks in Statistics Marks in Accountancy 1 2 3 4 5 6 7 8 9 10 15 1 1 3 16 2 18 5 4 17 13 1 2 7 8 9 12 9 17 16 26 11 12 13 14 15 16 17 18 19 20 21 22 23 24 6 19 14 9 8 13 10 13 11 11 12 18 9 7 6 18 11 3 5 4 10 11 14 17 18 15 15 3 Construct a two – way frequency for data, take class interval of two variables (Statistics and Accountancy) as 1 – 5; 6 – 11, etc. Use of 4 classes of width 5 for each variable. The Two – way Frequency table for marks in Statistics and Accountancy is shown as: Statistics ACC 1 - 5 6 - 10 11 - 15 16 -20 Total 1 - 5 2 3 1 6 6 - 10 3 2 2 6 11 - 15 1 4 2 7 16 - 20 1 2 5 5 Total 6 6 7 5 24 Example 2: The age of 20 husbands and wives are given below. Form a two way frequency table showing the relationship between the ages of husbands and wives with the class-intervals 20-24, 25-29, etc. 27 S. No. Age of husband Age of wife S. No Age of husband Age of wife 1 28 23 11 27 24 2 37 30 12 39 34 3 42 40 13 23 20 4 25 26 14 33 31 5 29 25 15 36 29 6 47 31 16 32 35 7 37 35 17 22 23 8 35 25 18 29 27 9 23 21 19 38 34 10 41 38 20 48 47 Solution Frequency Distribution of Age of Husbands and Wives Age of W Age of H 20-24 25-29 30-34 35-39 40-44 45-49 Total 20-24 III 3 25-29 II III 5 30-34 I I 2 35-39 II III I 6 40-44 I I 2 45-49 I I 2 Total 5 5 4 3 2 1 20 Exercises 1. Prepare a two-way frequency table and marginal frequency tables for 25 values of the two variables x and y given below. Take class interval of x as 10-20, 20-30, etc., and that of y as 100-200, 200-300, etc. x y x y 12 140 51 250 24 256 27 550 28 33 360 42 360 22 470 43 570 44 470 52 290 37 380 57 416 26 280 44 380 36 315 48 452 55 420 48 370 48 390 52 312 27 440 41 330 57 390 69 590 21 590 2. Prepare a bivariate frequency distribution for the following data: Marks in Law Marks in Statistics Marks in Law Marks in Statistics 10 20 13 24 11 21 12 23 10 22 11 22 11 21 12 23 11 23 10 22 14 23 14 22 12 22 12 20 12 21 13 24 13 24 10 23 10 23 14 24 2.2 . GRAPHIC REPRESENTATION OF A FREQUENCY DISTRIBUTION After the data have been organized into a frequency distribution, they can be presented in graphical form. It is easier to comprehend the meaning of data presented graphically than data presented numerically in tables or frequency distributions. The three most commonly used graphs in research are: 1. The histogram ; 2. The frequency polygon; 3. The cumulative frequency graph or ogive. 29 1. Histogram A histogram is a graphic presentation of a frequency distribution, in which the classes are marked on the horizontal axis and the class frequencies on the vertical axis. The class frequencies are represented by the heights of the rectangle. Each rectangle represents just one class; the rectangle width corresponds to the class width and the rectangles are drawn adjacent to each other. Notice: in drawing histograms class intervals must be equal and exclusive. Example: For the following frequency distribution of height of students drawn the histogram. Height 140-145 145-150 150-155 155-160 160-165 165-170 170-175 No.of Students 4 10 18 20 19 6 3 Solution Histogram of distribution of height of students 0 5 10 15 20 25 Height( Class) N u m b e r s o f s t u d e n t s ( f r e q u e n c i e s ) 140-145 145-150 150-155 155-160 160-165 165-170 170-175 In the frequency distribution, if the class intervals are of unequal width, we have first to calculate frequency density on a convenient scale. 30 i d i i i i i n d a f a · · Some time we can multiply densities to the smallest class interval Otherwise to multiply to a predetermined interval or choose the smallest in your distribution. 0 i 0 a d a i i i i i n d a f a · × · × With a 0 the smallest interval Example: Average monthly earning of 1035 employees in construction industry Monthly earning Number of workers Width 0 a i i i n d a · × 60-70 25 10 25 70-80 100 10 100 80-90 150 10 150 90-100 200 10 200 100-120 240 20 120 120-140 160 20 80 140-150 50 10 50 150-180 90 30 30 180 and more 20 - - Draw the histogram Histogram of average monthly earninng of 1035 employees 0 50 100 150 200 250 Average monthly earning ( classes) N u m b e r o f w o r k e r s ( f r e q u e n c i e s ) 60-70 70-80 80-90 90-100 100-120 120-140 140-150 150-180 31 If the frequency distribution has inclusive class intervals, they should be converted into the exclusive type and only then, the histogram should be drawn. Example: Draw histogram to present the following data. Income No.of Employees Income No.of Employees 100-149 150-199 200-249 250-299 21 32 52 105 300-349 350-399 400-449 450-499 62 43 18 9 Solution: here the grouped frequency distribution is not continuous because the class intervals are inclusive. We first convert it into a continuous distribution as follows: Adjustment factor 150 149 0.5 2 − · . Subtract it from each lower limit and add to each upper limit so as to have exclusive class intervals. Thus Income No.of Employees Income No.of Employees 99.5 -149.5 149.5-199.5 199.5-249.5 249.5-299.5 21 32 52 105 299.5-349.5 349.5-399.5 399.5-449.5 449.5-499.5 62 43 18 9 32 Frequency distribution of employees by earned income (HISTOGRAM) 0 20 40 60 80 100 120 Income N u m b e r o f e m p l o y e e s 99.5-149.5 149.5-199.5 199.5-249.5 249.5-299.5 299.5-349.5 349.5-399.5 399.5-449.5 449.5-499.5 2. A frequency Polygon A frequency polygon is a graph of class marks. Class marks are values of middle points of class intervals. The polygon is drawn by placing the class marks on the horizontal axis, and on the vertical axis are placed the frequency of observations. If the class intervals are of equal width, the class frequencies are plotted against the class mid – values. If the class intervals are of unequal width, the graph is obtained by plotting frequency density against class mid – values. Description of a frequency polygon: 1) Each class is represented by a single point. The height of the point represents the class frequency; the position of the point must be directly above the corresponding class mid – point; 2) The points are joined by straight lines. 3) The extremities of the graph are joined with the mid- values of the class preceding the first class and the class following the last class at zero frequency i.e on the x- axis. A curve of relative frequencies can also be drawn, and so can a curve of percentages. These are called frequency curves. Example: For the following frequency distribution, draw a frequency polygon. 33 Income 300-400 400-500 500-600 600- 700 700-800 800-900 900-1000 workers 18 32 35 30 21 12 4 Solution midpoint 350 450 550 650 750 850 950 workers 18 32 35 30 21 12 4 Frequency polygon of distribution of Income 3. Cumulative Frequency Curve or the Ogive A cumulative frequency distribution (traditionally called an ogive) is a graph that represents the cumulative frequencies for the classes in a frequency distribution. Cumulative frequency graph is used to visually how many values are below a certain upper class boundary. There are two types of ogives: A) Less than ogive: Plot the points with the upper limits of the classes as abscissae and the corresponding less than cumulative frequency as ordinates. For less than distributions, the cumulation will proceed from the least to the greatest size, and the series so obtained will be called less than cumulative frequency distribution. B) For more than distributions, the cumulation will proceed from the greatest to the least, and the series so obtained will be called more than cumulative frequency distribution. To form cumulative frequency distributions, the points are joined with straight lines. 34 Example Draw the two ogives for the following distribution showing the number of marks of 59 students. Marks No. Of students Marks No. Of students 0-10 10-20 20-30 30-40 4 8 11 15 40-50 50-60 60-70 12 6 3 Solution Construction of two Ogives marks No.of students ( f) Less than cumulat f More than Cumul f 0-10 10-20 20-30 30-40 40-50 50-60 60-70 4 8 11 15 12 6 3 4 12 23 38 50 56 59 59 55 47 36 21 9 3 Plotting the points ( 10, 4), ( 20,12), (30,23), ( 40, 38), ( 50,50 ), ( 60, 56), ( 70, 59) and joining them by free – hand, the smooth rising curve so obtained is less than ogive. Plotting the points (0, 59), (10, 55), (20, 47), (30, 36), (40, 21), (50, 9), (60, 3) and joining them by free-hand, the smooth falling curve so obtained is the more than ogive. Less-than and more than cumulative frequency of marks distribution 35 EXERCISES 1. This table represents sex, age, height, weight of 24 students of Level I AST at INES in 2010. Order Sex Age Height (en cm) Weight (en kg) 1 F 22 160 58 2 F 19 170 60 3 M 23 161 50 4 M 26 180 61 5 M 22 159 49 6 M 27 172 70 7 M 23 150 45 8 M 22 150 48 9 F 23 170 65 10 M 23 160 58 11 F 25 155 59 12 F 23 162 60 13 F 24 171 80 14 F 24 170 62 15 F 24 165 64 16 F 23 173 61 17 F 22 160 57 18 F 18 163 52 19 F 19 143 48 20 F 25 167 67 21 F 23 168 59 22 F 22 172 63 23 F 24 162 55 24 F 22 174 63 Draft a form of tabulation to show: Sex and age, weight and height, age and weight, age and height. Present absolute , relative, %, cumulative frequency distributions. Draw Histogram, frequency polygon and ogive for age, height and weight. 2. Draw a histogram for the following frequency distribution of heights of students. From the histogram, obtain the frequency polygon. Height 140-150 150-160 160-165 165-170 170-180 180-190 No. of students 5 15 15 20 10 2 36 3. Daily wages of works of a factory has the following distribution. Draw the less than cumulative frequency graph for the wages. wages 30-39 40-49 50-59 60-69 70-79 80-89 90-99 100-109 Total No of works 9 25 34 25 19 13 7 2 134 4. Draw a histogram and frequency polygon for the following data: Marks No. Of students Marks No. Of students 0-10 10-20 20-30 30-40 40-50 5 13 12 11 8 50-60 60-70 70-80 80-90 90-100 4 1 3 1 2 5. The following are the weights of 30 students. Draw up a frequency distribution with: a) Class intervals 40-44, 45-49, 50-54,…kgs. b) Class intervals of width 6 kgs each. Weights ( kgs) : 51, 47, 50, 54, 62, 52, 42, 49, 52, 49, 44, 50, 53, 58, 46, 50, 51, 53, 48, 50, 55, 52, 55, 58, 63, 54, 52,49,50,58. 2.3 . DIAGRAMATIC AND CHARTS REPRESENTATION In section 2.2, graphs such as the histogram, frequency polygon, and ogive showed how data can be represented when the variable displayed on the horizontal axis is quantitative, such as heights and weights. On the other hand, when the variable displayed on the horizontal axis is qualitative or categorical several types of charts are used such that: pictograms, statistical maps or cartogram, spider chart, Gantt charts, bar chart, pareto charts, time series graphs, pie graphs and so on. 37 This section is concerned with the presentation of non – numeric or qualitative frequency distributions data. The types of diagram described in this section include various types of bar charts, pie charts, pareto charts and time series graphs. 1) Bar charts a) Simple bar charts It is a chart constructing of a set of non-joint bars. A separate bar for each class is drawn to a height proportional to the frequency. % 0.00 20.00 40.00 60.00 80.00 % 0.00 20.00 40.00 60.00 80.00 tuer cul oi d i ndeter mi nate % % 0. 00 50. 00 100. 00 % % 60.64 27.31 7.23 4.82 tuer cul l epr om i ndeter bor del i % 60. 64 27. 31 7. 23 4. 82 % The following bar charts is used for discrete variable 38 Example: This table shows the details of monthly expenditure of two families. Draw a bar diagram to the data. Family items of expenditure Family A Family B Food Clothing House Rent Education Fuel and Lighting Miscellaneous Saving 140 80 100 30 40 40 70 240 160 120 80 40 80 80 Total 500 800 Solution Detail for monthly expenditure of family A 0 20 40 60 80 100 120 140 160 F o o d C l o t h i n g H o u s e R e n t E d u c a t io n F u e l a n d L ig h t in g M i s c e l l a n e o u s S a v i n g Expenditure R e v e n u e Series1 39 Details for monthly expenditure of family B 0 50 100 150 200 250 300 F o o d H o u s e R e n t F u e l a n d L i g h t i n g S a v i n g Expenditure R e v e n u e Series1 ii. Multiple bar charts These charts are used as extension of simple bar charts, where another dimension of the data is given. 0 . 0 0 1 0 . 0 0 2 0 . 0 0 3 0 . 0 0 4 0 . 0 0 t u e r c u l o i dl e p r o ma t o u s i n d e t e r mi n a t e b o r d e l i n e ma l e f e ma l e mal e f emal e 0.00 20.00 40.00 60.00 80.00 f emal e mal e Example: draw bar charts to show the details of monthly expenditure of two families. Solution 40 Details for monthly expenditure of two families A and B 0 50 100 150 200 250 300 F o o d H o u s e R e n t F u e l a n d L i g h t i n g S a v i n g Expenditure R e v e n u e Series1 Series2 2) Pie charts A pie chart shows the totality of the data being represented using a single circle. The circle is split into sectors, the size of each one being drawn in proportion to the class frequency. Each sector can be shaded or colored differently if desired. Procedures of drawing a pie graph are: Step 1: Calculate the proportion of the total that each frequency represents, using the formula f n where f = frequency of the class and n = total number of values. Step 2: Find the number of degrees for each class, using the formula Degrees = o 360 f n g or Step 3: Find the percentage of values in each class by using the formula % 100 f n · ⋅ . Step 4: Using a protractor and compass, graph each section and write its name and corresponding degrees or percentage 41 Advantages and disadvantages of Pie charts Advantages: easy to construct; easy to understand, a sense of continuity is given by line diagram which is not present in a bar chart. Disadvantages: might be confusing if too many diagrams with closely associated values are compared together. Where several diagrams are displayed, there is no provision for total figures. Example1: construct a pie charts for the following data Monthly expenditure of family A Family A Rs % age Cumulative % age Food Clothing House Rent Education Fuel and Lighting Miscellaneous Saving 140 80 100 30 40 40 70 28 16 20 6 8 8 14 28 44 64 70 78 86 100 Monthly expenditure of Family A 28% 16% 20% 6% 8% 8% 14% Food Clothing House Rent Education Fuel and Lighting Miscellaneous Saving This graph shows that food is the most expenditure of family A. 42 Example 2: a survey of the students in the school of education of a large university obtained the following data for students enrolled in specific fields. Construct a pie graph for the data and analyze the results. Major Number % 100 f n · ⋅ Preschool Elementary Middle secondary 893 605 245 1096 31 21 9 39 Total 2839 100 Students enrolled in specific fields Preschool 31% Elementary 21% Middle 9% secondary 39% This graph shows that there are many students in secondary School than other fields. Exercises 1. In a study of 100 women, the numbers shown here indicate the major reason why each woman surveyed worked outside the home. Construct a pie graph for the data and analyze the results. Reason Number To support self/family 62 43 For extra money For something different to do Other 18 12 8 1. A questionnaire about how people get news resulted in the following information from 25 respondents. Construct a frequency distribution and a pie graph for the data (N = newspaper, T = television, R = radio, M = magazine). N N R T T R N T M R M M N R M T R M N M T R R N N 2. A questionnaire on housing arrangements showed this information obtained from 25 respondents. Construct a frequency distribution and pie graph for the data (H = house, A = apartment, M = mobile home, C = condominium). H C H M H A C A M C M C A M A C C M C C H A H H M 4) Pareto chart A pareto chart is used to represent a frequency distribution for a categorical or qualitative variable, and the frequencies are displayed by the heights of vertical bars, which are arranged in order from highest to lowest. Procedures of drawing a pareto chart 1. Arrange the data from the largest to smallest according to frequency 2. Draw and label the x and y axes 3. Draw the bars corresponding to the frequencies. Example: The following data are based on a survey from American Travel Survey on why people travel. Construct a pareto for the data and comment. Purpose Number Personal business Visit friends or relatives 146 330 44 Work – related Leisure 225 299 Source: USA TODAY 0 50 100 150 200 250 300 350 1 Visit friends or relatives Leisure Work – related Personal business This chart shows that the majority of American travel for visiting friends or relatives and the minority travel for personal business. 5) Time series When data are collected over a period of time, they can be represented by a time series graph. A time series graph represents data that occur over a specific period of time. Procedures of drawing a time series Step 1: Draw and label the x and y axes Step 2: Label the x axis for years and the y axis for the number of Step 3: Plot each point according to the table Step 4: Draw line segments connecting adjacent points. Example 1: the number of bank failures in the United States during the years 1989 – 2000 is shown. Draw a time series graph to represent the data and comment the results. Year 198 9 199 0 199 1 199 2 199 3 199 4 199 5 199 6 199 7 199 8 199 9 2000 N. of 207 169 127 122 41 13 6 5 1 3 8 7 45 failures 0 50 100 150 200 250 1985 1990 1995 2000 2005 Series1 The graph shows the bank failures from 1989 trough 2000. The most bank failed was between 1989 and 1992. Example 2: The following table shows meat production for lamb for the years 1960 – 2000 (data are in millions of pounds), construct a time series for the data. year 1960 1970 1980 1990 2000 Lamb 769 551 318 358 234 46 0 100 200 300 400 500 600 700 800 900 1950 1960 1970 1980 1990 2000 2010 Year M e a t p r o d u c t i o n f o r L a m b Series1 The graph shows a decline in the quantity of meat production for lamb from 1960 through 2000. 47 Chapter 3: DATA DESCRIPTION: MEASURES OF CENTRAL TENDENCY, MEASURES OF DISPERSION, MEASURES OF POSITION. 3.1 INTRODUCTION This chapter explains the basic ways to summarize data. These include measures of central tendency, measures of variation or dispersion, and measures of position. Central tendency refers to the location of a distribution. A measure of central tendency is any of a number of ways of specifying this "central value". Several types of averages can be defined, the most important being the mean, the median, the mode and midrange. Means could be arithmetic, geometric, or harmonic mean. The three most commonly measures of variation are the range, variance, and standard deviation. The most common measures of position are percentiles, quartiles, and deciles 3.2 MEASURES OF CENTRAL TENDENCY A. The arithmetic mean 1. Definition of the arithmetic mean The arithmetic mean of a set of values is the simple arithmetic average of the observations. This is defined as “the sum of the values of all the observations divided by the number of observations" The arithmetic mean is normally abbreviated to just the "mean” or average The mean or average, of a population is represented by, the Greek letter µ ( mu); and for a sample, by the Roman letter X (read “X bar “ ). That is arithmetic mean= the sum of all the values of observations in the sample the number of values in the sample 48 2. The arithmetic mean for ungrouped data The formula for calculating the arithmetic mean is: j 1 1 2 3 N n j j=1 1 2 3 n x x x x x ... x for the population N N N x x x x x ... x X for the sample n n N j n µ · + + + + · · · + + + + · · · ∑ ∑ ∑ ∑ Where: • The symbol ∑ ( Geek capital letter "sigma")stands for summation: it means “the total of"; • x represents any particular value of an observation; • x ∑ is the sum of all values in the sample or population; • N represents the total number of observations in the population; • n refers to the number of observations in the sample. Assume that the data are obtained from samples unless otherwise specified. Example 1: Find the arithmetic mean (the average) of the numbers 8, 3, 5, 12, and 10. Solution: in this data set, 1 2 3 4 5 8, 3, 5, 12, 10 x x x x x · · · · · , n = 5 Then 8 3 5 12 10 38 x 7.6 5 5 + + + + · · · 49 3. The mean of a simple( discrete ) frequency distribution The mean for a simple frequency distribution is calculated using the following formula: k j j=1 1 1 2 2 3 3 x 1 3 3 1 x x x x + x x ... x Mean, x n j k k k j j f f f f f f f f f f f f f · + + · · · · + + + ∑ ∑ ∑ ∑ ∑ Where • X represents values • f represents frequencies • f ∑ is the total frequency or the total number of observations ( n) • fx ∑ refers to the sum of each value x times its frequency f Example : calculate the arithmetic Mean of the marks of 46 students given in the following table. Table 3.1 Frequency of marks of 46 students Marks ( X) Frequency ( f ) fx 9 10 11 12 13 14 15 16 17 18 1 2 3 6 10 11 7 3 2 1 9 20 33 72 130 154 105 48 34 18 Total 46 623 50 The total of all these values ( fx ∑ ) = 623 Total number of observations ( n) = 46 Therefore, the arithmetic mean of the marks of 46 students is, 623 13.54 46 fx x n · · · ∑ 4. The mean of a grouped frequency distribution For grouped data, µ and x are calculated by x x and x = N n f f µ · ∑ ∑ Where f is the frequency, x the mid-point of the class interval and n the total number of observation. Procedure of finding the Mean of grouped frequency distribution Characteristics of the Arithmetic Mean 1. Make a table as shown. Class interval Frequency( f) Midpoint (x) of class interval f.x 2. Find the midpoints of each class 3. Multiply the frequency by the midpoint for each class 4. Find de sum of the frequency f of each class times the class midpoint X. 4. Divide the sum obtained by the sum of the frequencies. 51 Example 1: Calculate the arithmetic mean of the following data: Table 3.2 shows profit per shop Profit in € N.of shops( f) Mid-point of Class interval f.x 0-10 12 5 60 10-20 18 15 270 20-30 27 25 675 30-40 20 35 700 40-50 17 45 765 50-60 6 55 330 Total 100 2800 The mean profit is: 2800 28 100 fx n · · ∑ Example 2: The following data relates to the number of successful sales made by the salesmen in a particular quarter. Number of sales: 0- 4 5 – 9 10 – 14 15- 19 20 – 24 25 – 29 Number of salesmen 1 14 23 21 15 6 Calculate the mean number of sales Answer: Number of sales ( class interval) Number of Salesmen (f) Class midpoint ( x) ( fx) 0 to 4 1 2 2 5 to 9 14 7 98 10 to 14 23 12 276 15 to 19 21 17 357 20 t0 24 15 22 330 25 to 29 6 27 162 Totals 80 1225 1225 80 1225 15.3 80 fx f fx x f · · · · · ∑ ∑ ∑ ∑ The advantages of the mean 52 The mean is the most commonly used measure of central tendency • Every set of interval- or ratio- level data has a mean; • It is easily understood; • All the values are included in computing the mean; • A set of data has only one mean. The mean is unique; • It is used in performing many other statistical procedures and tests. • It is not necessary, to know the value of each individual observation in order to calculate the arithmetic mean. Only the total of the observations and the number of observations are required. The disadvantages of the mean are: • The mean is affected by extremely high or low values, called outliers, and may not be the appropriate average to use in these situations; • It is time – consuming to compute for a large body of ungrouped data; • It cannot be calculated when the last class of grouped data is open – ended ( i.e., it includes the lower limit of the last class " and over "); • The sum of the deviations of each value from the mean will always zero: Expressed symbolically: ( ) 0 X X − · ∑ As an example, the mean of 3, 8, and 4 is 5. Then: ( ) ( ) ( ) ( ) 3 5 8 5 4 5 X X − · − + − + − ∑ B. THE MEDIAN The median is generally considered as an alternative average to the mean 53 The value of the variable which divides the distribution so that exactly half of the distribution has the same or larger values and exactly half has the same or lower values is called the median. 1. The median for ungrouped data The median of a set of data is the middle value that separates the higher half from the lower half of the data set after they have been ordered from the smallest to the largest, or the largest to the smallest. Procedure for obtaining the median of a set of data: order the given data from the smallest to the largest or the largest to the smallest; Select the middle point. Example: Find the median of the following five observations 1 2 3 4 5 x 10, x 15, x 6, x 12 and x 11 · · · · · Solution: We must: 1. order the given numbers from the smallest to the largest: 6, 10, 11, 12, 15 2. Select the middle point: the middle value is 11. Therefore the Median (MD) = 11 Note 1.When a set of data contains an even number of items; there is no unique middle or central value. The convention in this situation is to use the mean of the middle two items to give a median . Example 1: Find the median of the following six observations: 1 2 3 4 5 6 x 10, x 15, x 6, x 12 and x 11, x 17 · · · · · · Solution: As before arrange all the values of the observations in numerical order: 6, 10, 11, 12, 15, 17 54 Evidently there is no middle value. However two numbers lie in the middle: 11 and 12. The two must be added together and divided by 2; thus obtaining their average: 11+12 MD = 11.5 2 · Example 2: calculation of the median for the data given in table 3.1 Solution: Arranging all the 24 values in ascending order of magnitude, we get the following data: 2.90 3.57 3.73 2.98 3.61 3.75 3.30 3.62 3.76 3.43 3.66 3.76 3.43 3.68 3.77 3.45 3.71 3. 84 3.55 3.72 3.88 The 12 th value is 3.66 and 13 th is 3.68; the median is the average of these two. Median = 3.66 3.68 3.67 % 2 g + · Note 2. For a set with an odd number ( n) of items, the median can be precisely identified as the value of the 1 2 n th + item. Thus in a size –ordered set of the 15 items, the median would be the 15 1 8th item along. 2 th the + · 2. Median for a simple frequency distribution Where there is a large number of discrete items in a data set, but the range of values is limited, a simple frequency distribution will probably have been compiled. The median for a simple frequency distribution is calculated by the following formula: MD = 1 2 f + ∑ 55 Where f ∑ is cumulative frequency, represented by F or N Procedure for calculating the median To calculate the median for a simple (discrete) frequency distribution, the following procedures should be followed 1. Calculate the value of 1 2 f + ∑ ; 2. Form a F ( cumulative frequency) column; 3. Find that F value which first exceeds 1 2 f + ∑ ; 4. The median is that x – value corresponding to the F value identified in 3. Example: calculate the median for the following distribution of delivery times of orders sent out from a firm. Delivery time (days) 0 1 2 3 4 5 6 7 8 9 10 11 Number of orders 4 8 11 12 21 15 10 4 2 2 1 1 Answer STEP 1 The median is the 1 2 N th + = 91 1 2 th + = 46 th item STEP 2 The F Column is shown in the following table: Delivery time Number of orders (Days) orders cum ( x ) ( f ) ( F) 0 4 4 1 8 12 2 11 23 3 12 35 4 21 56 5 15 71 56 6 10 81 7 4 85 8 2 87 9 2 89 10 1 90 11 1 91 STEP 3 The first F value to exceed 46 is F = 56 STEP4 The median is thus 4 (days) 3: Median for a grouped frequency distribution There are two methods commonly employed for estimating the median for a grouped frequency distribution. a) using an interpolation formula; b) by graphical interpolation a) Estimating the median by formula Given a grouped frequency distribution, the best that can be done is to identify the class or group that contains the median item. From there, using cumulative frequencies and the fact the median must lie exactly one half of the way along the distribution. The formula for calculating the median for a grouped distribution is: Median = 2 . N F L c f ¸ _ − + ¸ , 57 Where lower bound(limit) of the median class ( the class contains the middle item of distribution) sum of frequecies of all classes lower than the median class = median class widt L F c · · h(interval of median class) = frequency of the median class N = total number of obsrvations f Example1: calculation of Median for the Data of table 3.2 Protein intake/consumption unit (g) /day ( class interval) N.of families Frequencies ( f) Cumulative frequency 15-25 30 30 25-35 40 70 35-45 100 170 45-55 110 280 55-65 80 360 65-75 30 390 75-85 10 400 Total 400 Median class is 45 -55 N=400 Median = ( ) . 200 170 .10 2 45 47.73 110 N F C L g f ¸ _ − − ¸ , + · + · Procedure for estimating the median by formula The procedure for estimating the median (by formula) for a grouped frequency distribution is: 1. Form a cumulative frequency (F) Column; 2. Find the value of N ( where N = ). 2 f ∑ 58 3. Find that F value first exceeds, which identifies the median class M. 4. Calculate the median using the following interpolation formula: 2 . N F L c f ¸ _ − + ¸ , Example: Estimate the median for the following data, which represents the ages of a set of 130 representatives who took part in a statistical survey. Age in years 20 and 25 and 30 and 35 and 40 and 45 and Under 25 under 30 under 35 under 40 under 45 under 50 Number of 2 14 29 43 33 9 Representatives Answer 1. Age ( years) Number of representatives ( f) ( F ) 20 and under 25 2 2 25 and under 30 14 16 30 and under 35 29 45 35 and under 40 43 88 40 and under 45 33 121 45 and under 50 9 130 2. 130 65 2 2 N · · 3. The median class is the class that has the first F greater than 65. Here, it is 35 to 40. 4. The median can now be estimated using the interpolation formula. 59 35; 43; 5 2 Thus, median = . 65-45 = 35 + 5 43 = 37.33 Median = 37.33years L F c N F c f · · · ¸ _ − ¸ , ¸ _ × ¸ , b) Estimating the median graphically A percentage cumulative frequency curve (or ogive ) is drawn and the value of the variable that corresponds to the 50% point is read off and gives the median estimate. Procedure for estimating the median graphically 1. Form a cumulative ( percentage ) frequency distribution 2. Draw up cumulative frequency curve by plotting class upper bounds against cumulative percentage frequency and join the points a smoth curve. 3. Read off 50% point to give median. Properties of Median 1. The median is particularly useful where : a) a set or distribution has extreme values present and b) Values at the end of a set or distribution are not known. This means that median is used for an open – ended distributions. 2. The median can be determined for all levels of data except nominal 3. the median is unique; there is only one median for a set of data The advantages of the median The advantages of the median are: it is not affected by extremely large or small values ; 60 it is easily understood ( i.e half the data are smaller than the median and half are greater); it can be calculated even when the last class is open – ended and when the data ere qualitative rather than quantitative; The disadvantages of the median It does not use much of the information available; It requires that observations be arranged into any array, which is time – consuming for a large body of ungrouped data. C. THE MODE 1. Definition The mode is the value of the observation that appears most frequently, or equivalently has the largest frequency. Especially, the mode is used in describing nominal and ordinal levels of measurement It is possible for data not to have any mode at all; like in a case where observations occur with equal frequency. Example: • The mode of the set 2, 1, 3, 3, 1,1, 2, 4 is 1, since this value occurs most often. • For the data in table 3.1 is 3.76 this observation is most commonly occurring • The mode of the following simple discrete frequency distribution : X 4 5 6 7 8 9 10 f 2 5 21 18 9 2 1 Is 6, since this value has the largest frequency 2. The mode for grouped data For a grouped frequency distribution, the mode cannot be determined exactly and so must be estimated. The technique used is one of interpolation. There are two methods that can be used to estimate the mode: Using an interpolation 61 Graphically, using a histogram. Mode of a grouped frequency distribution by formula An estimate of the mode for a grouped frequency distribution can be obtained using the following procedure: 1. Determine the modal class ( that class which has the largest frequency) 2. Calculate D 1 = difference between the largest frequency and the frequency immediately preceding it. 3. Calculate D 2 = difference between the largest frequency and the frequency immediately following it. 4. Use the following interpolation formula: Interpolation formula for the mode 1 1 2 D Mode = L+ . D C D ¸ _ + ¸ , Where: L = lower bound of modal class C = modal class width And: D 1, D 2 are as described above in 2 and 3 Example 1: Estimate the mode of the following distribution of ages. Age (years) 20-25 25-30 30-35 35-40 40-45 45-50 Number of employees 2 14 29 43 33 9 Answer: Age (years) number of employees 20 and under 25 2 25 and under 30 14 30 and under 35 29 35 and under 40 43 40 and under 45 33 45 and under 50 9 62 D 1 = 43 – 29 = 14 D 2 = 43-33 = 10 The lower class bound of the modal class, L = 35 The class width of the modal class, C = 5 (from 35 to 40 ) 1 1 2 Thus: mode= .C 14 = 35+ .5 14+10 mode = 37.92 years D L D D ¸ _ + + ¸ , ¸ _ ¸ , Graphical estimation of the mode The graphical equivalent of the above interpolation formula is to construct three histogram bars, representing the class with the highest frequency and the ones on either side of it, and to draw two lines. The mode estimate is the x value corresponding to the intersection of the lines. Example 2: Estimation of the mode of a frequency distribution using the graphical formula. Using the data of ex 1: Age (years) number of employees 30 and under 35 29 35 and under 40 43 40 and under 45 33 Draw the graph 63 The advantages of the mode • The mode has the advantage of not being affected by extremely high or low values; • It is easily understood ( half the data are smaller than the median and half are greater), not difficult to calculate and can be used when the last class of a distribution is open –ended; • The mode is used for al levels of data: nominal, ordinal, interval, and ratio. The disadvantages of the mode The disadvantages of the mode are: • The mode does not use much of the information available; • For many sets of data, there is no mode because no value appears more than once. For example, there is no mode for this set of price data: RWF250 , RWF 400, RWF 650 and RWF 1250 ; • The mode is not always unique. Example: suppose the ages of the individuals in a scout Club is 14, 16, 17, 18, 18, 20, 20, 22, 24, 24, and 25. Both the ages 27 and 35 are modes. In general, the mean is the most frequently used measure of central tendency and the mode is the least used. lowest value highest value MR 2 + · Example: D. THE MIDRANGE The midrange is defined as the sum of the lowest and highest values in the data set, divided by 2. The symbol MR is used for the midrange. Find the midrange of these numbers: 2, 3, 6, 8, 4, and 1 64 1 8 9 MR 4.5 2 2 + · · · Then, the midrange is 4.5 The Relationship between the Arithmetic Mean, the Median and the Mode • In a symmetrical frequency distribution the mode, median, and mean are located at the center and are always equal illustrates this for a normal distribution .Fig (a ) in this case one of these measures may be used. Mean Median Mode • If the distribution of the variable is not symmetrical, we have a skew distribution: the arithmetic mean is not so typical of the distribution. In a positively skewed distribution, the mean is not at the centre. The mean is dragged to the right of centre by a few extremely high values of the variable that have been observed. The median is generally the next largest measure in a positively skewed frequency distribution. The mode is the smallest of the three measures. If the distribution is highly skewed, the mean would not be a good measure to use. The median and mode would be more representative. • mode median mean 65 • In a negatively – skewed distribution the mean is reduced by a few extremely low values of the variable and hence will be left of centre. The median is greater than the arithmetic mean, and the modal value is the largest of the three measures. Again, if the distribution is highly skewed, the mean should not be used to represent the data. In a moderately skew distribution the following relationship holds approximately: 1. Mean - Mode= 3 (mean-Median); 2. Median – mode = 2 ( mean – median ); 3. Median = 2 mean + mode 3 ; 4. Mode = 3 median – 2 mean ; 5. Mean = 3 median - mode 2 THE GEOMETRIC MEAN G The geometric mean is useful in finding the average of percentages, ratios, indexes, or growth rates. It has a wide application in business and economics because we are often interested in finding the percentage changes in sales, salaries, or economic figures, such as the Gross Domestic Product, which compound or build on each other. The geometric mean G of a set of N positive numbers 1 2 3 , , ,... n x x x x , is calculated using the formula: Geometric mean= 1 2 3 ... n n x x x x Where n is the number of observation made of the variable x and 1 2 3 , , ,..., n x x x x are the values of these observations. Example: the geometric mean of the numbers 3, 25 and 45 is: G = 3 3 25 45 × × = 3 3375 66 Mean median mode THE HARMONIC MEAN H The harmonic mean is another specialized measure of location used only in particular circumstances; namely when the data consists of a set of rates, such as prices, speeds or productivity. The harmonic mean H of a set of N numbers 1 2 3 , , ,... n x x x x , is the reciprocal of the arithmetic mean of the reciprocals of the numbers: H = 1 1 1 1 1 n i i n x n x · · ∑ ∑ Where n is the number of observations. Example: the harmonic mean of the numbers 2, 4, and 8 is: H = 3 3 3.43 1 1 1 7 2 4 8 8 · · + + The relation between the arithmetic, geometric, and harmonic means. The geometric mean of a set of positive numbers 1 2 3 , , ,... n x x x x is less than or equal to their arithmetic mean but is greater than or equal to their harmonic mean. In symbols: X H G ≤ ≤ The equality signs hold only if all the numbers 1 2 3 , , ,... n x x x x are identical. Example: The set 2, 4, 8 has arithmetic mean 4.67, geometric mean 4, and harmonic mean 3.43. 67 3.2 MEASURES OF DISPERSION Dispersion refers to the variability or spread in the data. A small value for a measure of dispersion indicates that the data are clustered closely, say, around the arithmetic mean. A large measure of dispersion indicates that the mean is not reliable. The most important measures of dispersion are: 1. Range is the difference between the largest and the smallest values in a data. The range is the simplest of the three measures and is defined now. The symbol R is used for the range. R= Largest value – smallest value 1. Find the range of the following distribution. 35, 45, 30, 35, 40, 25 R = 45- 25 = 20 2. Mean Deviation (MD) is the arithmetic mean of the deviations of the observations from the arithmetic mean ignoring the sign of these deviations. a) The formula for the mean deviation for ungrouped data is MD = for populations X N µ − ∑ MD = for samples X X n − ∑ mean Where: X is the value of each observation; X is the arithmetic mean of the values; 68 µ is the arithmetic mean of the population; n is the number of observations in the sample; N is the number of observations in the population; Indicates the absolute value. Example: calculate the mean deviation of 43, 75, 48, 39, 51, 47, 50, 47 Solution First determine the mean as: 400 50 8 · , and then: MD = x x n − ∑ 43 50 75 50 48 50 39 50 51 50 47 50 50 50 47 50 8 − + − + − + − + − + − + − + − · 7 25 2 11 1 0 3 8 6.5 + + + + + + · · b) Mean deviation for grouped data: MD = for populations MD = for samples f X N f x x n µ − − ∑ ∑ Where f refers to the frequency of each class and X to the class midpoints. 69 Example: calculate the mean and the mean deviation of the number of sales (see ex 4.2) Table 1 Number of sales made by salesmen Number of sales 0-4 5-9 10-14 15-19 20-24 25-29 Number of salesman 1 14 23 21 15 6 Table 2 Layout of calculations Number of sales Number of Salesman f Mid-point ( x) ( fx) x x − f x x − 0 to 4 4 to 9 10 to 14 15 to 19 20 to 24 25 to 29 1 12 23 21 15 6 2 7 12 17 22 27 2 98 276 357 330 162 13.3 8.3 3.3 1.7 6.7 11.7 13.3 116.2 75.9 35.7 100.5 70.2 Totals 80 1225 411.8 Mean number of sales, 1225 80 x · = 15.3 Thus, mean deviation, MD = f x x f − ∑ ∑ = 411.8 80 = 5.1 sales 70 Characteristic of the mean deviation a. The mean deviation can be regarded as a good representative measure of dispersion that is not difficult to understand. It is useful comparing the variability between distributions of like nature. b. Its practical disadvantage is that it can be complicated to calculate if the mean is anything other than a whole number. c. Because of the modulus sign, the mean deviation is virtually impossible to handle theoretically and thus is not used in more advanced analysis. 3. Variance is the arithmetic mean of the squared deviations from the mean. The variance is nonnegative and is zero only if all observations are the same. The population variance 2 δ (the Greek letter sigma squared) and the sample variance 2 s for ungrouped data are given by: ( ) ( ) 2 2 2 2 and 1 X X x s N n µ δ − − · · − ∑ ∑ Where: N is the number of observation in the population; n-1 is the number of observations in the sample. For grouped data are given by: ( ) ( ) 2 2 2 2 and 1 f X X f X s N n µ δ − − · · − ∑ ∑ 4. Standard deviation. 71 The population standard deviation δ and sample standard deviation s are the positive square roots of their respective variances. a) For ungrouped data: ( ) ( ) 2 2 and s = 1 X X X N n µ δ − − · − ∑ ∑ Exemple 1 (for ungrouped data): calculate the variance and standard deviation of the following table Table 3.3 Haemoglobin values ( g%) of 26 Normal Children 11.8 12.9 12.4 13.3 13.8 11.4 12.3 11.7 12.9 12.2 10.4 10.8 12.7 13.2 11.6 12.0 12.2 14.2 10.8 10.5 11.6 13.5 12.2 11.2 12.6 13.0 Table 3.4 calculation of standard Deviation and variation for the data of table 3.3 Serial No Haenoglobin values Deviation from Aritm.mean 12.2 Square of deviation 1 2 3 4 5 6 7 8 9 11.8 11.4 10.4 11.6 10.8 12.2 12.9 12.3 10.8 - 0.4 - 0.8 - 1.8 - 0.6 - 1.4 0.0 0.7 0.1 -1.4 0.16 0.64 3.24 0.36 1.96 0.0 0.49 0.01 1.96 72 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 12.0 10.5 11.2 12.4 11.7 12.7 12.2 11.6 12.6 13.3 12.9 13.2 14.2 13.5 13.0 13.8 12.2 -0.2 -1.7 -1.0 -0.2 -0.5 -0.5 0.0 -0.6 0.4 1.1 0.7 1.0 2.0 1.3 0.8 1.6 0.0 0.04 2.89 1.00 0.04 0.25 0.25 0.0 0.36 0.16 1.21 0.49 1.00 4.00 1.69 0.64 2.56 0 Total 0 2540 Arithmetic mean is 12.2 Standard deviation = S = ( ) 25.40 1 25 x x n − · − ∑ S = 2 1.016 1.01 g% variance = s 1.016 · · b) For grouped data: 73 ( ) ( ) 2 2 and s = 1 X X f X N n µ δ − − · − ∑ ∑ Example: calculation of Variance and Standard Deviation for Data of table 3.2 Example1: calculation of variance and standard deviation for Data of table 3.2 Protein intake/consumption ( class interval) N.of families Frequencies ( f) Mid-point Of class Interval (x) Deviation Of mid- point from arithmetic Mean ( ) x x − Squared Deviation ( ) 2 x x − Frequency sq × Deviation ( ) f x x − 15-25 30 20 -27.5 756.25 22687.5 25-35 40 30 -17.5 306.25 12250.0 35-45 100 40 -7.5 56.25 5625.0 45-55 110 50 2.5 6.25 687.5 55-65 80 60 12.5 156.25 12500 65-75 30 70 22.5 506.25 15187.5 75-85 10 80 32.5 1056.25 10562.5 Total 400 79500 Arithmetic mean = 47.5 From this table, we get ( ) 2 400 f x x f − · ∑ ∑ Therefore, Standard deviation = s = 2 79500.0 14.10 400 variance = S 198.75 ga · · The major characteristics of the standard deviation are: • It is in the same units as the original ; 74 • It is the same square root of the average squared distance from the mean; • It cannot be negative • It is the most widely reported measure of dispersion. 4. The coefficient of variation In the majority of cases where distributions need to be compared with respect to variability, the following measure, known as the coefficient of variation, is much more appropriate and is considered as the standard measure of relative variation. The coefficient of variation is the standard deviation divided by the mean. The result is expressed as a percentage. Coefficient of variation (C.V.) = standard deviation 100 Mean × · For the example given in table 3.1, the standard deviation, s = 1.01 and the arithmetic mean 12.2 x · , the coefficient of variation is 1.01 100 8.28% 12.2 × · For the example given in table 3.2, the standard deviation, s = 14.10 and the arithmetic mean 47.5, x · the coefficient of variation, therefore, is 14.10 100 29.68% 47.5 × · 3.3 MEASURES OF POSITION In addition to measures of central tendency and measures of variation (dispersion), there are measures of position or location. These measures include standard scores, 75 percentiles, deciles, and quartiles. They are used to locate the relative position of a data set. For example, if a value is located at the 80 th percentile, it means that 80% of the values fall below it in the distribution and 20 % of the values fall above it. A. standards scores The standard score represents the number of standard deviations that a data value falls above or below the mean. The symbol for a standard score is z. the formula is: value-mean standard deviation z · For samples, the formula is: X X z s − · For populations, the formula is: X z µ δ − · A student scored 65 on a calculus test that had a mean of 50 and standard deviation of 10; she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative positions on the tests. Solution First, find the z scores. For calculus the z score is. 65 50 1.5 10 X X z s − − · · · For history the z score is: 30 25 1.0 5 z − · · Since the z score for calculus is larger, her relative position in the calculus class is higher than her relative position in the history class. Note that if the z score is positive, the score is above the mean. If the z score is 0, the score is the same as the mean. And if the z score is negative, the score is below the mean. B. Quartiles 76 Quartiles divide the distribution into four equal parts (quarters). The value of the variable for which the cumulative frequency is 4 N is called the first quartile or lower quartile and it is denoted by 1 Q . Similarly, the value of the variable for which the cumulative frequency is 3 4 N is called the third quartile or upper quartile and it is denoted by 3 Q . Cleary median is the second quartile and it can be denoted by In the case of ungrouped data with n items 1 Q is calculated as follows. Let ( ) 1 1 4 i n 1 · + 1 ¸ ] = integral part of ( ) 1 1 4 n + Let ( ) ( ) 1 1 1 1 . 4 4 q n n 1 · + − + 1 ¸ ] Hence q is the fractional part. Then ( ) 1 1 i i i Q x q x x + · + − where similarly ( ) 3 1 i i i Q x q x x + · + − ( ) ( ) ( ) 3 3 3 1 and 1 1 4 4 4 i n q n n 1 1 · + · + − + 1 1 ¸ ] ¸ ] In the case of grouped frequency distribution the quartiles are calculated by using the formula: 1 4 N F C Q L f ¸ _ − ¸ , · + is called the lower quartile 2 2 N F C Q L f ¸ _ − ¸ , · + is the median 3 3 4 N F C Q L f ¸ _ − ¸ , · + is called the upper quartile Where L is the lower limit of the class in which the particular quartile lies, f is the frequency of this class, C is the width of the class and F is the cumulative frequency of the preceding class. 77 C. Deciles Similarly, Deciles are the values of the variables which divide to the frequency into 10 equal parts. Consider a frequency distribution with total frequency N. The value of the variable for which the cumulative frequencies are ( ) 1, 2,..., 9 10 iN i · are called deciles. The ith decile is denoted by i D . Clearly median is the fifth decile. Hence the median can also be denoted by 5 D . In the case of the ungrouped data with n items for k = 1, 2, 3, …, 9. ( ) 1 k i i i D x q x x + · + − Where ( ) ( ) ( ) 1 1 1 and 10 10 10 k n k n k n i q + + + 1 1 · · − 1 1 ¸ ] ¸ ] For a grouped frequency distribution, we have 10 ; ( 1, 2,..., 9) i iN F C D L i f ¸ _ − ¸ , · + · D. Percentiles Percentiles are the values of the variables which divide to the frequency into 100 equal parts denoted by 1 2 99 , ,... . P P P and the ith percentile is denoted by i P . Cleary median is 50 th percentile and hence median can also be denoted by 50 P . In the case of ungrouped data with n items, for k = 1, 2, 3, …99 ( ) 1 k i i i P x q x x + · + − Where ( ) ( ) ( ) 1 1 1 and q 100 100 100 k n k n k n i + + + 1 1 · · − 1 1 ¸ ] ¸ ] Percentiles are got from the following formulae in the case of grouped frequency distribution. 100 ; 1, 2,..., 99 i iN F C P L i f ¸ _ − ¸ , · + · 78 ILLUSTRATIVE EXAMPLES 1. Find the median and quartiles of the heights in cm. of eleven students given by 66, 65, 64, 70, 61, 60, 56, 63, 60, 67, 62. Solution: Arranging the given data in ascending order of magnitude we get 56, 60, 60, 61, 62, 63, 64, 65, 66, 67, 70. Here n = 11. Since n is odd, median is the sixth item which is equal ton 63. ( ) ( ) 1 1 Size of 1 item. 4 1 11 1 3 4 th Q n · + + · 1 Q = third item = 60 ( ) 3 3 1 item 9 item = 66 4 th Q n th · + · 2. Find the median and quartile marks of 10 students in statistics test whose marks are given as 40, 90, 61, 68, 72, 43, 50, 84, 75, 33. Solution: Arranging in ascending order of magnitude we get 33, 40, 43, 50, 61, 68, 72, 75, 84,90. Here n = 10. Since n is an even, median is the average of the two middle items: 61 and 68. Median = ( ) 1 61 68 64.5 marks. 2 + · First quartile Here ( ) ( ) ( ) 1 1 1 1 2 and 1 1 0.75 4 4 4 n q n n 1 1 + · · + − + · 1 1 ¸ ] ¸ ] ( ) ( ) 1 2 3 2 .75 40 .75 43 40 42.5 Q x x x · + − · + − · Third quartile ( ) ( ) ( ) ( ) ( ) 3 8 9 8 3 3 3 1 8 and 1 1 0.25 4 4 4 0.25 75 0.25 84 75 77.25 n q n n Q x x x 1 1 + · · + − + · 1 1 ¸ ] ¸ ] · + − · + − · 79 3. Find the lower quartile, median, upper quartile, 4 th decile and 60 th percentile of the following data. Marks 0-4 4-8 8-12 12-14 14-18 18-20 20-25 25&above No.of student 10 12 18 7 5 8 4 6 Solution Marks No.of student Cumulative frequency 0-4 4-8 8-12 12-14 14-18 18-20 20-25 25 &above 10 12 18 7 5 8 4 6 10 22 40 47 52 60 64 70 70 N f · · ∑ i) Median = 2 C N L F f ¸ _ + − ¸ , Here 70 35, Median class is 8-12, L 8, C 12 8 4, F 22, 18 2 2 N f · · · · − · · · Median = ( ) 4 8 35 22 10.89 18 + − · Here 70 17.5 4, C 4, 12, F 10 4 4 N L f · · ⇒ · · · · ii) Lower quartile 1 4 C N Q L F f ¸ _ · + − ¸ , ( ) 1 4 4 17.5 10 6.5 12 Q · + − · 80 iii) Upper quartile: 3 3 4 C N Q L F f ¸ _ · + − ¸ , 3 3 70 52.5 18, C 20 18 2, 8, F 52 4 4 N L f × · · ⇒ · · − · · · ( ) 3 2 18 52.5 52 18.125 8 Q · + − · iv) 4 th Decile is 4 4 10 C N D L F f ¸ _ · + − ¸ , Here 4 280 28 8, C 4, 18, F 22 10 10 N L f · · ⇒ · · · · ( ) 4 4 8 28 22 9.33 12 D · + − · V) 60 th percentile is 60 P which is given by 60 60. 100 C N P L F f ¸ _ · + − ¸ , Here 60 60.70 42 12, C 14 12 2, 7, F 40 100 100 N L f · · ⇒ · · − · · · ( ) 60 2 12 42 40 12.57 7 P · + − · 81
Report "47721775 Ines Descriptive Statistics Level i Asta 2010"