Educational Data Mining in relation to Educational Statistics of Nepal

TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING PULCHOWK CAMPUS

Major Project Report on Educational Data Mining in relation to Educational Statistics of Nepal

By: Roshan Bhandari

(16226)

Sijan Bhandari

(16236)

Subit Raj Pokharel (16237) Sujit Maharjan

(16239)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING IN PARTIAL FULLFILLMENT OF THE REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING LALITPUR, NEPAL

August, 2013

TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING PULCHOWK CAMPUS

Major Project Report on Educational Data Mining in relation to Educational Statistics of Nepal

By: Roshan Bhandari

(16226)

Sijan Bhandari

(16236)

Subit Raj Pokharel (16237) Sujit Maharjan

(16239)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING IN PARTIAL FULLFILLMENT OF THE REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING LALITPUR, NEPAL

August, 2013

TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING PULCHOWK CAMPUS DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING The undersigned certify that they have read, and recommended to the Institute of Engineering for acceptance, a project report entitled EDUCATIONAL DATA MINING IN RELATION TO EDUCATIONAL STATISTICS OF NEPAL submitted by Roshan Bhandari, Sijan Bhandari, Subit Raj Pokharel and Sujit Maharjan in partial fulfilment of the requirements for the Bachelor’s degree in Computer Engineering.

_________________________________________________ Supervisor, Er. Bibha Sthapit Lecturer, Department of Electronics and Computer Engineering

_________________________________________________ Co-Supervisor, Er. Anjesh Tuladhar COO, Young Innovations Pvt. Ltd.

_________________________________________________ Internal Examiner, Dr. Nanda Bikram Adhikari Assistant Professor, Department of Electronics and Computer Engineering

_________________________________________________ External Examiner, Mr. Anup Poudyal Sr. Software Engineer, Verisk Information Technologies

Date of Approval:August 26, 2013

ii

COPYRIGHT The author has agreed that the Library, Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering may make this report freely available for inspection. Moreover, the author has agreed that permission for extensive copying of this project report for scholarly purpose may be granted by the supervisors who supervised the project work recorded herein or, in their absence, by the Head of the Department wherein the project report was done. It is understood that the recognition will be given to the author of this report and to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this project report. Copying or publication or the other use of this report for financial gain without approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited. Request for permission to copy or to make any other use of the material in this report in whole or in part should be addressed to:

Head Department of Electronics and Computer Engineering Pulchowk Campus, Institute of Engineering Lalitpur, Kathmandu Nepal

iii

ACKNOWLEDGEMENT We owe our gratitude to Dr. Arun Timalsina and Dr. Aman Shakya of Department of Electronics and Computer Engineering, Pulchowk Campus for their kind effort and help in guiding continuously during the project. We would also like to thank Department of Education for their co-operation and interest in our project. Kind help and suggestions from Shankar Thapa, Er. Arjun Aryal of DOE worth millions of thanks. Also, we would like to thank Er. Anup Neupane of Office of Controller of Examination for providing the school level Average Marks. Our special thanks goes to Anjesh Tuladhar and Rinu Maharjan of Young Innovations Private Limited for providing us with constant co-operation and helpful guidelines in the project. We also owe gratitude to Er. Manoj Ghimire for his valuable suggestions and continuous feedback in the project. Our final thanks goes to all our friends, seniors, teachers and all those people who have helped us directly or indirectly in completing this project.

Roshan Bhandari

(16226)

Sijan Bhandari

(16236)

Subit Raj Pokharel

(16237)

Sujit Maharjan

(16239)

iv

ABSTRACT Educational Data Mining is an emerging discipline among many of the researchers and students concerned with developing methods for exploring the unique types of data that come from educational institutes and organizations and using those methods to better understand the education system. Through our project “Educational Data Mining in relation to the Educational Statistics of Nepal” we have collected useful information from various data collected by Department of Education and other organizations working in the field of Education. Our project intends to analyze various educational statistics data and then present the result in better readable and understandable form so that it would be used in better decision making. Moreover, we also have created API from the available data fields which helps developers and students to easily get the formatted data for their applications and projects. Our major outcome is visualization of these data in map using the API and present it in more interactive way for better understanding. Using the data we have calculated Educational Development Index (EDI) to analyze the educational level of various districts. The analysis is done with different educational indicators and parameters which is shown in statistical charts. The EDI is also simulated over change in parameters to help determine the change in EDI over different districts. EDI is one of the major outcome of this project which mathematically determines the level of education of various districts. We also have school level subject wise average marks data from Office of the Controller of Examinations which we have clustered to analyze the performance of schools and districts in SLC. We also have used correlation to calculate relationships between various data parameters regression analysis to predict the data model of education parameters which best fit the given data sets. The clusters, relationships and predictions can be viewed along with other statistical charts. Keywords Educational Development Index, Cluster Analysis on SLC Marks, Classification, Educational Data Analysis of Nepal, Regression Analysis, Correlation Analysis

v

TABLE OF CONTENTS ACKNOWLEDGEMENT .................................................................................................... iv ABSTRACT........................................................................................................................... v LIST OF FIGURES ............................................................................................................. xii LIST OF TABLES .............................................................................................................. xiv LIST OF ABBREVIATIONS ............................................................................................. xvi 1. INTRODUCTION ............................................................................................................. 1 1.1. Background ................................................................................................................. 1 1.2. Overview ..................................................................................................................... 1 1.3. Motivation ................................................................................................................... 2 1.4. Aims and Objectives ................................................................................................... 2 1.4.1. Aims ..................................................................................................................... 2 1.4.2. Objectives ............................................................................................................ 3 1.5. Scope of the project ..................................................................................................... 3 2. LITERATURE REVIEW .................................................................................................. 4 2.1. Existing Tools for Data Analysis in DOE ................................................................... 6 2.2. Spreadsheet ................................................................................................................. 6 2.3. SPSS as an Analysis Tool ........................................................................................... 7 3. CURRENT SCENARIO OF EDUCATION IN NEPAL .................................................. 8 3.1. School Levels .............................................................................................................. 8 3.2. School Numbers .......................................................................................................... 8 3.3. Teacher Statistics ........................................................................................................ 9 3.4. Enrollment Statistics ................................................................................................. 11 3.4.1. Female Enrollment ............................................................................................. 12 3.4.2. Male Enrollment ................................................................................................ 13 vi

3.5. Promotion Statistics .................................................................................................. 14 3.5.1. Promotion Rate in Primary School Level .......................................................... 14 3.5.2. Promotion Rate in Lower Secondary School Level ........................................... 15 3.5.3. Promotion Rate in Secondary School Level ...................................................... 15 3.5.4. Dropout Statistics ............................................................................................... 16 3.6. Repetition Statistics................................................................................................... 17 3.7. SLC Results Statistics ............................................................................................... 18 4. RESEARCH..................................................................................................................... 20 4.1. School Education Nepal ............................................................................................ 22 4.2. Data Source ............................................................................................................... 22 4.3. Data Collection.......................................................................................................... 22 4.3.1. Collection of Education Data ............................................................................. 22 4.3.2. Collection of SCL Results Data ......................................................................... 23 4.3.3. Collection of GIS Data ...................................................................................... 23 4.4. Data Extraction.......................................................................................................... 24 4.5. Data Storage .............................................................................................................. 24 4.5.1. Average Student School Ratio ........................................................................... 24 4.5.2. Ratio of Primary to Upper Primary School ....................................................... 25 4.5.3. Percentage of Female Teachers ......................................................................... 25 4.5.4. Percentage of Unqualified Teachers .................................................................. 26 4.5.5. Participation of Janajati Students ....................................................................... 26 4.5.6. Participation of Dalit Students ........................................................................... 27 4.5.7. Repetition Rate .................................................................................................. 27 4.5.8. Dropout Rate ...................................................................................................... 27 4.5.9. Gross Enrollment Rate ....................................................................................... 28 vii

4.5.10. Percentage of Enrolled Student Passed ............................................................ 28 4.5.11. Gender Parity Index ......................................................................................... 28 4.5.12. Student Teacher Ratio ...................................................................................... 29 5. Theoretical Background ................................................................................................... 30 5.1. EDUCATIONAL DEVELOPMENT INDEX .......................................................... 30 5.2. Defining Indicators.................................................................................................... 30 5.3. Cluster Analysis ........................................................................................................ 31 5.4. Regression Analysis .................................................................................................. 32 5.4.1. Predicting the Present ........................................................................................ 32 5.4.2. Shaping the Future ............................................................................................. 33 5.5. Multiple Regressions ................................................................................................. 33 5.5.1. Least Square Estimates ...................................................................................... 33 5.5.2. Ordinary Least Squares ...................................................................................... 34 5.5.3. Coefficient of Determination (R2) ..................................................................... 35 5.5.4. Adjusted R2 ........................................................................................................ 35 5.5.5. F- STAT ............................................................................................................. 35 5.5.6. Standard Error of the Estimate........................................................................... 36 5.6. Correlation Analysis.................................................................................................. 36 5.7. Schools Classification ............................................................................................... 37 6. Technical Background ..................................................................................................... 38 6.1. Educational Development Index ............................................................................... 38 6.1.1. Loading Data...................................................................................................... 38 6.1.2. Normalizing data................................................................................................ 38 6.1.3. Principal Components Analysis ......................................................................... 39 6.1.4. Weight Calculation ............................................................................................ 41 viii

6.1.5. EDI Calculation ................................................................................................. 41 6.2. Clustering Algorithm and Techniques ...................................................................... 42 6.2.1. K-means Clustering ........................................................................................... 42 6.2.2. Hartigan and Wong Implementation of K-means Clustering ............................ 43 6.2.3. Cluster Analysis of SLC Results ....................................................................... 43 6.3. Regression Algorithm and Techniques ..................................................................... 44 6.3.1. Implementing Regression Method ..................................................................... 46 6.3.2. Implementation of Multiple Regression ............................................................ 47 6.4. ID3 Algorithm for Schools Classification................................................................ 48 6.4.1. Algorithm ........................................................................................................... 49 6.4.2. Data Description ................................................................................................ 49 6.4.3. Sufficient examples............................................................................................ 49 6.4.4. Attribute Selection ............................................................................................. 49 7. SYSTEM DESCRIPTION ............................................................................................... 51 7.1. Requirement Specification ........................................................................................ 51 7.1.1. High Level Requirement .................................................................................... 51 7.1.2. Functional Requirement ..................................................................................... 51 7.2. System Block Diagram ............................................................................................. 52 7.3. Description of System ............................................................................................... 53 7.3.1. Data Extraction and Storage .............................................................................. 53 7.3.2. Database Design ................................................................................................ 54 7.3.3. API Interface and Services ................................................................................ 58 7.3.4. Visualization ...................................................................................................... 62 8. Results, Visualization and Analysis of The Data ............................................................ 63 8.1. Output and Analysis of the Parameter of EDI .......................................................... 63 ix

8.1.1. Average Student School Ratio ........................................................................... 63 8.1.2. Percentage of Female Teachers ......................................................................... 64 8.1.3. Percentage of Unqualified (Untrained) Teachers .............................................. 65 8.1.4. Repetition Rate .................................................................................................. 66 8.1.5. Drop Out Rate .................................................................................................... 67 8.1.6. Gross Enrollment Rate ....................................................................................... 68 8.1.7. Student Teacher Ratio ........................................................................................ 69 8.2. Calculation, Visualization and Findings of Educational Development Index .......... 70 8.2.1. EDI Values of year 2007 to 2011 ...................................................................... 70 8.2.2. Visualization of Matrix for Edi calculation ....................................................... 72 8.2.3. EDI Visualization .............................................................................................. 73 8.3. Results and Findings of Cluster Analysis ................................................................. 74 8.3.1. Cluster Analysis on Average Mathematics Marks ............................................ 74 8.3.2. Cluster Analysis of Science Results .................................................................. 76 8.3.3. Cluster Analysis on English Marks ................................................................... 79 8.3.4. Cluster Analysis on Average Marks of Health .................................................. 83 8.3.5. Cluster Analysis on Pass Percentage ................................................................. 85 8.4. Correlation Analysis Output ..................................................................................... 88 8.5. Regression Analysis .................................................................................................. 92 8.5.1. Gross Enrollment Model. ................................................................................... 92 8.5.2. Girl's Enrollment Model .................................................................................... 94 8.5.3. Dropout Rate Model .......................................................................................... 95 8.5.4. Decision Tree for Schools Classification........................................................... 98 8.6. Visualization in Map ................................................................................................. 99 8.6.1. Pupil Teacher Ratio in 2008 .............................................................................. 99 x

8.6.2. Pupil Teacher Ratio in 2011 ............................................................................ 100 8.6.3. Educational Development Index in 2011 ......................................................... 100 9. TOOLS, PLATFORMS AND TECHNologies USED .................................................. 101 9.1. Language ................................................................................................................. 101 9.2. Framework .............................................................................................................. 101 9.3. Project Management Tools ..................................................................................... 101 9.4. Database .................................................................................................................. 101 9.5. Data Extraction Tools ............................................................................................. 101 9.6. Visualization Tools ................................................................................................. 101 9.6.1. Visualization in Map:....................................................................................... 101 9.6.2. UI Design using Twitter Bootstrap .................................................................. 102 9.7. Technology used for making API ........................................................................... 102 9.7.1. JSON (JavaScript Object Notation) ................................................................. 102 9.7.2. Comma-separated values (CSV) ...................................................................... 103 9.8. Tools for Data Analysis .......................................................................................... 103 9.8.1. R Statistical Programming Language .............................................................. 103 9.8.2. Rpy2 ................................................................................................................. 103 9.8.3. Numpy ............................................................................................................. 103 9.8.4. Scipy ................................................................................................................ 103 10. CONCLUSION ............................................................................................................ 104 11. FUTURE ENHANCEMENTS .................................................................................... 105 BIBLIOGRAPHY/REFERENCES ................................................................................... 106

xi

LIST OF FIGURES Figure 3.1: Number of Schools in Nepal ............................................................................... 8 Figure 3.2: Number of Teachers at Lower Secondary Schools ............................................. 9 Figure 3.3: Number of Teachers in Lower Secondary Schools ........................................... 10 Figure 3.4: Number of Teachers in Primary ........................................................................ 11 Figure 3.5: Number of Female students enrollment at different level ................................. 12 Figure 3.6: Number of male Enrollments at different school levels .................................... 13 Figure 3.7: Promotion Rate in Primary School Level of Nepal ........................................... 14 Figure 3.8: Promotion Rate in LS ........................................................................................ 15 Figure 3.9: Dropout Rate LSS ............................................................................................. 16 Figure 3.10: Dropout Rate in Primary School ..................................................................... 17 Figure 3.11: Dropout Rate in Secondary School ................................................................. 17 Figure 3.12: Number of Schools appeared in SLC .............................................................. 19 Figure 3.13: Number of Students Appeared and passed in SLC ......................................... 19 Figure 5.1: EDI with all components and indicators ........................................................... 31 Figure 7.1 : System Block Diagram..................................................................................... 52 Figure 7.2: ER Diagram for the System .............................................................................. 54 Figure 7.3: Use case Diagram for the System ..................................................................... 56 Figure 7.4: Sequence Diagram for the API call ................................................................... 57 Figure 7.5: Sequence Diagram for Visualization................................................................. 58 Figure 8.1: Visualization of top 10 districts with maximum ASSR .................................... 63 Figure 8.2 : Top 10 districts with percentage of female teachers in 2011 ........................... 64 Figure 8.3: Top 10 districts with high percentage of untrained teachers in 2011................ 65 Figure 8.4 : Visualization of Top 10 Repetition Rate in 2011 ............................................. 66 Figure 8.5 : Visualization of districts with top 10 Dropout Rates ....................................... 67 Figure 8.6 : Visualization of Top 10 GER in 2011 .............................................................. 68 Figure 8.7 : Visualization of Districts with Top STR values............................................... 69 Figure 8.8: Educational Development Index ....................................................................... 73 Figure 8.9: 3-D Scatter Plot of Marks of Maths in 3 consecutive year ............................... 74 Figure 8.10: Plot of Maths Cluster....................................................................................... 75 Figure 8.11: 3-D Scatter Plot of Marks of Science in 3 consecutive year ........................... 77 Figure 8.12: Plot of Science Cluster .................................................................................... 77 xii

Figure 8.13: Performance in Science Marks ........................................................................ 79 Figure 8.14: 3-D Scatter Plot of Marks of English in 3 consecutive year ........................... 80 Figure 8.15: Plot of English Cluster .................................................................................... 81 Figure 8.16: Government School Below Average Mark in English .................................... 83 Figure 8.17: Plot of Health Cluster ...................................................................................... 84 Figure 8.18: 3-D Scatter Plot of Pass Percentage in 3 consecutive year ............................. 86 Figure 8.19: Comparison of Cluster 1 and Cluster 4 ........................................................... 88 Figure 8.20: Multiple Regression of Teachers and Girls Enrollment .................................. 94 Figure 8.21: Mutiple Regression of Dropout, Student Passed and Gender Index ............... 96 Figure 8.22 :Decision Tree for Schools classification ......................................................... 98 Figure 8.23 Pupil Teacher Ratio of 2008 in Map ................................................................ 99 Figure 8.24Pupil Teacher Ratio of 2011 in Map .............................................................. 100 Figure 8.25 Educational Development Index of 2011 in map ........................................... 100

xiii

LIST OF TABLES Table 3.1: Number of Schools in Nepal ................................................................................ 9 Table 3.2: Teachers in Lower Secondary School Level ...................................................... 10 Table 3.3: Number of Teachers in LSS ............................................................................... 10 Table 3.4: Number of teachers by gender in different years .............................................. 11 Table 3.5: Number of Female Enrollment rate .................................................................... 12 Table 3.6: Number of Male Enrollment at different levels .................................................. 13 Table 3.7: Promotion Rate in PSL ....................................................................................... 14 Table 3.8: Promotion Rate in Lower Secondary Level ....................................................... 15 Figure 3.9: Promotion Rate in Secondary Level .................................................................. 15 Table 3.10: Promotion Rate in Secondary Level ................................................................. 15 Table 3.11: Dropout Rate at different Levels of Nepal ....................................................... 16 Table 3.12: Repetition Rates at different levels of Nepal .................................................... 18 Table 3.13: SLC Result Statistics ........................................................................................ 18 Table 7.1: Call Parameters and Variables ............................................................................ 60 Table 7.2: Parameter values for each district ....................................................................... 61 Table 8.1: EDI values for all district.................................................................................... 72 Table 8.2 : Center values for Math clusters ......................................................................... 75 Table 8.3: Center values for Science clusters ...................................................................... 76 Table 8.4: Number of Schools with Science Marks below 32............................................. 78 Table 8.5: Center values for English clusters ...................................................................... 80 Table 8.6: Number Of Schools with English Marks below 32 ............................................ 82 Table 8.7: Center values for Health clusters ........................................................................ 83 Table 8.8: Number Of Schools with Health Marks below 32 ............................................ 85 Table 8.9: Center values for Pass Percentage clusters ......................................................... 86 Table 8.10: Plot of Pass Percentage Cluster ........................................................................ 87 Table 8.11: Comparison of Cluster1 Cluster 4 .................................................................... 87 Table 8.12: District level Correlation Matrix ...................................................................... 89 Table 8.13: School level Correlation Matrix ....................................................................... 90 Figure 8.14: Multiple Regression for Student Enrollment, Teachers and Classroom ......... 92 Table 8.15: Summary of Regression Result for GER .......................................................... 93 Table 8.16: Model Statistics for GER .................................................................................. 93 xiv

Table 8.17: Summary of Results for Girls Enrollment ........................................................ 95 Table 8.18: Model Statistics of Girls Enrollment ................................................................ 95 Table 8.19: Summary of Regression Result for Dropout .................................................... 96 Table 8.20: Model Statistics for Dropout Rate .................................................................... 97

xv

LIST OF ABBREVIATIONS API APTR ASSR ASSR CBS CSV DB DOE DR ECD EDI EDM EER EFA GAN GER GON GPI GPI INGO JSON LSS PCA PEP PPC PRD PSL PUT RR SLC SQL SSL STR TBE TCN TDJ TFT TGE TMT TTN UNESCO VDC

Application Programming Interface Average pupil teacher ratio Average student school ratio Average Student School Ratio Central Bureau of Statistics Comma Separated Values Data Base Department of Education Dropout rate Early Childhood Development Educational Development Index Educational Data Mining Enhanced Entity Relationship Education For All Global Action Nepal Gross enrollment rate Government Of Nepal Gender parity index Gender Parity Index International Non-Government Organization Javascript Object Notion Lower Secondary School Principle Component Analysis Percentage of enrolled passed Pre-Primary Classes Promotion Repetition Dropout Primary School level Percentage of unqualified teachers Repetition rate School leaving Certificate Structured Query Language Lower Secondary School Level Student Teacher Ratio Total boys enrolled Total classroom number Total Dalit Janjati Total female teacher Total girls enrolled Total male teacher Total toilet number United Nations Educational Scientific and Cultural Organization Village Development Committee

xvi

1. INTRODUCTION 1.1. Background The world is progressing faster with technology and data has been one of the key factor for this development. Every organization generates huge chunk of data from their day to day operation. These data may be customer data or hospital data or any other research data. Managing these data for maximizing the benefit of an organization has always been a challenge and a topic for research. Data Mining, Statistics and bioinformatics are the recent technological fields for managing these data. In the context of our country, Organizations like Nepal Telecom, Central Bureau of Statistics (CBS), Banking Sectors and some ministries like Education Ministry are the key collectors of huge amount of data. They collect and process huge amount of raw data every year periodically. Educational Data Mining is the new field of data mining. Data related to the various fields of education are collected by various organizations. These data may be related to the marks and performance of student or may be related to the various factors that decide the educational development of a nation. Educational Data Mining relates to the processing, analyzing and visualizing various results of the data collected. Educational Data Mining can lead to the discovery of various unknown things about the education field.

1.2. Overview Our project “Educational Data Mining in relation to the Educational Statistics of Nepal” is related to finding useful information from various data collected by Education Department and other organizations working in the field of Education. Our project intends to analyze various educational statistics collected and the useful presentation of the result. Our software is a tool that will analyze these data and then present the result that would be in better readable and understandable form so that it would be used in better decision making. We will apply various data mining techniques for this.

1

We have various datasets related to the educational statistics of different districts and school such as enrollment ratio, dropout rate, Student to Teacher ratio, etc. We used these data to find Educational Development Index of each school and district. Then ranking is computed based on the EDI. This project also relates as much possible data from various other sources such as census and the budget data for the best possible ranking. Various Classification techniques and clustering algorithms is applied to find the educationally backward regions, schools, districts and community based on the EDI. Then the visualization of the conclusions of the algorithms shall be presented.

1.3. Motivation Government and Non-Government organizations collects huge amount of data every year. These data reflect the conditions of the society. Extraction of useful information and the meaningful visualization of them for the maximum benefit of the society is a key issue. We came up with an idea of making a tool that would analyze these data and then give a meaningful calculation of the EDI based on many indicators and rank the schools, districts so that it would be useful for the decision, policy makers to focus on the key areas that have not been on their eyes. Analysis of data can provide a useful insight of current happening and the future predictions. Interest from various organizations such as Department of Education, INGO such as Global Action Nepal, etc. has encouraged us in choosing this project.

1.4. Aims and Objectives 1.4.1. Aims The aim of this project is to develop a tool to analyze the educational statistics published by the department of education relating them to the other datasets such as SLC performance.

2

1.4.2. Objectives Following are the objectives of our project: 

Compute, compare the Educational Development Index (EDI) of 75 districts in Nepal.



Predicting the trends on pass rate for every district.



Visualizing the patterns and correlations among the districts based on educational data clustering.

1.5. Scope of the project Data Mining and Data Analysis tools always have a big demand in the market. Though the term Educational Data Mining seems to be new, it is an emerging concept. It has a wider range of application from accessing student performance, providing feedback for students, predicting the performance, constructing courseware etc. to the decision making, planning and policy making. This project is targeted to Education analyst and policy makers. End users like students/parents/teachers and normal users will be able to track the recent trends on education. At the end of the project the tool developed shall be very useful to the organizations working in the field of education in Nepal

3

2. LITERATURE REVIEW Education is one of the most powerful means for reducing poverty and inequality and lays a foundation for sustained economic growth. Education is the most important and influential during the early stage of development of any person. The developing brain of a child catches and learns a lot more thing during school age than at his adulthood. Our project also focuses on analyzing the educational data of school level in context of Nepal. Educational data has been study of interest for researchers and developers for a very long time. The concept of data mining has modernized the way of analyzing these data. Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in. Many national and international organizations have been working for a long time to collect, extract, process and analyze the educational data from different sectors and in different ways. The International Educational Data Mining Society is one of the parent organizations working for Educational Data Mining. Its aim is to support collaboration and scientific development in this new discipline, through the organization of the EDM conference series, the Journal of Educational Data Mining, and mailing lists, as well as the development of community resources to support the sharing of data and techniques. The society has been organizing conferences every year to discuss on the topic of usage, tools and techniques of EDM. The recently held 6th International Conference on Educational Data Mining (EDM 2013) from July 6 – July 9 at Memphis, Tennessee, USA invited papers that study how to apply data mining to analyze data generated by various information systems supporting learning or education (in schools, colleges, universities, and other academic or professional learning institutions providing traditional and modern forms and means of teaching, as well as informal learning). [1] EDM may require adaptation of existing or development of new approaches that build upon techniques from a combination of areas, including but not limited to statistics, psychometrics, machine learning, information retrieval, recommender systems and scientific computing.

4

In the context of Nepal, Department of Education is the leading government organization which collects and processes education data from all over Nepal. Different other government and non-government organizations have been active in this sector, but the result is not as expected due to limited scope, objectives, resources, tools and technologies. One of the major obstacle is the lack of proper and appropriate data. Our project provides them with the proper tool to help implement their program. Another major problem is the lack of tool to measure the development and progress of education in any region or entire country. Most of the time the education level in our country is measured by the pass percentage of any district in SLC, however, this is not accurate and reliable measure as it only depends upon the marks obtained by a particular student. This can be measure for students not for the education system of a region/country. UNESCO uses an index called Education for all (EFA) Development Index which is a composite index using four of the six EFA goals, selected on the basis of data availability. [2] The goals are: 

Universal primary education (UPE)



Adult literacy



Quality of education



Gender

Similarly UN annually publishes Human Development Index consisting of Life Expectancy Index, Education index, and Income index. The Education Index is calculated from the Mean years of schooling index and the Expected years of schooling index. However, both of these indices have broad scope and covers entire age groups and education levels. Our project deals with the educational data of school level, so these indices do not fit in our system. We have used Educational Development Index (EDI) which analyzes different indicators and parameters of school level needed for good and quality education to measure the development and level of education of any region or country.

5

Initially developed by Dr. Arun C. Mehta and Mr. Shamshad A. Siddiqui, Department of EMIS, NUEPA, New Delhi, India, EDI was used by Department of Educational Management Information System, National University of Educational Planning and Administration as the education measure for different states in India. We chose EDI because it can be computed at different levels of education, such as, primary, upper primary, elementary and other levels of education. Similarly, the weights in the computation of an EDI are determined from mathematical calculation by using Factor Loadings and Eigenvalues from Principal Component Analysis (PCA). During our project we studied various reports and thesis published by different researchers working on educational data mining and EDI. We also attended the Open Data Seminar and presented our project on Open Data Day. We consulted persons working on Openstreetmap for providing us the Open Street Map of Nepal. We studied the standard open data formats like JSON, CSV and conversion of data to these formats. We also studied the kml format of data for mapping layers. We also browsed for different Google APIs to develop API call interface. We studied the SQL database structure to develop our database. We visited Outliner Nepal for getting information about use of info graphics in our project. Our main purpose is to provide an analytical tool to help visualizing the education statistics of Nepal and measure the education level of different districts to help DOE and other organizations get data easily for research, analysis and planning.

2.1. Existing Tools for Data Analysis in DOE 2.2. Spreadsheet A spreadsheet has been used as an attractive choice for performing data calculations because of its easiness. A typical spreadsheet will have a restriction on the number of records it can handle, so if the scale of the job is large, a tool other than a spreadsheet may be very useful.

6

2.3. SPSS as an Analysis Tool Department of Education (DoE) being the largest data collector and authorized body of Nepal, has been using SPSS as data managing and analysis tool. IBM SPSS Statistics has been continuously developed and tested since 1968.Over that period, many forms of statistical analysis have been embedded in the software and the algorithms that execute the equations have been tested Without doing any programming, users can run a very broad range of statistical analyses. Naturally, IBM SPSS Statistics is optimized to handle statistical calculations in a way that a spreadsheet could never be. In fact, the software is optimized for statistical work at every point, from data entry through to the creation of reports for decision makers.

7

3. CURRENT SCENARIO OF EDUCATION IN NEPAL 3.1. School Levels School Level educational system in Nepal mainly consists of 3 levels. They are:1. Primary School Level (class 1-5) 2. Lower Secondary Level (class 6-8) 3. Secondary School Level (class 9-10)

3.2. School Numbers According to the Department there are around 33,000 schools. Following figure shows the plot of number of schools running primary, lower secondary and secondary schools in Nepal.

Figure 3.1: Number of Schools in Nepal

8

Year 2007 2008 2009 2010 2011

Primary 29220 30924 31655 32684 33881

Lower Secondary 9739 10636 11341 11939 13791

Secondary 5894 6516 6928 7266 7938

Table 3.1: Number of Schools in Nepal The number of schools giving primary level of education was 29,220 in 2007 which rose to 33881 in 2011. Similarly the number of schools giving lower secondary and secondary level education was 13791 and 7938 respectively in 2011.

3.3. Teacher Statistics There are around 160,000 teachers across the country. Most of the teacher around 173,000 which is around 65% teachers, teach in primary or lower school level. Female teachers count to about 36% of the number of teachers. Following figures shows the statistic of teacher at various levels.

Figure 3.2: Number of Teachers at Lower Secondary Schools

9

Gender/Year Male Female

2007 75371 41475

2008 88139 55435

2009 92710 60826

2010 96571 70645

2011 100331 73383

Table 3.2: Teachers in Lower Secondary School Level In Secondary school level there were 75371 male teachers and 41475 female teachers. This figure rose to 100331 and 73383 respectively.

Figure 3.3: Number of Teachers in Lower Secondary Schools

Gender/Year

2007

2008

2009

2010

2011

Male

22721

27926

30321

34124

35612

Female

5182

9142

9938

11908

13236

Table 3.3: Number of Teachers in LSS The number of teachers teaching in Lower Secondary school level was 22721 male teachers and 5182 female teachers in 2007. This figure rose to 35612 and 13236 respectively in 2011.

10

Figure 3.4: Number of Teachers in Primary

Gender/Year Male Female

2007 75371 41475

2008 88139 55435

2009 92710 60826

2010 96571 70645

2011 100331 73383

Table 3.4: Number of teachers by gender in different years In the primary school level there were 75,371 male teachers whereas 41475 female teachers. The number rose significantly rose to 100331 and 73383 in 2011.

3.4. Enrollment Statistics In 2007, there were altogether 6562143 students in Nepal, out of which 3177406 were male and 3384737 were female. The figure rose to 3738024 female students and 3309804 male students in 2011. The table below highlights the number of enrollment for male and female in primary, lower secondary and secondary level in 2011.

11

3.4.1. Female Enrollment Female Enrollment is increasing and the increment is encouraging.

Level

2007

Primary

2008

2009

2010

2011

2159763 2365379

2453935

2494472

2411849

Lower Secondary

680072

706494

786359

847607

914909

Secondary

337571

379826

379826

395945

421856

Table 3.5: Number of Female Enrollment rate

Figure 3.5: Number of Female students enrollment at different level

12

Level/Year Primary Lower Secondary Secondary

2007 2258950 763443 362344

2008 2416934 760368 377807

2009 2446728 818063 410522

2010 2457484 852320 415965

2011 2371036 897771 426713

Table 3.6: Number of Male Enrollment at different levels

3.4.2. Male Enrollment

Figure 3.6: Number of male Enrollments at different school levels

13

3.5. Promotion Statistics Promotion rate means the possibility of a student being pass and going to the other class. High promotion rate is always desirable. The table below gives the promotion rate of male and female in different school level. 3.5.1. Promotion Rate in Primary School Level

Figure 3.7: Promotion Rate in Primary School Level of Nepal

Gender/Year Female Male

2007 70.7 70.2

2008 76.6 76

2009 79.2 79

2010 82.1 81.8

2011 83.4 82.8

Table 3.7: Promotion Rate in PSL In average female have higher Promotion Rate than male. The fact is very interesting.

14

3.5.2. Promotion Rate in Lower Secondary School Level

Figure 3.8: Promotion Rate in LS Gender/Year Female Male

2007 83.1 84.4

2008 79.9 82.9

2009 85.5 86.4

2010 87.2 87.3

2011 88 88.1

Table 3.8: Promotion Rate in Lower Secondary Level 3.5.3. Promotion Rate in Secondary School Level

Figure 3.9: Promotion Rate in Secondary Level

Female Male

2007 84.8 85.6

2008 80.8 81

2009 84.9 84.1

2010 87.7 87.3

Table 3.10: Promotion Rate in Secondary Level In average, Female have more promotion rate than male students in Nepal. 15

2011 89 89.8

3.5.4. Dropout Statistics The dropout rate defines the possibility of a student leaving the school. Dropout may be due to various factors like socio-economic conditions of the student, migrations etc.

Primary School Level 2007 12 Female 12.8 Male Lower Secondary School Level 2007 7.8 Female 7.1 Male Secondary School Level 2007 6.2 Female 6.4 Male

2008 7.6 8.3

2009 6.3 6.7

2010 5.9 6.1

2011 5.2 5.7

2008 11.4 9

2009 7.3 6.9

2010 6.2 6.6

2011 6.3 6.6

2008 10.6 11.5

2009 8.1 9.4

2010 8.1 9.2

2011 6.9 6.9

Table 3.11: Dropout Rate at different Levels of Nepal

Figure 3.9: Dropout Rate LSS

16

Figure 3.10: Dropout Rate in Primary School

Figure 3.11: Dropout Rate in Secondary School

3.6. Repetition Statistics Repetition Rate signifies the number of students who repeat the same class at the end of the year. It is expressed in 100%. In overall, in Nepal Females students have lower repetition rate than male. Following table shows the value of repetition rate in various school level of Nepal.

17

Repetition Primary 2007 17.3 Female 17 Male Repetition Lower Secondary 2007 9 Female 8.5 Male Repetition Secondary 2007 9 Female 7.9 Male

2008 15.8 15.7

2009 14.5 14.3

2010 12 12.2

2011 11.4 11.5

2008 8.7 8.1

2009 7.2 6.7

2010 6.6 6.1

2011 5.7 5.3

2008 8.6 7.4

2009 7 6.5

2010 4.2 3.5

2011 4.1 3.4

Table 3.12: Repetition Rates at different levels of Nepal

3.7. SLC Results Statistics Every year thousands of students give SLC examination. But only few pass the examination. Lots of government Investment goes into the education sector. But we have not been able to achieve alot. The SlC result shows the same. Following figure shows the number of students appearing in SLC examination in 3 consecutive years.

Year

Number of Schools

Number of Students Appeared

Number Of Students Passed

2066

6994

385221

250220

2067

7449

397833

222568

2068

8405

419121

199714

Table 3.13: SLC Result Statistics

18

Figure 3.12: Number of Schools appeared in SLC

Figure 3.13: Number of Students Appeared and passed in SLC

19

4. RESEARCH Most of the information is in its raw form. If these data is characterized as recorded facts, then information build the set of patterns. Huge amount of information locked up in databases. These information may be potentially important. Our goal is to discover the important information and patterns of educational data that may govern some rules for the enhancement of current system. Educational Data mining helps to extract implicit, previously unknown, and potentially useful information from data. Many government and non-government organizations are working on educational system of Nepal. Most of them are involved in collecting performance metrics like numbers of students, teacher, physical facilities, enrollment of girls, learning needs for indigenous peoples and linguistic minorities. The 1990 World Conference on 'Education for All'(EFA) made a global commitment to make quality basic education relevant and universally. But, by the end of the decade, they recognized that progress had been insufficient. They found that their data collection and manipulation strategies are not well managed and goal oriented. Finally they come with six EFA goals to address "the learning needs of all children, youth and adults by 2015". They are focusing on following EFA goals: 1. Early childhood. 2. Primary Education. 3. Lifelong Education. 4. Adult Literacy. 5. Gender Parity. 6. Quality Education

20

The Government of Nepal (GON) also established the flash reporting system to monitor progress towards these goals. To ensure relevance and coherence, the system collected and reported school education data biannually in the following format: School ----> Resource center ----> District ---> Region ---> Central level Specifically, the Flash reports have been monitoring the progress of the EFA implementation based on the reports from each district. In addition, the Department of Education (DOE) has also established a system for reporting educational development indicators based on the time series data of Educational Management Information System (EMIS) through the Consolidated Report. Their major objectives are: 1. To evaluate and analyze the broad trends of school education data and information. 2. To compare and assess the progress of school education parameters. 3. To appraise and evaluate the overall functioning of the school education system, and 4. To facilitate the use of data for future planning, monitoring and evaluation.

Their methodology for accessing and manipulating the education data 1. The consolidate report is based on the time series school level educational information. 2. Development regions, ecological belts along with districts have been as the main units of analysis. 3. Population data has been generated through Population projection jointly published by the Central Bureau of Statistics (CBS). 4. Data manipulation, indicators, charts/figures and maps.

21

4.1. School Education Nepal The school education in Nepal consists of primary, lower secondary, secondary and higher secondary education. Starting from Grade one, Primary schools offer five years of education and lower secondary schools provide further three years of education. Secondary school offers two more years of education which concludes with the School Leaving Certificate (SLC) Examination, while higher secondary schools offer two more years of education after SLC. In addition, Early Childhood Development (ECD)/ Pre-primary Classes (PPCs) are offered as preparation for Grade one. The prescribed groups for these levels are 3-4 years of ECD/PPC, 5-9 years for primary, 10-12 years for lower secondary, and 13-14 years for secondary and 15-16 years for higher secondary program. The majority of schools in the country are running along with lower levels; i.e., lower secondary schools also offer classes at primary level, and in turn, secondary schools offer both lower secondary and primary levels as well. Very few of them offer only Grades 6-8 or only the grades 9-10. Broadly, schools are categorized into two types: community schools (supported by government) and institutional schools (supported by parents and trustees).

4.2. Data Source We are working on educational data of school levels so the main source of our data is Department of Education (DOE). Our raw data contains information about the schools, their infrastructure, statistics of students and more.

4.3. Data Collection 4.3.1. Collection of Education Data Our major data source is the data collected by Department of Education. We collected the data in excel format. The available data contained information like number of schools, year wise enrollment, number of boys and girls, number of teachers, school infrastructures and 22

other school information. DOE provided us the yearly information of data of the year 2069, 2068 and 2067. 4.3.2. Collection of SCL Results Data School of Controller of Examination had provided us with average marks in 6 core subjects for more than 7000 schools. The dataset was for the consecutive year 2066, 2067, 2068 BS. The Dataset was formatted and had no missing values. The dataset had following columns:1. School Code 2. School Name 3. District 4. Avg. Marks in Mathematics 5. Avg. Marks in Science 6. Avg. Marks in Social 7. Avg. Marks in Nepali 8. Avg. Marks in EPH 10. Avg. Marks in English

4.3.3. Collection of GIS Data World Bank provided us with the layer of district map in kml format. This map layer was converted to geojson form. Similarly, we obtained the layer of VDC map in shape format (.shp) which also was converted to geojson form.

23

4.4. Data Extraction Once we have collected our required data from the source, we extract the actual information from it. We obtain the data in .xls format and we need to collect the required information that can be saved to the database. We will be using data extraction tool in python to extract the required data.

4.5. Data Storage The extracted data will be saved in database. We will be using My-SQL as our primary database and other database engines as per our requirements. Our database will be designed to retrieve the required information. Metrics Definition and Parameters Construction for the calculation of Educational Development Index 4.5.1. Average Student School Ratio This is a negative indicator. Lesser the value of Student School ratio better is the educational development Index. Construction of the parameter: Number of schools in the district was available at all level. We summed up the values to find the total number of schools and the total number of students studying at various level. We summed up them to find the total number of students and finally applied following formula to find the value of average Student School Ratio. i.e Total Number of Students = Number of Students at Class 1 + …..Number of Students at Class 10 Then, Average Student School Ratio = (Total Number of Students)/ (Total Number of Schools)

24

4.5.2. Ratio of Primary to Upper Primary School This is a negative indicator. This indicator indicates that the number of schools at primary was higher than the number of schools at the secondary level. Higher value of this indicates that the performance of the district was lower. More and More students will be forced to study at same school. More the value of Ratio of Primary to upper primary school lesser will be the educational quality as the number of students increases at upper primary level and ultimately causing high Student Teacher Ratio, Higher Student Classroom Ratio, higher Student School

Ratio. More the value of Ratio lesser will be the quality of

education. Construction of the parameter We had data of number of schools at primary level in school, number of schools in the lower secondary level and the number of schools in the secondary level. So we obtained the value of ratio from the formula below:Ratio of Primary to Upper Primary School = (Number of School in the Primary School Level) / (Number of Schools in the lower Secondary School Level + Number of Schools in the Secondary School Level) 4.5.3. Percentage of Female Teachers This is a positive indicator in the Educational Development Index construction. Higher the percentage of female teachers in a district indicates that higher will be the Educational Development of that region. More number of female teachers in a district and school encourages more number of female students to join school. Construction of the Parameter We had data set with number of female teachers, male teachers in a district. Then we applied following formula to calculate the percentage of female teachers:Percentage of Female Teachers in a district = (Total Number of Female Teachers) / (Total Number of Female Teacher + Total number of Male Teachers) * 100

25

4.5.4. Percentage of Unqualified Teachers This is a negative indicator. Higher the value of percentage of unqualified teachers, lower will be the educational quality. Construction of the Parameter We had the number of trained teachers and untrained teachers in the district. We then calculated the percentage of untrained teachers as follows:Percentage of untrained teacher = (Total number of untrained teacher) / (Total number of trained teachers + Total number of untrained teachers) 4.5.5. Participation of Janajati Students This is a positive indicator. Government of Nepal has targeted various schemes to the development of various tribes of Nepal. Higher the participation of Janajati Students at various levels means that the higher is the achievement of government goal and there is a positive effect in the quality of education of the district. Construction of the Parameter We had number of janajati students studying at various classes in a district. So we calculated the total participation of Janajati students as below:Total Participation of Janjati Students = Total number of Janjati Students at Class 1 + … + Total Number of Janjati Students at class 10

26

4.5.6. Participation of Dalit Students This is a positive indicator. Maximum participation of dalit students in a district ensures the development of backward region. Construction of the parameter We had the numerical figures of dalit students studying at various classes. We then summed them to find the total participation of Dalit Students in the district. i.e Participation of Dalit Students = Number of Dalit Students in class 1 + Number of Dalit Students in class 2 +.... + Number of Dalit Students in class 10 4.5.7. Repetition Rate This is the negative Indicator. Higher the repetition rate lower is the quality of Education. Repetition rate indicates that the number of students that appeared for final exams of a class repeated the class. Construction of this parameter We had data of repetition rate at various classes in districts. We then calculated the average of all the repetition rates at various classes. Repetition Rate of a class in district = (Repetition rate of class 1 + …. + Repetition Rate of class 10) / 10 4.5.8. Dropout Rate This is also a negative indicator. Higher Dropout rate means that the students leave education at school level. This deteriorates the educational performance of a district. Construction of the Parameter We had data sets with the dropout rate at various classes. We then summed and took the average dropout rate in the district in school level education as follows:Repetition Rate = (Repetition rate in class 1 + …....... + Repetition Rate in class 10)

27

4.5.9. Gross Enrollment Rate This is a positive Indicator. Higher the enrollment rate at a district ensures that students and children are attending school level education. Gross Enrollment Rate is related to total enrolment in a specific level of education regardless of age and is expressed as a percentage of the eligible official population corresponding to the same level of education in a given school year. This indicator is widely used to show the general level of participation in a given level of education. Construction of the Parameter We had data of enrollment rates at various levels. We then took an average of all the levels to facilitate our calculation of the educational development Index. GER of a district = (GER of Primary School Level + GER of Lower Secondary School Level + GER of Secondary School Level) / 3 4.5.10. Percentage of Enrolled Student Passed This is a positive indicator. This indicator indicates the chance of any student enrolled in any class being passed. So, higher the value of Percentage of Enrollment Student passed better is the Educational development Index of the district. Construction of the parameter We had data set consisting of promotion from class 1 to 2, class 2 to 3 and so on. We then calculated the promotion rate for the district using an average value. Percentage of Enrollment Students Passed (Promotion Rate) = (Percentage of Enrolled Students passed from class 1 to class 2 + ....... +

Percentage of Enrollment Students

passed from class 9 to class 10) / 10 4.5.11. Gender Parity Index Gender Parity Index in GER indicates the participation of girls against boys in Gross Enrollment Rate. Near 1 value of GPI in GER better will be the value of Educational Development Index.

28

Construction of the GPI We had the data of GPI in primary school level, lower secondary school level and Secondary school level. We then averaged the GPI of all level to find the GPI o the district as: GPI in GER = (GPI in Primary School Level + GPI in Lower Secondary School Level + GPI in Secondary School Level) / 3 4.5.12. Student Teacher Ratio This is a negative Indicator. More the value of Student Teacher Ratio means lesser the value of Educational Development Index. In normal condition we should not have more value of Student teacher Index. Construction of the parameter We had data sets consisting of number of teachers and Students in primary School level, Lower Secondary School level and Secondary School level. We then got the STR for the district as: STR = (Total Number of Teacher in primary School level + Total Number of Teacher in Lower Secondary School level + Total Number of teacher in Lower Secondary School level) / (Total Number of Students in Primary School level + Total Number of Students in Lower Secondary School level + Total Number of Students at Secondary School level)

29

5. THEORETICAL BACKGROUND 5.1. EDUCATIONAL DEVELOPMENT INDEX Education level of a country can be measured using different entities. One of such entity is the Educational Development Index (EDI). EDI uses multiple components and parameters to compare the educational level of country. Basically, computing EDI is to know position or level of any political region taking into account the education level. In this project, we have calculated EDI to compare different districts on the perspective of education. EDI can be calculated for different levels such as primary, secondary or other levels, however, due to the availability of district level data rather than school-level data, this project computes EDI for the districts assuming school as a single level. EDI calculation was used by Department of Educational Management Information System, National University of Educational Planning and Administration, India to compare level of education in various states in India. This project also uses similar approach to compare different districts based on educational level.

5.2. Defining Indicators This project evaluates EDI based in 12 suggestive indicators which are broadly categorized into four groups: 

Access



Infrastructure



Teacher



Outcome

The 12 indicators are suggestive in nature and envelop most of the factors contributing to education measurement. However, it is not compulsory that only these 12 indicators contribute to education, based on objective and availability of data other indicators can also be considered to calculate the index.

30

These 12 indicators are of different nature, perspective and can be either positively influencing education or negatively influencing education so based on the influence on the education, indicators are classified as positive indicators or negative indicators. The overall structure of EDI with all components and indicators is shown in the diagram below:

Figure 5.1: EDI with all components and indicators

5.3. Cluster Analysis Clustering techniques have wide use. Today is the age of data. Size of data is ever growing. There is a huge need of processing power. Newer data techniques are needed to make data Analysis. Cluster Analysis is a technique to make data analysis. Clustering techniques are used extensively in various fields such as artificial Intelligence and others. Clustering can be considered an important unsupervised learning problem, which tries to 31

find similar structures within an unlabeled data collection. These similar structures are data groups, better known as clusters. The data inside each cluster is similar (or close) to elements within this cluster, and is dissimilar (or further) to elements that belongs to other clusters. The main purpose of clustering techniques is to partition data sets into various groups based similarity. These groups may be consistent in terms of similarity of its members. Thus, every group has a member that represents it. The motivation to use such clustering techniques is the fact that, besides reducing the cost of the algorithm, the use of representatives makes the process easier to understand. There are many decisions that have to be made in order to use the strategy of representative-based clustering.

5.4. Regression Analysis Department of education (DoE) and many Education related organizations have been collecting data for decades, building massive data warehouses. DoE is able to make these data available into different chunks of hard copy. Even though this data is available, very few of schools have been able to go through these hard copies and realize the actual value stored in it. Department of education has been trying to extract meaningful relationship and model of performance measurement for different district. Different education organizations are trying to figure out the major factors of education development, factors to be highly noticed for particular district, and improvement strategy. This project presents regression analysis for use in Education development prediction and factor analysis. Predictive analytics can be studied under two categories: a. predicting the Present b. shaping the future [3] 5.4.1. Predicting the Present Predictive analytics solely depends on the existing data and patterns. Existing data on the related field is collected and feed to the already available data mining algorithm. The algorithm will be able to find the real pattern of data. For instance, if there is limited number of qualified teacher in a district, the performance matrix will be low.

32

5.4.2. Shaping the Future DoE, and other education data collectors are using different analytics tools for predicting the present value. They are somewhat successful to discover the improvement factor of education and it ultimately improves the effectiveness of operations. Playing with present data and making rules on existing or obvious pattern are not the only focus of DoE. They are aimed at creating and implementing new strategies, and monitoring progress on different areas of Nepal. Predictive analysis with future shaping mechanisms helps to improve management group to take decision and make rules. Discovering and linking the leading and lagging indicators is major challenge for analysts. Lagging indicators help to discover past performance of particular district. Likewise, leading indicators predict future performance.

5.5. Multiple Regressions Multiple regression modeling describes the relationships between a single target variable and more than two predictor variables. [4] y = b0 + b1 * x1 + b2 * x2 For a multiple regression with m variables, the estimated regression equation takes the form y = b0 + b1 * x1 + b2 * x2 + ··· + bm * xm A multiple regression model uses a linear surface such as a plane or hyper plane to approximate the relationship between a continuous response (target) variable and a set of predictor variables. 5.5.1. Least Square Estimates The regression line is written in the form y^ = b0 + b1*x, called the regression equation or estimated regression equation (ERE), where: y^ is the estimated value of the response variable. B0 is the y-intercept of the regression line. B1 is the slope of the regression line 33

B0 and b1, together, are called the regression coefficients.

The error term ‘E’ is needed to account for the indeterminacy in the model, since two the sample may have same ‘x’ value but different ‘y’ values. The residuals (yi – y^) are estimates of the error terms, Ei, i = 1...n y = b0 + b1*x + E ……………………….. (1) Equation (1) is called the regression equation or true population regression equation. The least-squares line is that line which minimizes the population sum of squared errors, SSE. Similarly, SST, known as total sum of squares, is a measure of the total variability in the values of the response variable alone, without reference to the predictor. Also, SSR, the sum of squares regression, is a measure of the overall improvement in prediction accuracy when using the regression as opposed to ignoring the predictor information. SST = SSR + SSE 5.5.2. Ordinary Least Squares OLS stands for Ordinary Least Squares, the standard linear regression procedure. One estimates a parameter from data and applying the linear model. [5] Y = B0 + b1*X1 + B2*X2 +E Where the Bs are the OLS estimates. OLS minimizes the sum of the squared residuals OLS minimizes SUM e2 The residual, e^, is the difference between the actual Y and the predicted Y and has a zero mean. In other words, OLS calculates the slope coefficients so that the difference between the predicted Y and the actual Y is minimized.

34

5.5.3. Coefficient of Determination (R2) A least-squares regression line could be found to approximate the relationship between any two continuous variables; but this does not guarantee that the regression will be useful. Coefficient of determination,r2, is determined for measuring the goodness of fit of the regression. It measures how well the linear approximation produced by the least-squares regression line actually fits the data observed. r2 = (SSR)/(SST) ………………..(2) r2 may be interpreted as the proportion of the variability in the y-variable that is explained by the regression. The maximum value for r2 would occur when the regression is a perfect fit to the data set. In this optimal situation, there would be no estimation errors from using the regression, meaning that each of the residuals would be zero. 5.5.4. Adjusted R2 Adding a variable to a multiple regression equation virtually guarantees that the R2 will increase (even if the variable is not very meaningful). The adjusted R2 statistic is the same as the R2 except that it takes into account the number of independent variable (k). Adjusted R2 = 1 – (1-R2) *[(n-1)/(n-k-1)] The adjusted R2 is most useful when comparing regression models with different numbers of independent variables. 5.5.5. F- STAT The F statistic is the ratio of the explained to the unexplained portions of the total sum of squares (RSS = sum e^2), adjusted for the number of independent variables (k) and the degree of freedom (n-k-1) F = [ESS/k]/[RSS/(n-k-1)] The F statistic allows the researcher to determine whether the whole model is statistically significant from zero.

35

5.5.6. Standard Error of the Estimate S statistic, known as the standard error of the estimate, is a measure of the accuracy of the estimates produced by the regression. Clearly, s is one of the most important statistics to consider when performing a regression analysis. To find the value of s, we first find the mean square error: MSE = (SSE)/ (n – m – 1) Where m indicates the number of predictor variables, which is 1 for the simple linear regression case and greater than 1 for the multiple regression case. Then the standard error of the estimate is given by S = sqrt(MSE) The standard error of the estimate s represents the precision of the predictions generated by the regression equation estimated. Smaller values of s are better.

5.6. Correlation Analysis Correlation analysis is mainly carried out to find the related variables among number of variables. Correlation coefficient is calculated between any two variables to measure the relationship. Correlation coefficient is the measure of linear association between two variables. [6] Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables. The value of correlation coefficient can be interpreted as: 1. (0.0 -> 0.3)

Weak Relationship. Two variables are not related.

2. (0.3 -> 0.6)

Moderate Relationship. Two variables are moderately related.

3. (0.6 -> 0.9)

Strong Relationship. Two variables are connected in same way.

36

4. (0.9 -> 1)

Very Strong Relationship. Two variables are almost measuring the same

thing.

5.7. Schools Classification A decision tree has been developed to classify schools based on the ID3 algorithm. The averarage marks of six schools scored by the schools was used to predict the class of the school. The subjects used for the prediction are: 

English



Mathematics



Nepali



Science



Environment, Population and Health



Social

37

6. TECHNICAL BACKGROUND 6.1. Educational Development Index 6.1.1. Loading Data After the indicators are defined, data for each indicator is loaded to calculate EDI. API of each parameter for each year is provided which consists of data of that parameter for each district. The data is present in json format so we have used json module of python to extract these data from web. Code: import json import urllib2 url = 'http://localhost:8000/r-pr-upr-school/2007/ urlopen = urllib2.urlopen(url) jsonpage = urlopen.read() jsondata = json.loads(jsonpage) Here, “r-pr-upr-school” stands for “ratio of primary to upper primary school”. The above code loads json data of ratio of primary to upper primary school of the year 2007. The data is present in dictionary form. The value for each district is then loaded by reading the dictionary. Similarly, data values of other indicators are loaded. 6.1.2. Normalizing data Normalizing data is one of the important part of calculating EDI. Normalization is done differently for positive and negative indicators. First the Best and Worst values in an indicator are identified. The BEST and the WORST values will depend upon the nature of a particular indicator. In case of a positive indicator, the HIGHEST value will be treated as the BEST value and the LOWEST, will be considered as the WORST value. Similarly, if

38

the indicator is NEGATIVE in nature, then the LOWEST value will be considered as the BEST value. For example, in case of Ratio of Primary to Upper Primary Schools which is a negative indicator, the lowest value is considered as best and highest as worst while in case of Percentage of Female Teachers which is a positive indicator, the highest value is considered as best and lowest value is considered as worst. The normalization of each value is then done by using the formula:

The data for each parameter of a particular year is normalized so that each value lies between 0 and 1. For example: data before and after normalization for Ratio of Primary to Upper Primary Schools for 2007 is shown below: [2.375,

2.63503649635036,

.....

0.778921865536039, 1.88235294117647,

.....

3.45161290322581, 2.36206896551724, ..... 1.44387755102041, 1.36820083682008] Since, Ratio of Primary to Upper Primary Schools is a negative indicator, 0.778921865536039 is considered as best and 3.45161290322581 as worst. Then using this best and worst value, remaining values are normalized and the normalized values is shown below: [0.4028198127069771,

0.3055259270002585,

….

0.42178035657870094,

0.561992781848579,

0.4412637217299573,

….

0.7512036834384166,

0.7795184841891026] 6.1.3. Principal Components Analysis After the normalization is completed for all indicators across districts, the data is now ready for EDI evaluation. The next step is to assign Weights to each group of indicators. We have used Principal Component Analysis (PCA) to assign weights to the indicator groups. 39

The objective of Principal Component analysis is to reduce the dimensionality (number of indicators) of the data set but retain most of the original variability in the data. The entire EDI calculation is done using matrix operations. We have used numpy module of python to perform matrix manipulations. Also numpy used to perform the principal component analysis. At first, matrix is loaded for each indicator group with name of indicators in column and district values in row. If the indicator group has only one indicator then weight of the indicator group is same as the indicator matrix. Principal Component is calculated for matrix with more than one indicator. Then the eigen values and projection of data in the principal component space is calculated. The number of principal components is determined by checking the eigen values. The number of principal components to be extracted is equal to the number of eigen values greater than 1. If two values are greater than 1, then two principal components are extracted. The value of principal component is equal to the correlation-coefficient between new projected data and the mean subtracted original data (i.e. average across each dimension). This gives the component matrix of each indicator group. Although the initial or unrotated factor matrix indicates the relationship between the factors and individual variables, it seldom results in factors that can be interpreted, because the factors are correlated with many variables. Therefore, through rotation the factor matrix is transformed into a simpler one that is easier to interpret. In rotating the factors, we would like each factor to have nonzero, or significant, loadings or coefficients for only some of the variables. Likewise, we would like each variable to have nonzero or significant loadings with only a few factors, if possible with only one.

The most commonly used method for rotation is the varimax procedure. This is an orthogonal method of rotation that minimizes the number of variables with high loadings on a factor, thereby enhancing the interpretability of the factors. Orthogonal rotation results in factors that are uncorrelated. 40

6.1.4. Weight Calculation The rotated component matrix is used to calculate the weight of each district for the indicator group. The first eigen value is multiplied with the first extracted component and second with the second extracted component. The absolute value across each row (i.e. for each indicator of the group) is summed up which is the weight for that particular indicator. The weight for each variable indicator is calculated and summed to get the grand total of weights. Then, the index for each district is calculated using the weights of the indicators and normalized value of indicators for that district. The normalized value of each indicator for each district is multiplied with the weight of that indicator, summed up together and divided by the grand total of weight to obtain index of each particular district. 6.1.5. EDI Calculation We have defined 12 indicators that contribute for the calculation of EDI which are further divided into four sub-groups. The indicators for each indicator group is loaded, normalized and the index of each indicator group for each district is calculated. Now, the matrix of each indicator group index acts as the matrix to calculate EDI with indicators in column and district-wise value in row. The matrix is normalized and the principal components are calculated. This principal component is used to obtain weights of indicators and this weight is used to calculate the index of each district. This index is the Educational Development Index of each district for the given particular year. EDI can be used to rank districts as district with highest EDI is considered best and that with lowest one is considered worst.

41

6.2. Clustering Algorithm and Techniques 6.2.1. K-means Clustering K-Means is a simple learning algorithm for cluster analysis. This algorithm aims at best dividing n entities into k groups. This is center based algorithm and at the end total distance between the group’s members and its corresponding centroids, representative of the group is minimized. Formally, the goal is to partition the n entities into k sets Si, i=1, 2... k in order to minimize the within-cluster sum of squares (WCSS), defined as:

where term

provides the distance between an entity point and the cluster's

centroid. [7] The most common algorithm, described below, uses an iterative refinement approach, following these steps: Define the initial groups' centroids. This step can be done using different strategies. A very common one is to assign random values for the centroids of all groups. Another approach is to use the values of K different entities as being the centroids. Assign each entity to the cluster that has the closest centroid. In order to find the cluster with the most similar centroid, the algorithm must calculate the distance between all the entities and each centroid. Recalculate the values of the centroids. The values of the centroid's fields are updated, taken as the average of the values of the entities' attributes that are part of the cluster. Repeat steps 2 and 3 iteratively until entities can no longer change groups.

42

The K-Means is a greedy, computationally efficient technique, being the most popular representative-based clustering algorithm. The pseudo code of the K-Means algorithm is shown below. 6.2.2. Hartigan and Wong Implementation of K-means Clustering Given n objects with p variables measured on each object x(i, j) for i = 1,2,...,n; j = 1,2,...,p; K-means allocates each object to one of K groups or clusters to minimize the

within-cluster sum of squares: where

is the mean variable j of all elements in group K.

In addition to the data matrix, a K x p matrix giving the initial cluster centers for the K clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the Kpartition with locally optimal within-cluster sum of squares by moving points from one cluster to another. 6.2.3. Cluster Analysis of SLC Results A lot of students fail in SLC Examination. So, we decided to go for cluster analysis of the dataset. Government spends millions of rupees in Education. Every year hundreds of teacher are hired by the government. Some teachers are hired as full time, some as partial, some hired as rahat quota. But still the performance remains worst. So we have tried to address the issue of how the government should hire teachers and distribute the hiring schemes based on cluster analysis on SLC marks result data. We have performed cluster analysis on each subjects based on the average marks on each subject on the 3 years data.

Mathematics, Science, English, Health, Social Studies and Nepali are the core subjects for SLC. We have applied K-means Algorithm to the 3 years dataset and tried to find the schools that have very low performance in terms of marks in various subjects.

43

Methodology 1. Prepare dataset for Analysis 2. Apply K-means Algorithm 3. View and analyze the clusters 4. Find best cluster from many runs.

6.3. Regression Algorithm and Techniques In statistics, regression analysis focuses on discovering the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one ot the independent variables is varied, while the other independent variables are held fixed and to explore the forms of these relationships. Given a data set { yi, xi1,......, xip} [i = 1 to n] of n statistical units, a linear regression model assumes that the relationship between the dependent variable yi and the p-vector of regressions xi is linear. This relationship is modeled through a disturbance term or error variable Ei -- and unobserved random variable that adds noise to the linear relationship between the dependent variable and regressions. Thus the model takes the form yi = β1xi1 + ... + βpxip + Ei = X[T i]β + Ei , i = 1,....,n. Where T denotes the transpose. Often these n equations are stacked together and written in vector form as: y = Xβ + E, Some remarks on terminology 1. Yi is called the regress and, response variable, measured variable, or dependent variable. The dependent variable is decided based on a presumption that the value of one of the variables is caused by, or directly influenced by the other variables.

44

2. Xi is called repressors, explanatory variables, covariates, input variables, predictor variables or independent variables. 3. β is a p-dimensional parameter vector. Its elements are also called effects, or regression coefficients. 4. Ei is called the error term, disturbance term or noise. This variable captures all other factors which influence the dependent variable via other than the repressors xi. In linear regression, the model specification is that the dependent variable, is a linear combination of the parameters (but need not be linear in the independent variables). For example, in simple linear regression for modeling data points there is one independent variable: xi , and two parameters, β0 and β1 : Straight line: yi = β0 + β1xi + Ei, i = 1,....,n. In multiple linear regressions, there are several independent variables or functions of independent variables. Parabola: yi = β0 + β1 xi + β2 xi^2 + Ei ,i = 1,....,n. Regression Model as forecasting Tool Quantitative forecasting models are used to forecast future data as a function of past data; they are appropriate when past data are available. These methods are usually applied to short or intermediate-range decisions Regression Model as prediction Tool In statistics, prediction is a part of statistical inference. Statistical inference is the process of drawing conclusions from data that is subject to random variation. The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.

45

6.3.1. Implementing Regression Method Regression analysis proceeds by formulating an equation (model) whose line closely models the existing, known data points. This equation then can be extended to forecast future values. This project uses regression model for forecasting different educational parameters. Initially different factors of education are divided into broad performance categories as following. performanceAreas['ENROLLMENT'] = ['gross-enrollment-rate'] performanceAreas['PARTICIPATION'] = ['participation-dalit-student','participationjanjati-student'] performanceAreas['PERFORMANCE'] = ['repetition-rate','dropout-rate','penrolled-student-passed'] performanceAreas['FACILITY'] = ['r-pr-upr-school','a-scr','apupil-teacher-r'] performanceAreas['TEACHER'] = ['p-female-teacher','p-unqualifiedteacher'] performanceAreas['GPI'] = ['gender-parity-index'] Using data analysis tool rpy2, individual factor and its time-series data of 6 years are feed to the linear model. Since, the data is observed to fluctuate over past year; polynomial fitting is selected as best approach for predicting the future value of each model. For each parameter, regression model is determined and its slope for future time is calculated. Observing the slope, following conclusion are discovered. 1. Regression line with negative slope gives the negative performance of that particular parameter 2. Regression line with zero slope gives the constant performance and 3. Regression line with positive slope gives the better performance on that parameter. Based on slope, the system divides parameters into three performance decisions namely Good, Medium and Bad. These decisions are finally feed to broad categories which decide the stage of performance as successful, fail and warning for each performance areas.

46

6.3.2. Implementation of Multiple Regression Total Student Enrollment Model The enrollments of girls or boys in schools are determined by some geographical, infrastructures, presence of teachers and facilities of toilets within the school. The total student enrollment model is described by: Enrollment = b0 + b1 * total-teacher-number + b2 * total-classroom-number + b3 * totaltoilet-number Girl's Enrollment Model The overall study of dataset shows that the girls enrollment in educational institutes is less than that of boys besides the female population is high. The girl's enrollment model is described by: Girl’s enrollment = b0 + b1 * female-teacher-number + b2 * male-teacher-number + b3 * total-toilet-number + b4 * student-classroom-ratio Dropout Rate Model Dropout rate describes the number of student that have been leaving the school for some reasons. Most of the student may drop one school and join others. The school dropout is mainly affected by the percentage of enrolled student passed, gender parity index and percentage of qualified teacher. The dropout rate model is described by: Dropout = b0 + b1 * percentage-enrolled-student-passed + b2 * gender-parity-index + b3* percentage-of-qualified-teacher

Correlation Implementation Correlation analysis is conducted for different district-level parameters and school-level parameters separately.

47

The district-level parameters are: 1. Gross enrollment rate, 2. Gender parity index 3. Repetition rate 4. Dropout rate 5. Percentage of enrolled student passed 6. Average student school ratio 7. Average pupil teacher ratio 8. Percentage of unqualified teacher Similarly school-level parameters are: 1. Total male teacher 2. Total female teacher 3. Total boys enrolled 4. Total girls enrolled 5. Total toilet number 6. Total classroom number

6.4. ID3 Algorithm for Schools Classification ID3 (Iterative Dichrotomiser 3) was first introduced by JR. Quin lan in the late 1970’s. It is a greedy algorithm that selects the next attributes. [8] The information gain associated with the attributes. The information gain is measured by entropy ID3 algorithm. ID3 is based off the Concept Learning system. The basic CLS algorithm over a set of training instances C: [9]

48

6.4.1. Algorithm Step 1: If all instances in C are positive, then create YES node and halt. If all instances in C are negative, create a NO node and halt. Otherwise select a feature, F with values v1, ..., vn and create a decision node. Step 2: Partition the training instances in C into subsets C1, C2, ..., Cn according to the values of V. Step 3: apply the algorithm recursively to each of the sets Ci. 6.4.2. Data Description The sample data used by ID3 has certain requirements, which are: 

Attribute-value description is the same attributes must describe each example and have a fixed number of values.



Predefined classes is an example's attributes must already be defined, that is, they are not learned by ID3.



Discrete classes are classes must be sharply delineated. Continuous classes broken up into vague categories such as a metal being "hard, quite hard, flexible, soft, quite soft" are suspect.

6.4.3. Sufficient examples Since inductive generalization is used (i.e. not provable) there must be enough test cases to distinguish valid patterns from chance occurrences. 6.4.4. Attribute Selection ID3 selects an attribute through a statistical property termed as information gain. Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information (information being the most useful for classification) is selected. In order to define gain, we first borrow an idea from information theory called entropy. Entropy measures the amount of information in an attribute. Given a collection S of c outcomes 49

Entropy(S) = Σ -p(I) log2 p(I) Where p(I) is the proportion of S belonging to class I. S is over c. Log2 is log base 2.

Gain(S, A) is information gain of example set S on attribute A is defined as Gain(S, A) = Entropy(S) - Σ ((|Sv| / |S|) * Entropy (Sv)) Where: S is each value v of all possible values of attribute A Sv = subset of S for which attribute A has value v |Sv| = number of elements in Sv |S| = number of elements in S

50

7. SYSTEM DESCRIPTION 7.1. Requirement Specification Requirement specification provides a detailed picture of functional and non-functional requirement. The Software Requirement Specification provides the description of the purpose and environment for the software development. Requirements will be obtained by visiting various organizations concerned with the development of educational sector. 7.1.1. High Level Requirement The System shall be capable of generating rules. It shall also be able of forecasting various indices based on the previous trends. The system shall also cluster various schools based on their educational performance and indices. This system supports visualizations of statistical analysis. The system will have two types of users: Admin User and Analyst user. Administrator is capable of user management meanwhile the Analyst can access the reports and analyze the operations. 7.1.2. Functional Requirement Various functional requirements of our projects are listed below:

Relationship Mining The system shall provide relationship between various educational indices and find out which variable are most strongly associated with a single variable of particular interest. The system shall establish association rules between various educational indicators.



Clustering The system shall cluster schools on basis of various educational indicators. The analyst shall give the number and data range for segmentation.



Forecasting The system shall be able to forecast the educational indicators and indices.

51



Visualization The system shall provide attractive visualization for current educational performance of various districts using maps, charts and graphs. The system shall also use effective visualization techniques to show the outputs of clustering and analysis.



Educational Index Analyzer The system shall provide an effective tool to calculate the educational development index of every district. The analyzer shall rank every districts based on the Educational Development Index.

7.2. System Block Diagram

Figure 7.1 : System Block Diagram

52

7.3. Description of System 7.3.1. Data Extraction and Storage DOE provided us with the data in .xls format. There were multiple files with multiple sheets in each file. The data was unmanaged, unformatted and haphazardly distributed. A lot of data was missing in the file. We scanned each of the files and sheets to check and correct the data. A lot of effort was required in data cleaning. We also needed to construct some of the data that was missing in the files. Also data was not available as a whole so data extraction task became tedious than expected. Once, we assured that the data could be extracted we used xlrd library available in python to read the data from .xls file. We used My-SQL database to store our data. We constructed two forms of database: one for API call interface and the other one for analysis, visualization and mapping. Database for the API was constructed according to the format of available data. This database was used to convert the data in .xls form into json, csv and xml format. We have completed the conversion to json and xml form and conversion to csv form is in progress. Database for mapping and visualization is mostly based on the ecological and regional distribution like VDC/municipality, district and zone. This is a crude database to visualize the data in map. The complete database for complete analysis and visualization is being developed. The GIS data in geojson format was of large size (about 7 MB) and was difficult to store in the database. So, we stored these data by breaking the data to VDC wise distribution. This made data easy to store and retrieve to and from the database.

53

7.3.2. Database Design ER Diagram

Figure 7.2: ER Diagram for the System EER Diagram EER Diagram is basically the diagram of database for API development. This database was based on the data available from DOE. It consists of different tables and attributes described below: 

School It consists of information of school

54



Address It consists of attributes of school address



Vdc It consists of district wise vdc information of Nepal



District It consists of zone wise district information of Nepal



Anchal It consists of information of zones of Nepal



Teacher It consists of information about teacher in the school



Infrastructure It consist of information about physical infrastructure available in the school



Outcome It consist of information about attributes of school performance



Grade It consists of information about the enrollment and dropouts of students in the school

55

UML Diagram for System Use case Diagram for the System

Figure 7.3: Use case Diagram for the System

56

Sequence Diagram for the API Call

Figure 7.4: Sequence Diagram for the API call

57

Sequence Diagram for Visualization

Figure 7.5: Sequence Diagram for Visualization 7.3.3. API Interface and Services To allow various users of the application the data accessible in their desired format we have made API (Application Programming Interface). Users can get desired data via the API call. API call provides the data in the 3 format i.e. xml, json and csv. Following documentation gives more detail about the api call parameters and format.

58

API Call format: http://hamroschool.org/parameterName/year_value/district_name/factor_name/ Call Parameters and Variables: Name

Variables

Type

Description

parameterName

Schooled

String

This parameter can be used to get various information related to school numbers in Country or in a district.

Enrollment

String

This parameter can be used to get various information related to the enrollment statistics of country and district.

Physical

String

This parameter can be used to find the situation of physical Infrastructures like number of pakki classes and kachhi classrooms, number of toilets, drinking facilities of schools in a country or district.

Teacher

String

This parameter can be used to get the information related to teachers such as number of male teacher, female teacher, number of permanent teachers, allocated teachers, number of teachers in primary level etc. of Nepal or a district.

Year

yy format

integer

year.

District

Nepal

String

To get the dataset of all the districts of Nepal.

Kaski, Gorkha, Kathmandu,

String

To get the dataset of all the vdcs of desired district.

totalSchoolByLevel (Factor:SchoolEd)

String

To get the numbers of schools by level i.e. number of primary secondary or lower secondary Schools in Nepal or any district of Nepal.

factor_name

59

totalSchoolByGrade (Parameter:SchoolEd)

String

To get the number of schools running various classes in various districts of Nepal.

totalEnrollmentFigure (Parameter:Enrollment)

String

This parameter can be used to get the enrollment figures like male enrollment, female enrollment etc in various district and vdcs of Nepal.

totalEnrollmentFigureClassW ise (Parameter:Enrollment)

String

This parameter can be used to get the enrollment figures of various district and vdcs classwise

totalPhysicalFacility

String

This parameter can be used to get the dataset containing various information about the physical facilities of schools in district and vdcs of Nepal

(Parameter:Physical)

totalTeacherInfoLevelWise (Parameter:Teacher) String

totalTeacherInfoTypeWise (Parameter:Teacher)

String

This parameter can be used to get the dataset with various information related to the teachers teaching in various schools in various districts and vdcs of Nepal

This Parameter can be used to get the dataset with various types of information related to the teachers teaching in various schools in districts and vdcs of Nepal.

Table 7.1: Call Parameters and Variables

60

Other API Formats http://hamroschool.org/parameterName/districtName This api call list the parameter values for each district in json format. SN 1 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14 15. 16.

parametrName Edi a-ssr r-pr-upr-school p-female-teacher a-pupil-teacher-r p-unqualified-teacher gross-enrollment-rate participation-dalit-student participation-janjati-student gender-parity-index repetition-rate dropout-rate p-enrolled-student-passed Numteacher NumStudents NumSchool

Remarks Educational Development Index Average Student School Ratio Ratio of Primary to Upper Primary School Percentage of Female Teacher Pupil Teacher Ratio Percentage of Unqualified or Untrained Teacher Gross Enrollment Rate Participation of Dalit Students in number Participation of Janjati Students in number Gender Parity Index Repetition Rate Dropout Rate Percentage of Enrollment Students passed Number of teachers Number of Students Number of Schools

Table 7.2: Parameter values for each district DistrictName 

'



kaski, Gorkha, Kathmandu...for individual district

' = for Whole Nepal

Year 2007, 2008, 2009, 2010, 2011 http://hamroshool.org/parameterName/year This Api call lists all parameter values for all districts in json format. http://hamroschool.org/parameterName/ This Api call lists all parameter values for all districts from 2007 to 2011 in json format.

61

7.3.4. Visualization We used the API call interface to visualize the data in the map. After the layer of district and VDC was stored in the database, we used leaflet framework of open street map to obtain map of Nepal. Leaflet is a modern open-source JavaScript library for mobilefriendly interactive maps. It has all the features most developers ever need for online maps. Leaflet is designed with simplicity, performance and usability. It works efficiently across all major desktop and mobile platforms out of the box on modern browsers while still being accessible on older ones. It can be extended with many plugins has a beautiful, easy to use and well documented API and a simple, readable source code. Visualization of the available data is going on. We can visualize some portions of the available data with the help of API call interface. API provides the data from database in json or xml form. The json/xml format is parsed to obtain the required data and passed in the map to visualize. The data for map is obtained using Javascript. Also, some analysis of data is done using the statistical tools like bar-chart and pie-chart. The available data is used to visualize in different statistical charts.

62

8. RESULTS, VISUALIZATION AND ANALYSIS OF THE DATA 8.1. Output and Analysis of the Parameter of EDI 8.1.1. Average Student School Ratio

Figure 8.1: Visualization of top 10 districts with maximum ASSR

Highest value of Student School Ratio in 2011 was found for Bara district and its value was 257.04. In 2011 Bara, Mahottari, Rautahat, Dhanusa, Siraha, Mahottari, Parsa, Sarlahi, Kailali and Dang were the top 10 districts with the maximum value of Student School Ratio. These are the districts where focus on any infrastructure development needs to be. So, Government needs to make a concrete plan to develop the Infrastructure Index of these districts. Any investment in the building construction or physical facility should be focused on these districts.

63

8.1.2. Percentage of Female Teachers

Figure 8.2 : Top 10 districts with percentage of female teachers in 2011

1n 2011, Kathmandu is the district with maximum value of percentage of female teachers. Besides that Lalitpur, Bhaktapur, Kaski, Banke, Makwanpur, Chitwan, Rupandehi, Tanahun, Dhading are the districts with the maximum participation of females in teaching profession. So any scheme that motivates female teachers should be made in these districts.

64

8.1.3. Percentage of Unqualified (Untrained) Teachers

Figure 8.3: Top 10 districts with high percentage of untrained teachers in 2011

In 2011, Kavrepalanchowk stood as the district with highest percentage of untrained teachers. Besides that Lalitpur, Jhapa, Rolpa, Dang, Surkhet, Makwanpur, Bajhang, Dailekh, Achham were the districts with maximum percentage of untrained teacher. Government should make a concrete plan to train the teachers in these districts. Any teacher training scheme in these districts will help increase the educational development of the district.

65

8.1.4. Repetition Rate

Figure 8.4 : Visualization of Top 10 Repetition Rate in 2011

In 2011, Saptari, Gorkha, Pyuthan, Humla, Sankhuwasabha, Doti, Makwanpur, Rolpa, Gulmi and Mustang had highest repetition rates. Repetition rate may be higher due to various reasons like quality of teaching they get, may be due to unqualified teachers etc. Government should make appropriate plans to ensure that the repetition rate gets lower in these districts. Any scheme or plans to ensure that repetition rate gets lower can be effective in these districts.

66

8.1.5. Drop Out Rate

Figure 8.5 : Visualization of districts with top 10 Dropout Rates

In 2011, Jumla had highest repetition rates. Following jumla, Manang, Humla, Mustang, Mugu, Bajhang, Dolpa, Kalikot, Doti, Bajura had highest dropout rates. Any plans from the Department of Education to reduce dropout will be effective in these districts.

67

8.1.6. Gross Enrollment Rate

Figure 8.6 : Visualization of Top 10 GER in 2011

In 2011, Taplejung, Jajarkot, Kalikot, Mugu, Dailekh, Surkhet, Mugu, Bhaktapur, Bajura and Kathmandu saw good Gros Enrollment Rate. Besides the districts with lower GER are Manang, Dhanusa, Parsa, Bara, Morang, Sarlahi, Siraha, Saptari, Mustang and Doti. So Government plans to increase the enrollment in these districts would be more effective.

68

8.1.7. Student Teacher Ratio

Figure 8.7 : Visualization of Districts with Top STR values

In 2011, Rautahat, Sarlahi, Mahottari, Rukum, Dhanusa, Siraha, Jajarkot, Bara, Saptari, Salyan had maximum values of the Student Teacher. If government wants to increase the quota of teacher then these are the districts that would be suitable for the increment of number of teachers.

69

8.2. Calculation, Visualization and Findings of Educational Development Index 8.2.1. EDI Values of year 2007 to 2011 Calculated values of EDI for all the districts are shown below in the table:

District Name Taplejung Sankhuwasabha Solukhumbu Panchthar Ilam Dhankuta Terhathum Bhojpur Okhaldhunga Khotang Udayapur Jhapa Morang Sunsari Saptari Siraha Dolakha Sindhupalchok Rasuwa Sindhuli Ramechhap Kavrepalanchok Nuwakot Dhading Makwanpur Dhanusha Mahottari Sarlahi Rautahat Bara Parsa Chitwan Lalitpur Bhaktapur

2007 0.510022 0.489331 0.449399 0.501691 0.504689 0.550178 0.543359 0.479549 0.473838 0.496396 0.527587 0.535207 0.489856 0.492937 0.291991 0.314124 0.478199 0.485295 0.465752 0.433112 0.493489 0.617776 0.509215 0.5091 0.469659 0.404049 0.312805 0.319024 0.268926 0.376371 0.332598 0.594115 0.68565 0.730688

2008 0.505344 0.434429 0.418472 0.455663 0.459849 0.531049 0.506872 0.436943 0.487754 0.459438 0.559673 0.585297 0.522145 0.529714 0.380417 0.348041 0.459615 0.489491 0.405801 0.485085 0.474179 0.620782 0.489 0.509861 0.519321 0.482274 0.352666 0.327438 0.33536 0.381186 0.392303 0.630642 0.68735 0.769306

2009 0.501033 0.444954 0.379689 0.471149 0.48472 0.548915 0.516062 0.484501 0.56357 0.468161 0.577328 0.582072 0.516412 0.527406 0.3216 0.282708 0.45501 0.486404 0.44297 0.537732 0.527788 0.616055 0.514113 0.557907 0.508214 0.361594 0.429878 0.324437 0.392227 0.319971 0.290986 0.634336 0.651234 0.74852

70

2010 0.504016 0.476466 0.393557 0.560319 0.504955 0.536961 0.502502 0.479491 0.557459 0.444605 0.591825 0.555252 0.498156 0.564253 0.418485 0.270376 0.463143 0.528474 0.391926 0.549007 0.52481 0.590622 0.494994 0.527128 0.505093 0.379744 0.40867 0.351403 0.34352 0.355245 0.344405 0.558692 0.658138 0.667492

2011 0.529912 0.486744 0.518423 0.578512 0.528464 0.560651 0.532627 0.482968 0.535982 0.529903 0.631237 0.612003 0.621558 0.616382 0.445062 0.404918 0.563399 0.566172 0.507034 0.568098 0.558888 0.584586 0.531667 0.565626 0.540955 0.416882 0.411769 0.300319 0.359809 0.379605 0.379136 0.649516 0.705886 0.755624

Kathmandu Manang Mustang Gorkha Lamjung Tanahu Syangja Kaski Myagdi Parbat Baglung Gulmi Palpa Arghakhanchi Nawalparasi Rupandehi Kapilbastu Dolpa Jumla Kalikot Mugu Humla Pyuthan Rolpa Rukum Salyan Surkhet Dailekh Jajarkot Dang Banke Bardiya Bajura Bajhang Darchula Achham Doti Dadeldhura Baitadi Kailali Kanchanpur

0.718626 0.648795 0.541646 0.510482 0.553471 0.591171 0.578074 0.699507 0.572428 0.571667 0.531408 0.527836 0.574043 0.518893 0.496886 0.484699 0.377482 0.378743 0.446327 0.354101 0.375443 0.337398 0.376761 0.355789 0.436329 0.320778 0.437456 0.411143 0.391392 0.483384 0.429408 0.40001 0.427165 0.418249 0.471031 0.388484 0.443259 0.516752 0.469083 0.486875 0.484579

0.763216 0.587982 0.454923 0.50536 0.526164 0.593483 0.56951 0.707837 0.554181 0.546714 0.539158 0.516624 0.573407 0.456738 0.529974 0.531635 0.383356 0.337448 0.482977 0.36629 0.414397 0.28213 0.38755 0.401782 0.440555 0.362964 0.434431 0.414601 0.406216 0.532162 0.500305 0.492355 0.368158 0.376461 0.440219 0.391054 0.431021 0.522008 0.45549 0.543437 0.550319

0.764065 0.551478 0.481694 0.531099 0.534345 0.625765 0.589563 0.712513 0.571668 0.561318 0.540272 0.558637 0.584787 0.480415 0.577013 0.623962 0.344564 0.343677 0.493076 0.460446 0.434692 0.306631 0.456398 0.453336 0.460194 0.423735 0.465486 0.474459 0.40183 0.569546 0.518365 0.491037 0.390498 0.402894 0.450026 0.419309 0.478905 0.538699 0.537472 0.602203 0.564845

71

0.753188 0.534556 0.391759 0.495264 0.52734 0.603267 0.55974 0.673364 0.570881 0.581716 0.572291 0.548293 0.564977 0.491476 0.547881 0.591351 0.361133 0.344881 0.482641 0.455394 0.350919 0.325897 0.414459 0.469585 0.532471 0.383416 0.454461 0.498689 0.361946 0.561655 0.553985 0.475079 0.416687 0.444958 0.483532 0.510593 0.46407 0.543962 0.570463 0.58379 0.547926

0.830128 0.562919 0.438607 0.532231 0.538223 0.622978 0.615096 0.734965 0.602543 0.554107 0.583108 0.553852 0.59086 0.521344 0.637527 0.677621 0.489254 0.408089 0.458565 0.459585 0.445786 0.401017 0.441184 0.431929 0.51385 0.492708 0.473623 0.490116 0.484888 0.549541 0.594517 0.524312 0.486408 0.41965 0.509306 0.445877 0.464726 0.533387 0.570988 0.634962 0.656025

Table 8.1: EDI values for all district 8.2.2. Visualization of Matrix for Edi calculation One can also visualize the timeline progress of the district in various matrix. Following is the sample visualization for access, infrastructure, outcome and teacher index for Gorkha district from year 2007 to 2011.

72

8.2.3. EDI Visualization We have made a tool to visualize the educational development Index of each district. Time series progress of the district in education is shown over there. Below is a sample plot of the Educational Development Index for Gorkha District across from 2007 to 2011.

Figure 8.8: Educational Development Index

73

8.3. Results and Findings of Cluster Analysis 8.3.1. Cluster Analysis on Average Mathematics Marks Mathematics is one of the core subjects. It is necessary for everyone to get a good background on mathematics. Many Students fail in mathematics every year. Following 3d scatter plot shows the distribution of marks in mathematics for the 3 consecutive years, 2066, 2067, 2068.

Figure 8.9: 3-D Scatter Plot of Marks of Maths in 3 consecutive year From the plot it is seen that the marks mainly seem to be concentrated around 20 to 40 values. The marks are average marks in mathematics. Each point in the analysis is 3 dimensional representations of average values of 3 consecutive years for various SLC appearing schools. So, we need more detailed analysis using K-means clustering techniques. We applied clustering to find 5 cluster, we initialized the centers with random values and started looping it for 15 times to adjust the center. At the end of 15th run we got following results:

74

Cluster

Year2068

1

Year2067

Year2066

Size

Withins

28.09462 26.84918

20.56792

1655

253538.3

2

78.50699

81.01044

49.93815

1274

245727.3

3

64.79527

63.31504

39.08761

1057

286181.1

4

46.81571

28.32839

21.55837

1254

196084.5

5

45.44732

46.61653

26.14133

1621

287192.9

Table 8.2 : Center values for Math clusters The year2067, year2068 and year 2066 are the center(average) values of the clusters. The betweens, total withins and betweens for the cluster analysis are 5820902, 1268724 and 7089626 respectively.

Figure 8.10: Plot of Maths Cluster From above figure and statistics if we carefully look at them we get a clear picture of the clusters. Cluster number 1(highlighted as Red in table and black in figure) is the worst cluster in terms of cluster center where around 1655 values live. This cluster needs more attention because the values here are centered below average and thus they have mostly failed. The schools here have failed mostly in the 3 consecutively. So any teachers hiring

75

process, teacher training by the government should focus on these districts. The worst districts lying in this cluster are placed at the annex. 8.3.2. Cluster Analysis of Science Results We then analyzed various clusters in science and found the following results:-

Cluster

Year2068

Year2067

Year2066

size

Withinss

1

29.51913

30.48612

30.36418

1636

129496.3

2

19.02291

24.03402

21.49514

1639

142136.1

3

52.46213

55.00529

53.78933

1059

134229.2

4

40.05114

44.50979

43.31607

1226

175515.0

5

18.53111

37.83042

28.74314

1006

110522.6

Table 8.3: Center values for Science clusters

In this cluster most of the clusters have centers below 60%. So We would prefer to say that most of the schools and students in Nepal are week in Science. But still there is 2 clusters with average centers below 32. There is also other center with 2 values below 32. So, this is the prime subject where most students fail. The total Sum of Square for the algorithm implementation is 3387778.

Following 3d scatter plot shows the nature of concentration for science marks.

76

Figure 8.11: 3-D Scatter Plot of Marks of Science in 3 consecutive year

Figure 8.12: Plot of Science Cluster

This cluster is very serious one since more than3200 values lie in this cluster with 2 clusters and more than 3200 schools. So we decided to analyze the numerical values in each district i.e number of schools that have below average marks in each district. We found following statistics.

77

District Achham Arghakhanchi Baglung Baitadi Bajhang Bajura Banke Bara Bardia Bhaktapur Bhojpur Chitwan Dailekh Dandeldhura Dang Darchula Dhading Dhankuta Dhanusha Dolakha Dolpa Doti Gorkha Gulmi Humla

Number 56 52 25 50 41 12 23 18 45 25 37 27 33 37 50 18 35 37 29 17 7 44 10 78 5

District Ilam Jajarkot Jhapa Jumla Kailali Kalikot Kanchanpur Kapilvastu Kaski Kathmandu Kavrepalanchok Khotang Lalitpur Lamjung Mahottari Makwanpur Manang Morang Mugu Mustang Myagdi Nawalparasi Nuwakot Okhaldhunga Palpa

Number 41 16 68 12 92 1 75 40 42 34 48 31 13 22 7 41 0 83 13 0 17 61 47 32 57

District Panchthar Parbat Parsa Pyuthan Ramechhap Rasuwa Rautahat Rolpa Rukum Rupandehi Salyan Sankhuwasabha Saptari Sarlahi Sindhuli Sindhupalchok Siraha Solukhumbu Sunsari Surkhet Syangja Tanahun Taplejung Tehrathum Udayapur

Number 33 19 46 30 27 9 4 27 20 72 32 29 48 5 11 34 30 9 46 54 58 49 14 18 30

Table 8.4: Number of Schools with Science Marks below 32

78

Performance in Science Marks Figure 8.13: Performance in Science Marks

In science results that we analyzed, 97% of the schools that were clustered in the below 32 analysis we performed weregovernmet schools. Rest 3% were from the private sector. 8.3.3. Cluster Analysis on English Marks Another core subject in School Level Education is the marks in English. As a foreign language most students in Government schools find it very difficult to accommodate. We took the 3 years data and tried to find meanings in the data by cluster analysis.

79

Figure 8.14: 3-D Scatter Plot of Marks of English in 3 consecutive year We first tried to plot the marks in scatterplot 3d. We got an exciting insight. Since the data seemed to converge around 2 points, we focused on making 2 cluster here. The cluster analysis had following results.

Cluster

Year2068

Year2067

Year2066

Withinss

Size

1

24.08481

26.19623

25.74609

696506

4634

2

53.66839

50.65694

52.70283

352196.5

2227

Table 8.5: Center values for English clusters It is very exciting to know that out of 6861 schools taken for analysis 4634 have average marks on English below 32 for 3 consecutive years. So, our people and our students are really having problem with the English. But the need for English language cannot be ruled out because of the age of globalization. So, lots of endeavor must be put here.

80

Figure 8.15: Plot of English Cluster The following table shows the number of schools in each district that has below 32 marks in English for 3 consecutive years,

81

District Kailali Jhapa Morang Gulmi Palpa Nawalparasi Syangja Kanchanpur Kaski Dhading Rupandehi Nuwakot Surkhet Chitwan Ilam Achham Dang Saptari Arghakhanchi Baitadi Kavrepalanchok Bardia Sunsari Tanahun Khotang

Number 89 85 83 81 81 76 76 74 65 61 61 60 60 59 56 55 55 55 54 53 53 50 50 50 49

District Makwanpur Baglung Bhojpur Dhankuta Udayapur Dailekh Kapilvastu Bajhang Pyuthan Panchthar Salyan Gorkha Parsa Doti Lamjung Dhanusha Dandeldhura Kathmandu Siraha Dolakha Okhaldhunga Parbat Rukum Banke Sindhupalchok

Number 49 46 46 45 43 42 42 41 41 39 38 37 37 36 36 35 34 34 34 32 31 31 29 28 27

District Darchula Rolpa Sankhuwasabha Jajarkot Bajura Solukhumbu Lalitpur Ramechhap Myagdi Sindhuli Bhaktapur Mahottari Bara Taplejung Tehrathum Humla Jumla Rasuwa Mugu Sarlahi Kalikot Rautahat Dolpa Manang Mustang

Table 8.6: Number Of Schools with English Marks below 32

82

Number 26 26 26 24 22 22 20 19 17 17 16 16 14 13 10 9 9 9 8 8 6 6 4 0 0

Figure 8.16: Government School Below Average Mark in English The above pie chart shows that Government Schools occupy 97% of the 2872 schools in the below 32 schools. So government schools students are very week in English. 8.3.4. Cluster Analysis on Average Marks of Health Cluster Analysis on health resulted in following output:Cluster

Year2068

Year2067

Year2066

Size

Betweenss

1

28.58901

35.09035

33.18179

1274

89018.43

2

39.92441

35.86845

38.02594

1442

77641.89

3

34.22507

45.57025

39.03304

1368

80696.54

4

53.67909

54.38885

55.35554

1444

106017.7

5

45.92903

45.9916

47.19722

1333

98924.53

Table 8.7: Center values for Health clusters

83

The clusters here had good performance in health marks. Only a single year in cluster 1 had below 32 average value. Besides 1 important feature of the clusters here is that most of the cluster centers are shifting every year towards higher range. So the performance in Health marks is good.

Figure 8.17: Plot of Health Cluster

84

District Achham Arghakhanchi Baglung Baitadi Bajhang Bajura Banke Bara Bardia Bhaktapur Bhojpur Chitwan Dailekh Dandeldhura Dang Darchula Dhading Dhankuta Dhanusha Dolakha Dolpa Doti Gorkha Gulmi Humla

Number 2 5 0 0 0 0 5 1 1 3 0 0 3 1 4 1 2 1 5 1 0 0 0 2 0

District Ilam Jajarkot Jhapa Jumla Kailali Kalikot Kanchanpur Kapilvastu Kaski Kathmandu Kavrepalanchok Khotang Lalitpur Lamjung Mahottari Makwanpur Manang Morang Mugu Mustang Myagdi Nawalparasi Nuwakot Okhaldhunga Palpa

Number 0 0 1 0 8 0 0 0 2 3 1 0 0 0 1 1 0 6 1 0 0 6 1 0 1

District Panchthar Parbat Parsa Pyuthan Ramechhap Rasuwa Rautahat Rolpa Rukum Rupandehi Salyan Sankhuwasabha Saptari Sarlahi Sindhuli Sindhupalchok Siraha Solukhumbu Sunsari Surkhet Syangja Tanahun Taplejung Tehrathum Udayapur

Number 1 0 5 2 0 0 0 0 0 5 6 1 3 1 1 0 2 0 0 4 0 2 0 0 1

Table 8.8: Number Of Schools with Health Marks below 32 8.3.5. Cluster Analysis on Pass Percentage We tried to find various regions of nation where pass percentage is very low and high. We performed the analysis on pass percentage data for 3 consecutive years for the 6861 SLC appeared datasets. Our analysis showed following results.

85

Cluster

Pass 2068

Pass 67

Pass 66

Size

Withinss

1

26.30836

28.48226

28.09875

1601

1261942

2

21.57652

75.77876

32.52365

1150

901671

3

48.98323

69.59363

70.50342

1603

1749166

4

92.6857

95.0554

94.79935

2507

641995.1

Table 8.9: Center values for Pass Percentage clusters From the cluster Analysis we found 2 significant clusters, Cluster number 1 has very low pass percentage which is below 30% and cluster 4 has above 90% pass rate for 3 consecutive years.

Figure 8.18: 3-D Scatter Plot of Pass Percentage in 3 consecutive year Above is the 3d scatter plot of the pass percentage. The pass percentage shows variations in the values. The pass percentage range from 100% to near 0 values.

86

Table 8.10: Plot of Pass Percentage Cluster Clusters 1 and 4 had unique properties, so they were selected for further analysis. In Analysis of cluster 1 that had centers below 30 and in this cluster we found that government schools occupied major values:

Government School Private School

Cluster 1 1566 42

Cluster 4 549 1959

Table 8.11: Comparison of Cluster1 Cluster 4

87

Series1,

Series1, Private, 549, 26%

Government Schools in Cluster 1

Government Private Government School Govern ment, in Cluster 4 42, 2%

Series1, Private, 1959, 98%

Series1, Governm ent, 1556, 74%

Figure 8.19: Comparison of Cluster 1 and Cluster 4 The pie chart above shows the government school ratio in pass percentage in cluster 1 and cluster 4. In cluster 1, either the schools passed with low percentage or are failed where the government schools occupy major role whereas in cluster 4 the schools passed with high percentage and the private schools occupy the major portion of the pie in this chart. Finally this concludes that the government schools performance is very bad despite billions of investment in the government schools.

8.4. Correlation Analysis Output Correlation analysis was conducted for different district-level parameters and school-level parameters separately. The district-level parameters are: 1. Gross enrollment rate, 2. Gender parity index, 3. Repetition rate, 4. Dropout rate, 5. Percentage of enrolled student passed, 6. Average student school ratio, 88

7. Average pupil teacher ratio, 8. Percentage of unqualified teacher. Similarly school-level parameters are: 1. Total male teacher 2. Total female teacher 3. Total boys enrolled 4. Total girls enrolled 5. Total toilet number 6. Total classroom number Correlation matrix for each level is prepared as follows:

TMT TFT

TMT 1 0.679

TFT 0.679 1

TBE 0.597 0.475

TGE 0.617 0.481

TTN 0.0028 0.0009

TCN 0.0051 0.0046

TDJ 0.247 0.223

TBE

0.597

0.475

1

0.934

-0.0237

-0.0234

0.597

TGE

0.617

0.482

0.934

1

-0.0241

-0.023

0.605

TTN

0.0028

0.00099

-0.024

-0.024

1

0.671

-0.018

TCN

0.0051

0.0046

-0.023

-0.023

0.671

1

-0.018

TDJ

0.247

0.223

0.597

0.605

-0.018

-0.018

1

Table 8.12: District level Correlation Matrix

Major findings: 

The relation between girls enrollment in schools is moderately related with total female teachers. So, this can be said that girl’s enrollment can be evenly increased with the increase of female teachers.

89



A strong correlation (0.617) between girl’s enrollment and number of male teacher has been found. School with highest male teacher number is ranked highest for girl’s student enrollment.



A weak correlation (0.0028 and 0.00099) between total toilet numbers in a school and total teachers has been found. This indicates that teacher enrollment is may be affected by other factors rather than school facility.



A weak relation between girls enrollment and school infrastructure and toilet facility reveals that girl’s enrollment is not affected by the physical facility within the school.

GER GER

1

GPI

RR

DR

PEP

ASSR

APTR

PUT

-0.054

0.050

-0.187

0.067

-0.243

-0.156

-0.139

GPI

-0.054

1

0.149

-0.463

0.199

-0.094

-0.315

0.045

RR

0.050

0.149

1

0.243

-0.733

-0.337

-0.253

-0.292

DR

-0.187

-0.463

0.243

1

-0.812

-0.258

0.098

-0.278

PEP

0.067

0.199

-0.733

-0.812

1

0.393

0.114

0.365

ASSR

-0.243

-0.095

-0.337

-0.259

0.393

1

0.803

0.241

APTR

-0.157

-0.315

-0.253

0.098

0.114

0.803

`1

0.047

PUT

-0.139

0.045

-0.292

-0.278

0.365

0.241

0.047

1

Table 8.13: School level Correlation Matrix Major findings: 

A negative correlation (-0.253) between average student-teacher ratio and dropout rate has been found. Since lower dropout is considered better, school with highest student-teacher ratio can be ranked highest. Thus, a negative correlation between these parameters shows that dropout rate decreases with the increase in studentteacher ratio.

90



The weak correlation (0.199) between gender parity index and percentage of enrolled student passed has been found. Although higher gender parity index is considered better, but major enrolled pass rate is contributed by boy’s participation.



The weak correlation (0.067) between gross enrollment rate and percentage of enrolled student passed has been found. Although higher gross enrollment rate is considered better for particular area, but the output or result is disrupted by the higher number of student. There may be a loop here caused by the fact that increase in the student-size would also mean increases in requirements of more teachers.



The weak relation between student teacher ratio and percentage of enrolled student passed has been found. This also reveals that school/district level result/output is not solely depends upon number of teachers.



The weak relation (0.243) between repetition rate and dropout rate has been found and which shows that number of students who repeats more than one times on the same class has high probability of leaving the school.



The weak relation (0.149) between gender parity index and repetition rate has been found. This indicates that increase in the number of girls also increases the number of repeaters in the class.



The negative (-0.243) between average student-school ratio and gross enrollment rate indicates that parents prefer to admit their children in the school with lower student population.



The negative correlation (-0.253) between parameters average student-teacher ratio and repetition rate indicates that repetition rate can be decreased by decreasing the number of student per teacher.

91

8.5. Regression Analysis 8.5.1. Gross Enrollment Model. The total student enrollment model is described by: Gross-enrollment-rate = b0 + b1 * total-teacher-number + b2 * total-classroom-number + b3 * total-toilet-number The regression equation is Gross enrollment rate = 100.9 + 0.02110* total teacher number + 0.0083 * total classroom number – 0.0337 * total toilet number

Figure 8.14: Multiple Regression for Student Enrollment, Teachers and Classroom

92

Summary of Regression Results Dependent Variable:

'gross-enrollment-rate'

Model:

OLS

Method:

Least Squares

Constant total-teacher total-classroom total-toilet

Coefficient std. error

T-statistic Probability

100.9 0.02110 0.008322 -0.03370

26.7527 0.8479 2.1280 -2.3409

3.771 0.02489 0.003911 0.01440

0.0000 0.3994 0.0368 0.0220

Table 8.15: Summary of Regression Result for GER

Models stats R-squared:

0.08048

Adjusted R-squared: F-statistic: Prob (F-statistic):

0.04163 2.071 0.1117

Table 8.16: Model Statistics for GER Here, we have b0 = 100.0 ,b1 = 0.02110 ,b2 = 0.0083 ,b3 = - 0.0337

That is, the estimated gross enrollment rate has negative relationship with total-toilet numbers, whereas the coefficient for both total total teacher and total classroom are positive, indicating a positive relationship. The negative value b3 =

-0.0337 can be interpreted as “the estimated decrease in

enrollment for a unit increase in total-classroom is 0.0337 points when the total teacher and total toilet is held constant”. The errors in prediction are measured by the residual y – y^. In multiple regression, the residual is represented by the vertical distance between the data poi and the regression plane or hyper plane.

93

From the table, the value of R^2 is 8.048%, which means that 8.048% of the variability in gross enrollment is accounted by the linear relationship (the plane) between total teacher, total classroom and total toilet number. The typical error in estimation is provided by the standard error of the estimate, s. 8.5.2. Girl's Enrollment Model The regression equation is Girl’s enrollment = -1.268 + 8.113 * total_male_teacher + 3.873 * total_male_teacher – 0.9608 * total_toilet_number + 0.1883 * total_student_classroom_ratio

x1 = total_male_teacher, x2 = total_male_teacher , x3 = * total_toilet_number, x4 = total_student_classroom_ratio y = Girl’s enrollment

Figure 8.20: Multiple Regression of Teachers and Girls Enrollment

94

Summary of Regression Results Dependent Variable: y Model: OLS Method: Least Squares coefficient -1.268 8.113 3.873 0.9608 0.1883

const x1 x2 x3 x4

std. error 1.130 0.09168 0.2035 0.1706 0.01402

t-statistic -1.1215 88.4935 19.0368 -5.6322 13.4328

prob. 0.2621 0.0000 0.0000 0.0000 0.0000

Table 8.17: Summary of Results for Girls Enrollment

Models stats R-squared:

0.3902

Adjusted R-squared:

0.3901

F-statistic:

4914.

Prob (F-statistic):

0.000

Table 8.18: Model Statistics of Girls Enrollment The value of R^2 is 39.01%. This shows that data points do not lies precisely on the estimated regression line. 8.5.3. Dropout Rate Model The regression equation is Dropout rate = 53.65 – 0.4644 * percentage_enrolled_student_passed -5.455 * gender_parity_index – 0.084 * percentage_of_unqualified_teacher

x1 = percentage_enrolled_student_passed ,x2 = gender_parity_index , x3 = percentage_of_unqualified_teacher 95

y = dropout rate

Figure 8.21: Mutiple Regression of Dropout, Student Passed and Gender Index Summary of Regression Results Dependent Variable: ‘y’ Model: OLS Method: Least Squares coefficient

std. error

t-statistic

prob.

const

53.65

2.873

18.6713

0.0000

x1

-0.4644

0.03404

-13.6428

0.0000

x2

-5.455

1.173

-4.6521

0.0000

x3

-0.08376

0.03100

-2.7016

0.0086

Table 8.19: Summary of Regression Result for Dropout

96

Models stats R-squared:

0.8073

Adjusted R-squared:

0.7992

F-statistic:

99.15

Prob (F-statistic):

2.513e-25

Table 8.20: Model Statistics for Dropout Rate The value of R^2 is 80.73%. This means that the data points from variables lies precisely on the estimated regression line.

97

8.5.4. Decision Tree for Schools Classification

Figure 8.22 :Decision Tree for Schools classification

98

8.6. Visualization in Map Our system has visualization of various educational Indicators in Map. It has visualization facility from year 2007 to 2011. This lets developers, researchers and analysts to perform geographical analysis of the educational state of Nepal. 8.6.1. Pupil Teacher Ratio in 2008

Figure 8.23 Pupil Teacher Ratio of 2008 in Map

99

8.6.2. Pupil Teacher Ratio in 2011

Figure 8.24Pupil Teacher Ratio of 2011 in Map

8.6.3. Educational Development Index in 2011

Figure 8.25 Educational Development Index of 2011 in map

100

9. TOOLS, PLATFORMS AND TECHNOLOGIES USED 9.1. Language Python is used as the server side scripting language. For Client Side scripting Javascript and JQuery is used.

9.2. Framework Python Django framework is used for coding our web application.

9.3. Project Management Tools Bitbucket is used as the online repository.

9.4. Database MY-SQL is be used for the data storage purpose.

9.5. Data Extraction Tools Python package xlrd is used to load excel sheet and necessary logic is implemented to store the data.

9.6. Visualization Tools 9.6.1. Visualization in Map: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Two major driving forces behind the establishment and growth of OSM have been restrictions on use or availability of map information across much of the world and the advent of inexpensive portable satellite navigation devices.

101

Leaflet.js Leaflet is a modern open-source JavaScript library for mobile-friendly interactive maps. It is developed by Vladimir Agafonkin with a team of dedicated contributors. Weighing just about 28 KB of JS code, it has all the features most developers ever need for online maps. D3.js (or just D3 for Data-Driven Documents) D3 is a JavaScript library to display digital data in dynamic graphical forms. It is a tool for data visualization in W3C compliant computing making use of the widely implemented SVG, JavaScript, and CSS standards. It is the successor to the earlier Protovis framework. 9.6.2. UI Design using Twitter Bootstrap Twitter Bootstrap is a powerful tool by Twitter for user Interface Design. It provides various facilities for the design. We used twitter bootstrap to design the UI.

9.7. Technology used for making API 9.7.1. JSON (JavaScript Object Notation) Json is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language. JSON is built on two structures: A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

102

9.7.2. Comma-separated values (CSV) CSV file stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters, with no data that has to be interpreted instead, as binary numbers. A CSV file consists of any number of records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or tab. usually, all records have an identical sequence of fields.

9.8. Tools for Data Analysis 9.8.1. R Statistical Programming Language R is a popular programming Language for statistical computing. It is free and is widely used by statisticians for data analysis and computation. it is very popular due to its processing power and its heavy and very powerful data structures. 9.8.2. Rpy2 RPY2 is popular interfacing of the powerful features of R statistical programming Language in python. Various documentations and papers are available in the internet. 9.8.3. Numpy Numpy is a scientific computing package for python. It has powerful array(N-dimensional) and sophisticated tools and functions. 9.8.4. Scipy Scipy is an open source library of algorithms and mathematical tools for the python programming Language. It has lots of features and algorithms for mathematical processing.

103

10. CONCLUSION Data mining is a field of discovering patterns of data. Because of the possibilities of educational performance gains, educational data mining has been an area of so much of interest. As a result a number of organizations are being involved for the collection and analysis of education data in Nepal. The project has been very helpful in publishing our skills in the field of Regression Analysis. The objective of the project is to categories the districts of Nepal according to Education Development Index (EDI) ,comparing performance and predicting the corrections based on past results. The system has been optimised using huge data set acquired from Department of Education (DoE) ,Nepal. The data fetching and storage has been automated for the purpose of further use.

104

11. FUTURE ENHANCEMENTS This application currently gives the visualization and statistical figures of districts. This can further be extended to make a complete application to give information about VDC level and all school level. It is a basic right of all the citizens dwelling in this country to get the necessary information about the schools in their community, and find their performance. So, to make the data public this application aims to make open data portal. Currently all the information that we processed and that we shall get in case we happened to collaborate shall be available online via http://hamroschool.org Mining Educational Data to find the loopholes in current educational system is necessary. The Educational Development Index that we calculated currently takes 11 information to calculate a weighted parameter representing the development Index of a district. We can further extend it to make more precise index by including parameters like drinking water facility, toilet facility, text book availability, number of schools with lesser teachers etc. Mining can be performed in depth to study the limitations of current educational System. Hence there are lots of task and future Enhancement that can be performed. We hope to see co-operation and collaboration from Department of Education to continue our analysis and further extension of the research and this project.

105

BIBLIOGRAPHY/REFERENCES [1] "The 6th International Conference on Educational Data Mining (EDM 2013)," EDM 2013 Conference, [Online]. Available: https://sites.google.com/a/iis.memphis.edu/edm-2013-conference/. [2] "Education For All Global Monitoring Report," UNESCO, [Online]. Available: http://www.unesco.org/new/en/education/themes/leading-the-internationalagenda/efareport/statistics/efa-development-index/. [3] "Interpreting Regression Output.," Data and Statistical Services, [Online]. Available: http://dss.wikidot.com/interpreting-regression-output.. [4] D. T. Larose, Data Mining Methods and Models, New Jersey: John Wiley & Sons, Inc., 2006. [5] "Mistakes to Avoid and Reporting OLS," CMU, Central Michigan University., [Online]. Available: http://www.chsbs.cmich.edu/fattah/courses/empirical/29.html. [6] M. Bramer, Principles of Data Mining, London: Springer-Verlag, 2007. [7] "Data Mining Algorithms In R/Clustering/K-Means," WIKIBOOKS, [Online]. Available: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/KMeans.. [8] T. a. K. Steinbach, Data Minining Cluster Analysis: Basic Concept and Algorithms, 2011. [9] "The ID3 Algorithm," Department of Computer and Information Science and Engineering, University of Florida, [Online]. Available: http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm. [10] "Educational Developement Index (EDI)," Department of Educational Management Information System, Nepal University of Educational Planning and Administration, New Delhi, India, 2009. [11] S. K. a. A. P. L, "International Journal of Advanced Research in Computer Science and Software Engineering," vol. 2, no. 2277 128x, p. 12. [12] National Population and Housing Census, Kathmandu: Central Bureau of Statistics, 2011.

106

[13] R. Gandura, "A survey of Composite Indices Measuring Country Performance," United Nations Development Programme, 2008. [14] P. G. a. D. Sharma, "Educational Data Mining for Improving Educational Quality," vol. 2.

107

Educational Data Mining in relation to Educational Statistics of Nepal

Comments

Description