RESEARCH PROPOSALBHAVIK NITIN MER 17304936 Contents 1 Proposed Supervisor 1 2 Research Area 1 3 Research Background 1 4 Research Aims 1 5 Research Question 2 6 Proposed Methodology/Implementation Approach and Litera- ture Research Strategy 2 7 Ethical Issues 2 8 Sources of literature 3 9 Your own expertise and how well you are positioned to carry out the work 3 10 Proposed Table of contents for your dissertation 3 1 Proposed Supervisor Professor. Myra O’Regan 2 Research Area The domain of my research interest can be broadly classified as Exploring en- semble method named XGBoost and subject of this study is Data Analytics 3 Research Background eXtreme Gradient Boosting (XGBoost) is an improvement over Gradient Boost- ing. It’s widely used machine learning method by data scientist to achieve state-of-the-art results. Tree Boosting gives excellent results on many standard classification benchmarks. XGBoost provides a parallel tree boosting (GBM) that solve many data science problems in a fast and accurate way. The most important factor behind the success of XGBoost is scalability which is nothing but several vital systems and algorithmic optimisations; these include use of novel tree algorithm for handling sparse(missing) data, Paral- lel and distributed computing making learning process faster enabling quicker model exploration [2]. XGBoost belongs to a broader collection of tools under the umbrella of the Distributed Machine Learning Community who are also the creators of the fa- mous mxnet deep learning library [1]. Tianqi Chen and Carlos Guestrin tried to address the directions such as out-of-core computation,cache-aware and sparsity- aware learning have not been explored in existing works on parallel tree boosting. 4 Research Aims The aims for my research are sequentially defined as follows: 1. To understand idea behind XGBoost Understanding scalability term coined for XGBoost by TChen , understanding the advantages and short- comings of XGBoost and mathematics behind this ensemble. 2. To implement XGBoost Implement XGBoost with different frame- works such as R, Python, Spark and different dataset such as imbalanced dataset, dataset with more missing values. 3. To explain result Since all ensembles are mostly treated as blackbox, interpreting the result is more important and for XGBoost we have package in R named ’xgboostExplainer’ which explains 1 4. To compare Gradient Boosting vs XGBoost vs other ensembles . Major focus will be comparing gradient boosting with XGBoost, under- standing the shortcoming of gradient boosting and understanding bene- fits of using XGBoost. Finally, doing a comparison based on performance among different popular ensembles. 5 Research Question Considering the background and the current scenarion, we have four concrete quetions to answer: Q1. What is idea behind development of coming with XGboost and understand- ing mathematics? Q2. How can we implementing XGBoost model with different datasets using different frameworks? Q3. What are the results generated by XGBoost and how we can go about ex- plaining why XGBoost made particular decision? Q4. Comparing and constrasting Gradient Boosting against XGBoost or XG- Boost with different ensembles? 6 Proposed Methodology/Implementation Ap- proach and Literature Research Strategy There will be more then one dataset but for each dataset we will attempt to procure the data first from the listed sources and them analyse them to obtain insights. 1. Descriptive analysis - Explaining the result . 2. Data Cleansing - After analysis, our next step is to clean them here we will uderstand how XGBoost deals with missing data. 3. Implementing XGBoost - Implementing XGBoost on cleaned dataset using different framework. 4. Comparing results - Comparing the results genereated against different framework or comparing with different ensembles 7 Ethical Issues During the research of the given topic, the researcher must act in a proper way which is he must ethical, and no malpractices are being done during the given research. All the data which are being collected from the respective Trinity College Dublin are being kept confidential, and the data is mainly used for the required research purposes, and there will be no leakage of the respective data. 2 8 Sources of literature The sources of literature can be broadly classified as: XGBoost A Scalable Tree Boosting System Improvement over gradient boosting 9 Your own expertise and how well you are po- sitioned to carry out the work This research is based on XGBoost using R and statistical analysis using Excel. I have excellent analytical and programming skills which, I believe, would help me in carrying out the research successfully. I am also familiar with R stu- dio and Microsoft Excel as a statistical analysis tool as well as I have required fundamental statistical knowledge. In short, my strong interest, my prepara- tory initiatives combined with my technical knowledge make me well suited to undertake this research. 10 Proposed Table of contents for your disser- tation 1. ACKNOWLEDGMENTS 2. ABSTRACT 3. TABLE OF CONTENTS 4. TABLE OF FIGURES 5. INTRODUCTION (a) RESEARCH BACKGROUND (b) RESEARCH OBJECTIVE (c) RESEARCH QUESTION (d) STRUCTURE OF THESIS 6. STATE OF THE ART 7. METHODOLOGY 8. RESULTS 9. CONCLUSION 10. DISCUSSION 3 (a) STRENGTHS, LIMITATIONS AND RELIABILITY OF THE RE- SEARCH RESULTS (b) SCOPE OF FUTURE WORK 11. BIBLIOGRAPHY 12. APPENDICES References [1] A gentle introduction to xgboost for applied ma- chine learning. https://machinelearningmastery.com/ gentle-introduction-xgboost-applied-machine-learning/. Accessed: 2017- 12-31. [2] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting sys- tem. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM. 4