Big Data Analytics using R with Hadoop.pdf

Big Data Analytics using R with HadoopSeptember 2015 White Paper BFS Ameenudeen M [email protected] Confidentiality Statement Confidentiality and Non-Disclosure Notice The information contained in this document is confidential and proprietary to TATA Consultancy Services. The information contained in this document may not be released in whole or in part outside TCS for any purpose without the express written permission of TATA Consultancy Services. We request that any violation or potential violation of the Code by any person be promptly brought to the notice of the Local Ethics Counsellor or the Principal Ethics Counsellor or the CEO of TCS. This information may not be disclosed. are self-regulated by a Code of Conduct as enshrined in the Tata Code of Conduct. duplicated or used for any other purposes. Tata Code of Conduct We. All communication received in this regard will be treated and kept as confidential. in our dealings. 2 . We request your support in helping us adhere to the Code in letter and spirit. ...................................................................................................... 8 3......................................................................................................................................... 7 3...................................................................................................................... 14 6................ Integrating R with Hadoop ...................................................................................................... 6 2............................... 5 1..................................................Table of Content Abstract ........................................................................................ 15 7.................................................. 8 3................................... Understanding Data Analytics Life Cycle ........................... References ........ 4 About the Author ..........................................................................................1 RHIPE ........................................................ 9 3.................................................................................................................................................. Understanding Features of R Language ......................................... 12 5................................. Conclusion ...........3 Hadoop Streaming with R .................................................................... 16 3 .2 RHadoop .................................. 10 4.............................................................................................................. Acknowledgements ..................................................... Introduction of R ........................................................................................................ Abstract The amount of data being produced. This white paper is devoted to brief how R could help on getting insights from the data using Hadoop. that enterprises acquire every day is increasing exponentially. Thus R comes into picture. It is now possible to store these vast amounts of information on low cost platforms such as Hadoop. translate the derived models into colorful graphs and visualizations. R is a very amazing tool that makes it a snap to run advanced statistical models on data. and do a lot more functions related to data science. The size of the databases has been growing at exponential rates in today‘s enterprises. The challenges these organizations now face is how to process and analyse these large volumes of data to glean key insights from this data. 4 . About the Author Ameenudeen M is having more than 23 years of industrial experience in Solution Architecture. and Telecom. Gaming and Engineering domains. Hi-Tech. Insurance. 5 . Program and Delivery Management across Banking. R is a programming language used by data scientist statisticians and others who need to make statistical analysis of data and glean key insights from data using mechanisms. and text analysis. clustering) and graphical techniques. and Hadoop for data storage activities. classification. classification. clustering. R has various built-in as well as extended functions for statistical.1. machine learning (linear and nonlinear modelling. 6 . SQLite. and is highly extensible. R provides a wide variety of statistical. time-series analysis. machine learning. R can now connect with other data stores. such as regression. R is registered under GNU (General Public License). and visualization tasks such as: • • • • • • • Data extraction Data cleaning Data loading Data transformation Statistical analysis Predictive modelling Data visualization It is one of the most popular open source statistical analysis packages available on the market today. classic statistical tests. such as MySQL. Mongo DB. Introduction of R R is an open source software package to perform statistical analysis on data. With its growing list of packages. A comprehensive list of these packages can be found at http://cran. Universal data processing operations are as follows: • Data cleaning: This option is to clean massive datasets. Figure adapted from http://www. R will take care of data analysis operations with the preliminary functions. logistic regression. max. • Data analysis: This option is to perform analytics on data with descriptive and predictive analytics data visualization. such as mean. visualization of analysis output programming. classification. min. Machine learning operations. If we think about a combined RHadoop system. R packages are selfcontained units of R functionality that can be invoked as functions. that is. exploration.org R enables a wide range of operations. and visualization. distribution. probability.2.000 R packages and the list is growing day by day. such as linear regression. and regression.r-project.org/ called Comprehensive R Archive Network (CRAN). • Data exploration: This option is to explore all the possible values of datasets. Statistical operations.r-project. Understanding Features of R Language There are over 3. and Hadoop will take care of parallel data storage as well as computation power against distributed data. 7 . such as data loading. and clustering. analysis. 8 . Since RHIPE is a Java package.1 RHIPE RHIPE stands for R and Hadoop Integrated Programming Environment. Integrating R with Hadoop The integration of data-driven tools and technologies can build a powerful scalable system that has features of R and Hadoop. it acts like a Java bridge between R and Hadoop. There are three ways to link R and Hadoop are as follows:    RHIPE RHadoop Hadoop streaming 3.3. The RHIPE package uses the Divide and Recombine technique to perform data analytics over Big Data. Each of them offers different Hadoop features. rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS. rmr. RHadoop is available with three main R packages: rhdfs. The rhbase package is designed with several methods for initialization and read/write and table manipulation operations. After that. output file. RHadoop is a collection of three R packages for providing large data operations with an R environment. the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr methods. 9 . It executes the MapReduce jobs as per the orders given by JobTracker. and run R-specific Mapper and Reducer over it.2 RHadoop RHadoop is a great open source software framework of R for performing data analytics with the Hadoop platform via R functions.  TaskTracker: TaskTracker is a slave node in the Hadoop cluster. Reducer.  rmr is an R interface for providing Hadoop MapReduce facility inside the R environment.  rhdfs is an R interface for providing the HDFS usability from the R console. it is very easy to access them by calling the rhdfs methods. 3. The R programmer can easily perform read and write operations on distributed data files. output directory. to perform the R MapReduce job over Hadoop cluster. and so on. input format. Basically. It provides data services for various data operations. The components of RHIPE are as follows:  RClient: RClient is an R application that calls the JobTracker to execute the job with an indication of several MapReduce job resources such as Mapper.There are a number of Hadoop components that will be used for data analytics operations with R and Hadoop. input file. Finally.  HDFS: HDFS is a file system distributed over Hadoop clusters with several data nodes. output format. retrieve the input data chunks. and other several parameters that can handle the MapReduce jobs with RClient. So. rmr calls the Hadoop streaming MapReduce API with several job parameters as input directory.  rhbase is an R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server. reducer. mapper. As Hadoop MapReduce programs write their output on HDFS. and rhbase.  JobTracker: A JobTracker is the master node of the Hadoop MapReduce operations for initializing and monitoring the MapReduce jobs over the Hadoop cluster. the output will be written on the HDFS directory. 3 Hadoop Streaming with R Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer. the text input file is printed on stream (stdin). The following diagram shows various components of the Hadoop streaming MapReduce job.Hadoop MapReduce and HDFS will be used within the R console with the help of RHadoop rhdfs and rmr packages. RHadoop also includes another package called quick check. These packages are enough to run Hadoop MapReduce from R. 10 . The architecture of RHadoop is shown in the following diagram 3. This is similar to the pipe operation in Linux. which is provided as an input to Mapper and the output (stdout) of Mapper is provided as an input to Reducer. finally. Reducer writes the output to the HDFS directory. which is designed for debugging the developed MapReduce job defied by the rmr package. Basically rhdfs provides HDFS data operations while rmr provides MapReduce execution operations. With this. the developer just needs to translate the application logic into the Mapper and Reducer sections with the key and value output elements. Also. Python. The Hadoop streaming supports the Perl. and C++ programming languages.The main advantage of the Hadoop streaming utility is that it allows Java as well as non-Java programmed MapReduce jobs to be executed over Hadoop clusters. R. it takes care of the progress of running MapReduce jobs. PHP. 11 . To run an application written in other programming languages. preprocessing is used to perform data operation to translate data into a field data format before providing data to algorithms or tools. such as data cleansing. it needs datasets from related domains. data analytics. as well as content. their analytical application needs to be scalable for collecting insights from their datasets. designing. we can solve the business analytics problems. we do not use the same data sources. data sorting. 2. For identifying the user characteristics. likes. For example. data attributes.4. the data source can be decided and based on the problem definition. business analytics trends change by performing data analytics over web datasets for growing business. and their content. We can identify the important pages of our website by categorizing them as per popularity into high. their types. and posts as data attributes. and data formatting. data augmentation. data aggregation. 12 . Understanding Data Analytics Life Cycle The defied data analytics processes of a project life cycle should be followed by sequences for effectively achieving the goal using input datasets. Based on the domain and problem specification. The data analytics project life cycle stages are seen in the following diagram: 1. to provide the data in a supported format to all the data tools as well as algorithms that will be used in the data analytics. Pre-processing data: In data analytics. we will be able to decide the roadmap to improve business by improving web traffic. 3. medium. and data visualization. In simple terms. Based on these popular pages. collecting datasets. the data attributes of these datasets can be defied. and algorithms all the time as all of them will not use data in the same format. The data analytics process will then be initiated with this formatted data as the input. their traffic sources. and we want to know how to increase the business. With the help of web analytics. we use the data source as Facebook or Twitter. This leads to the performance of data operations. we need user profile information. Let's assume that we have a large e-commerce website. This data analytics process may include identifying the data analytics problems. and low. Identifying the problem: Today. Designing data requirement: To perform the data analytics for a specific problem. if we are going to perform social media analytics (problem specification). data tools. Since their data size is increasing gradually day by day. R has a variety of packages for the visualization of datasets. 4. This can be done with various data visualization software as well as R packages. Improved or optimized algorithms can provide better insights. Visualizing data: Data visualization is used for displaying the output of data analytics. the same algorithms can be translated to Map Reduce algorithms for running them on Hadoop clusters by translating their data analytics logic to the Map Reduce job which is to be run over Hadoop clusters. 5. Visualization is an interactive way to represent the data insights. For Big Data.In case of Big Data. clustering. and model-based recommendation. Performing analytics over data: After data is available in the required format for data analytics algorithms. It may either use descriptive or predictive analytics for business intelligence. These models need to be further evaluated as well as improved by various evaluation stages of machine learning concepts. The data analytics operations are performed for discovering meaningful information from data to take better decisions towards business with data mining concepts. the datasets need to be formatted and uploaded to Hadoop Distributed File System (HDFS) and used further by various nodes with Mappers and Reducers in Hadoop clusters. such as regression. classification. (a) ggplot2 like Plots for facet scales (b) rCharts like Dashboard charts 13 . data analytics operations will be performed. Analytics can be performed with various machine learning as well as custom algorithmic concepts. 5. Using R with Hadoop will provide an elastic data analytics platform that will scale depending on the size of the dataset to be analyzed. Conclusion The important thing we can achieve using Hadoop is we are bringing the computation towards data rather than approaching data towards computation platform. As Hadoop is very popular for Big Data processing. Experienced programmers can then write Map/Reduce modules in R and run it using Hadoop's parallel processing Map/Reduce mechanism to identify patterns in the dataset. Big Data analysis never involves throwing a lot of data into a computer and waiting for the output to pop out. R can be made scalable by using a platform as Hadoop. 14 . The core R engine can process and work on very limited amount of data. Acknowledgements I would like to thank to all my friends and buddies who have given their support.6. encouragement and review comments to make this white paper complete. 15 . John Wiley & Sons. Hadoop For Dummies Dirk deRoos. Packt Publishing. Hadoop Beginner's Guide Garry Turkington. References 1. 2013. Machine Learning with R Brett Lantz.com/whitepapers/bigdata_architecture_whitepaper.com/author/vignesh/ 9.com/ 7. www. October 2013. 1st edition 2014. Zikopoulos. Paul C. 3.. https://hortonworks. and Roman B. Packt Publishing. Packt Publishing. https://hadoop. Rafael Coss. 1st edition.apache. http://pingax.pdf 16 . 5. 4.r-project. Inc.org/ 8. 1st edition. http://www. 2013.org/ 6. Bruce Brown. Big Data Analytics with R and Hadoop Vignesh Prajapati. Melnyk.blazegraph.7. 2. ensuring a level of certainty no other firm can match. India’s largest industrial conglomerate. The content / information contained here is correct at the time of publishing. recognized as the benchmark of excellence in software development. For more information. A part of the Tata Group. uploaded. Unauthorized use of the content / information appearing here may violate copyright.tcs. No material from here may be copied.com. TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India. transmitted. Copyright © 2011 Tata Consultancy Services Limited . visit us at www. and could result in criminal or civil penalties.cdsfiodg@tcs. This is delivered through its TM unique Global Network Delivery Model . contact gsl. TCS offers a consulting-led. modified. posted or distributed in any form without prior written permission from TCS.com (Email Id of ISU) About Tata Consultancy Services (TCS) Tata Consultancy Services is an IT services. republished. integrated portfolio of IT and ITenabled infrastructure. trademark and other applicable laws. engineering and assurance services. reproduced.Thank You Contact For more information. consulting and business solutions organization that delivers real results to global business. IT Services Business Solutions Consulting All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS).

Big Data Analytics using R with Hadoop.pdf

Comments

Description