Talend Open Studio and R Integration



Comments



Description

Talend Open Studio and R integration21-June-2015 Target Corp LIM Upgrade TARGET Relationship Akshayaa.V Talend Open Studio and R [email protected] Confidentiality and Non-Disclosure Notice Confidentiality Statement The information contained in this document is confidential and proprietary to TCS. Code of Conduct Tata Code of Conduct We. The information contained in this document may not be released in whole or in part outside TCS for any purpose without the express written permission of TATA Consultancy Services. duplicated or used for any other purposes. We request that any violation or potential violation of the Code by any person be promptly brought to the notice of the Local Ethics Counsellor or the Principal Ethics Counsellor or the CEO of TCS. are self-regulated by a Code of Conduct as enshrined in the Tata Code of Conduct. This information may not be disclosed. All communication received in this regard will be treated and kept as confidential. Page | 2 . in our dealings. We request your support in helping us adhere to the Code in letter and spirit. .............................................................................................................................1..........................................................1...................... 5 2...................... 5 2...................................................................................................... 8 ..............2..................................................................... R and Talend Interface ................ 3 Overview ............................................... Sample Scenario (Classification of Dataset) ............................................................................................................................................................. 4 1................................ How to get back outcomes on the Talend ................ 5 2............................... 4 Objective ................................................................................................................................................. 2 Code of Conduct.................................................................................... 2 Table of Content .... Table of Content Confidentiality and Non-Disclosure Notice ............................................................................................... 6 3................................... Why to integrate R and Talend? ....................................... Using tExecuteRScript Component .......................................................................2.......................................................................... 7 3............................................................................................................................................................................ 7 3............................... How to Use the Component .......................................................................................................................................................... Talend component tExecuteRScript............ Overview This document covers method to integrate Talend Open Studio and R language and how to use R to build a simple predictive model with data coming from Talend and how to get results back to Talend. tExecuteRScript component overview 3. Example to explain R and Talend interfacing Page | 4 . To illustrate the purpose of integrating R and Talend 2. Objective 1. Low level connection to an existing R installation ii. vi. 2. but it completely lacks any kind of serious statistical tool. Some features of the component: i.1. with a huge amount of external packages for practically any possible kind of analysis one could imagine. federation and governance. if available. if possible. Support for external . Written in true OOP using the robust Talend Bridge framework . but one perhaps would prefer to spend time reasoning on the predictive model. but even simple data operations must be hand-coded. A better interface with R is really advisable. Int and Double arrays. This is possible using a custom optional component made by me for Talend. Results mapping to convert R symbols to standard ‘row’ Talend connection v. This limitation is imposed by the very inner architecture of R and it doesn’t depends on the component itself. but also in rapid prototyping and. Autocast of output. Two logging possibilities (Verbose/Silent) iv. generally speaking. R is basically a data language plus a command line executor. R and Talend Interface 2. But in real life Business Intelligence life-cycle. There are plenty of scenarios when one would benefit to do a cross-over between Talend Open Studio and R. Log redirection to tLogCatcher elements. Why to integrate R and Talend? R is an absolute standard for statisticians. It’s built around the JRI interface of the rJava package. Talend component tExecuteRScript This Talend Open Studio component provides a complete environment for executing code written for the popular statistical platform R and retrieve results back. which by their very basic nature involves massive data I/O. it’s quite rudimental on I/O features and it’s limited to retrieve only String. This is particularly true in data exploitation scenarios. in the whole business world. Although this package offers a 100% compatibility in execution of code.1. you probably have a corporate standard. rather than writing code to get the data out from the database. The first is perfect for even complex ETL tasks. This is historically common for statistical software (just think to SAS) so it’s not a flaw on its own. a protocol for data transfer and so on. manipulation. R language is a very expressive and extensible data language. a service bus. If it’s not enough.R file load and inline R code iii. this is done from the advanced parameters tab. just write down in the box the code you want to be executed by R. Boolean to Integers and so on). On Standard clients. They will be printed out to stdout/stderr only. R autoprinting is disabled. Irrespective of the choice. How to Use the Component To use it. since it doesn’t need to be quoted/escaped. more often you probably need to get the computation results back on the Talend side. output coming from R is redirected to Talend logging facilities. you can source a code from an external . not Talend. while external .To establish the interface between R and Talend this component needs to be installed. inline code could be manipulated (parametrized) at runtime directly on the Talend side. if allowed. Page | 6 . So the final choice depends on your need. it’s always possible to pass some command-line parameters to R using the proper String (which must be quoted/escaped too). so you must explicitly put a “print” statement in your R code to output something to the console. if disabled. at the moment. Integers on the R-side. you can choose between two output redirection strategies. 2. Under the hood of the limitations imposed by JRI. you’ll probably need to look at your R code and expressions to fix it (for example. Double. If the expression doesn’t return an array. At the very end of the tab. Double and Integers are supported).R scripts could not. while on Silent clients.2. Factors must be converted to Characters/String. you can map a R expression returning an array of a imposed kind (only String. Anyway. Although sometimes you just need to execute R code. Alternatively. but the I/O interface is quite rude: there is just a uni-directional data link. or the array cannot be cast to String.R file. you’ll probably going to get a NullPointerException somewhere. Talend schema columns conversion is automatic. This is perfectly allowed and doesn’t cause any error. On the advanced parameters. Please note this must be correctly quoted and escaped. R console is actually muted. Anyway. This is by far the fastest route. especially if you have lot and lot of code. For each column of an output schema. you’ll have a parameter that let you choose if log messages from the component must be notified to tLogCatcher instances. As this is from R. This is a low-level interface to an existing R installation and should be able to execute almost any arbitrary R code. Of course. a Double array coming from R can be stored in a Float column of the output schema. These parameters applies on both executing scenarios just exposed. For example. and it goes from R to Talend and supports only a subset of R types. while the latter (30 rows) our prediction set. In the second sub job. we’re going to drop the outcome variable from the prediction dataset is dropped.R file. The first (120 rows) will be our training set. . To simulate a prediction. Find below the job layout. there are two sub jobs.3. In the above scenario. R is called. Then. In the first one the iris dataset from the external file is rapidly loaded.1. when the species if iris as the outcome (predicted) variable. Using tExecuteRScript Component A classification model fit from a dataset (iris dataset) is made. Sample Scenario (Classification of Dataset) 3. the subsamples in two CSV files with headers are saved. Finally. The component basically sources on the. that dataset is split into two subsamples. A Random forest model is made without a test phase. Find below the sample R script sent to the component. we simply go to import the freshly created CSV files. How to get back outcomes on the Talend Find below the advanced paramaters pane for tExecuteRCode. Finally. For the specific example. Page | 8 . 3. it’s just time to take these results into consideration in our ETL job. using a standard outgoing Talend data connection. Then. 1 setwd("C:/Users/Akshayaa/Desktop") 2 library(randomForest) 3 predict train r outcome After the working dir is set and the needed library is loaded. what we’re actually say is: (R) convert the outcome array elements to characters and then (Talend) put as Strings on the column species. This allows to define an output schema and to write a set of R expressions on how to feed that schema. we fit a trivial model and we feed an outcome array with the predicted species.2. tcs. TCS offers a consulting-led. and could result in criminal or civil penalties. transmitted. republished. posted or distributed in any form without prior written permission from TCS. No material from here may be copied. Copyright © 2011 Tata Consultancy Services Limited . Unauthorized use of the content / information appearing here may violate copyright. contact gsl.cdsfiodg@tcs. A part of the Tata Group.com (Email Id of ISU) About Tata Consultancy Services (TCS) Tata Consultancy Services is an IT services. For more information. integrated portfolio of IT and IT-enabled infrastructure. ensuring a level of certainty no other firm can match. India’s largest industrial conglomerate. recognized as the benchmark of excellence in software development. uploaded. TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India. consulting and business solutions organization that delivers real results to global business. trademark and other applicable laws.com. reproduced. modified. engineering and assurance services. visit us at www. IT Services/Business Solutions/Consulting All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. Thank You Contact For more information. This is delivered through its unique Global Network Delivery ModelTM.
Copyright © 2024 DOKUMEN.SITE Inc.