DataStage NOTES

April 3, 2018 | Author: Hemanth Kumar | Category: Data Warehouse, File Format, Text File, Information Technology, Information Retrieval


Comments



Description

DataStage:: FUNDAMENTAL CONCEPTS:: DAY 1 Introduction for Phases of DataStage Four different phases are in DataStage, they are Phase I: Data Profiling It is for source system analyses, and the analysis are 1. Column analysis, 2. Primary key analysis, 3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”. 2010 4. Base Line analysis, and 5. Cross domain analysis. Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as shown below. Parsing Correcting Standardizing Matching Consolidated Phase III: Data Transmission In this ETL process is done here, the data transmission from one stage to another stage And ETL means E- Extract T- Transmission L- Load. Phase IV: Meta Data Management - “Meta data means where the data for data”. Inter Dependent Navs notes Page 1 DataStage DAY 2 How the ETL programming tool works?  Pictorial view: 2010 Data Base ETL Process Business Interface Flat files ETL db BI DM DWH MS Excel Figure: ETL programming process Navs notes Page 2 DataStage DAY 3 Continue… 2010 Extracting from .txt (ASCII code) Source Extract window Staging (permanent data) Understand to DataStage Format (Native Format) Source Staging (after transmission) Load window Source DWH data base or resides in local repository Loading the data into .txt (ASCII code) ETL is a process that is performs in stages: S OLTP T S T sa S sa T sa DWH stage area Here, S- source and T- target. Home Work (HW): one record for each kindle (multiple records for multiple addresses and dummy records for joint accounts); Navs notes Page 3 HLD.high level document Developer LLD. Inputs here.. Multiple records means multiple of the customers(records) and multiple addresses means one customer(one account) maintaining multiple of addresses like savings/credit cards/current account/loan. . Kindle means information of customers. .low level document Navs notes Page 4 2010 .DataStage DAY 4 ETL Developer Requirements • Q: One record for each kindle(multiple records for multiple addresses and dummy records for joint accounts). Customer Loan Bank Credit card Savings kindle • Customer maintaining one record but handling different addresses is called ‘single view customer’ or ‘single version of truth’. ETL Developer Requirements: HLD LLD ... HW explanation: Here we must read the query very care fully and understand the terminology of the words in business perceptive. * means range of 1-9. 4. UNIT Test 6. Job Sequencing Navs notes Page 5 . Performance Tuning 7. 10. 8.**) here. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical Design Document(TDD) 9.DataStage ETL Developer Requirements are: 1. Under Standing forums/team leads/project leads. 3. Physical model: using Tool. Peer Reviews: it is nothing but releasing versions(version control *. Prepare Questions: after reading document which is given and ask to friends/ 5. 2010 2. Logical designs: means paper work. Backups: means importing and exporting the data require. which convert server jobs to parallel jobs Navs notes Page 6 . Cognizent. • • Production based companies are like IBM and so on. x – cross mark that developer not involves in the flow. In Migration: works both server and parallel jobs. Migration means Support based companies like TCS.80%) Production(10%) Migration (30%) x TEST Production x Migration x here. 70% automatically 30% manually. Server jobs – parallel jobs Up to 2002 this environment worked In this it converts up to. Satyam Mahindra and so on.DataStage DAY 5 How the DWH project is under taken? HLD Requirements: x Warehouse(WH) -HLD x x  as developer involves 2010 Process: TD Developer system engineer jobs in % Developer (70% . after 2002 and up to till this environment IBM launched X-Migrator.  – mean where the developer involves in the project and implement all TEN requirements shown above. 1. Project Process: Period (that taken in months and years) 6m 6m – 1y 1– 11/2 y 11/2 y – 5y and so on(it may takes many years depend up on project) 2010 (high level documents) HLD Requirements: SRS BRD (here. business analyzer/ Subject matter expert) HLD Warehouse: Architecture Schema (structure) Dimensions and tables (target tables) Facts (low level doc’s) LLD TD Mapping Doc’s (specifications-spec’s) Test Spec’s Naming Doc’s Navs notes Page 7 . Categories Simple Medium Complex Too complex 5.DataStage Project divided into some category with respective to period as shown below and its period( time of the project). DataStage 5. last name. and 3.2.dname. Mapping Document: For example if a query requirements are 1-experience employee.no Load order Target Entity Target Source Source Fields Hire date Dno Transmi ssion Current DateHire date (CD-HD) Constan t Pk Fk Sk Error Handling F C D C Attributes Tables Eno FName Exp_tbl MName LName Exp_emp DName Emp Dept Ename Eno Dno Dname Funneling S1 Get data from Multiple tables ‘C’ Is combining Target S2 Horizontal combining or vertical combining As per example here horizontal combination is used Navs notes Page 8 .first 2010 name. middle name. For this mapping pictorial way as we see in the way: Common fields S. 2. h & t) ( F/ dB) S1 T HC H C TRG (Types of dB) S2 Format of Mapping Document. DAY 6 Architecture of DWH Navs notes Page 9 . After document: .txt (fwf. HC means Horizontal combination is used for combine primary rows with secondary  As Developer maximum 100 source fields will get. vl. sc. “Look Up!” means cross verification from primary table.  As Developer maximum 30 Target fields will get. cv. Here. s & t.DataStage Emp HC Trg 2010 Dept rows. By that the data transmission between warehouse and data mart where depends upon by each other. And for all this manager one Top level manager (TLM) will be there. Bottom level For above example how ETL process is done shown below reliance fresh ETL PROCES S ETL PROCES S RC-mgr ERP mini WH/ Data mart DWH Dependent Data Mart independent Data Mart Reliance Fresh(taking one from group directly) Dependent Data Mart: means the ETL process takes all manager information or dB and keep in the Warehouse. Here Data mart is also called as ‘Bottom level’/ ‘mini WH’ as Navs notes Page 10 2010 . Reliance Group : Reliance power Manager Reliance Fresh ` TLM needs manager Top Level mgr(TLM) details of below sales customer employee period order Input Explanation of above example: Reliance group with some there branches and every branch have one manager.DataStage For example: dB every branch have each mgr Manager Reliance comm. And TLM needs the details of list shown above for analyze. Inner. warehouse is top level and all data mart are bottom level as shown in the above figure.Top level approach. Here. R Comm. 6. RP and so on). and 2. Hence the data mart depends up on the WH is called dependent data mart. 6. the data of individual manager (like RF. 1.e. Navs notes Page 11 .DataStage shown in blue color in above figure i. data mart were directly access the ETL process with out any help of Warehouse. and this approach is invented by W. Data Mart 2010 R Power Reliance Group ETL PROCE SS Data Mart Warehouse R Fresh Top level Layer I Layer II Data Mart Bottom level Top – Bottom level approach In the above the top – bottom level is defined. Bottom.. H. RC. Independent Data Mart: only one or individual manager i.1 Two level approaches: For the both approaches two layers architecture will apply.. Top-Bottom level approach.1.e. Top – Bottom level approach: The level start from top means as per example Reliance group to their individual managers their ETL process from their to Data Warehouse (top level) and from their to all separate data marts (bottom level). That’s why its called independent data mart. DM R power Reliance Group R fresh Layer I Bottom level ETL PROCE SS DM DM Layer II DWH Top level Bottom – Top level approach is invented by R Kimbell.DataStage 6. R comm. Here. location and so on. employees. products. Top – Bottom level approach These two approaches comes under two layer Architecture Bottom – Top level approach Programming (coding) Navs notes Page 12 . Bottom – top level approach: Means from here the ETL process takes directly from data mart (DM) and the data put 2010 in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH). one data mart (DM) contains information like customer.2. Layer II: DM SRC Layer I DM SRC DWH DM Layer I Layer II DWH DM Layer II TOP – BOTTOM APPROACH BOTTOM – TOP APPROACH In this layer the data follow from source – data warehouse – data mart and this type of follow is called “top – bottom approach”.2.2. And in another case the data follow from source – data Navs notes Page 13 .DataStage • • ETL Tool’s: GUI(graph user interface) This tool’s to “extract the data from heterogeneous source”.2. Four layers of DWH Architecture: 6.1.2. Layer I: DM DM Source Layer I DWH Source DM Layer I In this layer the data send directly in first case from source to Data WareHouse(DWH) and in second case source to group of Data Marts(DM). 2010 ETL program Tool’s are “Tara Data/ Oracle/ DB2 & so on…” 6. 6. Reliance group. And who solve the instance/ temporary problems that team called Interface team is involved here.DataStage marts – data warehouse and this type of following data is called “bottom – top approach”. The clear explanation about the layer 3 architecture in the below example. For this Layer II architecture is explained in the above shown example eg.99% using layer 3 and layer 4) 6. Layer III: 2010 DM Source ODS DWH DM DM Layer I Layer II Layer III In this layer the data follow from source – ODS (operations data stores) – DWH – Data Marts.3. it is the best example for clear explanation. Here the new concept add that is ODS means operations of data stores for at period like 6 months or one year that data used to solve instance problem where the ETL developer is not involved here. The ODS data stores after the period into the DWH and from that it goes to DM there the ETL developers involves here in layer 3.2. * (99. Example #1: Navs notes Page 14 . interface team. From DWH to it goes to Data Marts here ETL developers involves for solve technical problems i.. In the airport base station the technical problems and the Operations Data Store (ODS) in db i.e...DataStage (at least or max. And years of database stores in the data warehouse because of some technical problems to be not repeat or for future reference. is also called layer 3 architecture of data warehouse. source is aero plan that is for waiting for landing to the airport terminal. Involves here Airport terminal Interface team involves here Airport base station DWH Stores problem info for future references DM Layer III ODS Layer II Problem information captured Data Base (stores the technical problem in dB for 1year) OPERATIONS DATA STORE Example explanation: In this example. But it is not to suppose to land because of some technical problem in the airport base station.e. 2hrs to solve the problem ) Layer I ETL dev. DAY 7 Navs notes Page 15 2010 DM DM Source (it is waiting for landing.e. simple say problem information captured. But the ODS stores the data for one year only. To solve this type operations special team involves here i. because of some technical problem) . DM.. SVC..layer2. ------------- reference data .. L2 & L3 & L4.1. DW.4.. Layer IV: Layer 4 is also called as “Project Architecture” look up It is for data backup of DWH & SVC L3 Business intelligence Source 1 Interface Files (FLAT FILES) ETL Read flat files through DS L2 DW BI DM Source 2 L4 Format MISMATC H Condition MISMATC H ODS SVC DM Reporting Source Layer I SRC Figure: Project Architecture / layer IV Here.Single view customer.Data Warehouse.Data Mart. Project Architecture: 7...DataStage Continues…..Business Intelligence.-> reject data About the project architecture: Navs notes Page 16 2010 . ODS-operations data store. BI.3. Note: (Information about dropped data when the transmission done between ETL reads the flat files(. there are 4 layers. When ETL sending the flat files to ODS if any mismatch data will there it will drops that data. . if it is mismatched the record will drops automatically. To see the drop data the reference link is used and it shows which record is condition mismatched.DataStage In project architecture. dno = 10  Coming to second layer ETL reads the flat files through the DataStage(DS) and sends Condition mismatch(CM): this verify the data from flat files whether they are conditions are correct or mismatched. Here also reference link is used to see drop data.30 from emp Example for Format Mismatch: Navs notes Page 17 .20. Condition mismatch 2.xml and so on) to ODS. Example for condition mismatch: An employee table contains some data SQL> select * from emp.30.  In last layer data warehouse to check whether a single customer or not and data loading or transmission in between DWH and DM(business intelligence). to ODS. csv.) Two types of mismatch data: • 2010 Trg only req.  In third layer the ETL transfer the data to warehouse. Format mismatch. There are two types mismatch data 1. • Format mismatch(FM): this is also like condition mismatch but it checks on the format whether the sending data or records is format is correct or mismatched. EID 08 19 99 15 ENAME Naveen Munna Suman Sravan DNO 10 20 30 10 Contains dno 10.  In first layer source to interface files(flat files).1 0 emp tbl TR G Referenc e link drops20.txt. loan insurance. For example: *to make unique customer? Same records Phase – II > identify field by field. Single View Customer (SCV): It is also called as “single version of truth”. credit Here DataStage people involves in this process SVC/ single version of truth naveen munna deposit suman This type of transforming is also called as Reverse Pivoting. 2010 The cross mark record has format mismatched so that the record its just rejected. NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth). naveen savings munna insurance suman credit 5 multiple records of customers transforming CName Adds. DAY 8 Navs notes Page 18 . 7.DataStage EID EName Place 111 naveen mncl 222 knl munna Here the table format is tab – space separated. CName Adds. Phase – III> cannot identify in this. savings.2.  Reverse Engineering (RE): it’s create from existing system is known as RE. two types of domain they are 1. Alphabetical and 2. Pictorial View Logical View EM P SQ De pt B optional Manual Here the above is Designing manual Data Modeler’s are use DM Tools o o ERWIN ER – STUDIO Forward Engineering Reverse Engineering Entity relation windows (ERWIN). o Physical design: data base perspective. are simple say “ altering the existing process” For example: Q: An client required a experience of an employee. Number. o Logical design: client perspective.DataStage Dimensional Model Modeling: it represent in physical or logical design from our source system to target system. Navs notes Page 19 2010 .  Forward Engineering (FE): it’s starting from the scratch. Entity relation studio(ER-Studio) these two are data modeler’s where logical and physical design is done.  Mata Data: every entity has a structure is called Meta Data(simple say ‘a data to a data’) o In a table there are attributes and domain. Surrogate key. Primary Key: means which is constraint. - Here taking some tables and linking with them with related to other tables. Like in product table. Tables as follow. Foreign Key: means which is constraint and used as reference for other table. Q: how the tables are interconnected is shown below. Foreign Key - Navs notes Page 20 . it is a combination of unique and not null.1. Dimensional Table: To find out everything as per the client requirement want to see (or) the “Lowest level of detailed information” in the target tables is called Dimensional Table.DataStage SRC Implicit requirement (is experience of employee) Hire Date 2010 EMP_table From Developer point of view is Explicit Requirement (to find out everything as per the client requirement want to see) TRG ENo (Employee hire detailed information) EName Years Months Days Hours Minutes Seconds Nano_Seconds Lowest level Detailed Information 8. This link is created by using foreign key and primary key. SC These is Repetitive Information or Redundancy Fk ENO EName Designation DNo This dividing Pk DNo Higher Quali. Normalized Data: In a table there if repetitive information or records is called Redundancy. JAVA  The SVU HYD Navs notes Page 21 .TECH JNTU HYD Raju Call Center 30 M.SC Normalization 20 333 Sravan JAVA (or) reducing Developer 10 444 Raju Redundancy Call Center 30 555 Rajesh target table must be always in De-Normalized format.2. that information is to minimize or that technique is called as Normalization.TECH is known Developer 10 222 munna JNTU HYD System analysis 20 M. Add1 Add2 Technique 10 111 naveen ETL M. For example: ENO EName Designation Quali. Add1 Add2 111 222 333 444 DNo Higher naveen ETL Developer 10 M.DataStage Product_ID PRD_Desc PRD_TYPE_ID 2010 Primary Key PRD_TYPE_ID PRD_SP_ID PRD_Category Foreign Key Link Establishing Using Fk & Pk Primary Key PRD_SP_ID SName ADD1 8.TECH JNTU HYD munna System analysis 20 M.SC SVU HYD Sravan JAVA Developer 10 M.  But it is not in all cases. DAY 9 Navs notes Page 22 2010 .DataStage HC Normalization De-Normalization  De-Normalization means combining the multiple tables into one table. And combining is done by Horizontal combine. that de-normalized is must and should. there are two options to design a job. And EMP table is secondary table because it is depends on the DEPT table. and 2. that we joining the two table by using Horizontal combining it takes the EMP table as primary table and DEPT table as secondary table. Optional.SC 111 naveen ETL Developer 10 222 munna System analysis 20 333 Sravan JAVA Developer 10 444 Raju Call Center 30 555 Rajesh JAVA SVU HYD Primary (or also known as Master Table) Secondary (or also known as Child Table)  Here from above two tables the primary table is DEPT table.  But when we take in real time.1.TECH JNTU HYD M. Horizontal Combine: Navs notes Page 23 2010 .DataStage E-R Model An Entity-Relationship Model: In logical design.  Mandatory is must 1. They are 1. because is not depends for any other table. 9. Add1 Add2 M.primary table & n-secondary table EMP table The given two tables EMP and DEPT ENO EName Designation DEPT table DNo DNo 10 20 Higher Quali. Manual.  HC means combining primary rows with secondary rows based on the primary key column values.TECH JNTU 222 munna SVU HYD Designation Add2 ETL Developer HYD System analysis DNo 10 20 M. hence it’s like below ENO EName Quali. they are o Primary key. Add1 111 naveen M.TECH JNTU HYD M. n – secondary.  1 – Primary.  There are three types of keys. and o Surgut key. Add1 Add2 M. 2010  There should be dependency.SC Higher Navs notes Page 24 .DataStage To perform horizontal combining we must follow these cases.  It must have multiple sources. o Foreign key. Horizontal combining is also called as JOIN. For example combining these two tables: EMP & DEPT tables Fk ENO EName Designation DNo 10 20 Pk DNo Higher Quali.SC 111 naveen ETL Developer 10 222 munna System analysis 20 333 Sravan JAVA Developer 10 444 Raju Call Center 30 555 naveen ETL SVU HYD After combining or joining the table by using HC.  in pictorial way it look like as below Transmission Sourc e T DIM table T FACT tbl  in practical way it directly from source to dimensional table and fact table.2. they are o Snow Flake Schema. STAR Schema:  In the star schema.  Dimensional table: means ‘Lowest level detailed information’ of a table. DIM table Sourc e T T FACT tbl Navs notes Page 25 . and o Fact table. Definition of STAR Schema: “A Fact Table collection of foreign key surrounded by multiple dimensional table and each dimensional collection of de-normalized data.  Fact Table: means it is collection of foreign key from n. Different types of Schema’s: There are four types schemas. and o Galaxy Schema. 2010 o STAR Schema.” The data transmission is done in two different methods. it is called STAR Schema. you must know about two things o Dimensional table. o Multi STAR Schema.dimensional tables.DataStage 9. 1. Navs notes Page 26 . and Fk – foreign key..e. where dimensional table is lowest level detailed information. The link is creating to the measurements i. and fact table is collection of foreign key.  And fact table is also called as Bridge or Intermediate table.  As per above question. Q: display what suman buy a lux product in ameerpet on January 1st week? Bridge/ intermediate table Product table Brand table Category table Customer table Unit table Customer_Category_table Fact table Cust_Dim_tbl Pk Fk Fk Fk PRD_Dim_tbl Pk Pk Date_Dim_tbl Customer table Location table Loc_Dim_tbl Pk  Here.  By above shown that fact table is surrounded by the dimensional table.  But in current market STAR Schema and Snow Flake Schema is using rarely.DataStage Example for STAR Schema: 2010 “As taking some tables as below to derive a star schema from that”. it needs information PRD_dim_tbl. measurements mean taking the information as per client requirement or user requirement.  In the fact table. for Fact table by foreign key and primary key. Pk – primary key. Cust_dim_tbl. and Loc_dim_tbl. Date_dim_tbl. into some tables.DataStage 2. And that tables is known as “look up tables” Definition of Snow Flake Schema: “The Fact table surrounded by dimensional tables. STAR Schema works effectively De-normalization D N Cagnos/B O Sour ce DWH Reports N MIG/H1 Normalization Snow Flake Schema works effectively Navs notes Page 27 . For example: Fk Pk Fk Pk Fk Pk Area 2010 The fact-tables surrounded by dimensional table. each dimensional table have EMP_t bl Dept_t bl Locati ons  If we want to require the information from location table it fetch from that table and display the client required. some times it not possible to bring as soon as possible if huge data in dimensional table.  To minimize the huge data at once or in a one dimensional table. Snow Flake Schema: lookup table is called Snow Flake Schema.  That is reason we divide the dimensional table. and each dimensional table have look up tables is called Snow Flake Schema”. Only 5 members involved in release the software into the market.5. :: DataStage CONCEPTS:: DAY 10 DataStage (DS) Concepts:     2010 History of DS. first version of DataStage is released by VMARK company i. Enhancements and new features of version 8.5.x2 and 8.  ODI(OWB). some of they are  DataStage Parallel Extends.  Abinity and so on… But DataStage is so famous and widely used in the market and it is to expensive also. US based company. DataStage those days called as “Data Integrator”. LEE SCHEFFLER is father of DataStage.0.0.e.x2 and 8..  BODI. - Navs notes Page 28 . Feature of DS. comprehensive means good in all areas) History begins: - In 1997.1 versions. Q: What is DataStage? ANS: DataStage is a comprehensive ETL tool. and the Mr. Differences between 7. which provides End – to – End Enterprise Resource Planning (ERP) solution (here.0. Architecture of 7.  SAS(ETL Studio).DataStage  NOTE: Selection of Schema in run time it is depends on report generation.1 versions.1  HISTORY of DataStage An ETL tool according year 2006 there are 600 tools in market. from that version parallel operations starts or parallel capabilities starts. a version 7.5. ADSS + ORCHESTRATE means ACENTIAL company is integrated with ORCHESTRATE company for the parallel capabilities.x2 is released that support server configuration for windows flat form also. Data Integrator is acquiring by company name called TORRENT. o o And ADSSPX is version is 6. INFORMIX Company has acquired Data Integrator from TORRENT Company. o o For this ADSSPX is integrated with MKS_TOOL_KIT. o From that parallel versions gone on developing up to 7. In 1997. MKS_TOOL_KIT is virtual UNIX machine that brings the capabilities to windows for support server configuration. 2010 - - In 2000. o But from 6.1 version. o Navs notes Page 29 .5. ACENTIAL Company acquired both Data Base and Data Integrator and after that ACENTIAL DataStage Server Edition released in this year. and all the UNIX commands works in the windows flat form.1 versions they supports only UNIX flavors environment.e. UNIX) have parallel extendable capabilities in UNIX environment.DataStage - There are 90% changes from 1997 to 2010 comparing to release versions. - In 2002. By integrating ADSS + ORCHESTRATE and they named as ADSSPX. After two years i. o Because ORCHESTRATE (PX. o And released software were 30 tools used to run.0 to 7. in 1999..5. o By this company the DataStage has popularized into the market from that year. - In 2004.0. NOTE: After installing the ADSSPX+MKS_TOOL_KIT into the windows. o Because server configured only on UNIX flat form or environment. x2 were having ASCENTIAL suite components o They are. Meta stage.e. o There are 12 types of ASCENTIAL suite components. - In 2006.1” o In current market. DataStage Px. Audit stage. and DataStage Px. Meta stage.x2 using 40 – 50% 8. In 2009.        Profile stage. Quality stage. the IBM has made some changes to the IBM DS EE and the changes are the integrated the profiling stage and audit stage into one. IDE. and so on these are individual tools. DataStage Tx.DataStage - In 2004.    7. IBM has released another version that “IBM INFOSPHERE DS & QS 8. - In 2005. enterprise edition.1 using 30 – 40% 8.0.e. DataStage MUS.1 using 10 – 20% Navs notes Page 30 2010 . quality stage.. December the version 7.. February the IBM acquired all the ASCENTIAL suite components and the IBM released IBM DS EE i. o With the combination of four stages they have released “IBM WEBSPHERE DS & QS 8.0” o This is also called as “Integrated Developer Environment” i.5.5. they are Any to Any. they are    UNI. Plat form Independent. SMP HDD SMP -1 C P U Page 31 SMP -2 UNI HDD MMP C C P U Navs notes P U C P U C P U . and Massively Multi Processor (MMP). Symmetric Multi Processor(SMP).  Plat form Independent: o “A job can run in any processor is called plat form independent” o Three types of processor are there. it nothing to be stored. Partition parallelism.DataStage NOTE: DataStage is Front End. and Pipe line parallelism. Node configuration.  Any to Any: o DataStage that capable to any source to any target. 2010 DAY 11 DataStage FEATURES Features of DS: There are 5 important features of DataStage. C P U P U 10 minutes Navs notes 2.DataStage “““ SMP -3 ””” SMP -n RAM  Node Configuration: o RAM Node is software that is created in operating system. o Hence..1000 records share for four CPU’ hence execution time reduced. is instance of physical CPU. o For example:  An ETL job requires executing 1000records? For above question an UNI processor takes 10mins to execute 1000 records.” o Node configuration concept is exclusively work on the DataStage. it is the best feature comparing from other ETL tools.5 minutes Page 32 2010 . o “Node is a logical CPU i. It is explained clearly in below diagram.e. But for the same question an SMP processor takes 2. using software it is “the process of creating virtual CPU’s is called Node Configuration.    1000 records HDD 1000 records HDD C C P U C P U C P U Here.5 minutes to execute 1000 records. 30.5minutes PU RAM  Partition parallelism: o Partition is a distributing the data across the nodes.20. based on partition techniques. o Using Node Configuration for the above example to UNI processor. because here primary table is 9 records.10. o After partitioning these records output must and should have 9 records. EMP(10. o In below figure how the virtual CPU’s can create and reduce the execution time of the process.20.30) and DEPT(10. DEPT table have 3 records. Node Configuration is also can create virtual CPU’s to .DataStage RAM RAM reduce the execution time for UNI processor.20.20. o Considering one example why we use the partition technique’s o Example: taking some records in EMP table and some in DEPT table   EMP table have 9 records.10.30) Navs notes Page 33 2010 o As per above example.10. 1000 records created multiple nodes HDD CP U CP PU U CP PU U CP PU U Node1 Node2 Node3 Node4 C P U 10 minutes reduces to 2. Key based partitioning Navs notes Page 34 2010 .  Key based • • • •  • • • • Hash Modulus Range Db/2 o Key less Same Random Entire Round robin o Key based category or key based techniques will give the assurance. o Key less technique is used to append the data for joining given tables.20. only 4 records are in there in final output and 5 records are missing for this reason the partition techniques are introduced. And there are two types of partition parallelism categories. From above taken records we partitioning using key based.10 30. in those total 8 types of partition techniques are there.20.30 10 20 30 only 2 matched 1 1 4 total only 4 matched But output must be 9 records In the above example.20. to the same key column value to collected same key partition.DataStage N1 12 N2 N3 o 10.10 10. DataStage EMP DNO N1 10 N2 20 N3 30 DEPT DAY 12 Continues… Features of DataStage  Partition Parallelism: o Re – Partition: means re – distributing the distributed data. ENO EName DNo Loc 111 AP 222 TN 333 KN 444 naveen munna Sravan Raju 30 10 20 10 P1 P2 P3 EMP 10 20 30 N1 N2 N3 Dno N1 AP N2 TN N3 KN Loc Page 35 Dno DEPT JOIN Navs notes 2010 JOIN . this is channel it is moving data from one stage to another stage. i. for that it re – distributing for the distributed data. and taking a separate column as location.e.DataStage o First partition is done by key based partition for dno..   SRC TRS F TRG S1 Example: S2 S3  Here collecting to Nodes N1 Nn into N 2 S 1 S Parallel files Sequential/Single file  There are four categories of collecting techniques • • Order Round robin Navs notes Page 36 . o Reverse Partitioning:  It is also called as collecting. 2010 known as Re – Partition. But it done in one case only or in one situation only : “when the data move from parallel stage to sequential stage the collecting happens in this case only” Designing job in “stages” is also called as link or pipe. o For example: how the execution done in server environment we see Extract S 1 Transform 10min S 2 Load S 3 10min 10min HD HD  o Here.DataStage • •  N1 a.z  Pipeline Parallelism: N Sort – merge Auto Order Auto a x b y c z a b c x y z a b c x y z a z y c x b RR SM 2010 Example for collecting techniques: “All pipes carry the data parallel and the process done simultaneously” o In server environment: the execution process is called traditional batch processing.y N3 c. Same job in parallel environment : E T R3 R1 L S 3 R5 S 1 R4 S 2 R2 Navs notes Page 37 . the execution taken 30minutes to complete.x N2 b. DAY 13 Differences between 7.Architecture Components * Server Component * Client Component .0.0.N-Tier architecture Page 38 Navs notes . all the pipe carry the data parallel and processing the job simultaneously and the execution taken only 10 minutes to complete 2010  By using the pipeline parallelism we can reduce the process time.0.Architecture Components * Common User Interface * Common Repository * Common Engine * Common Connectivity * Common shared Services II.1 .5.x2 .tier architecture .4 client components * DS Designer * DS Manager * DS Director * DS Administrator 8.5 client components * DS Designer * DS Director * DS Administrator * Web Console * Information Analyzer .DataStage  Here.5.5.x2 & 8.1 Differences: 7.1 8.x2 7. web console( simple say work from home) - 13. Client components of 7. Client components of 8.OS independent w.1: Navs notes Page 39 .t.Data Base based repository 2010 .Web based administration through .r. users but one time dependent only. • • • • Mainframes job Server job Parallel job Job sequence job - DS Director: it can handle the given list below.   - DS Administrator: it can handle the given list below.0.     Schedule . users Capable of P-III & P-IV No web based administration File based repository .Capable of all phases. status.1. Import and Export of Repository components Node Configuration Create project Delete project Organize project - DS Manager: it can handle the given list below. run job’s Monitor.x2: - DS Designer: it is to create jobs. .DataStage - OS dependent w. batch jobs Views(job. compile.  4 types of jobs can handle by DS Designer. Unlock.    13.t.r. logs) Message Handling.1. run and multiple job compile.5. DataStage - DS Designer: it is to create jobs. Base Line analysis. Mainframes job Server job Parallel job Job sequence job Data quality job - DS Director: same in as above shown in 7. Primary key analysis. Foreign key analysis. compile. Navs notes Page 40 .x2 Web Console: administrator components through which performing. But. and DS Administrator.  As an ETL developer you can come across DS Designer and DS Director.5. run and multiple job compile. some information to be knows about Web console.x2 DS Administrator: same in as above shown in 7. and Cross domain analysis.5. Information Analyzer.       Security services Scheduling services Logging services Reporting services Session management Domain manager It perform all phase-I activities • • • • • Column analysis. • • • • • 2010  4 types of jobs can handle by DS Designer. - Information Analyzer: is also called as console for IBM INFO SERVER. o Repository organize different component at one area is called collection of components.x2: * Server Components: it is divided into 3 categories. o Here repository is also Integrated Developer Environment(IDE)  IDE performs design. Package Installer  Repository: is also called as project or work area. run.5. Engine c.0.1 Architecture 14. compile.DataStage DAY 14 Description of 7. save jobs. Repository b.5. Some of components are    Job’s Table definition Shared container Navs notes Page 41 2010 . they are a.x2 & 8. Architecture of 7.1. DS Manager c. ER P SW DS Packs  Best example that normal windows XP acquires Service Pack2 for more capabilities  Here. o Repository is for developing application as well as storing application. DS Director Navs notes Page 42 . DS Designer b.. interface is also called as plug-in between computer and printer. packs are used to configuration for DataStage to ERP solution.. it select auto partition technique it causes effect on the performance. etc. they are a. *Client components: it is divided into 4 categories. Example: Derivers needed 1100 to install Comput er Interfac e Printer 1100 driver provide  Here.  Package Installer: in this component contains two types of package installer one plug- in and another is pack’s. o Never leave any stage to auto?  If we leave it auto.DataStage  Routines ….  Engine: it is executing DataStage jobs and it automatically selects the partition 2010 technique. (it’s checks security issues) b. DS Administrator 2. Local repository: it is for individual files stores here(it’s for performance issue) o common repository is also called as Mata Data SERVER o three types    Project level MD Design level MD Operation level MD 2010 3.. DS Administrator  These categories are shown above what they handle i. Navs notes Page 43 . a. Common Repository: is divided into two types a. Information analyzer c.1: 1. Web console b. DS Director e.DataStage d. Common Engine: o It is responsible of    Data Profiling analysis Data Quality analysis Data Transmission analysis 4. Common user interface: is also called as unified user.e. in page no 39. Architecture of 8. Global repository: it is for DataStage jobs files to store here. 14. Common Connectivity: It provides the connections to common repository. DS Designer d.0.2. Surrogate key stage: it is new concept introduced. SCD(slow changing Dimension) 2. FTP(File transfer Protocol) 3.0. WTX(Web Sphere Transfer) o Enhanced Stage: 1. Lookup stage.  Processing stage: o New Stage: 1.1. Sparse lookup Newly added iii.0. Normal lookup ii. there are 8 categories of stages.DataStage WC IA DE DI DA 2010 REPOSITORY MD SERVER Project level MD Design level MD Operation level MD Common shared services DP DQ DT DA Common Engine Common Connectivity Table representation of “8. 2.1 Architecture” DAY 15 Enhancement & New feature of version8 In version 8. previously lookup having i. Range lookup iv. Case less lookup Navs notes Page 44 . o Other stages are same as version 7.  Palate of the version 8. no changes in this version.DataStage  Data Base Stage: o New stages: 2010     IWAY Classic federation ODBC connector NETEZZA o Enhanced Stages:  All Stages techniques used with respect to SQL Builder.0..1      General Data Base File Processing Real time Restructure X Data Quality new X X Development & Debug    X X here. Navs notes Page 45 .5.0. o Data Base and processing stages have some changes that shown above. have changes X no changes o Palate is shortcuts of stages where we can drag n drop into canvas to do design the job. o Data Quality is exclusively new concept of 8.x2 i.1.e. & attach appropriate project:  Palate -> (it’s from tool bar)  General Data Quality Database Where the File place we Navs notes & Debug Development design the job..  After started: select DS Designer & enter uid: & enter pwd: admin **** (eg. And link them (or giving connectivity) and after that setting properties is important.  DB2 Repository started and DataStage server started.: phil) Project\navs…. • Five difference steps job development process (this is for design a job).DataStage :: Stages Process & Lab Work:: DAY 16 Starting steps for DataStage tool The starting of DataStage on the system we must follow the difference steps to do job. Processing Real Time Eg: Seq to Seq Restructure Designer Canvas or Editor CANVAS Select appropriate stage in the palate and dragging them on to the CANVAS. Page 46 2010 . it displays “ the login page appears” Navs notes Page 47 .DataStage  Palate means which contains all stages shortcuts i.. If not manually to start..2 & 8 – stages in 8.5. 2 –> Passive Stage (here what ever stage whether extracting or loading is called passive stage). whether the server for DataStage was start or not. DAY 17 My first job creating process Process:  In computer desktop.e. the current running process will show at the left Conner in that a round symbol with green color is to start when it is not automatically starts.0. 2010  This stages are categorized into two groups. they are 1 –> Active Stage (what ever stage is transmission is called active stage). In 8 categories we have use sequential stage and parallel stage jobs. compile and run the job.   Save.  When 8th version of DataStage is installed five client components short cuts visible on desktop.  Run director (to see views) or to view the status of your job.  Web Console Information Analyzer DS Administrator DS Designer DS Director      Web Console: when you will click. i. 7 – stages in 7.e.  Below figure showing how to authenticate & shows designer canvas for creating jobs. they are Navs notes Page 48 . warnings.e. it will display to attach the project for creating a new job. it displays “the page cannot open” error will appear. status. As shown as below o User id: admin o Password: **** o If authentication failed to login i. 2010  DS Designer: when you will click on the designer icon.  DS Director: it is for views the status of the job executed. Attach the project X Domain Localhost:8080 User Name admin Password phil Project Teleco OK canc el  After authentication. it displays the Designer canvas o And it ask which job want to you do.  DS Administrator: it is for creating/deleting/organize the project.DataStage o If server is not started. and to view log. the server must be restart for doing or creating jobs. o If error occurs like that.. because repository interface error. txt. File Stage: Q: How data can read from files?  File stage can read only flat files and the formats of flat files are . o Example how a job can execute: one sequential file(SF) to another SF.txt there are different types of formats like fwf.xml  In .xml means extendable markup language.csv means comma separated value. they are         General Data Quality Data Base File Development & Debug Processing Real Time Re – Structure 17. csv.  . In File Stage. s & t. file set and so on. there are sub–stages like sequential stage. Source Target Navs notes Page 49 . . data set.csv. sc.  In palate the 8 types of stages were displayed for designing a job. H & T. go to tool bar – view – palate.  .DataStage     Main frames Parallel 2010 Sequential Server jobs  After clicking on parallel jobs.1. . xml \\ meta data General properties of sequential file: 1. Select a file name: File: \ c:\data\se_source_file. Setting / importing source file from local server.txt File: \? (This option for multiple purposes) C:\data\se_source_file. and o Target file require input/source properties. Navs notes Page 50 . .csv. Format selection: - As per input file taken and the data must to be in given format Like “tab/ space/ comma” must to be select one them. we must set the properties as below     File name Location Format Structure \\ browse the file name.txt Browse button 2. .txt.DataStage o Source file require target/output properties. how we to read a file? o On double clicking source file. \\ example in c:\ \\ . 2010 - In source file.  These three are general properties when we design for simple job. Column structure defining: 2010 LOA To get the structure of file. DAY 18 Sequential File Stage Navs notes Page 51 . Steps for load a structure Import o Sequential file  Browse the file and import • Select the import file o Define the structure.DataStage 3. txt.xml)  Step 2: SF it reads/writes sequentially by default.  Stream link Reject link SF SF SF SF   Link Marker: Reference link SF SF It is show how the link behaves between the transmissions from source to target. .DataStage Sequential file stage also says as “output properties” For single structure format only we going to use sequential file stage. Navs notes Page 52 . that it to read flat files from different of extensions(.csv. . o And it also reads/writes parallel when it read/writes to or from multiple files  Step 3: Sequential stage supports one input (or) one output and one reject link. 2010 Output Properties About Sequential File Stage and how it works: Input Properties  Step1: Sequential file is file stage. Link : Link is also a stage that transforms data from one stage to another stage. o That link has divided into categories. when it reads/writes from single file. FAN IN: it indicates when “a data transform from parallel stage to sequential stage” and it done when collecting happens FAN IN 3. FAN OUT: it indicates when “a data transform from sequential stage to parallel stage” and it is also called auto partition.DataStage 1. FAN OUT 4. Ready BOX: it is indicate that “a stage is ready with Mata Data” and data transform 2010 between sequential stages to sequential stage. Ready BOX 2. BOX: it indicates when “a data transform from parallel stage to parallel stage” and it is also known as partitioning. BOX Navs notes Page 53 . Navs notes Page 54 . Because the stage that imports import operator for purpose of creating in Native Format. operator is a pre – built in component”. stage is a operator. 2010 BOW – TIE Link Color: The link color indicates the process in execution of a job. BOW – TIE: it indicates when “a data transform parallel stage to parallel stage” and it is also known as re-partitioning. NOTE: “Stage is an operator. Native Format is DataStage under stable format.DataStage 5. So. LINK  RED: o A link in RED color means    BLACK: o A link in BLACK color means “a stage is ready”.  BLUE: o case1: a stage not connected properly and case2: job aborted A link in BLUE color means “ it indicates that a job execution on process”  GREEN: o A link in GREEN color means “execution of job finished”. In process. it checks for  Link Requirement (checks for link)  Mandatory stage properties  Syntax Rules Navs notes Page 55 .e. Transformer stage that is done by C++. OB J MC OSH Code & C++ *MC – Machine Code *OSH – Orchestrate Shell Script Note: Orchestrate Shell Script generate for all stage except one i. EX E . 2010  Compiling .. EX E . C HLL .C function .DataStage Compile: Compile is a translator that source code to target code. OB J BC ALL *HLL – High Level Language *ALL – Assembly Level Language *BC – Binary Code  Compiling process in DataStage: GU I . Output: its capture the drop data through the link to another sequential file. Navs notes Page 56 .  First line or record of table: true/false.e.  Row Number Column: “Source record number at target” it gives information about which source record number at target table.  Missing File Mode: if any file name miss this option used Two options o o Ok: drops the file name when missed. * & ?  For example: C:\eid*. Error: if file name miss it aborts the job. File Pattern: we can use wild card character and search for pattern i. Else it is true. it’s doesn’t drop the first record.. o o If it false.DataStage DAY 19 Sequential File Stage Properties Properties:  Read Methods: two options are o o 2010 Specific File: user or client to give specifically each file name.txt C:\eid??. it display the first line also a drop record. Directly to add a new column to existing table and it’s displays in that column. Three options o o o Continue: Drops the miss match and continue other records. Fail: job aborted.txt  Reject Mode: to handle a “format/data type/condition” miss match records.  File Name Column: “source information at the target” it gives information about which record in which address in local server. ……..so on o grep “moon” . egrep.  Read from Multiple Nodes: we can read the data parallel from using sequential stage   Reads parallel is possible Loading parallel is not possible  LIMITATIONS of SF: o It should be sequential processing( process the data in sequential) o Memory limit 2gb(. Navs notes Page 57 . \\ it is case sensitive that display only moon contained records. o grep .i “moon” \\ it ignores the case sensitive it displays all moon records.DataStage It is also directly to add a new column to existing table and it’s displays in that column. o grep .  Like ASCII – NF – ASCII – NF o It is lands or resides the data “outside of boundary” of DataStage.w “moon” \\ it displays exact match record.txt format) o Problem with sequential is conversions.  Read First Rows: “will get you top first n-records rows” records  Filter: “blocking unwanted data based on UNIX filter commands” 2010 o Read First Rows option will asks give n value to display the n number of   Example: Like grep.  According naming standards every stage has to be named.. After setting above when we restart the DS Designer it directly goes designer canvas. 2010 Types of jobs are main frames/parallel/sequential/server. Description Annotation: it is used for job title (any one tile can keep).DataStage DAY 20 General settings DataStage and about Data Set Default setting for startup with parallel job: Tools o Options  Select a default • And to create new: it ask which type of job u want. Parallel Capable of 3 jobs: Resides into or Navs notes Page 58 .  Let we discuss on Annotation & Description Annotation Annotation: it is for stage comment. simple giving comments for a stage. o Naming a stage is simple.e. just right click on a stage rename option is visible and name a stage as naming standards. General Stage: In this stage the some of stage were used for commenting a stage what they behave or what a stage can perform to do i.  Data Set over comes the limitation of sequential file stage for the better performance. Target need to import an operator. When we convert NF code into ASCII. and it is used staging the data when we design dependent jobs”. it is ASCII format because .txt file from source.  “Native Format” is also called as Virtual Dataset. SRC need to import an ASCII trg_f. here the ASCII code will convert into Native format that is understandable to DataStage.txt file support only ASCII format and DataStage support the Native format only.DataStage SRC Extracting TRG landing the data into LS/RR/db 2010 Q: In which format the data sends between the source file to target file? A: if we send a .txt Data Set (DS): “It is file stage. NF ASCII src_f.  In Data Set the data lands in to “Native Format”.  By default Data Set sends the data in parallel.txt format to user/client visible.txt When we convert ASCII code into NF. And at target ASCII code will convert into . Q: How the Data Set over comes the sequential file limitation? By default the data process parallel. More than 2 GB. Page 59 Navs notes . ds example st_trg. Data Set extension is *.ds - Structure saving as “st_trg” src_f.DataStage - No need of conversion. trg_f. copying the structure st_trg & trg_f.txt Data Set can read only Native Format file.ds we can copy the “trg_f. Navs notes Page 60 2010 - . because Dataset represent or data directly resides into Native format.ds” file name and also we must save the structure of the trg_f. The data Lands in the DataStage repository.ds - trg_f.ds for reusing here. like DataStage reads only orchestrate format. We can use the saved file name and structure of the target in other job.txt Q: How the conversion is easy in Data Set? - trg_f. DataStage DAY 21 Types of Data Set (DS) Two types of Data Set, they are  Virtual (temporary)  Persistency (permanent) - Virtual: it is a Data Set stage that the data moves in the link from one stage to another stage i.e., link holds the data temporary. Persistency: means the data sending from the link it directly lands into the repository. That data is permanent. - Alias of Data Set: o ORCHESTRATE FILE o OS FILE Q: How many files are created internally when we created data set? A: Data Set is not a single file; it creates multiple files when it created internally. o Descriptor file o Data file o Control file o Header file  Descriptor File: it contains schema details and address of data.  Data File: consists of data in the Native Format and resides in DataStage repository.  Control File: Navs notes Page 61 2010 DataStage It resides in the operating system and both acting as interface between descriptor file and data file.  Physical file means it stores in the local drive/ local server.  Permanently stores in the install program files c:\ibm\inser..\server\dataset{“pools”} 2010  Header File: Q: How can we organize Data Set to view/copy/delete in real time and etc., A: Case1: we can’t directly delete the Data Set Case2: we can’t directly see it or view it.  Data Set organizes using utilities. o Using GUI i.e., we have utility in tool (dataset management) o Using Command Line: we have to start with $orachadmin grep “moon”;  Navigation of organize Data Set in GUI: o Tools  Dataset Management File_name.ds(eg.: dataset.ds) o Then we will see the general information of dataset     Schema window Data window Copy window Delete window  At command line o $orachadmin rm dataset.ds (this is correct process) \\ this command for remove a file o $rm dataset.ds (this is wrong process) \\ cannot write like this o $ds records \\ to view files in a folder Navs notes Page 62 DataStage Q: What is the operator which associates to Dataset: A: Dataset doesn’t have any operator, but it uses copy operator has a it’s operator. 2010 Dataset Version: - Dataset have version control Dataset has version for different DataStage version Default version in 8 is it saves in the version 4.1 i.e., v41 - Q: how to perform version control in run time? A: we have set the environment variable for this question.  Navigation for how to set a environment variable.  Job properties o Parameters  Add environments variable Compile o Dataset version ($APT_WRITE_DS_VERSION)  Click on that.  After doing this when we want to save the job, it will ask whether which version you want. Navs notes Page 63 DataStage DAY 22 File Set & Sequential File (SF) input properties File Set (FS): “It is also a staging the data”. Navs notes Page 64 2010 . Data Set have more performance than File Set.fs extension  Copy (file name) operator  Native format  . - File stage is same to design in dependent jobs. Data Set & File Set are same.ds files saves  But. but having minor differences The differences between DS & FS are shown below Data Set  Having parallel extendable capabilities  More than 2 GB limit  NO REJECT link with the Dataset  DS is exclusively for internal use DataStage environment File Set  Having parallel extendable capabilities  More than 2GB limit  REJECT LINK with in File Set -  External application create FS we use the any other application  Import / Export operator  Binary Format  . First line in column names 4. and at target there have four properties 2010 1.  Setting passing value in Run time(for file update mode) o Job properties  Parameters Add environment variables o Parallel  Automatically overwrite ($APT_CLOBBER_OUTPUT) Cleanup on Failure: having two options – true/false. Its works only when “file update mode” is equal to append. First Line in Column Names: having two options – true/false. o o Create (error if exists): just creating a file if not exist or given wrong.  True – it is enable the first row or record as a fields of column  False – it is simple reads every row include first row read as record.DataStage Sequential File Stage: input properties Setting input properties at target file. Reject Mode File Update Mode: having three options – append/create (error if exists)/overwrite o Append: when the multiple file or single file sending to sequential target it’s appends one file after another file to single file. File update mode 2.  True – the cleanup on failure option when it is true it adds partially coded or records. Cleanup on failure 3.  False – it’s simple appends or overwrites the records. Navs notes Page 65 . Overwrite: it’s overwriting one file with another file.  Output – it capture the drops record data.DataStage Reject mode: here reject mode is same like as output properties we discussed already before. In this we have three options – continue/fail/output. Head b. 23. Stages that Generated Data: Row Generator Data: “It having only one output” Navs notes Page 66 . 2010  Continue – it just drops when the format/condition/data type miss match the data and  Fail – it just abort the file when format/condition/data type miss match were found. continues process remain records. they are 1. Simple 3. DAY 23 Development & Debug Stage The development and debug stage having three categories. Row Generated Data b. Tail c. Column Generated Data 2. The stage that helps in Debugging: a.1. Stage that Generated Data: a. Peek  Simply say in development and debug we having 6 types of stages and the 6 stages where divided into three categories as above shown. The stage that used to Pick Sample Data: a. in some cases it is used. o 2010 - o When client unable to give the data. Q: how to generate User define value instead of junk data? A: first we must go to the RG properties Column o Double click serial number or press ctrl+E  Generator Navs notes Page 67 . Row Generator design as below: ROW Generator Navigation for Row Generator: Opening the RG properties Properties DS_TRG o Number of records = XXX( user define value) Column o Load structure or Meta data if existing or we can type their.DataStage - The row generator is for generating the sample data. In this having only one property and select a structure for creating junk data. - Row Generator can generate the junk data automatically by considering data type. Some cases are. For example n=30 Data generated for the 30 records and the junk data also generated considering the data type. o For doing testing purpose. To make job design simple that shoots for jobs. or we manual can set a some related understandable data by giving user define values. and signed. Navs notes Page 68 . Q: when we select signed? A: it going to generate signed values for the field (values between –limit and +limit). Q: when we select limit=20? A: it is going to generate up to limit number in a cycle form. Q: when we select initial value=30? A: it starts from 30 only. Q: when we select seed=XX. Initial value. Under Random type: There are three types of random generated data – limit. Q: when we select increment=45? A: it going to generate a cycle value of from 45 and after adds every number with 45. otherwise generate values between 0 and +limit. seed. and limit.DataStage • Type = cycle/random (it is integer data type) In integer data type we have three option 2010 • Under cycle type: There are three types of cycle generated data Increment. A: it is going to generate the junk data for random values. Column Generator Data: “it having the one input and one output” Main purpose of column generator to group a table as one. Q: when we select limit=20? A: it going to generate random value up to limit=20 and continues if more than 20 rows. - Column o We can change data type as you require. Navigation: - Output o Mapping  After adding extra column it will visible here. - The junk data will generate automatically for extra added columns. 2010 Here mapping should be done in the column generated properties.DataStage - And by using this we add extra column for the added column the junk data will be generated in the output. In the output. dropping created column into existing table. To open the properties just double clicking on that. For manual we can generate some meaning full data to extra column’s Navigation for manual: o Column  Ctrl+E • Generator - Navs notes Page 69 . and for mapping we drag simple to existing table into right side of a table. means just drag and Sequential file - Column Generator DataSet Coming to the column generator properties. Stage o Options   Column to generate =? And so on we can give up to required. .e. Q: when we select alphabet where string=naveen? A: it going to generate different rows with given alphabetical wise.1.DataStage o Algorithm = two options “cycle/ alphabet” o Cycle – it have only one option i. Head Tail Sample  Head: “it reads the top ‘n’ records of the every partition”. string.. DAY 24 Pick sample Data & Peek 24. o In the head stage mapping must and should do. Pick sample data: “it is a debug stage. Cycle is same like above shown in row generator. value 2010 o Alphabet – it also have only one option i. SF_SRC  Properties of Head: o Rows HEAD DS_TRG Navs notes Page 70 .e. o It having one input and one output. there are three types of pick sample data”. DataStage  All Rows(after skip)=false - It is to copy all rows to the output following any requested skip 2010 positioning  Number of rows(per partition)=XX o Partitions  All partition = true - It copy number of rows from input to output per partition. that it can read bottom ‘n’ rows from every partition” o Tail stage having one input and one output. True: copies row from all partitions False: copies from specific partition numbers. That mapping done in the tail output properties. SF_SRC  Properties of Tail: TAIL_F DS_TRG o The properties of head and tail are similar way as show above. Navs notes Page 71 . which must be specified. Percentage: means when it’s operating is supports one input and multiple of outputs. o Mainly we must give the value for “number of rows to display”  Sample: “it is also a debug stage consists of period and percentage” o o Period: means when it’s operating is supports one input and one output. o In this stage mapping must and should do.  Tail: “it is debug stage.  Percentage: it reads from one input to multiple outputs. Target1 Target2 SF_SRC SAMPLE Navs notes Page 72 2010 . target = 0 Percentage = 15 . o Coming to the properties  Options - Percentage = 25 and we must set target =1 Percentage = 50 . o Link Order: it specifies to which output the specific data has to be send.DataStage SF_SRC SAMPLE DS_TRG  Period: if I have some records in source table and when we give ‘n’ number of period value it displays or retrieves the every nth record from the source table.  Skip: it also displays or retrieves the every nth record from given source table. o Mapping: it should be done for multiple outputs. target = 2 o Here we setting target number that is called link order. we must assign o Number of row = value? o Peek record output mode = job log and so on.2. PEEK: “it is a debug stage and it helps in debugging stage” SF_SRC It is used in three types they are PEEK 1. 24.DataStage Target3 NOTE: sum of percentage of all outputs must be less than are equal to ‘<=’ to ‘n’ records of input records. And it can use as stub stage. Send the data into logs. It considers 90% as 100% and it distributes as we specify. 2. When sample receives the 90% of data from source. o In the percentage it distributes the data in percentage form. Q: How to send the data into logs?  Opening properties of peek stage. It can use as copying the data from Source to multiple outputs. as per options Navs notes Page 73 2010 . 3. Oracle Enterprise o Properties of Oracle Enterprise(OE): Data Set Navs notes Page 74 . because in some situations a client requires only dropped data.We see here ‘n’ values of records and fields Q: When the peek act as copy stage? A: It is done when the sequence file it doesn’t send the data to multiple outputs.DataStage o If we put column name = false. it reads tables from the oracle data base from source to the target” o Oracle enterprise reads multiple tables from. Oracle Enterprise: “Oracle enterprise is a data base stage. but it loads in the one output.1. it doesn’t shows the column in the log. and dynamic RDBMS and so on. ODBC enterprise. Tara data with ODBC. In that time the peek act as copy stage. 2010 o In DS Director  From Peek – log – peek . In that time the stub stage acts as a place holder which holds the output data as temporary. Q: What is Stub Stage? A: Stub Stage is a place holder. DAY 25 Database Stages In this stage we have use generally oracle enterprise. 25. and its sends the rejected data to the another file.  For seeing the log records that we stored. If we select table option Table = “<table name>” Connection Password = ***** User = Scott Remote server = oracle o Navigations for how the data load to the column  This is for already data present in plug-in. Table \\ giving table name here User Defined \\ here we are giving user defined SQL query. • • • Select load option in column Then we go to import Import “meta data definition” o Select related plug-in   Oracle User id: Scott Navs notes Page 75 . If table not in the not their in plug-in. • • • •  Select load option in column Going to the table definitions Than to plug-in Loading EMP table from their.DataStage  Read Method have four options • • • •  •  • • • Auto Generated \\ it generated auto query 2010 SQL Builder \\ its new concept apart comparing from v7 to v8. Data connection: its main purpose is reusing the saved properties. Q: What we can do when we don’t know how to write a select command? A: Selecting in read method = SQL Builder  After selecting SQL Builder option from read method o Oracle 10g o From their dragging which table you want o And select column or double clicking in the dragged table   There we can select what condition we need to get. Q: How to reuse the saved properties? A: navigation for how to save and reuse the properties  Opening the OE properties o Select stage  Data connection • There load saved dc Navs notes Page 76 . in define we must change hired date data type as “Time Stamp”.  But by the first read method option.x2 we don’t have saving and reusing the properties. Q: A table containing 300 records in that.5. we can auto generate the query by that we can use by coping the query statement in user-defined SQL. I need only 100 fields from that? A: In read method we use user-defined SQL query to solve this problem by writing a query for reading 100 records. NOTE: in version 7. It is totally automated. 2010 After importing into column.DataStage   • Password: tiger After loading select specific table and import. 2010 DAY 26 ODBC Enterprise ODBC Enterprise is a data base stage About ODBC Enterprise:  Oracle needs some plug-ins to connect the DataStage. Oracle Enterpris e Navs notes ODBC Enterpris e ORACLE DB Page 77 OS .DataStage o Naveen_dbc \\ it is a saved dc o Save in table definition. When DataStage version7 released that time the oracle 9i provides some drivers to use. But ODBC needs OS drivers to hit oracle or to connect oracle data base.  When coming to connection oracle enterprise connects directly to oracle data base. Q: How database connect using ODBC? ODBCE First step: opening the properties of ODBCE  Read method = table o Table = EMP  Connection Data Set o Data Source = WHR \\ WHR means name of ODBC driver Navs notes Page 78 2010 .DataStage Directly hitting Use OS drivers to hit the oracle db  Difference between Oracle Enterprise (OE) and ODBC Enterprise OE  Version dependent  Good performance  Specific to oracle  Uses plug-ins  No rejects at source ODBCE  Version independent  Poor performance  For multiple db  Uses OS drivers  Reject at SRC &TRG.  Best Feature by using ODBC Connector is “Schema reconciliation”. That automatically handles data type miss match between the source data types and DataStage data types.  Differences between ODBCE and ODBC Connector.DataStage o Password = ****** o User = Scott  Creating of WHR ODBC driver at OS level.  In this we can test the connection by test button.  Using ODBC Connector is quick process as we compare with ODBCE.  ODBCE read sequentially and load ODBC  It provides the list have in ODBC DSN. o Administration tools  ODBC • Add o MS ODBC for Oracle    Giving name as WHR Providing user name= Scott And server= tiger.  In the ODBCE “no testing the connection”.  It read parallel and loads parallel (good performance). to over this ODBC connector were introduced. ODBCE Connector  It cannot make the list of Data Source Name (DSN). 2010  ODBCE driver at OS level having lengthy process to connect. Navs notes Page 79 . 1. It’s having ‘n’ number of sheets in that.  Connections o DSN = EXE o Password = ***** o User = xxxxx  Column o Load  Import ODBC table definitions • • Navs notes DSN \\ here select work book User id & password Page 80 . MS Excel with ODBCE:  First step is to create MS Excel that is called “work book”.  For example CUST work book is created Q: How to read Excel work book with ODBCE? A: opening the properties of ODBCE  Read method = table o Table = “empl$” \\ when we reading from excel name must be in double codes end with $ symbol.DataStage  Properties of ODBC Connector: o Selecting Data Source Name DSN = WHR 2010 o User name = Scott o Password = ***** o SQL query 26. 1  After these things we must open the properties of ODBCE o Read method = table  Table = financial.customer Navs notes Page 81 .csv 26.0.DataStage o Filter \\ enable by click on include system tables o And select which you need & ok 2010  In Operating System o Add in ODBC  MS EXCEL drivers • Name = EXE \\ it is DSN Q: How do you read Excel format in Sequential File? A: By changing the CUST excel format into CUST. Q: How to read Tara Data with ODBC A: we must start the Tara Data connection (by clicking shortcut). Tara Data with ODBCE:  Tara Data is like an oracle cooperation data base. which use as a data base.2.0. o And in OS also we must start  Start ->control panel ->Administrator tools -> services -> • Tara Data db initiator \\ must start here o Add DSN in ODBC drivers   Select Tara data in add list We must provide details as shown below • • • User id = tduser Password = tduser Server : 127. 0.0. which we have load in source. DAY 27 Dynamic RDBMS and PROCESSING STAGE 27.1.DataStage o Connections     Column o Load  Import • • • • Table definitions\plug-in\taradata Server: 127. it is also called as DRS”  It supports multiple inputs and multiple outputs Navs notes Page 82 . Dynamic RDBMS: “It is data base stage.1 Uid = tduser Pwd = tduser DSN = tduser 2010 Uid = tduser Pwd = tduser  After all this navigation at last we view the data. Navs notes Page 83 2010 .DataStage Ln_EMP_Data Data Set DRS Ln_DEPT_Data Data Set  It all most common properties of oracle enterprise.  Coming to DRS properties o Select db type i.  We can solve this problem with DRS that we can read multiple files and load in to multiple files.  In oracle enterprise we can read multiple files. but we can’t load into multiple files.. oracle o Oracle   o Scott Tiger \\ for authentication At output   Ln_EMP_Data \\ set emp table here And Ln_DEPT_Data \\ set dept table here o Column  Load • Meta data for table EMP & DEPT.e. Surrogate key 27. Transformer Stage: The symbol of Transformer Stage is Navs notes Page 84 2010 • IWay can use in source only to set in output properties.DataStage Some of data base stages: • Netezza can use in target only to set in input properties. Look UP 3. They are. but we use 10 stages generally. Slowly changing dimension 8. 1. Sort 10. Processing Stage: In this 28 processing stages are there.2. . Transformer 2. Remove duplicates 7. 27. And the 10 stages are very important. Join 4.3. Modify 9. Copy 5. Funnel 6. 2010 Q: calculate the salary and commission of an employee from EMP table. IN. For example. o For this we can functions in derivation  IN.COMM \\ we can write by write clicking their  It visible in input column\function\ and so on. Navs notes Page 85 .  After that when we execute the null values records it drops and remaining records it sends to the target.e. we can write derivation here. That column we name as NETSAL By double clicking on the NETSAL.  Properties of Transformer Stage: o For above question we must create a column to write description     In the down at output properties clicking in empty position.  Transformer Stage is “all in one stage”.SAL + IN. setting the connection and load Meta data in to column here.COMM) o By this derivation we can null values records as target.SAL + NullToZero (IN.DataStage A simple query that we solving by using transformer i. source field and structure available mapping should be do. Oracle Enterprise Transformer Data Set Here. NS Variables to adding column 1 NS 0 integer 4 0 After adding NS column  To NS column including the derivation.COMM)) – 200 Else (IN. how to include this logic in derivation? A: adding THome column in output properties.COMM))> 2000   Then (IN. IN. Stage Variable: “it is a temporary variable which will holds the value until the process completes and which doesn’t sent to the result to output”  Stage variable is shown in the tool bar of transformer properties. DAY 28 Transformer Functions-I Examples on Transformer Functions: Navs notes Page 86 . o NETSAL = NS o THome = if (NS > 2000) then (NS -200) else (NS + 200).SAL + NullToZero (IN.SAL + NullToZero (IN.COMM).COMM) ) + 200 o By this logic it takes more time in huge records.SAL + NullToZero (IN.SAL + NullToZero (IN.DataStage Q: NETSAL= SAL + COMM +200.  After clicking that it visible in the input properties In stage variable we must add a column for example.  In THome derivation part we include this logic o 2010 Logic: if NETSAL > 2000 then TakeHome = NETSAL – 200 else TakeHome = NETSAL If (IN. so the best way to over this problem is Stage Variable.  Adding these derivations to the input properties to created columns. but it Parallel Transformer effects on compile time. extended filter) 3.3) 2010 3. Constraints Function (Filter) For example. lookup) Constraints: “In transformer constraints used as filter. means constraints is also called as filter” Q: how a constraint used in Transformer? A: in transformer properties.R(L(7). Navs notes  Basic Tx can call the Routines which is in basic and shell Page 87  It supports wide range of language or multiple .  Right Function using the above for question . Source level 2. Right Function 4. we will see a constraints row in output link. Substring Function Filter: DataStage in 3 different ways 1. a word MINDQUEST. Concatenate Function 5. Stages (filter. There we can write the derivation by double clicking. Field Function 6.DataStage 1.  Basic Tx can only execute up to SMP.3)  Left Function – L(R(5). Left Function 2.  Can execute in any platform. Differences between Basic transformer and parallel transformer:  Its effects on performance. Basic Transformer  Don’t effects on performance. from that word we need only QUE.3)  Substring – SST(5. switch. Constraints (transformer. 28 HINVC43205CID67632120080405EUO TPID5630 8 1657.DataStage NOTE: Tx is very sensitive with respect to Data Types.txt HINVC23409CID45432120080203DOL TPID5650 5 8261.96 HINVC12304CID46762120080304EUO TPID5640 3 5234.69 TPID5657 7 6218. right.64 Design: IN1 IN2 Navs notes Page 88 2010 . substring functions and date display like DD-MM-YYYY? A: File.13 TPID5637 1 2343. if an source and target be cannot different data types.00 TPID5645 2 7855. Q: How the below file can read and perform operation like filtering. separating by using left.57 TPID5635 6 9564.99 TPID5655 4 2861.67 TPID5657 9 7452. 1) Left (Right (IN2. Step 2: IN1 Tx. 8] INVCNO Page 89 2010 . here creating four column and separating the data as per created columns. DS IN1 REC IN1 CONSTRAINT Left (IN1.DATA [20.DATA. 1) IN1. Here. in the properties of sequential file loading the whole data into one record.1)=”H” IN2 Derivation Column Left (IN1.DataStage SF Tx1 IN3 Tx2 OUT Tx3 Total five steps to need to solve the given question: Step 1: Loading file.Properties.REC.REC.txt into sequential file. IN2 TYPE DATA Navs notes IN3 Left (IN1. 9) CID IN2. Means here creating one column called REC and no need of loading of Meta data for this. 21).REC.REC DATA TYPE Step 3: IN2 Tx properties. we are creating two columns TYPE and DATA. in this step we are filtering the “H” staring records from the given file. Stage Variable IN3 INVCNO CID BILL_DA TE CURR Derivation Column Right (IN3. setting the output file name for displaying the BILL_DATE.CID CID D:’-‘: M:’-‘: Y Step 5: here.INVCNO INVCNO IN3. 2) Right (Left (IN3.BILL_DATE. DAY 29 Transformer Functions-II Examples on Transformer Functions II: Navs notes Page 90 2010 . 4) D M Y OUT Derivation Column IN3. here BILL_DATE column going to change into DD-MM-YYYY format using Stage Variable.BILL_DATE. 6). 2) Left (IN3.BILL_DATE.DataStage Derivation Column Step 4: IN3 Tx properties. Compact White Spaces: “it removes before. @333. spaces”. Q: A file. after.STATE 111. comma delimiters and spaces (before. Field Function: “it separates the fields using delimiter support”.MH Design: IN1 SF Tx IN2 Tx Navs notes Page 91 . MUnNA.ENAME. Trim B: “it removes all after spaces”.txt EID. anvesh. NaVeen. Sra van. Trim F: “it removes all before spaces”. 7. 3. KN 555. 5. @ San DeeP. and in between). middle one. after.DataStage 1. Strip White Spaces: “it removes all spaces”. Trim T & L: “it removes all after and before spaces”. Trim: “it removes all special characters”. 6.txt consisting of special character. 4. How to solve by above functions and at last it to be one record? File. AP TN 222@. 2010 2. KN@ 444. txt using above functions: Step 1: Here.”@”.e. IN2 IN3 Derivation Column EID ENAME STATE Navs notes Trim(IN2. Step 2:IN1.’. extracting the file.DataStage IN3 2010 OUT Tx Total Five steps to solve the File. IN1 REC Derivation DS IN2 Column Field(IN2.REC.””) EID Upcase(Trim(SWS(IN2. using field functions. that REC to divide into fields by comma delimiter i. Tx properties  Here.’.REC. spaces. to remove special characters.3) EID ENAME STATE Step 3: IN2.’. Strip Whitespaces (SWS). no need of load meta data to this.REC.”@”.2) Field(IN2.txt and setting into all data into one record to the new column created that REC.’.ENAME. Tx properties  In link IN1 having the REC.’.  Point to remember keep that first line is column name = true.””)) ENAME Page 92 .’.EID..1) Field(IN2. Up case functions. lower cases into upper cases by using the trim. ENAME: IN3. here assigning a target file. spaces were removed after doing are implementing the transformer functions to the above file.DataStage Step 4: IN3. all rows that divided into fields are concatenating means adding all records into one REC. Final output: Trg_file. Tx properties  Here. And at last the answer will display in one record but all special characters.EID: IN3.txt.ds REC 111NAVEEN AP 222 MUNNATN 333SRAVAN KN 444SAN DEEPKN 555 ANVESHMH 29. IN3 OUT Derivation Column EID ENAME STATE IN3.1.STATE REC Step 5:  For the output. Column Import Column Export: Navs notes Page 93 2010 . Re-Structure Stage: 1. Column Export 2. o Input     o Output   Column Import:  “it is used to explore from single column into multiple columns” and it is also like field separator in the transformer function.  Properties: o Input   o Output     Import column type = “varchar” Import output column= EID Import output column= ENAME Import output column= STATE DAY 30 JOB Parameters (Dynamic Binding) Column method= Column To Import = REC Export column type = “varchar” Export output column = REC Column method = explicit Column To Export = EID Column To Export = ENAME Column To Export = STATE 2010  Properties: Navs notes Page 94 .DataStage  “it is used to combine the multiple of columns into single column” and it is also like concatenate in the transformer function. But coming to version8 we can reuse them by technique called parameter set”. To give runtime values for user ID. Under parallel compiler. this is up to version7. password. and remote server? Navs notes Page 95 . Here table name must be static bind. reporting will available. because of some security reasons. we must provide the table and load its meta data. o Existing: comes with in DataStage. in this two types one general and another one parallel. NOTE: “The local parameters that created one job they cannot be reused in other job.  Global Variables: “it is also called as environment variables”. For this we can use job parameters that can provide values at runtime to authenticate. job only”. They are. Job parameters: “job parameters is a technique that passing values at the runtime. it is divided into two types. o User Defining: it is created in the DataStage administrator only. it can use with in the 2010 dynamic binding”. But in version7 we can also reuse parameters by User Define values by DataStage Administrator. operator specific. it is also called dynamic binding”.  Job parameters are divided into two types.DataStage Dynamic Binding: “After compiling the job and passing the values during the runtime is known as  Assuming one scenario that when we taking a oracle enterprise. Q: How to give Runtime values using parameters for the following list? a.  But there is no need for giving the authentication to oracle are to be static bind. they are o Local variables o Global Variable  Local variables (params): “it is created by the DS Designer only. c.  Job parameters o Parameters Name  a b c DNAME USER Password SERVER DEPT BONUS DRIVE FOLDER TARGET Type string Encrypted String List Integer String String String Default value SCOTT ****** ORACLE 10 1000 C:\ Repository\ dataset. a. Department number (DNO) to keep as constraint and runtime to select list of any number to display it? d.DataStage b. Navs notes Page 96 . Providing target file name at runtime? e. Add BONUS to SAL + COMM at runtime? ORACLE Step1: Tx Data Set “Creating job parameters for given question in local variable”. Re-using the global and parameter set? Design: 2010 c. b.ds UID PWD RS DNO BONUS IP FOLDER TRG FILE      d   Here. d are represents a solution for the given question. Step 2:“Creating global job parameters and parameter set”.  For Re-use. and TEST”. PRD. we must o Add environment variables  User defined • • • UID $UID PWD $PWD RS $RS Step 3: “Creating parameter set for multiple values & providing UID and PWD other values for DEV.DataStage  DS Administrator o Select a project • 2010  Properties General o Environment variables  User defined (there we can write parameters) Default value SCOTT ****** ORACLE Name UID PWD RS DNAME USER Password SERVER Type string Encrypted String  Here. global parameters are preceded by $ symbol.  In local variables job parameters o Select multiple of values by clicking on  And create parameter set • Providing name to the set o SUN_ORA  Saving in Table definition • In table definition Navs notes Page 97 . UID SUN_ORA. Properties:  Read method = table o Table = EMP  Connection o Password = #PWD# o User = #UID# o Remote Server = #RS# Column:  Load o Meta data for EMP table Parameters Insert job parameters $UID $PWD variables $RS SUN_ORA.RS UID PWD Local variables global environment Navs notes Page 98 .DataStage o Edit SUN_ORA values to add Name DEV PRD TEST UID SYSTEM PRD TEST PWD ****** ****** ****** SERVER SUN ORACLE 2010 MOON  For re-using this to another job. o Add parameters set (in job parameters)  Table definitions • Navs o SUN_ORA(select here to use) NOTE: “Parameter set use in the jobs with in the project only”. Step 4: “In oracle enterprise properties selecting the table name and later assign created job parameter as shown below”.PWD parameter set SUN_ORA. DataStage Step 5: 2010 “In Tx properties dept no using as a constraint and assign bonus to bonus column”. Stage Variable IN EID ENAME STATE SAL COMM DEPTNO Derivation Column IN.SAL + NullToZero(IN.COMM) NS OUT Constraint: IN.DEPTNO = DNO Derivation Column IN.EID IN.ENAME NS NS+BONUS EID ENAME NETSAL BONUS Here, DNO and BONUS are the job parameters we have created above to use here. For that simply right click->job parameters->DNO/BONUS (choose what you want) Step 6: “Target file set at runtime, means following below steps to follow to keep at runtime”.  Data set properties o Target file= #IP##FOLDER##TRGFILE# Here, when run the job it asks in what drive, and in which folder. At last it asks what target file name you want. Navs notes Page 99 DataStage DAY 31 Sort Stage (Processing Stage) Q: What is sorting? “Here sorting means higher than we know actually”. Q: Why to sort the data? “To provide sorted data to some sort stages like join/ aggregator/ merge/ remove duplicates for the good performance”. Two types of sorting: 1. Traditional sorting: “simple sort arranging the data in ascending order or descending 2010 order”. 2. Complex sorting: “it is only for sort stages and to create group id, blocking unwanted sorting, and group wise sorting”. In DataStage we can perform sorting in three levels:  Source level: “it can only possible in data base”.  Link level: “it can use in traditional sort”.  Stage level: “it can use in traditional sorting as well as complex sorting”. Q: What is best level to sort when we consider the performance? “At Link level sort is the best we can perform”. Source level sort: o It can be done in only data base, like oracle enterprise and so on. o How it will be done in Oracle Enterprise (OE)? Navs notes Page 100 DataStage  Go to OE properties • Link level sort: o Select user define SQL 2010 o Query: select * from EMP order by DEPTNO. Here sorting will be done in the link stage that is shown how in pictorial way. o And it will use in traditional sorting only. o Link sort is best sort in case of performance. OE JOIN DS Q: How to perform a Link Sort? “Here as per above design, open the JOIN properties”.  And go to partitions o Select partition technique (here default is ‘auto’)  Mark “perform sort” • • When we select unique (it removes duplicates) When we select stable (it displays the stable data) Q: Get all unique records to target1 and remaining to another target2? “For this we must create group id, it indicates the group identification”. Navs notes Page 101 False = disables the group id.DataStage  It is done in a stage called sort stage. blocking unwanted sorting. True = enables group id. and group wise sorting in some sort stage like join. in the properties of the sort stage and in the options by keeping create key change column (CKCC) = “true”. Sort Stage  Complex sort means to create group id. Sort Properties:  Input properties o Sorting key = EID (select the column from source table) o Key mode = sort (sort/ don’t sort (previously sorted)/ don’t sort (previously grouped)) o Options   Create cluster key change column = false (true/ false) Create key change column = (true/ false) • •  Output properties o Mapping should be done here.  Traditional sort means sorting in ascending order or descending order. merge. aggregate. and remove duplicates. Navs notes Page 102 . that it can sort the data in traditional sort or in complex sort”. 2010  Here we must select to which column group id you want. default is false. Sort Stage: “It is a processing stage. ENAME.txt EID. loans 111. munna. File. loans 222. current 111. munna. credit 111. kumar. naveen.DataStage DAY 32 A Transformer & Sort stage job Q: Sort the given file and extract the all addresses to one column of a unique record and count of the addresses to new column. kumar. savings Design: SF Sort1 DS Navs notes Page 103 2010 . kumar. naveen. current 222.munna. munna. loans 222. ACCTYPE 111. insurance 333. savings 333. keychange = 1) then IN2. current. ACCTYPE 111. savings.ENAME func1 ACCTYPE ENAME  For this logic output will displays like below: EID. munna.ACCTYPE func1 else func1 :’.loans. kumar.  Transformer (TX): here logic to implement operation for target. current. insurance 111. current 111. munna. loans COUNT 1 2 3 4 1 2 3 1 2 Navs notes Page 104 2010 Tx Sort2 . munna. savings 111.’: IN2. o Properties of TX: Stage Variable IN2 EID ENAME ACCTYP E KeyChan Derivation Column if (IN2.keychange=1) then 1 else c+1 OUT Derivation Column IN2. credit . credit 222. kumar. insurance.  Sort1: here sorting key = EID  And enables the CKCC for group id.ACCTYPE if(IN2.DataStage  Sequential File (SF): here reads the file.munna.EID EID IN3. savings 333.loans 222. kumar. current 333. savings. current. naveen. credit . naveen. loans 222.txt for the process. ENAME. savings. case sensitive Mapping should be doing here.DataStage  Sort2: o Here. partitioning Options= ascending.  Data Set (DS): o Input:  partition type: hash o Sorting: Navs notes Page 105 2010 . in the properties we must set as below.  Stage • Key=ACCTYPE o o Sort key mode = sort Sort order = Descending order  Input • • Partition type: hash Sorting o Perform sort   Stable (uncheck) Unique (check this) o Selected     Output • Key= count Usage= sorting. sav 333. ACCTYPE. loans 222. naveen. o Source File: here we have option called filter there we can write filter commands like “grep “moon”/ grep –I “moon”/ grep –w “moon” ”. o Data Base: by write filter quires like “select * from EMP where DEPTNO = 10”.  It can only have 128 cases.  Stage Filter: o “Stage filters use in three stages. Navs notes  SWITCH can only one Page 106 condition can perform. partition Ascending 111. o Difference between if and switch:  Poor performance.  It have ‘n’ number of cases. munna. curr. and they are 1.DataStage  Perform sort Stable (check this) EID. Switch and 3. sav. insu. they are 1. External filter”.loans. COUNT 4 3 2 DAY 33 FILTER STAGE 2010   Unique (check this) Final output: o Selected    Key= EID Usage= sorting. loans Filter means “blocking the unwanted data”. current. Filter. Constraints  Source Level Filter: “it can be done in data base and as well as in file at source level”. kumar. Source level 2. 2. credit .  Better SWITCH performance than IF. ENAME. IF  IF can write ‘n’ number of column in condition. Stage level 3. . In DataStage Filter stage can perform in three level. DataStage o Here filter is like an IF. switch as switch. o 1 – input 128 – outputs 1 . o 1 – input n – outputs 1 – reject SWITCH  Condition on single column. o Differences between three filter stages.default EXTERNAL  It is using by the GREP commands. n outputs. and one reject link”. FILTER FILTER  Condition on multiple columns.  It have.  It have.  The symbol of filter is Filter Q: How the filter stage to send the data from source to target? Design: DS Navs notes T 1 Page 107 2010 .  It have. o 1 – input 1 – output no rejects Filter stage: “it having one input. DataStage OE Reject T 2 DS DS Step1:  Connecting to the oracle for extracting the EMP table from it. Step2: Filter properties  Predicates o Where clauses = DEPT NO =10  Output link =1 o Where clauses = SAL > 1000 and SAL < 3000  Output link = 2 o Output rejects = true // it is for output reject data.  Link ordering o Order of the following output links  Output: o Mapping should be done for links of the targets we have.  Step3:  “Assigning a target files names in the target”. Here, Mapping for T1 and T2 should be done separately for both. Navs notes Page 108 2010 Filter DataStage It have no reject link, we must convert a link as reject link. Because it has ‘n’ number of outputs. 2010 DAY 34 Jobs on Filter and properties of Switch stage Assignment Job 1: a. Only DEPTNO 10 to target1? b. Condition SAL>1000 and SAL<3000 satisfied records to target2? c. Only DEPTNO 20 where clause = SAL<1000 and SAL>3000 to target3? d. Reject data to target4? Design to the JOB1: T Filter EMP_TBL T T Filter Navs notes Page 109 DataStage T 2010 Step1: “For target1: In filter where clause for target1 is DEPTNO=10 and link order=0”. Step2: “For target2: where clause = SAL>1000 and SAL<3000 and link order=1”. Step3: “For target3: where clause= DEPTNO=20 and link order=0”. Step4: “For target4: convert link into reject link and output reject link=true”. Job 2: a. All records from source to target1? b. Only DEPTNO=30 to target2? c. Where clause = SAL<1000 and SAL>3000 to target3? d. Reject data to target4? Design to the JOB2: T Copy EMP_TBL T T Filter T Navs notes Page 110 All duplicates records of DEPTNO to target2? c. Condition SAL>1000 & SAL<3000. Job 3: a. Step2: “For target2 where clause = DEPTNO=30 and link order =0”. Only DEPTNO 10 records to target4? e.DataStage Step1: “For target1 mapping should be done output links for this”. Step3: “For target3 where clause = SAL<1000 and SAL>3000 and link order=1”. Step4: “For target4 convert link into reject link and output reject link=true”. but no DEPTNO=10 to target5? Design to the JOB3: K= T Filter EMP_TBL K= T TT T Navs notes Page 111 2010 . All records to target3? d. All unique records of DEPTNO to target1? b. 128 – outputs and 1. Step3: “For target3: mapping should be done output links for this”. SWITCH Stage: “Condition on single column and it has only 1 – input. Step2: “For target2: where clause = keychange=0 and link order=1”. Picture of switch stage: Properties of Switch stage:  Input o Selector column = DEPTNO  Cases o values Case = 10 = 0 link order o Case = 20 = 1  Options Navs notes Page 112 2010 . Step4: “For target4: where clause= DEPTNO=10”. Step5: “For target5: in filter properties put output rows only once= true for where clause SAL>1000 & SAL<3000”.DataStage Filter T Step1: “For target1: where clause = keychange=1 and link order=0”.default”.  Example filter command: grep “newyork”. Combining: “in DataStage combining can done in three types”. 1-output. 2010 Fail= if any records drops job aborts.  To perform a text file. first it must read in single record in the input. Sequential File  External Filter properties: External Filter Data Set o Filter command = grep “newyork” o Grep –v “newyork” \\ other than new it filters. and 1-reject link. Output= to view reject data through the link.  It having 1-input.  They are Navs notes Page 113 . which can perform filter by UNIX commands”.DataStage o If no found = options (Drop/ fail/ output)    Drop= drops the data and continue the process. DAY 35 External Filter and Combining External Filter: “It is processes stage. and MERGE. o Treatment of unmatched records. LOOKUP. ENO EName DNo 111 10 222 DNo LOC 10 20 40 naveen munna DName IT SE SA HYD SEC DNO DNAME LOC ENO ENAME H C Here we can combine Navs notes Page 114 . o This stage that perform by JOIN. and o Memory usage.  Selection of primary table is situation based.DataStage o Horizontal combining o Vertical combining 2010 o Funneling combining Horizontal combining: combining primary rows with secondary rows based on primary key. DAY 36 Horizontal Combining (HC) and Description of HC stages Horizontal Combining (HC): “combining the primary rows with secondary rows based on primary key”. o Inputs requirements.  These three stages differs with each other with respect to.  T1 T2 2010 Right outer join. o o Treatment of unmatched records. o Input names.  T1 (T1  T2) Right Outer Join: “Matched primary & secondary and unmatched secondary records”. o Key column names. o De – duplication (removing duplicates). Memory usage.  T2 (T1  T2) Full Outer Join: “Matched primary & secondary and unmatched primary & unmatched secondary records”. and Left Outer Join: “Matched primary & secondary and unmatched primary records”. o Input output rejects. 20. o Input requirements with respect to sorting. 40} Inner Join: “Matched primary and secondary records”. Left outer join. 30} and T2= {10. Navs notes Page 115 . o Join types.DataStage Inner join. They are.  T1  T2 Description of HC stages: “The description of horizontal combining is divided into nine parts”. and o Types of inner join. full outer join If T1= {10. 20. and full outer join. Join Types: Inner join. 2010 N – Inputs (normal) 2 – inputs (sparse) 1 – output. target (keep) Drop Reject Page 116 (unmatched secondary Navs notes Secondary: Drop (inner) . and last SRC is right table. The first table is master table and remaining tables are updates tables. Target (continue). and 1 – reject N – inputs 1 – output (n – 1) rejects. And all middle SRC’s are intermediate tables. LOJ. Inner Join Left outer join Inner join Left outer join :: Input Requirements with respect to sorting:: Primary: mandatory Optional Optional Mandatory Mandatory Secondary: ::De – Duplication (removing the duplicates):: Primary: OK (nothing happens) OK Warnings OK Warnings Secondary: OK :: Treatment of Unmatched Records:: Primary: Drop (inner) Target (Left) Drop. ROJ) 2 – inputs (FOJ) 1 – output. and 1 – LOOKUP The first link from source is primary/ input and remaining links are lookup/ references links. lookup. Input output rejects: N – inputs (inner.DataStage  The differences between join. JOIN MERGE Input names: When we work on HC with JOIN the first SRC is left table. and merge with respect to above nine points are shown below. reject (unmatched primary records) Drop Drop. left outer join. right outer join. ENAME.  “Look up stage is for cross verification of primary records with secondary records”. DNAME. LOC Navs notes Page 117 2010 . DNO  Reference table as DEPT with column consisting of DNO.DataStage :: MEMORY USAGE:: Light memory :: Key Column Names:: Must be SAME :: Type of Inner Join :: ALL ALL ANY Optional Same in case of lookup file set Must be SAME Heavy memory Light memory DAY 37 LOOKUP stage (Processer Stage) Lookup stage:  In real time projects. 95% of horizontal combining is used by this stage.  DataStage version8 supports four types of LOOKUP. they are o Normal LOOKUP o Sparse LOOKUP o Range LOOKUP o Case less LOOKUP For example in simple job with EMP and DEPT tables:  Primary table as EMP with column consisting of EID. Navs notes Page 118 2010 .DataStage DEPT table (reference/ lookup) EMP table (Primary/ input) LOOKUP Data Set (target) LOOKUP properties for two tables: Primary Table ENO ENAM E DNO Target ENO ENAM E DNAM Reference Table DNO DNAM E LOC Key column for both tables  It can set by just drag from primary table to reference table to DNO column.  To set sparse lookup we must adjust key type as sparse in reference table only.  But we have a option to remove the case sensitive i.  Fail: its aborts job.  Sparse lookup: “is cross verification of primary records with secondary at source level itself”. its supports only two inputs.e. DAY 38 Sparse and Range LOOKUP Sparse LOOKUP:  If the source is database. in that we have to select  Continue: this option for Left Outer Join. But in ONE Case sparse LOOKUP stage can supports ‘n’ references.  Reject: it’s captured the primary unmatched records. 2010  Drop: it is to Inner Join. if a primary unmatched records are their. By taking lookup file set Navs notes Page 119 .  Normal lookup: “is cross verification of primary records with secondary at memory”. Note: sparse lookup not support another reference when it is database. By default Normal LOOKUP is done in lookup stage.DataStage In tool bar of LOOKUP stage consists of constraints button. o Key type = case less. Case less LOOKUP: In execution by default it acts as a case sensitive.. DataStage Job1: a sequential file extracting a text file to load into lookup file set (lfs). LFS …………………… LFS SF LOOKUP DS  In lookup file set. o Address of the target must save to use in another job. we must paste the address of the above lfs.lfs extension. Job2: in this job we are using lookup file set as sparse lookup. Navs notes Page 120 .  Lookup file supports ‘n’ references means indirectly sparse supports ‘n’ references. 2010 Sequential file  Here in lookup file set properties: Lookup file set o Column names should same as in sequential file. o Target file stored in . Condition for LOOKUP stage:  How to write a condition in the lookup stage? o Go to tool bar constraint. Data type should be same Funnel stage it is process to append the records one table after the one. o In condition.primary= “AP” o For multiple links we can write multiple conditions for ‘n’ references. there we will see condition box. Columns should be same 2. but above four conditions has to be meet. Copy and Modify stages Funnel Stage: “It is a processing stage which performs combining of multiple sources to a target”.  How to set the range lookup: In LOOKUP properties:  Select the check box for column you need to condition. for example: in.DataStage  “Range lookup is keeping condition in between the tables”. To perform the funnel stage some conditions must to follow: 1. DAY 39 Funnel. Columns names should be case sensitive 4. Columns names also should be same 3. Navs notes Page 121 2010 Range LOOKUP: . 2010 Simple example for funnel stage: ENO EN GEN 111 HYD 222 naveen M munna Loc T X ENO GEN Copy /Modi fy EN ADD EMPID EName Loc Company GEN 444 555 IT SA Country 1 0 In this column names has change as primary table. Drop the columns. Copy Stage: “It is processing stage which can be used from”. 3. Stub stage. 1. Charge the column names. DEL INDIA IBM NY USA IBM Funnel operation three modes:  Continues funnel: it’s random. NOTE: best for change column names and drop columns. 4. Navs notes Page 122 . Copying source data to multiple targets.DataStage In this stage the column GEN M has to exchange into 1 and F=0.  Sequence: collection of records is based on link order.  Sort funnel: it’s based on key column values. 2. Navs notes Page 123 .  At runtime: Data Set Management (view the operation process)  Specification: <new column name> DOJ=HIREDATE<old column> o Here to change column name. Keep the columns. Drop the columns. 2010 “It is processing stage which can perform”. In modify properties:  Specification: drop SAL. 3. DEPTNO o Here drops the above columns. Oracle Enterprise Modify Data Set From OE using modify stage send data into data set with respect to above five points. MGR.DataStage Modify Stage: 1. Change the column names.  Specification: keep SAL. MGR. remaining columns were drops. 2. 5. 4. Modify the data types. Alter the data. DEPTNO o Here accept the columns. and intermediate tables.  Input requirements with respect to sorting: it is mandatory in primary and secondary tables. 1. no reject. treatment of unmatched records.output. right table.  Join stage input names are left table. 2 – inputs (FOJ).  Join stage having n – inputs (inner. and full outer join. left outer join. Navs notes Page 124 . ROJ). right outer join.  Types of Join stage are inner.DataStage  Specification: <new column name>DOJ=DATE_FROM_TIMESTAMP(HIREDATE) <old column> 2010 o Here changing the column name with data type. LOJ. and memory usage. DAY 40 JOIN Stage (processing stage) Join stage it used in horizontal combining with respect to input requirements. DataStage  Input requirements with respect to de – duplication: nothing happens means it’s OK when de – duplication.  All types of inner join will supports. in this no scope from third table that’s why FOJ have two inputs. o Right Outer JOIN comes in right table. Left Outer JOIN comes in left table. drops and when it is LOJ will keep all records in target. 2010  Treatment of unmatched records: in primary table when the option Inner its simple  Key column names should be SAME in this stage. And in secondary table in Inner option it’s drops and it ROJ will keep all records in target. that job can executes but its effect on the performance (simply say WARNINGS will occurs) Navs notes Page 125 .  In join stage when we sort with different key column names. o Full Outer JOIN comes both tables. A simple job for JOIN Stage: JOIN properties:  Need a key column o Inner JOIN.  Memory usage: light memory in join stage. treatment of unmatched records. 1 – output. and left outer join. DN. 2010 DAY 41 MERGE Stage (processing stage) Merge stage is a processing stage it perform horizontal combining with respect to input requirements. and memory usage.DataStage  We can change the column name by two types Copy stage and with query statement. and (n – 1) rejects for merge stage. Example of SQL query: select DEPTNO1 as DEPTNO. and Loc from DEPT.  Input requirements with respect to sorting is mandatory to sort before perform merge stage.  Merge stage input names are master and updates. Navs notes Page 126 .  Join types of this stage are inner join.  N – inputs. unmatched records of the unmatched primary table records.  The key column names must be the SAME.DataStage  Input requirements with respect to de – duplication in the primary table it will get warnings when we don’t remove the duplicates in primary table. And in secondary table drops and reject it captures the unmatched secondary table records.  Merge operates with only two options o Keep (left outer join) o Drop (inner Join) Simple job for MERGE stage: PID PRD_DESC PRD_MANF 11 indica tata 22 swift maruthi 33 civic PID PRD_SUPP PRD_CAT 11 abc XXX 33 xyz XXX 55 pqr XXX 77 mno XXX PID PRD_AGE PRD_PRICE 11 4 1000 22 9 1200 66 3 1500 88 9 1020 Master Table Master table Update (U1) Update (U2) Navs notes Page 127 .  All changes information stores in the update tables. And in secondary  Treatment of unmatched records in primary table Drop (drops).  In type of inner join it compares in ANY update tables. Target (keep) the 2010 table nothing will happens its OK when we don’t remove the duplicates.  In the merge stage the memory usage is LIGHT memory. NOTE:  Static information stores in the master table.  Here COPY stage is acting as STUB Stage means holding the data with out sending the data into the target. DAY 42 Remove Duplicates & Aggregator Stages Remove Duplicates: “It is a processing stage which removes the duplicates from a column and retains the first or last duplicate rows”.DataStage TRG 2010 U1 U2 or Reject (U1) In MERGE properties:  Merge have inbuilt sort = (Ascending Order/Descending Order) Reject (U2)  Must to follow link order. Sequential File Remove Duplicates Data Set Navs notes Page 128 .  NOTE: there has to be same number of reject links as update links or zero reject links.  Merge supports (n-1) reject links. SF Properties of Aggregator:  Grouping keys: o Group= Deptno  Aggregator Aggregator DS o Aggregator type = count rows (count rows/ calculation/ re – calculation) o Count output column= count <column name> 1Q: Count the number of all records and deptno wise in a EMP table? 1 Design: OE_EMP Copy of EMP Counting rows of deptno TRG1 Navs notes Page 129 .e.DataStage Properties of Remove Duplicates:  Two options in this stage. 2010 o Key column= <column name> o Dup to retain=(first/last) Remove Duplicates stage supports 1 – input and 1 – output. NOTE: for every n – input and n – output stages should must done mapping. Aggregator: “It is a processing stage that performs count of rows and different calculation between columns i. group by same operation in oracle”. and in target two company wise maximum? 2 Design: OE_emp copy of emp max.Column for calculation = SAL <column name> Operations are  Maximum value output column = max <new column name>  Minimum value output column = min <new column name>  Sum of column = sum <new column name> and so on. doing calculation on SAL based on DEPTNO.Aggregation type = calculation . 2Q In Target one dept no wise to find maximum.DataStage Generating a column counting rows of created column TRG2 For doing some group calculation between columns: Example: Select group key Group= DEPTNO . minimum. sum of deptno trg1 Company: IBM max of IBM trg2 3Q: To find max salary from emp table of a company and find all the details of that? Navs notes Page 130 2010 . Here. min. and sum of rows.  The subsequent is alter is called incremental load i. min.. coming from OLTP also source is after data. Incremental load  Initial load: complete dump in dimensions or data warehouse i.e. Navs notes Page 131 2010 .DataStage & 4Q: To find max. sum of salary of a deptno wise in a emp table? dummy dno=10 3 & 4 Design: compare emp max(deptno) UNION ALL diving dno=20 compare copy min(deptno) dummy dno=30 company: IBM compare maximum SAL with his details max (IBM) DAY 43 Slowly Changing Dimensions (SCD) Stage Before SCD we must understand: types of loading 1. Initial load 2.e. target also before data is called Initial load.. DataStage Example: #1 Before data (already data in a table) CID 11 CNAME A ADD HYD GEN M BALANCE Phone No 30000 988531068 8 After data (update n insert at source level data) CID 11 CNAME A ADD SEC GEN M BALANCE Phone No 60000 988586542 2 Column fields that have changes types: Address – slowly change Balance – rapid change Phone No – often change Age – frequently AGE 25 AGE 24 Example: #2 Before Data: CID 11 22 33 CNAME A B C ADD HYD SEC DEL After Data: (update ‘n’ insert option loading a table) CID 11 22 CNAME A B ADD HYD CUL Navs notes Page 132 2010 . active flag.e. . not having primary key that need system generated primary key.  In SCD – II.  And when SCD – II performs we get a practical problem is to identify old and current record. surrogate key.  Record version: it is concept that when the ESDATE and EEDATE where not able to use is some conditions. and effect end date. and no historical data were organized”.. Here surrogate key acting as a primary key.. they are  SCD – I  SCD – II  SCD – III  SCD – IV or V  SCD – VI Explanation: SCD – I: execution. With some special “it only maintains current update. effect start date.DataStage 33 D PUN We have SIX Types of SCD’s are there. it updates the before data with after data and no history present after the operation columns they are. effect start date (ESDATE) and effect end date (EEDATE).e.  Unique key: the unique key is done by comparing. As per SCD – I. i. SCD – II: “it maintains both current update data and historical data”. new concepts are introduced here i.  In SCD – II. surrogate key. That we can solve by active flag: “Y” or “N”. SCD – III: SCD – I (+) SCD – II “maintain the history but no duplicates”. Navs notes Page 133 2010  Extracting after and before data from DW (or) database to compare and upsert. DataStage SCD – IV or V: SCD – II + record version 2010 “When we not maintain date version then the record version useful”. 20. Example table of SCD data: SID 1 2 3 4 5 6 7 8 CID 11 22 33 22 44 11 22 55 CNAME A B C B D A B E ADD HYD SEC DEL DEL MCI GDK RAJ CUL AF N N Y N Y Y Y Y ESDATE 03-06-06 03-06-06 03-06-06 08-09-07 08-09-07 30-11-10 30-11-10 30-11-10 EEDATE 29-11-10 07-09-07 9999-12-31 29-11-10 9999-12-31 9999-12-31 9999-12-31 9999-12-31 RV 1 1 1 2 1 2 3 1 UID 1 2 3 2 5 1 2 8 Table: this table is describing the SCD six types and the description is shown above. SCD – VI: SCD – I + unique identification. 40 DS_TRG_DIM -update and insert OE_SRC Navs notes Page 134 .20. 20. 20.20. 40 10. 40 After dim OE_UPSERT 10. DAY 44 SCD I & SCD II (Design and Properties) SCD – I: Type1 (Design and Properties): Transfer job 10. 40 Load job DS_TRG_DIM 10.30 OE_DIM before fact DS_FACT 10. o Insert into src values(222. SNO number. Step 2: “SCD1 properties” Fast path 1 of 5: Fast path 2 of 5: select output link as: fact navigating the key column value between before and after tables AFTER SNO SNAME KEY EXPR COLUMN N PURPOSE SKID surrogate key AFTER. o Insert into src values(333. Table1: o Insert into src values(111. ‘kumar’). SNAME varchar2(25)). SNAME varchar2(25)). ‘naveen’). Processes of transform job SCD1: Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources. ‘munna’). 2010 BEFORE  Create table SRC(SNO number.txt Page 135 Source type: Flat file Navs notes . o No records to display. source name: D:\study\navs\empty.DataStage In oracle we have to create table1 and table2.SNO SNO business key Fast path 3 of 5: selecting source type and source name. Table2:  Create table DIM(SKID number. e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are..DataStage NOTE: for every time of running the program we should empty the source name i. Fast path 4 of 5: select output in DIM.SNO SNO BEFORE SKID SNO SNAME Step 3: In the Next job. 2010 empty.SKID SKID AFTER.e. Navs notes Page 136 . else surrogate key will continue with last stored value.SNO SNO business key For path 5 of 5: setting the output paths to FACT data set.txt. -update n insert \\ if key column value is already. AFTER DIM SNO SNAME Derivation COLUMN N PURPOSE next sk() SKID surrogate key AFTER. i. AFTER FACT SNO SNAME Derivation COLUMN N BEFORE. Before table CID CNAME SKID 10 abc 1 20 xyz 2 30 pqr 3 Target Dimensional table of SCD I CID 10 20 40 CNAME SKID abc 1 nav 2 pqr 3 After table CID CNAME 10 abc 20 nav 40 pqr SCD – II: (Design and Properties): Transfer job 10. 30. 20.DataStage -insert n update \\ if key column value is new. 20. 20. 30. 20. 20. 20. 20. 40 Load job DS_TRG_DIM 10.30 before OE_DIM fact DS_FACT 10. 30. 40 OE_UPSERT -update and insert OE_SRC DS_TRG_DIM Step 1: in transformer stage: Navs notes Page 137 2010 Here SCD I result is for the below input .20. 40 After dim 10. 40 10. SNAME SNAME COLUMN SNO In SCD II properties: Fast path 1 of 5: select output link as: fact Fast path 2 of 5: navigating the key column value between before and after tables BEFORE AFTER KEY EXPR SNO SNAME COLUMN N PURPOSE SKID surrogate key AFTER.SNO SNO business key SNAME Type2 ESDATE experi date Page 138 Navs notes 2010 .SNO BEFORE.SKID SKID BEFORE.DataStage Adding some columns to the to before table – to covert EEDATE and ESDATE columns into time stamp transformer stage to perform SCD II In TX properties: BEFORE BEFORE_TX SKID SNO SNAME ESDATE EEDATE ACF Derivation NAM BEFORE. txt.e. AFTER SNO SNAME FACT Derivation COLUMN NAME BEFORE.SNO SNO business key AFTER.ESD ESDATE BEFORE SKID SNO SNAME ESDATE EEDATE ACF Navs notes Page 139 2010 Source type: Flat file . empty.SNO SNO AFTER.SNAME SNAME BEFORE.SNAME SNAME Type2 curr date() ESDATE experi date - Date from Julian (Julian day from day (current date ()) – 1) For path 5 of 5: setting the output paths to FACT data set..SKID SKID AFTER. else surrogate key will continue with last stored value. DIM SNO SNAME Derivation COLUMN N PURPOSE Expires next sk() SKID surrogate key AFTER.txt NOTE: for every time of running the program we should empty the source name i. source name: D:\study\navs\empty. Fast path 4 of 5: AFTER select output in DIM.DataStage Fast path 3 of 5: selecting source type and source name. in load job if we change or edit in the source table and when you they are. \\ if key column value is new.e. Simple example of change capture: Navs notes Page 140 . i.DataStage Step 3: In the Next job. that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name”. -update n insert -insert n update \\ if key column value is already. 2010 are loading into oracle we must change the write method = upsert in that we have two options Here SCD II result is for the below input Before table CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-1231 Y 20 xyz 20 01-10-08 Target Dimensional table of SCD II CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-1231 Y 20 xyz 2 01-10-08 09-12-10 N 20 xyz 4 10-12-10 After table CID CNAME 10 abc 20 nav 40 DAY 45 Change Capture. Change Apply & Surrogate Key stages Change Capture Stage: “It is processing stage. DataStage Change_capture Properties of Change Capture:  Change keys o Key = EID (key column name)   Change valves o Values =? \\ ENAME o Values =? \\ ADD  Options o Change mode = (explicit keys & values / explicit keys. that it applies the changes of records of a table”. values) o Drop output for copy = (false/ true) “false – default ” o Drop output for delete = (false/ true) “false – default” o Drop output for edit = (false/ true) “false – default” o Drop output for insert = (false/ true) “false – default”      Sort order = ascending order Copy code = 0 Delete code = 2 Edit code = 3 Insert code = 1 Code column name = <column name> o Log statistics = (false/ true) “false – default” Change Apply Stage: “It is processing stage. Navs notes Page 141 2010 . DataStage Change Apply Properties of Change Apply:  Change keys o Key = EID   Options o Change mode = explicit key & values o Check value columns on delete = (false/ true) “true .x2 Design of that ESDATE=current date () EEDATE= “9999-12-31” Key=EID ACF= “Y” -option: e k & v Before.default” o Log statistics = false o Code column name = <column name> \\ change capture and this has to be SAME for apply operations Sort order = ascending order SCD II in version 7.txt key= EID -option: e k & v Navs notes Page 142 2010 .txt c=3 c=all after.5. In version 8. a surrogate key stage used for generates the system key column values that are like primary key values.5.  But by taking tail stage with that we tracing the last value and storing into the peek stage that is in buffer.txt ESDATE.if c=3 then DFJD(JDFD(CD())-1) else EEDATE = “9999-12-31” ACF.x2: “identifying last value which generated for the first time compiling and running the job in surrogate key stage.5.DataStage before.x2. But it generate at first compile only. And that job in version 7.txt) file and storing last value information in that file.5.if(c=3) then “N” else “Y” SURROGATE KEY Stage: In version 7.  With that buffer value we can generate the sequence values that are surrogate key in version 7. for that reason in version 7 we have to do a another job to store a last generated value”.0: “The above problem with version7 is over comes by version 8. and by using that it generates the sequence values” Navs notes Page 143 2010 .x2: design SF Sk copy ds Tail peek  In this job.0 surrogate key by taking an empty text(empty.current date () EEDATE. txt Source type = flat file Option 2: database type= oracle (DB2/ oracle) Source name = sq9 (in oracle – create sequence sq9)\\ it is like empty.txt Password= tiger User id= scott Server name= oracle Source type = database sequence DAY 46 DataStage Manager Export: “Export is used to save the group of jobs for the export purpose that where we want”.DataStage Before. Navigation .txt SK Data Set Properties of SK version8: Option 1: generated output column name = skid Source name = g:\data\empty.“how to export”? DataStage toolbar  Change selection: AD D or o Job components to export REMOV E or SELECT ALL Here there are three options are Export job designs with executables(where applicable) Navs notes Page 144 2010 . Options of import are o DataStage components… o DataStage components (xml)… o External function definitions o Web services function definitions o Table definitions o IMS definitions  In IMS two options are..DataStage o Export to file  Source name\. o Import from file Give the source name to import …. • • Database description (DBD) Program Specification Block (PSB / PCB)  In DataStage components. Export job designs without executables Export job executables without designs 2010 Where we want locate the export file. Navs notes Page 145 .dsx or ... o Type of export  dsx By two options we can export file - dsx 7 – bit encoded xml Import: “It is used to import the ..xml extensions to a particular project and also to import some definitions as shown below”.. that it generates a report to a job instantly”. Fast name – server name or system name. For that. Resource – memory associated with node.DataStage Import all Import selected Generate Report: overwrite without query 2010 perform impact analysis “It is for to generate report to a job or a specific. 4. Pools – logical area where stages are executed. go to  File o Generate report  Report name • Options Use default style sheet Use custom style sheet After finishing the settings:  It’s generates in default position “/reportingsendfile/ send file/ tempDir. Navs notes Page 146 .tmp” Node Configuration: Q: To see nodes in a project: o Go to run director  Check in logs • Double click on main program: APT config file Q: What are Node Components? 1. 2. 3. Node name – logical CPU name. . o Name of configuration file is C:\ibm\..apt will have the single node information. o We can create new node by option NEW NEW o  Save the things after creating new nodes Navs notes Page 147 .  “c:\ibm\information server\scratch” Q: What node that handles to run each and every job and name of the configuration file? o Every job runs on APT node as on below name that is default for every job..apt Q: How to run a job on specific configuration file? o Job properties  Parameters • Add environment variables o Parallel  Compiler • Config file (Add $APT_CONFIG_FILE) Q: How to create a new Node configuration File? o Tools  Configurations • There we see o Default.\default...apt o Default.DataStage o Node components stores in the disc’s permanent in the below address... 2010  “c:\ibm\information server\server\parasets” o Node components stores temporary is the below address.. Q: How to run a job in a job? Navigation for how to run a job in a job  Job properties o Job control  Select a job • • • ------------------------------------o Dependencies  Select job (first compile this job before the main job) Q: Repository of Advance Find (means palate of advance find)? o Name to find: Nav* here. Advanced Find: “It is the new feature to version8” It consists of to find objects of a job like list shown below 1. If uni processing system with 1 CPU needs minimum 1 node to run a job then for SMP with 4 CPU needs how many minimum nodes? o Only 1 node. 2. o Folder to search: D:\datastage\ o Type o Creation Navs notes Page 148 .DataStage  By. Compared report. Dependency. Job Control Language (JCL) script presents. save configuration As o NOTE: Best 8 or 16 nodes is to create in a project. and 2010 • Q: 2^0. 3.2^1(say) CPU’s have & so on. Where used. o Compare against o Export o Multiple job compile o Add to palate o Create copy o Locate in tree o Find dependencies Q: How to find dependency in a job? o Go to tool bar  Repository • Find dependency: all types of a job DAY 47 DataStage Director DS Director maintains:  Schedule  Monitor  Views Navs notes Page 149 .DataStage o Last modification o Where used 2010   Find objects that use any of the following objects. remove all o Dependencies of job Q: Advance Find of repository through tool bar? o Cross project compare…. remove. Options: Add. o Right click on job in the DS Director  Click on “add to schedule…” • And set the timings. By right clicking we can filter.  In real time.DataStage o Job view o Status view 2010 o Log view  Message Handling  Batch jobs  Unlocking Schedule: “Schedule means a job can run in specific timings”  To set timings for that. specific the job sequence by some tools shown below o Tools to schedule jobs (its happen the production only)    Control M Cron tab Autosys Purge: “It means cleaning or wash out or deleting the already created logs” In job can we clear Job logs having a option is FILTER.  Navigation for set the purge. Navs notes Page 150 . elapsed time (i.e. Navs notes Page 151 . at offset: 0 2. Import warnings at record 0. Reasons for warnings:  Default warnings in sequential file are 1. percentage used by CPU)”  Navigation for job that how to monitor. started at (time). StatusNo. rows/sec).DataStage o Tool bar  Job o Immediate purge o Auto purge Monitor: “It shows the Status of job. Field “<column name>” has import error and no default value. data : { e i d }. o Right click on job  Click monitor • “it shows performance of a job” 2010 - Clear log (choose the option)  Like below figure for a simple job. numbers of row where executed. rows started at elaspsed time rows/sec %CPU Finished 6 sys time 00:00:03 2 =9 Finished 6 sys time 00:00:03 2 =7 NOTE: Based on this we can check the performance tuning of a stage in a particular job. .  When sorting for different key column in join.DataStage 3. in the secondary stage have duplicates we with get warning. Message Handling: “If the warnings are failed to handle then we come across the message handling”  Navigation for how to add rule set a message handle the warnings.(here default option is false) . Abort a job: Q: How can we abort a job conditionally?  Conditionally o When we Run a job  Their we can keep a constraint • Like warnings o No limit o Abort job after:  In transformer stage o Constraint  Otherwise/log • Abort after rows: 5 (if 5 records not meet the constraint it’s simple aborts the job)  We can keep constraint same like this only in Range Lookup.  Missing record delimiter “\r\n”. saw EOF instead (format mismatch)  When we working on look-up. i.e. 5 Navs notes Page 152 2010  First line is column names= set as true. Import unsuccessful at record 0.  Where these is length miss match. like in source length (10) and target (20). o These three warnings can solve by a simple option in sequential file.  When second stage in merge. Batch jobs: “Executing set of jobs in a order” Q: How to create a Batch? Navigation for creating a batch  DS Director o Tools  Batch • • New (give the name of batch) Add jobs in created job batch o Just compile after adding in new batch.DataStage  Jog logs o Right click on a warning 2010   Add rule to message handler Two options • • Suppress from log Demote to information  Choose any one of above option and add rule.  But a job can execute by multiple users at the same time in director. Allow multiple instances: “Same job can open by multiple clients and run the job”  If we not enable the option it will open in a read only that you can’t edit.  Navigation for enable the allow multiple instance  Go to tool bar in DS Designer o Job properties  Unlock the jobs: Check the box on “allow multiple instances” Navs notes Page 153 . for that  DS Administrator o General  Environment variables • Parallel o Reporting  2010 DS Director Add (APT_PM_SHOW_ PIDS) • Set as (true/false) Navs notes Page 154 .DataStage “We can unlock the jobs for multiple instances by release all the permissions” Navigation for unlock the job  Tool bar o Job   Cleanup resources Processes • • Show by job Show all o Release all For global to see PIDs for jobs. And assigning permissions 2010  Session managements: o Active sessions   Reports: o DS  INDIA (server/system name) • •  Domain Management: o License   Update the license here Upload to review View report. We can create the reports. For admin  Scheduling management: “It is know what user is doing from part” o Scheduling views  New Navs notes Page 155 .DataStage DAY 48 Web Console Administrator Components of administrator:  Administration: o User & group  Users • • User name & password is created here. Wait for file activity Job Activity: “It is job activity that holds the job and it have 1-input and n-outputs” 2010 Job activity How the Job Activity drag into design canvas? Navs notes Page 156 . Sequencer 3. Important stages in job sequencing are 1. Notification activity 6. Job activity 2. Exception handler 5. Terminator activity 4. o Extract o Transform o Load o Master jobs: “its control the order of execution”.DataStage • • schedule | Run creation task run | last update DAY 49 Job Sequencing Stages of job sequencing: “It is for executing jobs in sequence that we can schedule job sequencing” Or “Its control the order of execution jobs”  A simple job will process in below process. Go to tool bar – view – repository – jobs – just drag the job to the canvas. Go to tool bar – view – palate – job activity – just drag the icon to the canvas. Student FAIL Sequencer student rank Terminator activity Properties of Job Activity:  Load a job what you want in active o Job name:  Execution action: D:\DS\scd_job RUN Do not check point Run options . than run/ Validate only/ Reset only) Check Point: “Job has re-started where it aborted it is called check point”  It is special option that we must enable manually  Go to o Job properties of DS Designer  Enable check point Navs notes Page 157 . Simple job: OK WAR 2010 2.In two methods we can.DataStage . 1.(Run/Reset if required. “custom” . Navs notes Page 158 Unconditional Otherwise User Status “N/A (its default)” “N/A” = “<user define message>” Custom-(conditional) .DataStage Parameter mapping: “If job have already some parameters to that we can map to the another job if we need” Triggers: “It holds the link expression type that how to act” Name of output link OK WAR warnings” Fail Expression type OK-(conditional) Expression “executed OK” 2010 WAR-(conditional) “execution finished with Failed-(conditional) “execution failed” And some more options in “Expression type” Terminator Activity: “It is stage that handles the error if it fails” Properties: It consists of two options: for if any sub ordinate jobs are still running. o Abort without sending STOP requests Wait for all jobs to finish first.  Its for job failure o Send STOP requests to all Running Jobs And wait for all jobs to finish  It’s for server downs in between the process running. DataStage Sequencer: “it holds multiple inputs and multiple outputs” ANY – it’s for FAIL (‘n’ number of links) 2010 It has two options or modes: Exception handler: ALL ANY ALL – it’s for OK & WAR links “It handles the server interrupts”  we don’t connect any stage here it will separate in a job A simple job for exception handler: Exception handler Notification activity Terminator activity Exception handler properties: “Its have only general information” Notification Activity: “It is sending acknowledgement in between the process” Option to fill in the properties: SMTP Mail server Name: Senders email address: Recipients email address: Email subject: Attachments: Email body: Wait for file Activity: D:\DS\SCD_LOAD browse file “To place the job in pause” Navs notes Page 159 . r.  Hash partition technique: o It is selected when number of key columns will be there. Range Key less: 1. key columns (>1) and hetro data types (means different different data types) o Other than this situation we can select “modulus partition technique”. Hash 2.e.DataStage File name: Two options: wait for file to appear Do not timeout (no time length for the above options) 2010 Wait for file to disappear Timeout length (hh:mm:ss) DAY 50 Performance tuning w.. o And mod formula is MOD(value/ Number of nodes) Navs notes Page 160 . Entire 4. Round Robin 3.  Modulus partition technique: o It distributes the data based on mod values. Same 2. Random In key based partition technique:  DB2 is used when the target is database. i.  DB2 and Range techniques are used rarely. Modulus 3. DB2 4.t partition techniques & Stages Partition techniques: “are two categories” Key based: 1. DataStage NOTE: Modulus is having high performance than Hash, because the way its groups the data NOTE: But modules can only be selected, if the only one key column and only one data type that is only integer (data type). In Key less partition technique:  Same: is never distributes the data, but is carry previous technique that continuous.  Entire: will distribute the same group of records to all nodes. That is the purpose of avoiding the mismatch records in between the operation.  Round Robin: it is for generated stage like Column Generator and so on is associated this partition technique. o It is the best partition technique than comparing to random.  Random: all key less partition techniques stages are used this technique its default. Performance tuning w.r.t Stages:  If when Sorting already perform then JOIN stage we can use. 2010 and based on the mod value.  Else LOOKUP stage is the best.  LOOKUP FILE SET: is options use to remove duplicates in lookup stage.  SORT stage: if complex sort : go to Stage sort  Else: go to link sort.  Remove Duplicates: the data already sort – Remove duplicates stage • Sorting and remove duplicates – go to link sort (unique)  Constraints: when operation and constraints needed – go to Transformer stage Navs notes Page 161 DataStage  Else only constraints – simply go to FILTER stage. DAY 51 Compress, Expand, Generic, Pivot, xml input & output Stages Compress Stage: “It is a processing stage that compresses the records into single format means in single file or it compresses the records into zip”.  It supports – “1 input and 1 output”. Properties:  Stage o Options   Input o <do nothing>  Output o Load the ‘Meta’ data of the source file. Expand Stage: “It is a processing stage the extract the compress data or its extract the zip data into unzip data”.  It supports – “1 input and 1 output”. Navs notes Page 162 Command= (compress/gzip) 2010  Conversions: Modify stage and Transformer stage (it takes more compile time). DataStage Properties:  Stage: o Options : - command= (uncompress/gunzip)  Input: o <do nothing>  Output: o Load the Meta data of the source file for the further process. Encode Stage: “It is processing stage that encodes the records into single format with the support of command line”.  It supports – “1-input and 1-output”. Properties:  Stage o Options: Command line = (compress/ gzip)  Input o <do nothing>  Output o Load the ‘Meta’ data of the source file. Decode Stage: “It is processing stage that decodes the encoded data”.  It supports – “1-intput and 1-output”. Navs notes Page 163 2010 Generic Stage: “It is processing stage that holds any operator can call here.  Its purpose is migration serve jobs to parallel jobs (IBM has x.migrator that converts into 70%)  And it can call ANY operator here. Properties:  Stage o Options  Operator: copy (we can write any stage operator here)  Input o <do nothing>  Output o Load the Meta data of the source file. but no rejects”  When compiling the job.inputs and n-outputs. but it must and should full fill the properties”.DataStage  Stage o Options: command line = (uncompress/gunzip)  Output o Load the ‘Meta’ data of the source file. but it must full fill the properties. Navs notes Page 164 2010 Properties: .  Generic stage can call the operator on the datastage.  It supports – “n. the job related OSH code will generated. DataStage Pivot Stage: “It is processing stage that converts rows into columns in a table”. 2010  Its supports – “1-input and 1-output”.  Properties: Stage – <do nothing>  Input: <do nothing>  Output: Column name Length REC varchar Derivation 25 SQL Type <col_n with comma separated> XML Stages: “It is real time stage that the data stores in single records or in aggregator with in the xml format”. XML Output 2. Navs notes Page 165 .  And XML Stage divided into two types. they are 1. XML Input XML Input: “”.
Copyright © 2024 DOKUMEN.SITE Inc.