Accenture DataStage Best Practise

DataStage Best PracticesDate: Version: File Location: Author: 21 June 2005 0.9 $/Ranch/Design/Plans/DataStage Best Practise.doc 52538333.doc © Accenture 2004. All Rights Reserved Page 1 of 44 Highly Confidential version 0.9 CONTENTS 1. INTRODUCTION.....................................................................................................................................5 1.1 OBJECTIVE..........................................................................................................................................5 1.2 REFERENCES........................................................................................................................................5 1.3 AUDIENCE............................................................................................................................................5 1.4 DOCUMENT USAGE................................................................................................................................5 2. DATASTAGE OVERVIEW......................................................................................................................6 3. DATASTAGE DEVELOPMENT WORKFLOW.......................................................................................6 3.1 BUILDING AND TESTING JOBS...................................................................................................................6 3.1.1 Ranch_Dev Project...............................................................................................................7 3.2 OTHER DATASTAGE PROJECTS................................................................................................................7 4. DATASTAGE JOB DESIGN CONSIDERATIONS..................................................................................7 4.1 JOB TYPES..........................................................................................................................................7 4.1.1 Import Jobs...........................................................................................................................8 4.1.2 Transform Jobs.....................................................................................................................8 4.1.3 Unload Jobs..........................................................................................................................9 5. USE OF STAGES....................................................................................................................................9 5.1 COMBINING DATA..................................................................................................................................9 5.1.1 Join, Lookup and Merge Stages...........................................................................................9 5.1.2 Aggregate Stage.................................................................................................................10 5.1.3 The Funnel Stage...............................................................................................................10 5.2 SORTING...........................................................................................................................................10 5.3 DATA MANIPULATION............................................................................................................................11 5.3.1 Transformer Usage Guidelines...........................................................................................11 5.3.2 Modify Stage.......................................................................................................................13 5.4 TRANSITIONING DATA...........................................................................................................................14 5.4.1 External Data......................................................................................................................14 5.4.2 Parallel Dataset..................................................................................................................14 5.5 UNIT TEST........................................................................................................................................14 5.5.1 Copy Stage.........................................................................................................................14 5.5.2 Peek Stage.........................................................................................................................14 5.5.3 Row Generator...................................................................................................................14 5.5.4 Column Generator..............................................................................................................14 5.5.5 Manual XLS Generation......................................................................................................15 6. GUI STANDARDS.................................................................................................................................15 7. DATASTAGE NAMING STANDARDS..................................................................................................15 8. RUNTIME COLUMN PROPAGATION (RCP).......................................................................................17 9. STANDARDISED REJECT HANDLING...............................................................................................17 9.1 REJECT COMPONENTS..........................................................................................................................17 9.2 CUSTOMISED REJECT MESSAGES............................................................................................................19 9.3 REJECT LIMIT.....................................................................................................................................20 9.4 BEFORE ROUTINE................................................................................................................................20 52538333.doc © Accenture 2004. All Rights Reserved Page 2 of 44 Highly Confidential version 0.9 9.5 NOTIFICATIONS...................................................................................................................................20 9.5.1 In-line Notification of Rejects..............................................................................................20 9.5.2 Cross Functional Notification of Rejects.............................................................................21 10. ENVIRONMENT..................................................................................................................................21 10.1 DEFAULT ENVIRONMENT VARIABLES STANDARDS.......................................................................................21 10.2 JOB PARAMETER FILE STANDARDS........................................................................................................21 10.3 DIRECTORY PATH PARAMETERS............................................................................................................21 10.4 DEFAULT DIRECTORY PATH PARAMETERS ..............................................................................................21 10.5 DIRECTORY & DATASET NAMING STANDARDS............................................................................................22 10.5.1 Functional Area Input Files...............................................................................................22 10.5.2 Functional Area Output Tables.........................................................................................22 10.5.3 Functional Area Staging Tables........................................................................................22 10.5.4 Internal Module Tables.....................................................................................................22 10.5.5 Datasets Produced from Import Processing ....................................................................22 11. METADATA MANAGEMENT..............................................................................................................22 11.1 SOURCE AND TARGET METADATA..........................................................................................................23 11.2 INTERNAL METADATA..........................................................................................................................23 12. STANDARD COMMON COMPONENTS............................................................................................23 12.1 JOB TEMPLATES................................................................................................................................23 12.1.1 Import Jobs.......................................................................................................................24 12.1.2 Transform Jobs.................................................................................................................24 12.1.3 Unload Jobs......................................................................................................................25 12.2 CONTAINERS....................................................................................................................................25 13. DEBUGGING A JOB...........................................................................................................................26 14. COMMON ISSUES AND TIPS............................................................................................................26 14.1 1-WAY / N-WAY.................................................................................................................................26 14.2 DUPLICATE KEYS..............................................................................................................................27 14.3 RESOURCE USAGE VS PERFORMANCE....................................................................................................28 14.4 GENERAL TIPS.................................................................................................................................29 15. REPOSITORY STRUCTURE..............................................................................................................30 15.1 JOB CATEGORIES..............................................................................................................................30 15.2 TABLE DEFINITION CATEGORIES............................................................................................................30 15.3 ROUTINES........................................................................................................................................30 15.4 SHARED CONTAINERS.........................................................................................................................31 16. COMMON COMPONENTS USED IN RANCH....................................................................................31 16.1 JBT_SC_JOIN....................................................................................................................................31 16.2 JBT_SC_SRT_CD_LKP.........................................................................................................................31 16.3 JBT_ENV_VAR...................................................................................................................................32 16.4 JBT_ANNOTATION...............................................................................................................................33 16.5 JOB LOG SNAPSHOT..........................................................................................................................33 16.6 RECONCILIATION REPORT....................................................................................................................35 16.7 SCRIPT TEMPLATE..............................................................................................................................37 16.8 SPLIT FILE......................................................................................................................................38 16.9 MAKE FILE......................................................................................................................................38 16.10 JBT_IMPORT...................................................................................................................................38 52538333.doc © Accenture 2004. All Rights Reserved Page 3 of 44 Highly Confidential version 0.9 16.11 JST_IMPORT...................................................................................................................................41 16.12 JBT_UNLOAD...................................................................................................................................41 16.13 JST_UNLOAD...................................................................................................................................44 16.14 JBT_ABORT_THRESHOLD....................................................................................................................44 52538333.doc © Accenture 2004. All Rights Reserved Page 4 of 44 Highly Confidential version 0.9 ensuring that the standards it describes are communicated and understood. The below mentioned standards will be followed by all developers. I. 1. It is understood that this document.1 Objective This document will serve as a source of standards for use of the DataStage software as employed by the Ranch Transformation project. In such cases.3 Audience • Ranch Transform team. It will be referenced by developers initially for familiarisation and as required during the course of the project. 1.e. It is intended to channel the general knowledge of DataStage developers towards the specific things they need to know about the Ranch project and the specific way jobs will be developed.9 . INTRODUCTION 1. It will therefore be an evolving document which will be updated to continually reflect the changing needs and thoughts of the development team and hence continue to represent best practices as the project progresses. 52538333.2 References • • • • • DataStage Guide Ascential Best practises document Ranch Transform Batch Blue print document Ranch Transform Technical architecture design document Best practice documents and experience from previous projects from Ascential as DataStage vendors. these best practices have been built on previous experience and tailored to the requirements of the Ranch transform project.4 Document Usage This document describes the DataStage best practices to be applied to the Ranch Transformation project. Use of the document will therefore reduce over time as developers become familiar with the practices described.doc © Accenture 2004. LogicaCMG and Accenture as system integrators. Such communication will highlight the areas of change. developer must contact the appropriate authority to seek clarification and ensure that such missing items are subsequently added to this document. All Rights Reserved Page 5 of 44 Highly Confidential version 0. while setting the standards might not be possible to cover all the development scenarios. 1. The best practices will also form the basis for QA and peer testing within the development environment. Initial review and sign-off process will therefore be followed within this context. The Offshore Build Manager will maintain the document (in collaboration with the development team – through weekly developer meetings) and will be responsible for distributing the document to developers (and explaining it’s content) initially and after updates have been applied.1. With simple point-and-click techniques you can draw a scheme to represent your processing requirements Extracts data from any number or type of database Handles all the metadata definitions required to define your data warehouse or migration.doc © Accenture 2004.9 . Development will have three projects where each code will move i. Ranch_Dev. Transformation. After base-lining the code the DataStage administrator will collate all code in the Ranch_Promo project from where the DMCoE will move it for unit and end to end testing on the Test server. Version and Ranch_Promo. DataStage has a set of predefined transforms and functions you can use to convert your data.e.2. DATASTAGE DEVELOPMENT WORKFLOW 3. and Loading tool. test and production. Please refer to the Ranch Transform Code Migration Strategy document for further details. As detailed in diagram below there will three environments i.e. Developers will develop code in Ranch_Dev project and after unit testing it promote to project Version where Version controlling will be managed. You can modify SQL SELECT statements used to extract data Transforms data. development. You can view and modify the table definitions at any point during the design of your application Aggregates data. a project is the entity in which all related material to a development is stored and organised. You can easily extend the functionality by defining your own transforms to use.1 Building and Testing Jobs This section provides an overview of the DataStage Job development process for the Ranch transformation project. DATASTAGE OVERVIEW DataStage is a powerful Extraction. 52538333. Within DataStage. • • 3. All Rights Reserved Page 6 of 44 Highly Confidential version 0. DataStage has the following features to aid the design and processing: • • • Uses graphical design tools. Finally the code will be moved by DMCoE to production. 4.1. Import.DMcoE Production Server Developer Role Ranch_Prod DS Project Version DS Project Administrator Role Onshore Activity -. DATASTAGE JOB DESIGN CONSIDERATIONS 4. It will be mapped to a working directory on the UNIX DataStage server. All Rights Reserved Page 7 of 44 Highly Confidential version 0. Transform and Unload Jobs.doc © Accenture 2004. Test and Production environments.Development Server BUILD MANAGER (Review / Defect Fix / QA / Sign Off) Role Test Server Build / Unit Test Process Deploy / Promote Process Ranch_Test DS Project Ranch_Dev DS Project Ranch_Dev\FDyy DS Project Ranch_Promo DS Project Onshore E2E Test Activity-.1 Ranch_Dev Project The Ranch_Dev project will be used by developers for building DataStage jobs and unit testing by the developers.DMcoE Each DataStage project is defined below: 3.e. Source data having complex file layout will be processed by these jobs in sequence to give Target file which will in the format required by Load team.2 Other DataStage Projects Several further DataStage projects will be employed across the Development. 3. This will also be used for unit testing. Please refer to the Ranch Transform Code Migration Strategy document for further details.9 . 52538333. changes / defect fixing will be documented and fixed before promoting the job to “Ranch_Promo” for integration testing.1 Job Types As per diagram below there will be three types of Jobs within Transform i. 1. Transform will join two or more datasets. Validate header and trailer details Read File in specific format Create output datasets Record s to be Proces sed NonHogan Extract Data Write error details in Stats File and Stop processing 4.1.doc © Accenture 2004. All data errors will be captured in an exception log for future reference. Finally one or more datasets will be created which will be input to actual transform process. Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference.1 Import Jobs Import Jobs will be starting point for transformation. Hogan Extract Data Import Job Check zero byte file. Size will be done here.g. All Rights Reserved Highly Confidential version 0. Finally the records will be split as per destination file design and a destination dataset will be created.9 . See section 9 for further details of action to be taken on failure or reject. NonHogan Extract Data Page 8 of 44 52538333.Hogan Extract Data 4. If there are any unwanted or bad record the job will fail and file needs to be corrected before restarting the job.2 Transform Jobs Datasets created by import jobs will be processed by transform jobs. Source file will be read as per source record layout. lookup data as per functional design specification. Sanity checks on file and validation of external properties e. Data Held for Future Job Transform jobs have a number datasets.e.1. n Updates Highly Confidential version 0. sorted and de-duped.1. 5.doc © Accenture 2004. They differ mainly in memory usage. Unload Job Unload Data in output file as per layout Load Data Data Held for Future Job Data Held in temporary datasets 5. Thes from completed or may represe event of the ba failure.1 USE OF STAGES Combining Data Records to be Processed Target data is provided as flat files 5.1 Join.4. All Rights Reserved Join SQL-like Light 1 Left. A brief description as to when to use these stages is provided in the following table: Type Memory Number of Inputs 52538333. n Lookup Tables Page 9 of 44 Merge Master / Update Light 1 Master. Lookup and Merge stages combine two or more input links according to values of key columns. treatment of rows with unmatched key values and input requirements i.3 Unload Jobs Unload jobs will take transform datasets as a source and create final files required by load team in the given format. 1 Right Lookup In RAM Lookup Table Heavy 1 Source. Lookup and Merge Stages The Join.9 . 5. attributes including null ability). 52538333. Drop or Reject None 1 Out. Continue. If the datasets are larger than available resources.2 Aggregate Stage The purpose of the aggregator stage is to perform data aggregations. the columns to be aggregated and the kind of aggregation. 1 Reject Unmatched Primary Rows All Warning OK (when n=1) Keep or Drop Capture as Reject 1 Out. it is necessary to understand the key columns that define the aggregation groups.9 . Each lookup reference requires a contiguous block of physical memory. however it is most likely that aggregations will be used as part of a calculation to determine the number of rows in an output table for inclusion in header and footer records for unload files.3 The Funnel Stage The funnel requires all input links to have identical schemas (column names.1. either on the input properties page of many stages (a simple sort) or using the explicit sort stage. In order to do this.2 Sorting There are two options for sorting data within a job. Common aggregation functions include: • • • • Count Sum Mean Min / Max.doc © Accenture 2004. Several others are available to process business logic. the JOIN or MERGE stage should be used. 5. The single output link matches the input schema. such as the ability to generate key change column and to specify the memory usage of the stage. types. 5. n Rejects Unmatched Secondary Rows The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. All Rights Reserved Page 10 of 44 Highly Confidential version 0.Sort on Input Duplicates on Primary Input Duplicates on Secondary Input(s) Options on Unmatched Primary Options on Unmatched Secondary Number of Output Links Captured on Reject All OK OK None None 1 Nothing None OK Warning Fail.1. The explicit sort stage has additional properties. right-click on an output link and choose "Convert to Reject". For this reason. always test for null values before using a column in an expression.1. To create a Transformer reject link in DataStage Designer. 5. Write the output record o Next output link Next input row • The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor.1.1. the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression.3.4 Optimizing Transformer Expressions and Stage Variables In order to write efficient Transformer stage derivations. as they will be evaluated once for every output column that uses them. unless the derivation is empty o For each output link: 1. TrimLeadingTrailing(string) works only if string is a VarChar field.3.3. it is important to minimize the number of transformers. the output links are also evaluated in the order in which they are displayed.col) Then… Else… 5. it is important to make sure the type conversion is done before a row reaches the Transformer. Therefore. The evaluation sequence is as follows: • • Evaluate each stage variable initial value For each input row to process: o Evaluate each stage variable derivation value. Modify etc) when derivations are not needed.9 .3 5. the PadString function uses the length of the source type. it is useful to understand what items are evaluated and when. All Rights Reserved Page 11 of 44 Highly Confidential version 0. for example: If ISNULL(link. Filter.1 Choosing Appropriate Stages The parallel Transformer stage always generates "C" code which is then compiled to a parallel component. 5. From this sequence.1. Thus. Such constructs are: 52538333. it can be seen that there are certain constructs that will be inefficient to include in output column derivations. For example.1 Data Manipulation Transformer Usage Guidelines 5.doc © Accenture 2004. Switch. Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason. Similarly.5.2 Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints.3. For example. and to use other stages (Copy. not the target.3 Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment.3. Evaluate each column derivation value 2. the incoming column must be type VarChar before it is evaluated in the Transformer. doc © Accenture 2004.20) This returns a string of 20 spaces. This example could be improved further by also moving the string comparison into the stage variable. such as: Str(" ".. a column definition may include a function call that returns a constant value.3] is evaluated for each column that uses it. the substring is evaluated just once for every input row.3] = "001" THEN 1 ELSE 0 and each column derivation will start with: IF (Stage Var1) THEN This reduces both the number of substring functions evaluated and string comparisons made in the Transformer. In this case. the function will still be evaluated once for every input row. the variable will have its initial value set to: Str(" ". It will be more efficient to calculate the constant value just once for the whole Transformer.Where the same part of an expression is used in multiple column derivations For example.col[1.col[1. the evaluation of the substring of DSLINK1.3] and each column derivation will start with: IF (Stage Var1 = "001" THEN .. suppose multiple columns in output links want to use the same substring of an input column. In this case. the stage variable definition will be: DSLINK1. In this case.col[1. Where an expression includes calculated constant values For example. then the following test may appear in a number of output column derivations: IF (DSLINK1.3] = "001") THEN . The solution here is to move the function evaluation into the initial value of a stage variable. In this case.9 . The stage variable will be: IF (DSLink1.. This can be achieved using stage variables. However in this case. the function will be evaluated every time the column derivation is evaluated. All Rights Reserved Page 12 of 44 Highly Confidential version 0. This can be made more efficient by moving the substring calculation into a stage variable. A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab in the Transformer stage editor..20) 52538333.col1[1. This function could be moved into a stage variable derivation. By doing this. 5. it is not reevaluated for each input row. Then. it must be converted from a string to an integer each time the expression is evaluated.3. If this just appeared once in one output column expression.col1+"1" In this case. using the initial value setting to perform the concatenation just once. and then use the stage variable in place of DSLink1. this concatenation is evaluated every time the column derivation is evaluated. The solution in this case is just to change the constant from a string to an integer: DSLink1. Transformations that touch a single field. this constant part of the expression could again be moved into a stage variable. specify its derivation to be DSLINK1. Where an expression requiring a type conversion is used as a constant.2 Modify Stage The Modify stage is the most efficient stage available.col1. such as keep/drop. the "1" is a string constant. and null handling. The initial value of the stage variable is evaluated just once. an expression may include something like this: DSLink1. All Rights Reserved Page 13 of 44 Highly Confidential version 0. you will create. where it requires the same type conversion in each expression. However. the data type of the stage variable should be set correctly for that context. before any input rows are processed. because the derivation expression of the stage variable is empty. otherwise needless conversions are required wherever that variable is used. where that conversion would have been required. are the primary operations which should be implemented using Modify instead of using Transform.col1. In addition to a function value returning a constant value. it is value for the whole Transformer processing is unchanged from the initial value.col1 were a string field.col1. type conversions. then a conversion will be required every time the expression is evaluated.You will then leave the derivation of the stage variable on the main Transformer page empty. Therefore.doc © Accenture 2004. and so. an integer stage variable. 52538333. in order to be able to add it to DSLink1. it will be more efficient to use a stage variable to perform the conversion once. if an input column is used in more than one expression. Any expression that previously used this function will be changed to use the stage variable instead.col1+1 In this example. if DSLINK1. for example. another example would be part of an expression such as: "abc" : "def" As with the function call example. some string manipulations. In this case. or it is used in multiple places For example.9 . this will be fine. Since the subpart of the expression is actually constant. It should be noted that when using stage variables to evaluate parts of expressions. 5. This stage is used commonly for debugging/testing purpose where a copy of the data flowing from a particular stage can be isolated from the flow and analysed. It can be configured to execute in parallel or sequential mode. All Rights Reserved Page 14 of 44 Highly Confidential version 0. The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. each referred to by a control file.5.1 Copy Stage The Copy stage copies a single input data set to a number of output data sets. This stage will be typically used in the ‘Import’ jobs to import the External Data to parallel datasets to be processed by further ‘Transformation’ jobs. more details can be specified about each data type if required to shape the data being generated.5. The Row Generator stage produces a set of mock data fitting the specified metadata. Records can be copied without modification or columns can be dropped or changed (to copy with more modification – for example changing column data types). which can then be used by other DataStage jobs.5 Unit Test 5.2 Parallel Dataset The Data Set stage is used to read data from or write data to a data set. The new data set is then 52538333. the danger of bottle necks is eliminated during dataset creation. and a single output link.ds. and a single rejects link.5. This is useful where you want to test your job but have no real data available which may be a source file or a dataset produced by some other job whose development is also underway. Data sets are operating system files. The Data Set stage can store data being operated on in a persistent form. This stage is used when a specific column’s data is only to be analysed while Unit Testing to validate whether the preceding transformation logic is working as desired.4. Also.5.9 .2 Peek Stage The Peek stage can print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.4 Column Generator The Column Generator stage is a Development/Debug stage that can have a single input link and a single output link. 5. Due to the parallel nature of processing. Each record of the input data set is copied to every output data set. 5.3 Row Generator The Row Generator stage is a Development/Debug stage that has no input links. The stage can have a single output link. Type as ‘Cycle’ specifying what ‘Increment’ value is required Type as ‘Random’ specifying what percent of invalid/zero data is required. 5.4.4 Transitioning Data 5. which by convention has the suffix .g. For e. It can be configured to execute in parallel or sequential mode. The stage calls the program and passes appropriate arguments. DataStage parallel extender jobs use data sets to manage data within a job. Using data sets wisely can be key to good performance in a set of linked jobs. The stage can have a single input link or a single output link. These Parallel Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process. 5. 5.doc © Accenture 2004.1 External Data The External Source stage is a file stage which is used to read data that is output from one or more source programs. date and a brief reference to the design document including the version number the job has been coded up to. Two types of annotation. a blue job description (description annotation) and a yellow operator specific description (standard annotation) are used. promo and production. plus the main job annotation and any modifications to the job. 7. When using DataStage Version Control. Where the job has not yet entered Version Control. Annotations are also used to further describe the functionality of jobs and stages.02…13 indicating FD name. Naming conventions must be enforced on links.5 Manual XLS Generation In addition to the ‘Row Generator’ and ‘Column Generator’ methods DataStage provides.1.5. It does not stop developers from using Full Description as a method of maintaining the relevant documentation. but information maintained by the developer will get appended to by the Version Control tool. developer name. The detailed description is also updated automatically in by DataStage Version Control process following the first initialization into Version Control. <im> indicates Import Job <tr> indicates Transform Job <ul> indicates Unload Job js_<fdXX>_<im/tr/ul>_<file/detail> Where XX is 01. GUI STANDARDS Job Description Fields – the description annotation is mandatory for each job.output. the initial version should be referred to as 0. The full description should include the job version number. This is packaged and maintained with the job and will be visible when the jobs are deployed to test. 6. Note that the description annotation updates the job short description.02…13 indicating FD name. These methods of data generation will be used extensively during Unit testing. Entries put in the detailed description by Version Control must not be modified manually.9 . Those columns need to be inserted with mock data fitting the specified metadata. mock data can also be created manually in an XLS file and then saved as a CSV file to be given as input to the DataStage job where this test data is required. transforms and source and target files. <im> indicates Import Job Sequence <tr> indicates Transform Job Sequence <ul> indicates Unload Job Sequence source Page 15 of 44 Highly Confidential version 0. This is used where not all the columns’ real data is available for testing. the Full Description field in job properties is also used by DS Version control to append revision history. All Rights Reserved Syntax Import/transform/unload jb_fdXX_<im/tr/ul>_<JobName> Where XX is 01.doc © Accenture 2004. Standard description annotations should be used on every non-trivial stage. 5. DATASTAGE NAMING STANDARDS Object Type Category Job Job Sequence Source Definition Category 52538333. All Rights Reserved ds_<Dataset Name> sq_<Sequential file name> fs_<File Set name> lfs_<Lookup file set name> esrc_< External Source name> etrg_< External Target name> cff_< Complex Flat File name> tr_<Purpose> btr_<Purpose> agg_<Purpose> jn_<Purpose> mrg_<Purpose> lkp_<Purpose> srt_<Purpose> fnl_<Purpose> rdup_<Purpose> cps_<Purpose> exp_<Purpose> cp_<Purpose> md_<Purpose> flt_<Purpose> sflt_<Purpose> ccap_<Purpose> capp_<Purpose> diff_<Purpose> cmp_<Purpose> enc_<Purpose> dec_<Purpose> cwt_<Purpose> gen_<Purpose> sur_<Target Column Name> ci_<Purpose> ce_<Purpose> msub_<Purpose> ssub_<Purpose> crec_<Purpose> prec_<Purpose> mkv_<Purpose> splv_<Purpose> Page 16 of 44 Highly Confidential version 0.doc © Accenture 2004. njn=non join.Target Definition Category Link* target lnk_<StageName>_<rej/njn/jn> lnk_<StageName> <StageName> is the name of the stage from which the link is coming out. If not applicable then this will be dropped.9 . Parallel Job FILE Stages Data Set Sequential File File Set Lookup File Set External Source External Target Complex Flat File Parallel Job Processing Stages Transformer BASIC Transformer Aggregator Join Merge Lookup Sort Funnel Remove Duplicates Compress Expand Copy Modify Filter External Filter Change Capture Change Apply Difference Compare Encode Decode Switch Generic Surrogate Key Parallel Job RESTRUCTURE Stages Column Import Column Export Make Subrecord Split Subrecord Combine Records Promote Subrecord Make Vector Split Vector 52538333. jn=join. <rej/njn/jn> indicates the type of link rej=reject. 9. can cause additional columns to appear in the output dataset that the developer may have thought were dropped. This will be achieved by the introduction of a bespoke element (in the form of example stages within template jobs) and through the use of a standardised reject component made available to all developers via a DataStage wrapper. RCP should be enabled within the Project Properties (providing flexibility at to use RCP at job level) and in the event that RCP is required. In any event. The standardisation of reject capture allows operational support to easily: • • • • locate the rejection message and understand the format of the message locate and diagnose the reason for rejections set tolerances to the numbers of rejects permitted allow for the re-process rejected rows.1 STANDARDISED REJECT HANDLING Reject Components There is a requirement to set up a standard approach to reject handling. it can be turned on at job / stage level. An example would be a generic job that reads flat file and stores the data into a Dataset. However. but the file name itself is a job parameter. is that in jobs where RCP is not desired by the developer but the feature is switched on. These components are shown in the following diagram: 52538333. Conversely.Containers Local Container Shared Container Others Stage Variable Sequence Generator Job Sequences Stages Job Activity Execute Command Sequencer lc_<functionality> sc_<functionality> s_<StageVariableName> seq_<Target Column Name> ja_<job name without jb and fd#> ex_<Script function>_<file/detail> sq_<Purpose> 8. In this case it is not possible to determine the column definitions during build. 9.9 . a standard approach must be introduced for the remaining stages and adopted across all stages. Reject processing is not provided as standard within DataStage Enterprise (Parallel) across the majority stages. All Rights Reserved Page 17 of 44 Highly Confidential version 0. There is a reject link on the Lookup stage. one of the features that sometimes confuse developers. An annotation should make this clear on the job. For these reasons developers must turn off RCP within each job unless the feature is explicitly required in the job by the developer as in the above example.doc © Accenture 2004. RUNTIME COLUMN PROPAGATION (RCP) One of the aims/benefits of RCP is to enable jobs that have variable metadata that is determined at run time. uniquely defining each reject and writes the message along with the two keys to a dataset. For instance.9 . i. Identifies the key of the rejected row and passes this down the relevant link (depending on the key type) to the standardised reject handling component.e. though this is just as applicable to Join. Transform and other stages. Where there is no key. a file is empty or there is a mismatch between the number of rows read and the information provided on the footer record. In the example above. The standardised reject component takes two inputs (over a possible five input links) and creates a surrogate key. Compiles and passes a standard message (see table below) describing the rejection to the standardised reject handling component. Join.doc © Accenture 2004. 52538333. the Lookup stage is shown with a reject link. In each of these cases. the rejected row is passed down a reject link to a bespoke component that: 1. All Rights Reserved Highly Confidential Driving Data Flow Page 18 of 44 version 0. Lookup and Transform). This approach assumes that a key uniquely identifying each failing row is present on driving flows. Three such stages are shown in the diagram (i. for instance an unexpected value or null might be encountered.Records to be Processed All stages (where a row might be rejected) must include a reject link. data flowing down the reject link from a Lookup or Join stage might result from an inability to match keys and from a Transform stage from the validation of data items. Passes the row to a dataset in order to facilitate the re-processing of the rejected rows 2.e. This reject dataset therefore holds the key from the rejected row (that can be used to cross reference to the dataset of rejected rows) and a message that will help identify the reason for the rejection. zeros are passed down all links intended for key information 3. Developers are limited to using the messages specified below. The job and stage name must be included in the message. The reject component will be used with every stage which can fail due to data discrepancies (e. thus prevent ting the creation of random error messages. Highly Confidential version 0. This message is particular to import jobs where the input file is validated against the footer record.9 Column .2 Customised Reject Messages The following reject messages / conditions will be used: Reject Message Lookup / Join Failure Description For all referential integrity checking and any other critical Lookups / Joins. and lookup). In order to facilitate reject handling within the Join stage. join. This processing requirement is shown in the following diagram: This component will be made available to all developers for use in reject handling as a job template. 9. The job and stage name must be included in the message.doc © Accenture 2004. The input file is empty. This message is Page 19 of 44 Row Count Mismatch Empty File 52538333. Keys have not been matched between input links on a Join or Lookup stage. further processing is required. The Join stage requires further processing whilst the error link from the lookup stage can be linked directly to the custom error component.g. The number of records processed does not match the number of records described in the footer record. Reject datasets are uniquely named and created each time the module runs (see below).Paths where reject datasets are automatically set to write to are date stamped within a common reject and log directory. All Rights Reserved Secondary Data Source The creation of a list of standard error conditions limits the number of exceptions an operator will see allowing errors to be quickly identified and resolved. 5 • • Notifications operations are informed of a reject i. 9. whilst a reject limit of 99 will NEVER ABORT (on reject). A reject limit of 0 (zero) will ABORT ON FIRST REJECT. field and stage names must be inserted into the message.e. A Not Null field has been identified as containing null values. All Rights Reserved Page 20 of 44 Highly Confidential version 0. The last activity within a module will be to email notification of rejects within a module to operations. 9. The job. A template job will be provided that includes the Notification stage and job parameters that can be tailored such that the names and paths of the reject datasets can be interrogated and the relevant notifications made. This will be achieved by using the Notification stage. 9. The job. Developers must intercept rejects in the code they generate and generate a standard reject message that contains accurate data and relevant information from the record. 9.1 In-line Notification of Rejects In-line notifications are those resulting from rejects within a functional processing stream.doc © Accenture 2004.9 . A field has been identified as containing invalid data. in-line notifications rejects are communicated between functional streams and / retained to support the rerunning of modules i. The reject will be variable between 0 and 99. A description of rejects and messages should be made available to operational support to help diagnose problems encountered when running the batch.5. On meeting the reject limit. field and stage name must be included in the message.4 Before Routine The before routine for the first job in a sequence of jobs that implement a module (or for a single job where there is only one job in a module) will be used to interrogate increment the number that will uniquely identify the datasets that will be created from the processing of rejects for a particular module.e. cross functional notifications.Reject Message Invalid Field Null Field Description particular to import jobs where the input file is validated against the footer record. 52538333. the job and hence the processing for any given module is terminated. The job. field and stage name must be included in the message. A notification is the method by which: These are described in the following sections. This is used by the standardised reject processing wrapper to test against the total number of errors for a module.3 Reject Limit A Reject Limit parameter is included in all jobs. This allows central control the level of rejects allowed across all modules and jobs used in the Ranch batch. apt (Default value used in every job) /DataStage/Product/Ascential/DataStage/Configx4node. The path to this parameter file will be /DataStage/Parameters/<project name> and its name will be parameters.e.e. (Note that DataStage Environment Variables are different to Standard Parameters) The template job already has these parameters defined: • $APT_CONFIGX_FILE= /DataStage/Product/Ascential/DataStage/Configx1.9 .2 Cross Functional Notification of Rejects This type of notification is the means by which rejects are communicated between functional streams.lst 10. pITERATION = n (where n is the migration iteration i.1 Default Environment Variables Standards DataStage Enterprise Edition allows project / job tuning by means of Environment variables. In this instance. rejected accounts will be incorporated into the transaction processing process. and error log activities.4 Default Directory Path Parameters The following parameters must exist in all jobs (the template job has these parameters defined): • • pDSPATH = /DataStage/Datasets/RanchDev (DataStage Datasets top level directory) pITERATION = 1 Page 21 of 44 52538333. held under the Users/Template DataStage will have these parameters defined. 10.doc © Accenture 2004. therefore limiting those transactions processed to those where an account had also been successfully processed.2 Job Parameter File Standards A generic parameter file which stores all the default job parameter values including user names and login details will be run in conjunction with the before job routine “SetDSParamsFromFile” This will allow project wide settings to be changed once. prompting a rerun and between migration steps (i. and avoid unnecessary parameter duplication.9.5.3 Directory Path Parameters The following parameters must exist in all jobs.apt (Value overwritten for Testing on extra nodes in individual jobs) $APT_DUMP_SCORE=false • 10. from 1 to 9) pRUNNUMBER = n (where n is the run number within the iteration starting from 1) 10. The template job.e. The following DataStage Environment variables must exist in all jobs. These include the settings of the default node configuration file. This ‘communication’ is built around the feedback from the load process. All Rights Reserved Highly Confidential version 0. o XX – Base Ranch directory as set by DMCoE. • • • pDSPATH = /XX/XX/Ranch (DataStage Datasets top level development directory – there will be equivalents for testing and Live). transactions being dependent on accounts etc. T14 to T) and an understanding of the dependencies between functional areas i. ENVIRONMENT 10. Reference data is not split into iterations. described below: 52538333.5. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Staging/<datasetname>.ds 10. Datasets in this directory are only used within jobs.2 Functional Area Output Tables Datasets that are defined in the Detailed Design as output tables for a functional area are stored in a “Product” directory.• pRUNNUMBER = 1 10.e.e.1 Functional Area Input Files Source files will be pushed by Extract system to ETL server in a holding area ‘Hold` via connect direct software. There are two types of Metadata. #pDSPATH#/Reference/<datasetname>.5 Datasets Produced from Import Processing Datasets that are produced by Pre-Processing are stored in a “Source” directory. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Product/<datasetname>. Note the final subdirectories (i.5. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Hold/<source_file_name> 10. 10. METADATA MANAGEMENT Metadata consists of record formats for all external files (flat files) and internal files (datasets) processed by DataStage which are stored in the DataStage Repository (a Metadata repository). Metadata is either created manually within stages (i.9 . #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Internal/<datasetname>.ds 10.5. Flat File.doc © Accenture 2004. This is the directory that Functional Areas will go to find input tables from the source.5. “Deliver” and “Internal”) are hard coded in the jobs.4 Internal Module Tables Datasets produced within a module and used only internally within that module will be stored in an “Internal” directory. All Rights Reserved Page 22 of 44 Highly Confidential version 0. This is the directory that other modules within the same Functional Area will go to find staging tables from previous modules.ds 11.5.3 Functional Area Staging Tables Datasets that are defined in the Detailed Design as staging tables within an area are stored in a “Staging“directory. This is the directory that downstream Functional Areas (including the Unload process) will go to find input tables from previous areas.5 Directory & Dataset naming standards UNIX directory paths are set using the following convention. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Source/<datasetname>. based on the parameters defined above.ds 10.ds Reference Datasets that are produced by Import Processing are stored in a “Reference” directory. Complex Flat File and Dataset) or imported from sources such as COBOL copybooks. This is fine because if the developer mistypes the value the job will fail immediately as the mistyped directory will not exist. All Rights Reserved Page 23 of 44 Highly Confidential version 0.) and help maintain consistency in terms of the way data is interpreted across all jobs (define once. productivity is increased and developers can spend more time on tasks which are specific to individual jobs Reduce the complexity of common tasks.1 Source and Target Metadata Record formats will have been pre-defined within the DataStage Repository describing the record formats of files that form inputs to import jobs and outputs from unload jobs. be used by all transoform jobs and will define the inputs to unload jobs and must be stored in the repository with a name that matches the name of the dataset it describes.doc © Accenture 2004. This metadata must not be changed by developers. STANDARD COMMON COMPONENTS The use of Standard Components in developing DataStage jobs will: • • • Increase quality of the code.9 . This extracts data from a source and writes it to a target Not only will the use of templates help in standardization but also it will form reusable components. Should a change be required to this metadata. This metadata will define the outputs of import jobs. Should it be necessary or more efficient to process data in a different way from the way it is presented within the pre-defined metadata. 12. therefore having a positive impact in terms of quality.2 Internal Metadata Developers will also create metadata describing the datasets that: • • pass data between jobs within a functional area pass data between jobs in different functional areas. Also certain elements will be common in many 52538333. since the most optimal method will be used for a function which is to be achieved in multiple jobs Promote reuse. it should first be impacted to assess the potential impact of the change on jobs that use the metadata and processed through standard change control. This can be subsequently used to create new jobs. This metadata will therefore only be used by import and unload jobs.1 Job Templates DataStage provides intelligent assistance which guides through basic DataStage tasks. New jobs will be copies of the original job Create a new job from a previously created template Create a simple parallel data migration job.11. developers may create a job specific version of the metadata which must be clearly identified as a variant on the original and saved within the repository. The Intelligent Assistants are listed below: • • • Create a template for a server or parallel job. use many times). 11. 12. which need not be coded yet again. These record formats are for the convenience of developers (they are described in the FDs and are therefore fixed. we will read file only once and create a DataStage datasets. these processes will be developed once and will be copied in respective occurrences. FD10. The table below identifies such occurrences: Common Process Sort Code Lookup & Split data based on processing centre ETL Redirections Table Load file performs the same join with many different files i. All Rights Reserved Page 24 of 44 Highly Confidential version 0. FD03. FD05 FD01. Join based on S/C & Acc Num ETL Customer Data File performs the same join with many different files i. FD11. FD11 FD02. header and trailer details are consistent with file properties. FD04. FD13 12. FD05.2 Transform Jobs Ranch Transform jobs repeatedly perform joins on similar driver files with other data files. FD09. etc.1 Import Jobs Each source file will be read in persistence datasets by separate jobs called import jobs. FD03.e. FD11. FD09 FD01. FD13 FD01.jobs. These datasets will then be used in respective functionalities. FD02. Join based on Customer Num ETL Customer Pointer File performs the same join with many different files i. In Ranch project the files are repeatedly used in different functionality. FD05 FD01. FD02. FD03. FD09 FD01. FD09. FD03. These jobs will have functionality of doing sanity checks on received file e. FD03. FD04. FD09. Ranch project will have templates which will be a job with stages following naming standards. annotations and reject handling.doc © Accenture 2004. FD08. FD04.1.e. FD10. The files that are used in multiple instances are described below: Common Source Files Import Account Selection File Import Customer Selection File Import ETL Customer Data File Import ETL Address Data File Import ETL Customer Pointer File Import ETL DDA Account Data Import ETL TDA Account Data Import TAX Certification File Import ETL Re-directions table load file Functionalities the file is used FD01. FD05. Join based on Customer Num (to get details of associated Account Numbers for each customer) Functionality FD02.1. FD09 FD01. FD09 52538333. FD12 FD01. FD13 FD01. FD03.9 . FD05. FD05.g. FD10. data file is not empty. FD08. FD07. 12. we will build and test one such job and use this architecture in rest. FD07. namely: parameters.e. FD11 FD01. FD03. Since associated logic for importing and validating files will be same. FD03. which can be implemented by the use of templates. FD02. Since this functionality is common. FD08. FD06. These jobs acting as a template will assist developer to develop new jobs as per mentioned standards. FD07. FD05. FD09 FD01. FD11. FD02. though this can cause bottlenecks in processing as they are serial only and should be avoided if possible o Parallel shared container is used in parallel jobs. Shared containers. All Rights Reserved Highly Confidential version 0. FD06. Page 25 of 44 Reject Handling 52538333. FD12 FD01. Containers are the means by which standard DataStage processes are captured and made available to many users. FD11 12.e. Identified containers in Ranch transform project are described in the table below: Container Functionalities Definition Will act on the joins. you could use one to make a server plug-in stage available to a parallel job).Join based on S/C & Acc Num.9 . lookups and active transformations to check records eliminated in process and log them in a separate file. Join based on Customer Num. 12. The functionality needed is discussed in section 7. reusable components will be identified and delivered into the DataStage repository as shared components. These files will be mainly in mainframe format. A local container is edited in a tabbed page of the job’s Diagram window. There are two types of shared container: o Server shared containers are used in server jobs. ETL Re-directions Table Load File JOIN WITH ETL DDA Account Data ETL Re-directions Table Load File JOIN WITH ETL TDA Account Data Functionality FD01. Apart from creating files from persistent dataset these jobs will create header and trailer details within file. once identified.Common Process Customer Selection File performs the same join with many different files. They are used just as a developer would use a standard stage.e. FD11 FD02. DataStage provides two types of container: • • Local containers. FD09 FD02. They can also be used in parallel jobs. Account Selection File performs the same join with many different files. Containers simplify and modularize server job designs by replacing complex areas of the diagram with a single container stage.1. Some work needs to be done to identify opportunities for reuse within the overall design. Local containers can be used in server jobs or parallel jobs. i.3 Unload Jobs Ranch Unload jobs are tasked to create output files in format required by load team.2 Containers A container is a group of stages and links. These are created within a job and are only accessible by that job. These are created separately and are stored in the Repository in the same way as other jobs.doc © Accenture 2004. i. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example. However. Finally there is a tips section to assist developer while coding. All changes to code made for debugging (including peeks. This improves performance and should not effect or change the function of the code. Problems to do with scaling usually become evident when comparing record counts between 1way and n-way runs. In order to ensure trouble free scaling. The use of the copy stage would also be an option A variant of the above would be to add a parameter pDEBUG with a value of 1 or 0 that will be used as part of the constraint.doc © Accenture 2004. Clearly. COMMON ISSUES AND TIPS Common issues faced in project while development and testing are mentioned in this section. If there is a difference. these counts and the physical records involved should be the same. 14. DEBUGGING A JOB The following techniques options will assist when debugging a job. extra stages and extra parameters) must be removed prior to final unit test.Statistics Report logger This component will log messages in a mentioned file.9 . This ensures that there has been no functional impact in making the switch to parallel processing. Jobs will run n-way when live in order to achieve the benefits of parallel processing provided by DataStage Enterprise. The COPY stage is a no-op (non operator) stage. There are a number of techniques including: • • • • Adding a peek stage will output certain rows to the job log Adding a filter to the start of the job to filter out all rows except the ones with the attributes that the developer may wish to test or debug the behaviour on Adding an additional output to a transformer with the relevant constraints and storing the data into a sequential file to be used as part of the investigation. the reasons for this must be examined and corrected. There are many possible reasons for variations in record counts. All Rights Reserved Page 26 of 44 Highly Confidential version 0. In processing hotspots (parts of a job which could potentially be an area of concern) it is advisable that peeks be replaced by COPY stages before promoting the jobs to Integration Test (instead of complete removal of the stage).1 1-way / n-way Scaling from 1-way to n-way processing is the method employed within DataStage to take advantage of parallelism. The resulting debug sequential file would only contain data when pDEBUG=1. Final unit testing must occur on the exact version of code that is to be promoted to Integration Test. Removing and re-inserting peeks and re-inserting them can often get to be quite a tedious task. this will not impact the processing times of the job. Debugging essentially involves viewing the data in order to isolate the fault. While the job may appear to look overly complex. 14. jobs are built 1-way and unit tested 1-way and n-way. 13. for instance: 52538333. This means that there isn’t a processing cost to having a copy stage in a job design. It will take input as filename and message to be written. The developer must ensure that overrides are removed from their jobs prior to promotion to the Test server. join conditions may not be met because of records (with keys that would otherwise match 1-way) being in different partitions and therefore go unmatched. perhaps in the form of a flat file.doc © Accenture 2004. a data source or internal dataset (for instance the source system itself. but may not be correctly partitioned for the needs of your job. Running with multiple nodes means that partitioning comes into play and therefore issues arise from applying processing rules across multiple partitions. This load process will most likely fail if there are duplicate keys in the data.e. partitioning will be considered within the overall solution design. Where possible. Configuration files are provided for 1-way and 4-way running on the Development server. but more effective in terms of retaining control over your jobs and the quality of the output data flows. all data flows through a single partition (where processing rules apply to all the data). with 1way being the default. care must be taken at the unit test stage and it is always a good idea to have a general understanding of the anticipated throughput of a job before starting the build. flat file or internal dataset will contain duplicate keys. is loaded into a target database table. i. If not. therefore minimising the need for repartitioning. Essentially this is because when running with a single node.• • • One of the most common reasons is when the Join. The key to solving problems related to duplicates. the output from another job or module). otherwise the lookup may fail simply because the dataset was partitioned incorrectly for the lookup an incoming dataset may have been created by another job or module which may also have been written by another developer. is to repartition at the start of a job. Another sign that there may be duplicates in the data is when the output of a job or stage (within a job) has more rows in the output stream than would have been thought possible from the inputs. 14. particularly if the target table is uniquely keyed. In these situations care must be taken to ensure that incoming data streams are not only sorted but partitioned the same way. This might be less efficient. In this case it might contain the required data.e. hence causing a variation between the actual rows processed and the anticipated number it should be ensured that a dataset used as input on the lookup link to a Lookup stage must be partitioned as Entire to ensure that the entire dataset is available for lookup across all partitions within the main input link to these stages.9 . is to understand how duplicates are be generated. Lookup and Merge stages (and others) are used. records may be unnecessarily rejected (either down a reject link or omitted all together) and will therefore not flow down the main output link to subsequent stages or into an output dataset. Scaling from 1-way to n-way processing will often cause problems.2 Duplicate Keys Often an output table. For these reasons. In these situations. 4-way processing is specified at job level as an override. All Rights Reserved Highly Confidential version 0. then this may need to be raised as a data quality issue and corrected at source a 1-way/n-way issue. Duplicates will often be identified when the output data (from a DataStage job). unless you can be absolutely sure that the datasets you are using are partitioned correctly for your needs. If the problem lies with the source system. the problem may be inherited and a more extensive search may be required in order to find the problem. usually giving correct results. Therefore good practice. This Page 27 of 44 • 52538333. Here are some examples: • an incoming data stream i. A sign that this is the case is if the final record count is a multiple of the number of nodes compared to the single node record count.doc © Accenture 2004. particularly discussing the balance that must be achieved between the resources available on the server where the DataStage jobs run and the performance of those jobs. Within DataStage. however in many cases this can also lead to incorrect results. The key is to run a number of performance tests to determine the optimum number of nodes. For instance of a job is generating a unique key column. 52538333. however this is completely unnecessary (providing the data is in the correct order already) and time consuming. even when included as an input link to sort dependent stages such as Dedupe and Join. the partition number can be built into the algorithm for generating the key. The sort order of the data within a partition in a data stream will be maintained throughout a job. Common sense is the key. therefore reducing the overhead.• effect may be desirable. Quite often they will ‘look’ good but could be combined. It is always tempting to sort on the input links of these stages. the optimum use of parallel (partitioned and piped) data streams is clearly essential. the same key may me generated across all partitions and therefore duplicated when the data is collected for output. particularly then defining keys. the Transform stage was inherited from the DataStage Server product and is less efficient than other native Parallel stages.3 Resource usage Vs Performance This section concentrates on issues found not only during development but also during wider Integration. Since DataStage Enterprise (DataStage) starts one Unix process per node (nodes are defined in the configuration file and can be thought of as a logical processor) per stage. Within DataStage. The jury is out as far as the use of Transform is concerned. 14. Modify for simple type conversions should be considered. To avoid this kind of issue. too many Transforms will slow your jobs down and in this situation. so where possible these activities should be minimized. As a general rule of thumb. it is also tempting to repartition on the input links of stages when specifying Same will suffice (again. incoming data streams should be partitioned and sorted as far up stream as possible and maintained for as long as possible. For users of DataStage Server it will be familiar and easy to use. a stage can be forced to run sequentially (though this may become a bottleneck) or alternatively. Partitioning and sorting will take considerable amounts of time during job execution. though generally the more resource (processors and memory) the better. therefore ensuring uniqueness across partitions A Cartesian join. having a detrimental effect on performance. The native Modify stage is a good alternative but is not consistent with the user interface implemented for other stages. the effective use of available processors and to an extent the total memory usage is determined by the operating system rather than DataStage. Using several transforms in sequence is also undesirable. though Transform also differs slightly. A starting point will usually be around 50% of actual CPUs. with arguments for and against. providing the data is correctly partitioned already). E2E and Performance test stages. read and maintain. this can lead to an explosion of processes running and eventually the operating system spend more time managing than executing code. as is the appropriate use of stages within jobs and the elimination of unnecessary repartitioning and sorting. All Rights Reserved Page 28 of 44 Highly Confidential version 0. Clearly.9 . Similarly. g. Likewise.4 General Tips General tips used while development code is mentioned below • Common information like home directory. When you need to get a substring (e. username. Use containers where stages in the jobs can be grouped together. Nulls are a curse when it comes to using functions/routines or normal equality type expressions. to improve runtimes. Page 29 of 44 Highly Confidential version 0. When using String functions on decimal always use Trim function to avoid as String functions interpret an extra Space used for sign in decimal.2] • Always use Hash Partition in Join and Aggregator stages.Finally. If there are more stages (more than 10) in a job divide into two or more jobs on functional basis. • 52538333. though this needs to be considered in the context of the total memory available and what else will be running at the time. be prepared to add further processors to facilitate scaling. Use Annotations for describing steps done at stages. 14. system date. Total memory usage will be hard to estimate and will be best left until a point when the runtime batch has been designed and run – be prepared to increase memory and split jobs if the usage is too great. Changing the nulls to 0 or “” before performing operations is recommended to avoid erroneous outcomes. NULL = NULL doesn’t work. the Lookup stage: This stage differs from Merge and Join in that it requires the whole of the lookup dataset to be held in memory. All Rights Reserved • • • • • • • Use Column Generator stage to create sequence numbers or adding columns having hard coded values. The hash key should be the same as the key used to join/aggregate. The upper limit is large.doc © Accenture 2004.2] Similarly for a decimal field then: Use Trim(<Field Name>)[1. the cleanup of data is more efficient and requires less iteration. if still incorrect problem is with data/logic) and then run in parallel using Hash partition. Ensure that job does not look complex. By being able to evaluate all data in a record and not just error on the first exception that is found. try running in sequential mode (verify results. If Join/Aggregator stages do not produce desirable results. Use Description Annotation as job title.9 . password should be initialized in a global variable and then variable should be referred everywhere. as Description Annotation also appears in Job properties>Short Job Description and also in the Job Report when generated. allowing you to compare between previous and current records. first 2 characters from the left) of a character field: Use <Field Name>[1. E. Stage Variables allow you to hold data from a previous record when the next record.g. neither does concatenation when one of the fields is null. Stage variables also allow you return multiple errors for a record of information. Datasets can either be Sequential or Parallel. • • 15. Datasets: Datasets are used as intermediate storage for the various processes. Transform Jobs: Datasets created by import jobs will be processed by actual transform job.1 Job Categories The jobs can be categorised by developer and by FD. REPOSITORY STRUCTURE The DataStage repository is the resource available to developers that helps organise the components they are developing or using within their development. lookup data as per given functionality. table definitions.e. 15. the jobs themselves and specific routines and shared containers. This consists of metadata i.2 Table Definition Categories The files are categorised into: • • Source/Target Flat-files: The source and target files will be included in this category. The following jobs will be created: • Import Jobs: Import Jobs will be starting point for transformation. A Dataset can store data being operated on in a persistent form. Exception log will be created with records that do not follow file layout. usually evolving to a structure that is in it’s most usable form. “Clean-up on failure” property in sequential files must be enabled (enabled by default) • 15. Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference. (Note: This is not a default option) When mapping a decimal field to a char field or vice versa . always use “Reset if required. 52538333.3 Routines Before and after routines (should they be needed) will be described here. 15. it is always better to convert the value in the field using the ‘Type Conversion’ functions “DecimalToString” or “StringToDecimal” as applicable while mapping. These files will be converted into datasets by DataStage jobs and then after the Transformation process is complete. which can then be used by other DataStage jobs. All data errors will be captured in an exception log for future reference. Finally the records will be split as per destination file and a destination dataset will be created. they will be converted back to Target flat files.• • In Job sequences. Finally one or more datasets will be created which will be input to actual transform process. Size will be done here. Unload Jobs: Unload jobs will take transform datasets as a source and create final files required by load team in the given format. then run” option in Job Activity stages. However the structure may change during development. These Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process. The anticipated repository structure is described in the following sections.g.doc © Accenture 2004. Transform will join two or more datasets. All Rights Reserved Page 30 of 44 Highly Confidential version 0. Sanity checks on file and validation of external properties e. Source file will be read in memory datasets as per source record layout.9 . It is anticipated that there will be a small number of these and therefore no further categorisation is anticipated. 16.9 . 52538333. This will take a file as input and divide into 2 files for notth and south separately. For example.doc © Accenture 2004. take file A (master) and file B (child).2 jbt_sc_srt_cd_lkp Sort Code look up is a functionality which is required at many places (in various FD’s in Ranch).15. All Rights Reserved Page 31 of 44 Highly Confidential version 0. whereas Datastage just offers 2 outputs from a Join stage. So a common component with this functionality is built.4 Shared Containers Shared containers (as described above) will be described here. COMMON COMPONENTS USED IN RANCH 16. The Join stage of Datastage will give 2 outputs in this case: • A + B (Join records) • A not in B (Reject Records) The common component jbt_sc_join will give 3 outputs in this case: • A + B (Join records) • A not in B (Reject records) • B not in A (Non Join records) This functionality is illustrated in the flow diagram below: A not in B ln k_ A_ B_ re j File ‘A’ (Master) lnk _A B_jn lnk_A_ A +B jn_A_B _ lnk File ‘B’ (Child) B lnk _A _B _n jn B not in A 16.1 jbt_sc_join jbt_sc_join is a common component built to meet a specific requirement in Ranch project to capture 3 types of records from a Join stage. North File h ort _n _A l nk File ‘A’ lnk_A sc_srt_c d_lkp l nk _A _s ou th ‘A’ .3 jbt_env_var This is a template job with commonly used environmental variables imported. $PARMFILEDIR: This folder will contain parameter files that would be looked up by jobs/routines that would be triggered from a common parameter file. 52538333. This can be used for all the jobs being developed with these set of common environment variables rather then importing them again and again. $SRCDATASET: All the input files will be partitioned and imported into DataStage datasets. All Rights Reserved Page 32 of 44 Highly Confidential version 0. This folder will store all the input datasets. $DSEESCHEMADIR: DSEE Schemas that are used by EE jobs using RCP/schema files. These parameters values will be set as per development environment.doc © Accenture 2004. copying. These Environment variables are as shown below: $ADTFILEDIR: This would contain the Audit file and reconciliation reports.9 .‘A’ .South File 16. $BASEDIR: This folder is the base directory. $REJFILEDIR: This would contain all the reject files generated in DataStage jobs. $ITERATION: Current Iteration number $JOBLOGDIR: This would contain all the Error log files generated in DataStage jobs. $SCRIPTDIR: This will contain routine UNIX scripts used for processing files. taking file backup etc. as Description Annotation also appears in Job properties>Short Job Description and also in the Job Report when generated.4 jbt_annotation This is a template job where annotations are used for describing steps done at stages. All files will be manually copied into this folder. $TMPDATASET: This folder will be used to store all the intermediate files created during transform job.doc © Accenture 2004. $SRCFORMATDIR: This folder will contain the copybook formats for input source files. 16.5 Job Log Snapshot JobLogSnapShot. 16. $TRGFORMATDIR: This folder will contain the copybook formats for output source files.9 .ksh is a script which will create the log file (as seen in Datastage Director) of job's latest run.$SRCFILEDIR: This folder will contain all the input files from the Extract team. These copybook formats are as per functional specifications. All Rights Reserved Page 33 of 44 Highly Confidential version 0. The following parameters need to be hard coded in the script as per environment: DSHOME=/wload/dqad/app/Ascential/DataStage/DSEngine PROJDIR=/wload/dqad/app/Ascential/DataStage/Projects/ranch_dev LOGDIR=/wload/dqad/app/data/ranch_dev/itr01/errfile 52538333. $TRGFILEDIR: These folders will contain all the transformed output files which can be loaded to Bank B’s mainframe. Also Description Annotation are used as job title. $TRGDATASET: This folder will be used for storing output DataStage datasets files. The Job Log file will be created in: /wload/dqad/app/data/ranch_dev/itr01/errfile/<Job name>_log_<time stamp>.doc © Accenture 2004.9 . The script will be called from the after job subroutine of a job.ksh $1 $1 is input parameter: Job name whose latest job log is required. PROJDIR is the project directory in which the job exists. LOGDIR is a common directory where the log file will be created.DSHOME is the Datastage Home path. . . 52538333. . All Rights Reserved Page 34 of 44 Highly Confidential version 0.txt Sample Job log: . ksh /wload/dqad/app/data/ranch_dev/com/script/JobLogSnapShot. ini file will contain the following separated by | sign. • The name of the File whose report is to be prepared.ini file: INP|fd01_customer_pointer_file|Customer Pointer dataset created from source file INP|fd01_customer_data_file|Customer Data dataset created from source file OUT|fd01_redirection_file|Output redirection file|117 REJ|fd01_duplicates_file|Reject file containing duplicated account numbers NJN|fd01_account_nonjoin|Nonjoin files from the join stage in job1 52538333.doc © Accenture 2004. Reject or Non-Join. Output. Example: INP or OUT or REJ or NJN. • The type of the file i.9 . All Rights Reserved Page 35 of 44 Highly Confidential version 0. Sample . ksh /wload/dqad/app/data/ranch_dev/com/script/Reconcilation.ksh $1 $2 $1 is 1st input parameter: FD## $2 is 2nd input parameter: .ksh is a script which will create the Reconciliation Report of the respective functional area (FD).6 Reconciliation Report Reconcilation. The script will be called from an Execute Command stage of a Job Sequence. • The Description of the file whose report is to be prepared.(this is need only for the output ebcidic file). the output files will be ebcidic files and the reject and non-join files will be in ascii format.ini file name (not path) Specifications of . Also the input files will be datasets. Input. • The Record length of the file.ini file: Path: /wload/dqad/app/data/ranch_dev/com/parmfile The .e.16. Note: this should be sorted order. All Rights Reserved Page 36 of 44 Highly Confidential version 0.The Reconciliation report will be created in: /wload/dqad/app/data/ranch_dev/itr01/adtfile/<FD##>_recon_<time stamp>.txt Sample Reconciliation report: 52538333.doc © Accenture 2004.9 . 16.ksh 52538333. All Rights Reserved Page 37 of 44 Highly Confidential version 0.7 Script template All scripts are made according to this template script. This has a script description and also a section for maintaining modification history of the script.9 . This script name is /wload/dqad/app/data/ranch_dev/com/script/ScriptTemplate.doc © Accenture 2004. The input file will be /wload/dqad/app/data/ranch_dev/itr01/opfile/$1.dat respectively.ksh described in 16. ksh /wload/dqad/app/data/ranch_dev/com/script/Make_File.dat respectively. detail and trailer file to be of name $1_hdr.ksh is a script which will split the input file into header.9 . All these files ($1_hdr.dat.ksh is a script which will merge the header.dat extension.ksh $1 $1 is 1st input parameter: <Input file name without extension> $2 is 2nd input parameter: <Record length> This requires the file name to have .dat 16. The validations done on trailer are: 52538333. $1_det. $1_dtl.dat and $1_trl.doc © Accenture 2004.dat. The validations done on header are: • • • • • The file header identifier must contain the value ‘HDR-TDAACCT’ The file header date must equal the T-14 migration date The file trailer file identifier must contain the value ‘TRL-TDAACCT’ The file trailer creation date must equal the file header creation date The file trailer record count must equal the total number of record on the input file including the header and trailer records. The script will be called from an Execute Command stage of a Job Sequence (Import sequence). The output file will be /wload/dqad/app/data/ranch_dev/itr01/opfile/$1. $1_dtl. detail and trailer record to create the target file. All Rights Reserved Page 38 of 44 Highly Confidential version 0. Detail and Trailer record created by the SplitFile. ksh /wload/dqad/app/data/ranch_dev/com/script/SplitFile. $1_det.dat) will have to be present in /wload/dqad/app/data/ranch_dev/itr01/opfile/.8.dat. The header.dat. detail and trailer files created would be $1_hdr. The script will be called from an Execute Command stage of a Job Sequence (Unload sequence).16.dat and $1_trl. The header and trailer data is validated.8 Split File SplitFile.dat All these files ($1_hdr.9 Make File Make_File.dat and $1_trl. 16. detail and trailer files.dat and $1_trl.10 jbt_import This template job processes the Header.ksh $1 $1 is 1st input parameter: <Target file name without extension> This requires the header.dat) will be output in /wload/dqad/app/data/ranch_dev/itr01/opfile/. If any of the above checks fail. 52538333.9 . The accumulation of the Closing Balance field must be performed using an integer data format. This is implemented using subroutine AbortOnCall. All Rights Reserved Page 39 of 44 Highly Confidential version 0. allowing for overflow. Note: These header/trailer validations are for FD01. The detail records are written to a dataset to be processed in transform job. They will vary (slightly though) for other FD’s. But this common approach as shown in the template can be taken. then processing should be immediately aborted with a relevant fatal error message.• The file trailer record amount must equal the sum of the Closing Balance field from every record on the input file excluding the header and trailer records.doc © Accenture 2004. All Rights Reserved Page 40 of 44 version 0.doc © Accenture 2004.52538333.9 Highly Confidential . 12 jbt_unload This template job illustrates creation of header and trailer records.ksh as described in 16.10 This sequence template will split the source file into 3 different files: Header.doc © Accenture 2004. 16. The trailer consists of record count and Hash count.16.9 . Detail and Trailer & call the import job which will do the necessary validation and create a detail dataset. This template mainly is for following logic: • • Total number of records on file (excluding header & trailer) Hash of account numbers from all detail records on file 52538333. All Rights Reserved Page 41 of 44 Highly Confidential version 0.11 jst_import This template job sequence calls the following components: • • SplitFile.8 jbt_import as described in 16. 9 .52538333. All Rights Reserved Page 42 of 44 Highly Confidential version 0.doc © Accenture 2004. 9 Highly Confidential . All Rights Reserved Page 43 of 44 version 0.52538333.doc © Accenture 2004. ksh as described in 16. This is used in places where job needs to be aborted on a particular number of reject records.doc © Accenture 2004.16. job will abort after 4 records pass through the BASIC Transformer. All Rights Reserved Page 44 of 44 Highly Confidential version 0.ME) Here <Threshold Value> is the job parameter.6 This sequence template will create 3 different files: Header. For example.14 jbt_abort_threshold Abort Threshold template will abort a job based on threshold value passed as a job parameter.9 Reconciliation report as described in 16. Detail and Trailer & call the script which will combine these 3 files to create the target file. This routine has to be called from a BASIC Transformer: AbortOnThreshold (@INROWNUM.9 . <Threshold Value>. if you give Threshold Value as 5.13 jst_unload This template job sequence calls the following components: • • • jbt_unload as described in 16. DSJ. 52538333. 16. Also Reconciliation report is created.12 MakeFile. It uses common routine called “AbortOnThreshold”.

Comments

Description