DataStage Best Practises1

DataStage Best Practices55000783.doc Page 1 of 41 CONTENTS 1. INTRODUCTION.....................................................................................................................................5 1.1 OBJECTIVE..........................................................................................................................................5 1.2 DOCUMENT USAGE................................................................................................................................5 2. DATASTAGE OVERVIEW......................................................................................................................5 3. DATASTAGE DEVELOPMENT WORKFLOW.......................................................................................6 3.1 BUILDING AND TESTING JOBS...................................................................................................................6 3.1.1 Dummy_Dev Project.............................................................................................................6 3.2 OTHER DATASTAGE PROJECTS................................................................................................................6 4. DATASTAGE JOB DESIGN CONSIDERATIONS..................................................................................7 4.1 JOB TYPES..........................................................................................................................................7 4.1.1 Import Jobs...........................................................................................................................7 4.1.2 Transform Jobs.....................................................................................................................8 4.1.3 Unload Jobs..........................................................................................................................8 5. USE OF STAGES....................................................................................................................................8 5.1 COMBINING DATA..................................................................................................................................8 5.1.1 Join, Lookup and Merge Stages...........................................................................................8 5.1.2 Aggregate Stage...................................................................................................................9 5.1.3 The Funnel Stage.................................................................................................................9 5.2 SORTING.............................................................................................................................................9 5.3 DATA MANIPULATION............................................................................................................................10 5.3.1 Transformer Usage Guidelines...........................................................................................10 5.3.2 Modify Stage.......................................................................................................................12 5.4 TRANSITIONING DATA...........................................................................................................................13 5.4.1 External Data......................................................................................................................13 5.4.2 Parallel Dataset..................................................................................................................13 5.5 UNIT TEST........................................................................................................................................13 5.5.1 Copy Stage.........................................................................................................................13 5.5.2 Peek Stage.........................................................................................................................13 5.5.3 Row Generator...................................................................................................................13 5.5.4 Column Generator..............................................................................................................13 5.5.5 Manual XLS Generation......................................................................................................14 6. GUI STANDARDS.................................................................................................................................14 7. DATASTAGE NAMING STANDARDS..................................................................................................14 8. RUNTIME COLUMN PROPAGATION (RCP).......................................................................................16 9. STANDARDISED REJECT HANDLING...............................................................................................16 9.1 REJECT COMPONENTS..........................................................................................................................16 9.2 CUSTOMISED REJECT MESSAGES............................................................................................................18 9.3 REJECT LIMIT.....................................................................................................................................19 9.4 BEFORE ROUTINE................................................................................................................................19 9.5 NOTIFICATIONS...................................................................................................................................19 9.5.1 In-line Notification of Rejects..............................................................................................19 9.5.2 Cross Functional Notification of Rejects.............................................................................20 55000783.doc Page 2 of 41 10. ENVIRONMENT..................................................................................................................................20 10.1 DEFAULT ENVIRONMENT VARIABLES STANDARDS.......................................................................................20 10.2 JOB PARAMETER FILE STANDARDS........................................................................................................20 10.3 DIRECTORY PATH PARAMETERS............................................................................................................20 10.4 DEFAULT DIRECTORY PATH PARAMETERS ..............................................................................................20 10.5 DIRECTORY & DATASET NAMING STANDARDS............................................................................................21 10.5.1 Functional Area Input Files...............................................................................................21 10.5.2 Functional Area Output Tables.........................................................................................21 10.5.3 Functional Area Staging Tables........................................................................................21 10.5.4 Internal Module Tables.....................................................................................................21 10.5.5 Datasets Produced from Import Processing ....................................................................21 11. METADATA MANAGEMENT..............................................................................................................21 11.1 SOURCE AND TARGET METADATA..........................................................................................................22 11.2 INTERNAL METADATA..........................................................................................................................22 12. STANDARD COMMON COMPONENTS............................................................................................22 12.1 JOB TEMPLATES................................................................................................................................22 12.1.1 Import Jobs.......................................................................................................................23 12.1.2 Transform Jobs.................................................................................................................23 12.1.3 Unload Jobs......................................................................................................................24 12.2 CONTAINERS....................................................................................................................................24 13. DEBUGGING A JOB...........................................................................................................................25 14. COMMON ISSUES AND TIPS............................................................................................................25 14.1 1-WAY / N-WAY.................................................................................................................................25 14.2 DUPLICATE KEYS..............................................................................................................................26 14.3 RESOURCE USAGE VS PERFORMANCE....................................................................................................27 14.4 GENERAL TIPS.................................................................................................................................28 15. REPOSITORY STRUCTURE..............................................................................................................29 15.1 JOB CATEGORIES..............................................................................................................................29 15.2 TABLE DEFINITION CATEGORIES............................................................................................................29 15.3 ROUTINES........................................................................................................................................29 15.4 SHARED CONTAINERS.........................................................................................................................29 16. COMMON COMPONENTS USED IN DUMMY...................................................................................30 16.1 JBT_SC_JOIN....................................................................................................................................30 16.2 JBT_SC_SRT_CD_LKP.........................................................................................................................30 16.3 JBT_ENV_VAR...................................................................................................................................31 16.4 JBT_ANNOTATION...............................................................................................................................32 16.5 JOB LOG SNAPSHOT..........................................................................................................................32 16.6 RECONCILIATION REPORT....................................................................................................................34 16.7 SCRIPT TEMPLATE..............................................................................................................................36 16.8 SPLIT FILE......................................................................................................................................36 16.9 MAKE FILE......................................................................................................................................36 16.10 JBT_IMPORT...................................................................................................................................37 16.11 JST_IMPORT...................................................................................................................................39 16.12 JBT_UNLOAD...................................................................................................................................39 16.13 JST_UNLOAD...................................................................................................................................41 55000783.doc Page 3 of 41 16.14 JBT_ABORT_THRESHOLD....................................................................................................................41 55000783.doc Page 4 of 41 1. INTRODUCTION 1.1 Objective This document will serve as a source of standards for use of the DataStage software as employed by the Dummy Transformation project. The below mentioned standards will be followed by all developers. It is understood that this document, while setting the standards might not be possible to cover all the development scenarios. In such cases, developer must contact the appropriate authority to seek clarification and ensure that such missing items are subsequently added to this document. It will therefore be an evolving document which will be updated to continually reflect the changing needs and thoughts of the development team and hence continue to represent best practices as the project progresses. Initial review and sign-off process will therefore be followed within this context. 1.2 Document Usage This document describes the DataStage best practices to be applied to the Dummy Transformation project. It is intended to channel the general knowledge of DataStage developers towards the specific things they need to know about the Dummy project and the specific way jobs will be developed. It will be referenced by developers initially for familiarisation and as required during the course of the project. Use of the document will therefore reduce over time as developers become familiar with the practices described. The Offshore Build Manager will maintain the document (in collaboration with the development team – through weekly developer meetings) and will be responsible for distributing the document to developers (and explaining it’s content) initially and after updates have been applied, ensuring that the standards it describes are communicated and understood. Such communication will highlight the areas of change. The best practices will also form the basis for QA and peer testing within the development environment. 2. DATASTAGE OVERVIEW DataStage is a powerful Extraction, Transformation, and Loading tool. DataStage has the following features to aid the design and processing: • • • Uses graphical design tools. With simple point-and-click techniques you can draw a scheme to represent your processing requirements Extracts data from any number or type of database Handles all the metadata definitions required to define your data warehouse or migration. You can view and modify the table definitions at any point during the design of your application Aggregates data. You can modify SQL SELECT statements used to extract data Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use. Page 5 of 41 • • 55000783.doc 3. DATASTAGE DEVELOPMENT WORKFLOW 3.1 Building and Testing Jobs This section provides an overview of the DataStage Job development process for the Dummy transformation project. As detailed in diagram below there will three environments i.e. development, test and production. Within DataStage, a project is the entity in which all related material to a development is stored and organised. Development will have three projects where each code will move i.e. Dummy_Dev, Version and Dummy_Promo. Developers will develop code in Dummy_Dev project and after unit testing it promote to project Version where Version controlling will be managed. After base-lining the code the DataStage administrator will collate all code in the Dummy_Promo project from where the DMCoE will move it for unit and end to end testing on the Test server. Finally the code will be moved by DMCoE to production. Please refer to the Dummy Transform Code Migration Strategy document for further details. Development Server BUILD MANAGER (Review / Defect Fix / QA / Sign Off) Role Ranch_Test DS Project Test Server Build / Unit Test Process Deploy / Promote Process Ranch_Dev DS Project Ranch_Dev\FDyy DS Project Ranch_Promo DS Project Onshore E2E Test Activity-- DMcoE Production Server Developer Role Ranch_Prod DS Project Version DS Project Administrator Role Onshore Activity -- DMcoE Each DataStage project is defined below: 3.1.1 Dummy_Dev Project The Dummy_Dev project will be used by developers for building DataStage jobs and unit testing by the developers. It will be mapped to a working directory on the UNIX DataStage server. This will also be used for unit testing, changes / defect fixing will be documented and fixed before promoting the job to “Dummy_Promo” for integration testing. 3.2 Other DataStage Projects Several further DataStage projects will be employed across the Development, Test and Production environments. Please refer to the Dummy Transform Code Migration Strategy document for further details. 55000783.doc Page 6 of 41 4. DATASTAGE JOB DESIGN CONSIDERATIONS 4.1 Job Types As per diagram below there will be three types of Jobs within Transform i.e. Import, Transform and Unload Jobs. Source data having complex file layout will be processed by these jobs in sequence to give Target file which will in the format required by Load team. Hogan Extract Data 4.1.1 Import Jobs Import Jobs will be starting point for transformation. Sanity checks on file and validation of external properties e.g. Size will be done here. Source file will be read as per source record layout. If there are any unwanted or bad record the job will fail and file needs to be corrected before restarting the job. Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference. Finally one or more datasets will be created which will be input to actual transform process. See section 9 for further details of action to be taken on failure or reject. Hogan Extract Data Import Job Check zero byte file, Validate header and trailer details Read File in specific format Create output datasets Record s to be Proces sed NonHogan Extract Data Write error details in Stats File and Stop processing 55000783.doc NonHogan Extract Page 7 of 41 4.1.2 Transform Jobs Datasets created by import jobs will be processed by transform jobs. Transform will join two or more datasets, lookup data as per functional design specification. Finally the records will be split as per destination file design and a destination dataset will be created. All data errors will be captured in an exception log for future reference. 4.1.3 Unload Jobs Unload jobs will take transform datasets as a source and create final files required by load team in the given format. Data Held for Future Job Transform jobs have a number datasets. Thes from completed or may represe event of the ba failure. Unload Job Unload Data in output file as per layout Load Data Data Held for Future Job Data Held in temporary datasets 5. 5.1 USE OF STAGES Combining Data Records to be Processed Target data is provided as flat files 5.1.1 Join, Lookup and Merge Stages The Join, Lookup and Merge stages combine two or more input links according to values of key columns. They differ mainly in memory usage, treatment of rows with unmatched key values and input requirements i.e. sorted and de-duped. A brief description as to when to use these stages is provided in the following table: 55000783.doc Page 8 of 41 Type Memory Number of Inputs Sort on Input Duplicates on Primary Input Duplicates on Secondary Input(s) Options on Unmatched Primary Options on Unmatched Secondary Number of Output Links Captured on Reject Join SQL-like Light 1 Left, 1 Right All OK OK None None 1 Nothing Lookup In RAM Lookup Table Heavy 1 Source, n Lookup Tables None OK Warning Fail, Continue, Drop or Reject None 1 Out, 1 Reject Unmatched Primary Rows Merge Master / Update Light 1 Master, n Updates All Warning OK (when n=1) Keep or Drop Capture as Reject 1 Out, n Rejects Unmatched Secondary Rows The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. If the datasets are larger than available resources, the JOIN or MERGE stage should be used. 5.1.2 Aggregate Stage The purpose of the aggregator stage is to perform data aggregations. In order to do this, it is necessary to understand the key columns that define the aggregation groups, the columns to be aggregated and the kind of aggregation. Common aggregation functions include: • • • • Count Sum Mean Min / Max. Several others are available to process business logic, however it is most likely that aggregations will be used as part of a calculation to determine the number of rows in an output table for inclusion in header and footer records for unload files. 5.1.3 The Funnel Stage The funnel requires all input links to have identical schemas (column names, types, attributes including null ability). The single output link matches the input schema. 5.2 Sorting There are two options for sorting data within a job, either on the input properties page of many stages (a simple sort) or using the explicit sort stage. The explicit sort stage has additional properties, such as the ability to generate key change column and to specify the memory usage of the stage. 55000783.doc Page 9 of 41 5.3 5.3.1 Data Manipulation Transformer Usage Guidelines 5.3.1.1 Choosing Appropriate Stages The parallel Transformer stage always generates "C" code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, Modify etc) when derivations are not needed. Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. 5.3.1.2 Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose "Convert to Reject". The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example: If ISNULL(link.col) Then… Else… 5.3.1.3 Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer. For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer. 5.3.1.4 Optimizing Transformer Expressions and Stage Variables In order to write efficient Transformer stage derivations, it is useful to understand what items are evaluated and when. The evaluation sequence is as follows: • • Evaluate each stage variable initial value For each input row to process: o Evaluate each stage variable derivation value, unless the derivation is empty o For each output link: 1. Evaluate each column derivation value 2. Write the output record o Next output link Next input row • The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in which they are displayed. From this sequence, it can be seen that there are certain constructs that will be inefficient to include in output column derivations, as they will be evaluated once for every output column that uses them. Such constructs are: 55000783.doc Page 10 of 41 Where the same part of an expression is used in multiple column derivations For example, suppose multiple columns in output links want to use the same substring of an input column, then the following test may appear in a number of output column derivations: IF (DSLINK1.col[1,3] = "001") THEN ... In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated for each column that uses it. This can be made more efficient by moving the substring calculation into a stage variable. By doing this, the substring is evaluated just once for every input row. In this case, the stage variable definition will be: DSLINK1.col1[1,3] and each column derivation will start with: IF (Stage Var1 = "001" THEN ... This example could be improved further by also moving the string comparison into the stage variable. The stage variable will be: IF (DSLink1.col[1,3] = "001" THEN 1 ELSE 0 and each column derivation will start with: IF (Stage Var1) THEN This reduces both the number of substring functions evaluated and string comparisons made in the Transformer. Where an expression includes calculated constant values For example, a column definition may include a function call that returns a constant value, such as: Str(" ",20) This returns a string of 20 spaces. In this case, the function will be evaluated every time the column derivation is evaluated. It will be more efficient to calculate the constant value just once for the whole Transformer. This can be achieved using stage variables. This function could be moved into a stage variable derivation. However in this case, the function will still be evaluated once for every input row. The solution here is to move the function evaluation into the initial value of a stage variable. A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab in the Transformer stage editor. In this case, the variable will have its initial value set to: Str(" ",20) 55000783.doc Page 11 of 41 You will then leave the derivation of the stage variable on the main Transformer page empty. Any expression that previously used this function will be changed to use the stage variable instead. The initial value of the stage variable is evaluated just once, before any input rows are processed. Then, because the derivation expression of the stage variable is empty, it is not reevaluated for each input row. Therefore, it is value for the whole Transformer processing is unchanged from the initial value. In addition to a function value returning a constant value, another example would be part of an expression such as: "abc" : "def" As with the function call example, this concatenation is evaluated every time the column derivation is evaluated. Since the subpart of the expression is actually constant, this constant part of the expression could again be moved into a stage variable, using the initial value setting to perform the concatenation just once. Where an expression requiring a type conversion is used as a constant, or it is used in multiple places For example, an expression may include something like this: DSLink1.col1+"1" In this case, the "1" is a string constant, and so, in order to be able to add it to DSLink1.col1, it must be converted from a string to an integer each time the expression is evaluated. The solution in this case is just to change the constant from a string to an integer: DSLink1.col1+1 In this example, if DSLINK1.col1 were a string field, then a conversion will be required every time the expression is evaluated. If this just appeared once in one output column expression, this will be fine. However, if an input column is used in more than one expression, where it requires the same type conversion in each expression, it will be more efficient to use a stage variable to perform the conversion once. In this case, you will create, for example, an integer stage variable, specify its derivation to be DSLINK1.col1, and then use the stage variable in place of DSLink1.col1, where that conversion would have been required. It should be noted that when using stage variables to evaluate parts of expressions, the data type of the stage variable should be set correctly for that context, otherwise needless conversions are required wherever that variable is used. 5.3.2 Modify Stage The Modify stage is the most efficient stage available. Transformations that touch a single field, such as keep/drop, type conversions, some string manipulations, and null handling, are the primary operations which should be implemented using Modify instead of using Transform. 55000783.doc Page 12 of 41 5.4 Transitioning Data 5.4.1 External Data The External Source stage is a file stage which is used to read data that is output from one or more source programs. The stage calls the program and passes appropriate arguments. The stage can have a single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode. This stage will be typically used in the ‘Import’ jobs to import the External Data to parallel datasets to be processed by further ‘Transformation’ jobs. 5.4.2 Parallel Dataset The Data Set stage is used to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode. DataStage parallel extender jobs use data sets to manage data within a job. The Data Set stage can store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. These Parallel Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process. Due to the parallel nature of processing, the danger of bottle necks is eliminated during dataset creation. 5.5 Unit Test 5.5.1 Copy Stage The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or columns can be dropped or changed (to copy with more modification – for example changing column data types). This stage is used commonly for debugging/testing purpose where a copy of the data flowing from a particular stage can be isolated from the flow and analysed. 5.5.2 Peek Stage The Peek stage can print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. This stage is used when a specific column’s data is only to be analysed while Unit Testing to validate whether the preceding transformation logic is working as desired. 5.5.3 Row Generator The Row Generator stage is a Development/Debug stage that has no input links, and a single output link. The Row Generator stage produces a set of mock data fitting the specified metadata. This is useful where you want to test your job but have no real data available which may be a source file or a dataset produced by some other job whose development is also underway. Also, more details can be specified about each data type if required to shape the data being generated. For e.g. Type as ‘Cycle’ specifying what ‘Increment’ value is required Type as ‘Random’ specifying what percent of invalid/zero data is required. 5.5.4 Column Generator The Column Generator stage is a Development/Debug stage that can have a single input link and a single output link. The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new data set is then 55000783.doc Page 13 of 41 output. This is used where not all the columns’ real data is available for testing. Those columns need to be inserted with mock data fitting the specified metadata. 5.5.5 Manual XLS Generation In addition to the ‘Row Generator’ and ‘Column Generator’ methods DataStage provides, mock data can also be created manually in an XLS file and then saved as a CSV file to be given as input to the DataStage job where this test data is required. These methods of data generation will be used extensively during Unit testing. 6. GUI STANDARDS Job Description Fields – the description annotation is mandatory for each job. Note that the description annotation updates the job short description. The full description should include the job version number, developer name, date and a brief reference to the design document including the version number the job has been coded up to, plus the main job annotation and any modifications to the job. Where the job has not yet entered Version Control, the initial version should be referred to as 0.1. When using DataStage Version Control, the Full Description field in job properties is also used by DS Version control to append revision history. This is packaged and maintained with the job and will be visible when the jobs are deployed to test, promo and production. It does not stop developers from using Full Description as a method of maintaining the relevant documentation, but information maintained by the developer will get appended to by the Version Control tool. Naming conventions must be enforced on links, transforms and source and target files. Annotations are also used to further describe the functionality of jobs and stages. Two types of annotation, a blue job description (description annotation) and a yellow operator specific description (standard annotation) are used. The detailed description is also updated automatically in by DataStage Version Control process following the first initialization into Version Control. Entries put in the detailed description by Version Control must not be modified manually. Standard description annotations should be used on every non-trivial stage. 7. DATASTAGE NAMING STANDARDS Object Type Category Job Job Sequence Source Definition Category 55000783.doc Syntax Import/transform/unload jb_fdXX_<im/tr/ul>_<JobName> Where XX is 01,02…13 indicating FD name. <im> indicates Import Job <tr> indicates Transform Job <ul> indicates Unload Job js_<fdXX>_<im/tr/ul>_<file/detail> Where XX is 01,02…13 indicating FD name. <im> indicates Import Job Sequence <tr> indicates Transform Job Sequence <ul> indicates Unload Job Sequence source Page 14 of 41 Target Definition Category Link* target lnk_<StageName>_<rej/njn/jn> lnk_<StageName> <StageName> is the name of the stage from which the link is coming out. <rej/njn/jn> indicates the type of link rej=reject, njn=non join, jn=join. If not applicable then this will be dropped. Parallel Job FILE Stages Data Set Sequential File File Set Lookup File Set External Source External Target Complex Flat File Parallel Job Processing Stages Transformer BASIC Transformer Aggregator Join Merge Lookup Sort Funnel Remove Duplicates Compress Expand Copy Modify Filter External Filter Change Capture Change Apply Difference Compare Encode Decode Switch Generic Surrogate Key Parallel Job RESTRUCTURE Stages Column Import Column Export Make Subrecord Split Subrecord Combine Records Promote Subrecord Make Vector Split Vector 55000783.doc ds_<Dataset Name> sq_<Sequential file name> fs_<File Set name> lfs_<Lookup file set name> esrc_< External Source name> etrg_< External Target name> cff_< Complex Flat File name> tr_<Purpose> btr_<Purpose> agg_<Purpose> jn_<Purpose> mrg_<Purpose> lkp_<Purpose> srt_<Purpose> fnl_<Purpose> rdup_<Purpose> cps_<Purpose> exp_<Purpose> cp_<Purpose> md_<Purpose> flt_<Purpose> sflt_<Purpose> ccap_<Purpose> capp_<Purpose> diff_<Purpose> cmp_<Purpose> enc_<Purpose> dec_<Purpose> cwt_<Purpose> gen_<Purpose> sur_<Target Column Name> ci_<Purpose> ce_<Purpose> msub_<Purpose> ssub_<Purpose> crec_<Purpose> prec_<Purpose> mkv_<Purpose> splv_<Purpose> Page 15 of 41 Containers Local Container Shared Container Others Stage Variable Sequence Generator Job Sequences Stages Job Activity Execute Command Sequencer lc_<functionality> sc_<functionality> s_<StageVariableName> seq_<Target Column Name> ja_<job name without jb and fd#> ex_<Script function>_<file/detail> sq_<Purpose> 8. RUNTIME COLUMN PROPAGATION (RCP) One of the aims/benefits of RCP is to enable jobs that have variable metadata that is determined at run time. An example would be a generic job that reads flat file and stores the data into a Dataset, but the file name itself is a job parameter. In this case it is not possible to determine the column definitions during build. Conversely, one of the features that sometimes confuse developers, is that in jobs where RCP is not desired by the developer but the feature is switched on, can cause additional columns to appear in the output dataset that the developer may have thought were dropped. For these reasons developers must turn off RCP within each job unless the feature is explicitly required in the job by the developer as in the above example. In any event, RCP should be enabled within the Project Properties (providing flexibility at to use RCP at job level) and in the event that RCP is required, it can be turned on at job / stage level. An annotation should make this clear on the job. 9. 9.1 STANDARDISED REJECT HANDLING Reject Components There is a requirement to set up a standard approach to reject handling. The standardisation of reject capture allows operational support to easily: • • • • locate the rejection message and understand the format of the message locate and diagnose the reason for rejections set tolerances to the numbers of rejects permitted allow for the re-process rejected rows. Reject processing is not provided as standard within DataStage Enterprise (Parallel) across the majority stages. There is a reject link on the Lookup stage. However, a standard approach must be introduced for the remaining stages and adopted across all stages. This will be achieved by the introduction of a bespoke element (in the form of example stages within template jobs) and through the use of a standardised reject component made available to all developers via a DataStage wrapper. These components are shown in the following diagram: 55000783.doc Page 16 of 41 Records to be Processed All stages (where a row might be rejected) must include a reject link. Three such stages are shown in the diagram (i.e. Join, Lookup and Transform). In the example above, the Lookup stage is shown with a reject link, though this is just as applicable to Join, Transform and other stages. For instance, data flowing down the reject link from a Lookup or Join stage might result from an inability to match keys and from a Transform stage from the validation of data items, for instance an unexpected value or null might be encountered. In each of these cases, the rejected row is passed down a reject link to a bespoke component that: 1. Passes the row to a dataset in order to facilitate the re-processing of the rejected rows 2. Identifies the key of the rejected row and passes this down the relevant link (depending on the key type) to the standardised reject handling component. Where there is no key, i.e. a file is empty or there is a mismatch between the number of rows read and the information provided on the footer record, zeros are passed down all links intended for key information 3. Compiles and passes a standard message (see table below) describing the rejection to the standardised reject handling component. This approach assumes that a key uniquely identifying each failing row is present on driving flows. The standardised reject component takes two inputs (over a possible five input links) and creates a surrogate key, uniquely defining each reject and writes the message along with the two keys to a dataset. This reject dataset therefore holds the key from the rejected row (that can be used to cross reference to the dataset of rejected rows) and a message that will help identify the reason for the rejection. 55000783.doc Driving Data Flow Page 17 of 41 Paths where reject datasets are automatically set to write to are date stamped within a common reject and log directory. Reject datasets are uniquely named and created each time the module runs (see below). The reject component will be used with every stage which can fail due to data discrepancies (e.g. join, and lookup). The Join stage requires further processing whilst the error link from the lookup stage can be linked directly to the custom error component. In order to facilitate reject handling within the Join stage, further processing is required. This processing requirement is shown in the following diagram: This component will be made available to all developers for use in reject handling as a job template. 9.2 Customised Reject Messages The following reject messages / conditions will be used: Reject Message Lookup / Join Failure Description For all referential integrity checking and any other critical Lookups / Joins. Keys have not been matched between input links on a Join or Lookup stage. The job and stage name must be included in the message. The number of records processed does not match the number of records described in the footer record. The job and stage name must be included in the message. This message is particular to import jobs where the input file is validated against the footer record. The input file is empty. This message is Page 18 of 41 Row Count Mismatch Empty File 55000783.doc Secondary Data Source The creation of a list of standard error conditions limits the number of exceptions an operator will see allowing errors to be quickly identified and resolved. Developers are limited to using the messages specified below, thus prevent ting the creation of random error messages. Column Reject Message Invalid Field Null Field Description particular to import jobs where the input file is validated against the footer record. A field has been identified as containing invalid data. The job, field and stage name must be included in the message. A Not Null field has been identified as containing null values. The job, field and stage name must be included in the message. Developers must intercept rejects in the code they generate and generate a standard reject message that contains accurate data and relevant information from the record. The job, field and stage names must be inserted into the message. A description of rejects and messages should be made available to operational support to help diagnose problems encountered when running the batch. 9.3 Reject Limit A Reject Limit parameter is included in all jobs. This is used by the standardised reject processing wrapper to test against the total number of errors for a module. On meeting the reject limit, the job and hence the processing for any given module is terminated. The reject will be variable between 0 and 99. A reject limit of 0 (zero) will ABORT ON FIRST REJECT, whilst a reject limit of 99 will NEVER ABORT (on reject). This allows central control the level of rejects allowed across all modules and jobs used in the Dummy batch. 9.4 Before Routine The before routine for the first job in a sequence of jobs that implement a module (or for a single job where there is only one job in a module) will be used to interrogate increment the number that will uniquely identify the datasets that will be created from the processing of rejects for a particular module. 9.5 • • Notifications operations are informed of a reject i.e. in-line notifications rejects are communicated between functional streams and / retained to support the rerunning of modules i.e. cross functional notifications. A notification is the method by which: These are described in the following sections. 9.5.1 In-line Notification of Rejects In-line notifications are those resulting from rejects within a functional processing stream. The last activity within a module will be to email notification of rejects within a module to operations. This will be achieved by using the Notification stage. A template job will be provided that includes the Notification stage and job parameters that can be tailored such that the names and paths of the reject datasets can be interrogated and the relevant notifications made. 55000783.doc Page 19 of 41 9.5.2 Cross Functional Notification of Rejects This type of notification is the means by which rejects are communicated between functional streams. This ‘communication’ is built around the feedback from the load process, prompting a rerun and between migration steps (i.e. T14 to T) and an understanding of the dependencies between functional areas i.e. transactions being dependent on accounts etc. In this instance, rejected accounts will be incorporated into the transaction processing process, therefore limiting those transactions processed to those where an account had also been successfully processed. 10. ENVIRONMENT 10.1 Default Environment Variables Standards DataStage Enterprise Edition allows project / job tuning by means of Environment variables. These include the settings of the default node configuration file, and error log activities. The following DataStage Environment variables must exist in all jobs. (Note that DataStage Environment Variables are different to Standard Parameters) The template job already has these parameters defined: • $APT_CONFIGX_FILE= /DataStage/Product/Ascential/DataStage/Configx1.apt (Default value used in every job) /DataStage/Product/Ascential/DataStage/Configx4node.apt (Value overwritten for Testing on extra nodes in individual jobs) $APT_DUMP_SCORE=false • 10.2 Job Parameter File Standards A generic parameter file which stores all the default job parameter values including user names and login details will be run in conjunction with the before job routine “SetDSParamsFromFile” This will allow project wide settings to be changed once, and avoid unnecessary parameter duplication. The path to this parameter file will be /DataStage/Parameters/<project name> and its name will be parameters.lst 10.3 Directory Path Parameters The following parameters must exist in all jobs. The template job, held under the Users/Template DataStage will have these parameters defined. • • • pDSPATH = /XX/XX/Dummy (DataStage Datasets top level development directory – there will be equivalents for testing and Live). o XX – Base Dummy directory as set by DMCoE. pITERATION = n (where n is the migration iteration i.e. from 1 to 9) pRUNNUMBER = n (where n is the run number within the iteration starting from 1) 10.4 Default Directory Path Parameters The following parameters must exist in all jobs (the template job has these parameters defined): • • • pDSPATH = /DataStage/Datasets/DummyDev (DataStage Datasets top level directory) pITERATION = 1 pRUNNUMBER = 1 Page 20 of 41 55000783.doc 10.5 Directory & Dataset naming standards UNIX directory paths are set using the following convention, based on the parameters defined above. Note the final subdirectories (i.e. “Deliver” and “Internal”) are hard coded in the jobs. This is fine because if the developer mistypes the value the job will fail immediately as the mistyped directory will not exist. 10.5.1 Functional Area Input Files Source files will be pushed by Extract system to ETL server in a holding area ‘Hold` via connect direct software. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Hold/<source_file_name> 10.5.2 Functional Area Output Tables Datasets that are defined in the Detailed Design as output tables for a functional area are stored in a “Product” directory. This is the directory that downstream Functional Areas (including the Unload process) will go to find input tables from previous areas. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Product/<datasetname>.ds 10.5.3 Functional Area Staging Tables Datasets that are defined in the Detailed Design as staging tables within an area are stored in a “Staging“directory. This is the directory that other modules within the same Functional Area will go to find staging tables from previous modules. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Staging/<datasetname>.ds 10.5.4 Internal Module Tables Datasets produced within a module and used only internally within that module will be stored in an “Internal” directory. Datasets in this directory are only used within jobs. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Internal/<datasetname>.ds 10.5.5 Datasets Produced from Import Processing Datasets that are produced by Pre-Processing are stored in a “Source” directory. This is the directory that Functional Areas will go to find input tables from the source. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Source/<datasetname>.ds Reference Datasets that are produced by Import Processing are stored in a “Reference” directory. Reference data is not split into iterations. #pDSPATH#/Reference/<datasetname>.ds 11. METADATA MANAGEMENT Metadata consists of record formats for all external files (flat files) and internal files (datasets) processed by DataStage which are stored in the DataStage Repository (a Metadata repository). Metadata is either created manually within stages (i.e. Flat File, Complex Flat File and Dataset) or imported from sources such as COBOL copybooks. There are two types of Metadata, described below: 55000783.doc Page 21 of 41 11.1 Source and Target Metadata Record formats will have been pre-defined within the DataStage Repository describing the record formats of files that form inputs to import jobs and outputs from unload jobs. This metadata will therefore only be used by import and unload jobs. These record formats are for the convenience of developers (they are described in the FDs and are therefore fixed,) and help maintain consistency in terms of the way data is interpreted across all jobs (define once, use many times), therefore having a positive impact in terms of quality. This metadata must not be changed by developers. Should a change be required to this metadata, it should first be impacted to assess the potential impact of the change on jobs that use the metadata and processed through standard change control. 11.2 Internal Metadata Developers will also create metadata describing the datasets that: • • pass data between jobs within a functional area pass data between jobs in different functional areas. This metadata will define the outputs of import jobs, be used by all transoform jobs and will define the inputs to unload jobs and must be stored in the repository with a name that matches the name of the dataset it describes. Should it be necessary or more efficient to process data in a different way from the way it is presented within the pre-defined metadata, developers may create a job specific version of the metadata which must be clearly identified as a variant on the original and saved within the repository. 12. STANDARD COMMON COMPONENTS The use of Standard Components in developing DataStage jobs will: • • • Increase quality of the code, since the most optimal method will be used for a function which is to be achieved in multiple jobs Promote reuse, productivity is increased and developers can spend more time on tasks which are specific to individual jobs Reduce the complexity of common tasks. 12.1 Job Templates DataStage provides intelligent assistance which guides through basic DataStage tasks. The Intelligent Assistants are listed below: • • • Create a template for a server or parallel job. This can be subsequently used to create new jobs. New jobs will be copies of the original job Create a new job from a previously created template Create a simple parallel data migration job. This extracts data from a source and writes it to a target Not only will the use of templates help in standardization but also it will form reusable components, which need not be coded yet again. Also certain elements will be common in many 55000783.doc Page 22 of 41 jobs, namely: parameters, annotations and reject handling, etc. which can be implemented by the use of templates. Dummy project will have templates which will be a job with stages following naming standards. These jobs acting as a template will assist developer to develop new jobs as per mentioned standards. 12.1.1 Import Jobs Each source file will be read in persistence datasets by separate jobs called import jobs. These jobs will have functionality of doing sanity checks on received file e.g. data file is not empty, header and trailer details are consistent with file properties. In Dummy project the files are repeatedly used in different functionality, we will read file only once and create a DataStage datasets. These datasets will then be used in respective functionalities. Since associated logic for importing and validating files will be same, we will build and test one such job and use this architecture in rest. The files that are used in multiple instances are described below: Common Source Files Import Account Selection File Import Customer Selection File Import ETL Customer Data File Import ETL Address Data File Import ETL Customer Pointer File Import ETL DDA Account Data Import ETL TDA Account Data Import TAX Certification File Import ETL Re-directions table load file Functionalities the file is used FD01, FD09 FD01, FD05, FD06, FD12 FD01, FD03, FD05, FD09 FD01, FD05 FD01, FD03, FD05, FD09 FD01, FD02, FD03, FD09, FD11 FD01, FD02, FD11 FD02, FD05 FD01, FD02, FD03, FD04, FD07, FD08, FD09, FD10, FD11, FD13 12.1.2 Transform Jobs Dummy Transform jobs repeatedly perform joins on similar driver files with other data files. Since this functionality is common, these processes will be developed once and will be copied in respective occurrences. The table below identifies such occurrences: Common Process Sort Code Lookup & Split data based on processing centre ETL Redirections Table Load file performs the same join with many different files i.e. Join based on S/C & Acc Num ETL Customer Data File performs the same join with many different files i.e. Join based on Customer Num ETL Customer Pointer File performs the same join with many different files i.e. Join based on Customer Num (to get details of associated Account Numbers for each customer) Customer Selection File performs the same join with many different files. i.e. Join based on Customer Num. 55000783.doc Functionality FD02, FD03, FD04, FD07, FD08, FD10, FD11, FD13 FD01, FD02, FD03, FD04, FD07, FD08, FD09, FD10, FD11, FD13 FD01, FD03, FD05, FD09 FD01, FD03, FD05, FD09 FD01, FD06, FD12 Page 23 of 41 Common Process Account Selection File performs the same join with many different files. i.e.Join based on S/C & Acc Num. ETL Re-directions Table Load File JOIN WITH ETL DDA Account Data ETL Re-directions Table Load File JOIN WITH ETL TDA Account Data Functionality FD01, FD09 FD02, FD11 FD02, FD11 12.1.3 Unload Jobs Dummy Unload jobs are tasked to create output files in format required by load team. These files will be mainly in mainframe format. Apart from creating files from persistent dataset these jobs will create header and trailer details within file. 12.2 Containers A container is a group of stages and links. Containers simplify and modularize server job designs by replacing complex areas of the diagram with a single container stage. DataStage provides two types of container: • • Local containers. These are created within a job and are only accessible by that job. A local container is edited in a tabbed page of the job’s Diagram window. Local containers can be used in server jobs or parallel jobs. Shared containers. These are created separately and are stored in the Repository in the same way as other jobs. There are two types of shared container: o Server shared containers are used in server jobs. They can also be used in parallel jobs, though this can cause bottlenecks in processing as they are serial only and should be avoided if possible o Parallel shared container is used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job). Containers are the means by which standard DataStage processes are captured and made available to many users. They are used just as a developer would use a standard stage. Some work needs to be done to identify opportunities for reuse within the overall design. However, once identified, reusable components will be identified and delivered into the DataStage repository as shared components. Identified containers in Dummy transform project are described in the table below: Container Functionalities Definition Will act on the joins, lookups and active transformations to check records eliminated in process and log them in a separate file. The functionality needed is discussed in section 7. This component will log messages in a mentioned file. It will take input as filename and message to be written. Reject Handling Statistics Report logger 55000783.doc Page 24 of 41 13. DEBUGGING A JOB The following techniques options will assist when debugging a job. Debugging essentially involves viewing the data in order to isolate the fault. There are a number of techniques including: • • • • Adding a peek stage will output certain rows to the job log Adding a filter to the start of the job to filter out all rows except the ones with the attributes that the developer may wish to test or debug the behaviour on Adding an additional output to a transformer with the relevant constraints and storing the data into a sequential file to be used as part of the investigation. The use of the copy stage would also be an option A variant of the above would be to add a parameter pDEBUG with a value of 1 or 0 that will be used as part of the constraint. The resulting debug sequential file would only contain data when pDEBUG=1. All changes to code made for debugging (including peeks, extra stages and extra parameters) must be removed prior to final unit test. Final unit testing must occur on the exact version of code that is to be promoted to Integration Test. In processing hotspots (parts of a job which could potentially be an area of concern) it is advisable that peeks be replaced by COPY stages before promoting the jobs to Integration Test (instead of complete removal of the stage). Removing and re-inserting peeks and re-inserting them can often get to be quite a tedious task. The COPY stage is a no-op (non operator) stage. This means that there isn’t a processing cost to having a copy stage in a job design. While the job may appear to look overly complex, this will not impact the processing times of the job. 14. COMMON ISSUES AND TIPS Common issues faced in project while development and testing are mentioned in this section. Finally there is a tips section to assist developer while coding. 14.1 1-way / n-way Scaling from 1-way to n-way processing is the method employed within DataStage to take advantage of parallelism. This improves performance and should not effect or change the function of the code. In order to ensure trouble free scaling, jobs are built 1-way and unit tested 1-way and n-way. This ensures that there has been no functional impact in making the switch to parallel processing. Jobs will run n-way when live in order to achieve the benefits of parallel processing provided by DataStage Enterprise. Problems to do with scaling usually become evident when comparing record counts between 1way and n-way runs. Clearly, these counts and the physical records involved should be the same. If there is a difference, the reasons for this must be examined and corrected. There are many possible reasons for variations in record counts, for instance: • One of the most common reasons is when the Join, Lookup and Merge stages (and others) are used. In these situations care must be taken to ensure that incoming data streams are not only sorted but partitioned the same way. If not, join conditions may not be met because of records (with keys that would otherwise match 1-way) being in different partitions and therefore go unmatched. In these situations, records may be unnecessarily rejected (either down a reject link or omitted all together) and will therefore Page 25 of 41 55000783.doc • • not flow down the main output link to subsequent stages or into an output dataset, hence causing a variation between the actual rows processed and the anticipated number it should be ensured that a dataset used as input on the lookup link to a Lookup stage must be partitioned as Entire to ensure that the entire dataset is available for lookup across all partitions within the main input link to these stages, otherwise the lookup may fail simply because the dataset was partitioned incorrectly for the lookup an incoming dataset may have been created by another job or module which may also have been written by another developer. In this case it might contain the required data, but may not be correctly partitioned for the needs of your job. Therefore good practice, unless you can be absolutely sure that the datasets you are using are partitioned correctly for your needs, is to repartition at the start of a job. This might be less efficient, but more effective in terms of retaining control over your jobs and the quality of the output data flows. Where possible, partitioning will be considered within the overall solution design, therefore minimising the need for repartitioning. Configuration files are provided for 1-way and 4-way running on the Development server, with 1way being the default. 4-way processing is specified at job level as an override. The developer must ensure that overrides are removed from their jobs prior to promotion to the Test server. 14.2 Duplicate Keys Often an output table, flat file or internal dataset will contain duplicate keys. Duplicates will often be identified when the output data (from a DataStage job), perhaps in the form of a flat file, is loaded into a target database table. This load process will most likely fail if there are duplicate keys in the data, particularly if the target table is uniquely keyed. Another sign that there may be duplicates in the data is when the output of a job or stage (within a job) has more rows in the output stream than would have been thought possible from the inputs. For these reasons, care must be taken at the unit test stage and it is always a good idea to have a general understanding of the anticipated throughput of a job before starting the build. The key to solving problems related to duplicates, is to understand how duplicates are be generated. Here are some examples: • an incoming data stream i.e. a data source or internal dataset (for instance the source system itself, the output from another job or module), i.e. the problem may be inherited and a more extensive search may be required in order to find the problem. If the problem lies with the source system, then this may need to be raised as a data quality issue and corrected at source a 1-way/n-way issue. Scaling from 1-way to n-way processing will often cause problems. Essentially this is because when running with a single node, all data flows through a single partition (where processing rules apply to all the data), usually giving correct results. Running with multiple nodes means that partitioning comes into play and therefore issues arise from applying processing rules across multiple partitions. This effect may be desirable, however in many cases this can also lead to incorrect results. For instance of a job is generating a unique key column, the same key may me generated across all partitions and therefore duplicated when the data is collected for output. A sign that this is the case is if the final record count is a multiple of the number of nodes compared to the single node record count. To avoid this kind of issue, a stage can be forced to run sequentially (though this may become a bottleneck) or alternatively, Page 26 of 41 • 55000783.doc • particularly then defining keys, the partition number can be built into the algorithm for generating the key, therefore ensuring uniqueness across partitions A Cartesian join. 14.3 Resource usage Vs Performance This section concentrates on issues found not only during development but also during wider Integration, E2E and Performance test stages, particularly discussing the balance that must be achieved between the resources available on the server where the DataStage jobs run and the performance of those jobs. Since DataStage Enterprise (DataStage) starts one Unix process per node (nodes are defined in the configuration file and can be thought of as a logical processor) per stage, the effective use of available processors and to an extent the total memory usage is determined by the operating system rather than DataStage, though generally the more resource (processors and memory) the better. Clearly, this can lead to an explosion of processes running and eventually the operating system spend more time managing than executing code, having a detrimental effect on performance. The key is to run a number of performance tests to determine the optimum number of nodes. A starting point will usually be around 50% of actual CPUs. Within DataStage, the optimum use of parallel (partitioned and piped) data streams is clearly essential, as is the appropriate use of stages within jobs and the elimination of unnecessary repartitioning and sorting. As a general rule of thumb, incoming data streams should be partitioned and sorted as far up stream as possible and maintained for as long as possible. Partitioning and sorting will take considerable amounts of time during job execution, so where possible these activities should be minimized. The sort order of the data within a partition in a data stream will be maintained throughout a job, even when included as an input link to sort dependent stages such as Dedupe and Join. It is always tempting to sort on the input links of these stages, however this is completely unnecessary (providing the data is in the correct order already) and time consuming. Similarly, it is also tempting to repartition on the input links of stages when specifying Same will suffice (again, providing the data is correctly partitioned already). Within DataStage, the Transform stage was inherited from the DataStage Server product and is less efficient than other native Parallel stages. The jury is out as far as the use of Transform is concerned, with arguments for and against. For users of DataStage Server it will be familiar and easy to use, read and maintain. The native Modify stage is a good alternative but is not consistent with the user interface implemented for other stages, though Transform also differs slightly. Common sense is the key, too many Transforms will slow your jobs down and in this situation, Modify for simple type conversions should be considered. Using several transforms in sequence is also undesirable. Quite often they will ‘look’ good but could be combined, therefore reducing the overhead. Finally, the Lookup stage: This stage differs from Merge and Join in that it requires the whole of the lookup dataset to be held in memory. The upper limit is large, though this needs to be considered in the context of the total memory available and what else will be running at the time. Total memory usage will be hard to estimate and will be best left until a point when the runtime batch has been designed and run – be prepared to increase memory and split jobs if the usage is too great. 55000783.doc Page 27 of 41 Likewise, to improve runtimes, be prepared to add further processors to facilitate scaling. 14.4 General Tips General tips used while development code is mentioned below • Common information like home directory, system date, username, password should be initialized in a global variable and then variable should be referred everywhere. Stage Variables allow you to hold data from a previous record when the next record, allowing you to compare between previous and current records. Stage variables also allow you return multiple errors for a record of information. By being able to evaluate all data in a record and not just error on the first exception that is found, the cleanup of data is more efficient and requires less iteration. Nulls are a curse when it comes to using functions/routines or normal equality type expressions. E.g. NULL = NULL doesn’t work; neither does concatenation when one of the fields is null. Changing the nulls to 0 or “” before performing operations is recommended to avoid erroneous outcomes. Ensure that job does not look complex. If there are more stages (more than 10) in a job divide into two or more jobs on functional basis. Use containers where stages in the jobs can be grouped together. Use Annotations for describing steps done at stages. Use Description Annotation as job title; as Description Annotation also appears in Job properties>Short Job Description and also in the Job Report when generated. When using String functions on decimal always use Trim function to avoid as String functions interpret an extra Space used for sign in decimal. When you need to get a substring (e.g. first 2 characters from the left) of a character field: Use <Field Name>[1,2] Similarly for a decimal field then: Use Trim(<Field Name>)[1,2] • Always use Hash Partition in Join and Aggregator stages. The hash key should be the same as the key used to join/aggregate. If Join/Aggregator stages do not produce desirable results, try running in sequential mode (verify results; if still incorrect problem is with data/logic) and then run in parallel using Hash partition. • • • Use Column Generator stage to create sequence numbers or adding columns having hard coded values. In Job sequences; always use “Reset if required, then run” option in Job Activity stages. (Note: This is not a default option) When mapping a decimal field to a char field or vice versa , it is always better to convert the value in the field using the ‘Type Conversion’ functions “DecimalToString” or “StringToDecimal” as applicable while mapping. • • • • • • • 55000783.doc Page 28 of 41 • “Clean-up on failure” property in sequential files must be enabled (enabled by default) 15. REPOSITORY STRUCTURE The DataStage repository is the resource available to developers that helps organise the components they are developing or using within their development. This consists of metadata i.e. table definitions, the jobs themselves and specific routines and shared containers. The anticipated repository structure is described in the following sections. However the structure may change during development, usually evolving to a structure that is in it’s most usable form. 15.1 Job Categories The jobs can be categorised by developer and by FD. The following jobs will be created: • Import Jobs: Import Jobs will be starting point for transformation. Sanity checks on file and validation of external properties e.g. Size will be done here. Source file will be read in memory datasets as per source record layout. Exception log will be created with records that do not follow file layout. Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference. Finally one or more datasets will be created which will be input to actual transform process. Transform Jobs: Datasets created by import jobs will be processed by actual transform job. Transform will join two or more datasets, lookup data as per given functionality. Finally the records will be split as per destination file and a destination dataset will be created. All data errors will be captured in an exception log for future reference. Unload Jobs: Unload jobs will take transform datasets as a source and create final files required by load team in the given format. • • 15.2 Table Definition Categories The files are categorised into: • • Source/Target Flat-files: The source and target files will be included in this category. These files will be converted into datasets by DataStage jobs and then after the Transformation process is complete, they will be converted back to Target flat files. Datasets: Datasets are used as intermediate storage for the various processes. A Dataset can store data being operated on in a persistent form, which can then be used by other DataStage jobs. Datasets can either be Sequential or Parallel. These Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process. 15.3 Routines Before and after routines (should they be needed) will be described here. 15.4 Shared Containers Shared containers (as described above) will be described here. It is anticipated that there will be a small number of these and therefore no further categorisation is anticipated. 55000783.doc Page 29 of 41 16. COMMON COMPONENTS USED IN DUMMY 16.1 jbt_sc_join jbt_sc_join is a common component built to meet a specific requirement in Dummy project to capture 3 types of records from a Join stage, whereas Datastage just offers 2 outputs from a Join stage. For example, take file A (master) and file B (child). The Join stage of Datastage will give 2 outputs in this case: • A + B (Join records) • A not in B (Reject Records) The common component jbt_sc_join will give 3 outputs in this case: • A + B (Join records) • A not in B (Reject records) • B not in A (Non Join records) This functionality is illustrated in the flow diagram below: A not in B ln k_ A_ B_ re j File ‘A’ (Master) lnk _A B_jn lnk_A_ A +B jn_A_B _ lnk File ‘B’ (Child) B lnk _A _B _n jn B not in A 16.2 jbt_sc_srt_cd_lkp Sort Code look up is a functionality which is required at many places (in various FD’s in Dummy). So a common component with this functionality is built. This will take a file as input and divide into 2 files for notth and south separately. 55000783.doc Page 30 of 41 ‘A’ - North File h ort _n _A l nk File ‘A’ lnk_A sc_srt_c d_lkp l nk _A _s ou th ‘A’ - South File 16.3 jbt_env_var This is a template job with commonly used environmental variables imported. This can be used for all the jobs being developed with these set of common environment variables rather then importing them again and again. These Environment variables are as shown below: $ADTFILEDIR: This would contain the Audit file and reconciliation reports. $BASEDIR: This folder is the base directory. $DSEESCHEMADIR: DSEE Schemas that are used by EE jobs using RCP/schema files. $ITERATION: Current Iteration number $JOBLOGDIR: This would contain all the Error log files generated in DataStage jobs. $PARMFILEDIR: This folder will contain parameter files that would be looked up by jobs/routines that would be triggered from a common parameter file. These parameters values will be set as per development environment. $REJFILEDIR: This would contain all the reject files generated in DataStage jobs. $SCRIPTDIR: This will contain routine UNIX scripts used for processing files, copying, taking file backup etc. $SRCDATASET: All the input files will be partitioned and imported into DataStage datasets. This folder will store all the input datasets. 55000783.doc Page 31 of 41 $SRCFILEDIR: This folder will contain all the input files from the Extract team. All files will be manually copied into this folder. $SRCFORMATDIR: This folder will contain the copybook formats for input source files. These copybook formats are as per functional specifications. $TMPDATASET: This folder will be used to store all the intermediate files created during transform job. $TRGDATASET: This folder will be used for storing output DataStage datasets files. $TRGFILEDIR: These folders will contain all the transformed output files which can be loaded to Bank B’s mainframe. $TRGFORMATDIR: This folder will contain the copybook formats for output source files. 16.4 jbt_annotation This is a template job where annotations are used for describing steps done at stages. Also Description Annotation are used as job title; as Description Annotation also appears in Job properties>Short Job Description and also in the Job Report when generated. 16.5 Job Log Snapshot JobLogSnapShot.ksh is a script which will create the log file (as seen in Datastage Director) of job's latest run. The following parameters need to be hard coded in the script as per environment: DSHOME=/wload/dqad/app/Ascential/DataStage/DSEngine PROJDIR=/wload/dqad/app/Ascential/DataStage/Projects/Dummy_dev LOGDIR=/wload/dqad/app/data/Dummy_dev/itr01/errfile 55000783.doc Page 32 of 41 DSHOME is the Datastage Home path. PROJDIR is the project directory in which the job exists. LOGDIR is a common directory where the log file will be created. The script will be called from the after job subroutine of a job. ksh /wload/dqad/app/data/Dummy_dev/com/script/JobLogSnapShot.ksh $1 $1 is input parameter: Job name whose latest job log is required. The Job Log file will be created in: /wload/dqad/app/data/Dummy_dev/itr01/errfile/<Job name>_log_<time stamp>.txt Sample Job log: . . . . 55000783.doc Page 33 of 41 16.6 Reconciliation Report Reconcilation.ksh is a script which will create the Reconciliation Report of the respective functional area (FD). The script will be called from an Execute Command stage of a Job Sequence. ksh /wload/dqad/app/data/Dummy_dev/com/script/Reconcilation.ksh $1 $2 $1 is 1st input parameter: FD## $2 is 2nd input parameter: .ini file name (not path) Specifications of .ini file: Path: /wload/dqad/app/data/Dummy_dev/com/parmfile The .ini file will contain the following separated by | sign. • The type of the file i.e. Input, Output, Reject or Non-Join. Example: INP or OUT or REJ or NJN. Note: this should be sorted order. Also the input files will be datasets, the output files will be ebcidic files and the reject and non-join files will be in ascii format. • The name of the File whose report is to be prepared. • The Description of the file whose report is to be prepared. • The Record length of the file.(this is need only for the output ebcidic file). Sample .ini file: INP|fd01_customer_pointer_file|Customer Pointer dataset created from source file INP|fd01_customer_data_file|Customer Data dataset created from source file OUT|fd01_redirection_file|Output redirection file|117 REJ|fd01_duplicates_file|Reject file containing duplicated account numbers NJN|fd01_account_nonjoin|Nonjoin files from the join stage in job1 The Reconciliation report will be created in: /wload/dqad/app/data/Dummy_dev/itr01/adtfile/<FD##>_recon_<time stamp>.txt 55000783.doc Page 34 of 41 Sample Reconciliation report: 55000783.doc Page 35 of 41 16.7 Script template All scripts are made according to this template script. This has a script description and also a section for maintaining modification history of the script. This script name is /wload/dqad/app/data/Dummy_dev/com/script/ScriptTemplate.ksh 16.8 Split File SplitFile.ksh is a script which will split the input file into header, detail and trailer files. The script will be called from an Execute Command stage of a Job Sequence (Import sequence). ksh /wload/dqad/app/data/Dummy_dev/com/script/SplitFile.ksh $1 $1 is 1st input parameter: <Input file name without extension> $2 is 2nd input parameter: <Record length> This requires the file name to have .dat extension. The header, detail and trailer files created would be $1_hdr.dat, $1_det.dat and $1_trl.dat respectively. The input file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat All these files ($1_hdr.dat, $1_det.dat and $1_trl.dat) will be output in /wload/dqad/app/data/Dummy_dev/itr01/opfile/. 16.9 Make File Make_File.ksh is a script which will merge the header, detail and trailer record to create the target file. The script will be called from an Execute Command stage of a Job Sequence (Unload sequence). ksh /wload/dqad/app/data/Dummy_dev/com/script/Make_File.ksh $1 $1 is 1st input parameter: <Target file name without extension> This requires the header, detail and trailer file to be of name $1_hdr.dat, $1_dtl.dat and $1_trl.dat respectively. 55000783.doc Page 36 of 41 All these files ($1_hdr.dat, $1_dtl.dat and $1_trl.dat) will have to be present in /wload/dqad/app/data/Dummy_dev/itr01/opfile/. The output file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat 16.10 jbt_import This template job processes the Header, Detail and Trailer record created by the SplitFile.ksh described in 16.8. The header and trailer data is validated. The validations done on header are: • • • • • • The file header identifier must contain the value ‘HDR-TDAACCT’ The file header date must equal the T-14 migration date The file trailer file identifier must contain the value ‘TRL-TDAACCT’ The file trailer creation date must equal the file header creation date The file trailer record count must equal the total number of record on the input file including the header and trailer records. The file trailer record amount must equal the sum of the Closing Balance field from every record on the input file excluding the header and trailer records. The accumulation of the Closing Balance field must be performed using an integer data format, allowing for overflow. The validations done on trailer are: If any of the above checks fail, then processing should be immediately aborted with a relevant fatal error message. This is implemented using subroutine AbortOnCall. Note: These header/trailer validations are for FD01. They will vary (slightly though) for other FD’s. But this common approach as shown in the template can be taken. The detail records are written to a dataset to be processed in transform job. 55000783.doc Page 37 of 41 55000783.doc Page 38 of 41 16.11 jst_import This template job sequence calls the following components: • • SplitFile.ksh as described in 16.8 jbt_import as described in 16.10 This sequence template will split the source file into 3 different files: Header, Detail and Trailer & call the import job which will do the necessary validation and create a detail dataset. 16.12 jbt_unload This template job illustrates creation of header and trailer records. The trailer consists of record count and Hash count. This template mainly is for following logic: • • Total number of records on file (excluding header & trailer) Hash of account numbers from all detail records on file 55000783.doc Page 39 of 41 55000783.doc Page 40 of 41 16.13 jst_unload This template job sequence calls the following components: • • • jbt_unload as described in 16.12 MakeFile.ksh as described in 16.9 Reconciliation report as described in 16.6 This sequence template will create 3 different files: Header, Detail and Trailer & call the script which will combine these 3 files to create the target file. Also Reconciliation report is created. 16.14 jbt_abort_threshold Abort Threshold template will abort a job based on threshold value passed as a job parameter. It uses common routine called “AbortOnThreshold”. This routine has to be called from a BASIC Transformer: AbortOnThreshold (@INROWNUM, <Threshold Value>, DSJ.ME) Here <Threshold Value> is the job parameter. For example, if you give Threshold Value as 5, job will abort after 4 records pass through the BASIC Transformer. This is used in places where job needs to be aborted on a particular number of reject records. 55000783.doc Page 41 of 41

Comments

Description