•Data resides at Server since Repository is at Server. ---------- DataStage supports three types of job: • Server jobs are both developed and compiled using DataStage client tools. Compilation of a server job creates an executable that is scheduled and run from the DataStage Director. • Parallel jobs. These are compiled and run on the DataStage server in a similar way to server jobs, but support parallel processing on SMP, MPP, and cluster systems. Mainframe jobs are developed using the same DataStage client tools as for server jobs, but compilation and execution occur on a mainframe computer. The DataStage Designer generates a COBOL source file and supporting JCL script, then lets you upload them to the target mainframe computer. The job is compiled and run on the mainframe computer under the control of native mainframe software. Job Sequences: A job sequence allows you to specify a sequence of DataStage jobs to be executed, and actions to take depending on results. The DataStage Director is the client component that validates, runs, schedules, and monitors jobs run by the DataStage Server. What is a Job Batch? A job batch is a group of jobs or separate invocations of the same job (with different job parameters) that you want to run sequentially. DataStage treats a batch as though it were a single job. If any job fails to complete successfully, the batch run stops. You can follow the progress of jobs within a batch by examining the log files of each job or of the batch itself. These contain messages about the progress of the batch, as well as the job. You can create, schedule, edit, or delete Server Engine: The server software that holds, cleanses, and transforms the data during a DataStage job run. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Sequential Stage: The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. • • • • • • • OSH. These underlie the stages in a DataStage job. depending on the properties you have set. and whether you have chosen to partition or collect or sort data on the input link to a stage. 2. 2. . Remove Unneeded Columns – Unused Columns requires additional buffer memory. Modify being the preferred stage for conversion. This is the scripting language used internally by the DataStage Enterprise Edition engine. there is one section leader per processing node. Players are the workhorse processes in a parallel job. The Join stage must be used if the datasets are larger than available memory resources. Design for Good Performance: 1. Lookup is preferred. The Lookup stage requires all but the first input (the primary input) to fit into physical memory. Use Transformer stages sparingly and wisely 3. Lookup Stage: When one unsorted input is very large or sorting is not feasible. Increase Sort performance where possible 4. If performance issues arise while using Lookup. DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous. At compilation. 5. There is generally a player for each operator on each node. The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is small enough to fit into available physical memory. consider using the Join stage. A single stage may correspond to a single operator. • • • • • For Type Conversion Modify or Transformer Stages can be used. Avoid unnecessary type conversions. Each lookup reference requires a contiguous block of physical memory. Join is the preferred solution.• Operators. Avoid reading from sequential files using the Same partitioning method. Players. Players are the children of section leaders. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file). or a number of operators. Join Stage: When all inputs are of manageable size or are pre-sorted. or insert other operators if they are needed for the logic of the job. Combining Data can be done using: 1. parallel jobs pad the remaining length with NULL (ASCII zero) characters. You can make it available to other projects using the DataStage Manager Export/Import facilities. Compiler 4. This is primarily intended to prevent deadlock situations arising (where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output). You define a wrapper file that handles arguments for the UNIX command and inputs and outputs. NLS . The situation can arise where all the stages in the flow are waiting for each other to read or write. This is where a stage has two output links whose data paths are joined together later in the job. Building Custom Stages 3. 2. The DataStage Manager provides an interface that helps you define the wrapper. 3. Debugging 6. Buffering 2. General Job Administration 9. The stage is automatically added to the job palette. The stage will be available to all jobs in the project in which the stage was defined. Wrapped: Enables you to specify a UNIX command to be executed by a DataStage stage. Job Monitoring 10. Decimal Support 7. DB2 Support 5. Deadlock situations can occur where you have a fork-join in your job. Network 11. so none of them can proceed. • Types of Environment Variable: 1. Link Buffering: DataStage automatically performs buffering on the links of certain stages.• When converting from variable-length to fixed-length strings using default conversions. • • User Defined Parallel Stages: 1. Build: Enables you to provide a custom operator that can be executed from a DataStage Parallel job stage. your job will be in a state where it will wait forever for an input. No error or warning message is output for deadlock. Custom: In order to include an Orchestrate operator in a DataStage stage which you can then include in a DataStage job. Disk I/O 8. and reset jobs using the –run option. dsjob –run [ –mode [ NORMAL | RESET | VALIDATE ] ] [ –param name=value ] [ –warn n ] [ –rows n ] [ –wait ] [ –stop ] [ –jobstatus] [–userstatus] [–local] [–opmetadata [TRUE | FALSE]] [-disableprjhandler] [-disablejobhandler] [useid] project job|job_id • Stopping a Job You can stop a job using the –stop option. without using the DataStage Director. The methods are: 1. 2. C/C++ API (the DataStage development kit) DataStage BASIC calls Command line Interface commands (CLI) DataStage macros • Starting a Job You can start. stop.• DataStage provides a range of methods that enable you to run DataStage server or parallel jobs directly on the server. Dsjob –stop [useid] project job|job_id • Listing Projects The following syntax displays a list of all known projects on the server: dsjob –lprojects • Listing Jobs The following syntax displays a list of all jobs in the specified project: dsjob –ljobs project • Listing Stages The following syntax displays a list of all stages in a job: dsjob –lstages [useid] project job|job_id • Listing Links . validate. 3. 4. or may not. correspond to the actual number of processors in your system. or SMP environment has characteristics that define the system overall as well as the individual processors. for example. For example. For example. you do not need to alter or even recompile your DataStage job. want to always leave a couple of processors free to deal with other activities on your system. cluster. and still others are dedicated to running an RDBMS application. then four.The following syntax displays a list of all the links to or from a stage: dsjob –llinks [useid] project job|job_id stage • Listing Parameters The following syntax display a list of all the parameters in a job and their values: dsjob –lparams [useid] project job|job_id • Listing Invocations The following syntax displays a list of the invocations of a job: dsjob -linvocations • What does Configuration file do? DataStage learns about the shape and size of the system from the configuration file. you can first run your job on a single processing node. These may. then on two nodes. You can define and edit the configuration file using the DataStage Manager. A pool defines a group of related nodes or resources. and so on. and when you design a DataStage job you can specify that execution be confined to a particular pool. you change the file not the jobs. It organizes the resources needed for a job according to what is defined in the configuration file. The number of nodes you define in the configuration file determines how many instances of a process will be produced when you compile a parallel job. which also gives detailed information on how you might set up the file for different systems. You can use the configuration file to set up node pools and resource pools. When you modify your system by adding or removing processing nodes or by reconfiguring nodes. and other distinguishing attributes. Just edit the configuration file. then eight. while others have access to a tape drive. These characteristics include node names. This is described in the DataStage Manager Guide. The configuration file describes available processing power in terms of processing nodes. The configuration file also gives you control over parallelization of your job during the development cycle. Every MPP. by editing the configuration file. When you run a DataStage job. You may. DataStage first reads the configuration file to determine the available system resources. The configuration file lets you measure system performance and scalability without actually modifying your job. The configuration file describes every processing node that DataStage will use to run your application. certain processors might have a direct connection to a mainframe for performing high-speed data transfers. disk storage locations. . When your system changes. Pipeline 2. it will adopt these extra columns and propagate them through the rest of the job. • DataStage Enterprise Edition was originally called Orchestrate. • DataStage Enterprise MVS: Server jobs. parallel jobs. 1. if your job encounters extra columns that are not defined in the meta data when it actually runs. • • Ab Initio Productivity Features vs. then renamed to Parallel Extender when purchased by Ascential. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. what is a dataset 3. it now supports Windows. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage. mvs jobs. No 1.• DataStage is also flexible about meta data. Datastage used Orchestrate with Datastage XE (Beta version of 6. have You need to schedule Jobs can be scheduled in . MVS jobs only accept MVS stages. Informatica and DataStage Sr. Row splitter are only present in parallel Stage . AbInitio Informatica Supports 3 Types of Supports only one Type Parallelism: of Parallelism: 1. This is known as runtime column propagation (RCP). • DataStage Enterprise: Server jobs. Pipeline manager Doesn't DataStage Supports two Types of Parallelism (Through node configuration): 1. Row Merger.e Parallel Extender. sequence jobs.0 i. Server jobs only accept server stages. The enterprise edition offers parallel processing features for scalable high volume solutions. how's a ds different from a sequential • Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. parallel jobs. sequence jobs. how to take bakups in datastage 2. You can define part of your schema and specify that. Designed originally for UNIX. Parallel jobs have parallel stages but also accept some server stages via a container. Linux and Unix System Services on mainframes. Partitioning 2. • DataStage Standard Edition was previously called DataStage and DataStage Server Edition. It can cope with the situation where Meta data isn’t fully defined. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. Data some partitions in server 3. MVS jobs are jobs designed using an alternative set of stages that are generated into cobol/JCL code and are transferred to a mainframe to be compiled and run. Component Developer need to do 2. Now DataStage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0) to incorporate the parallel processing capabilities. it generates ksh tool. code.scheduler 3. 5. logs). 8. as of now (extract only the changed it doesn't have a way data through the DB to sniff the DB logs. UNIX or NT Admin will suffice It has to rely on DB to It has a built in Change provide the CDC Data Capture capabilities capabilities. through script or u need the project properties and to run manually it needs user name and Password. ETL tool itself Initial ramp up time -with Ab Initio is quick compare to Informatica Standardization and -tuning probably both (Ab Initio and Informatica) fall into same bucket Doesn't need a dedicated administrator. You can read data Informatica force you to with multiple have all the fields be delimiter in a given delimited by one standard record. 7. 11. which in its transformation can be modified to engine and the code that achieve the goals. 10. 6. 4. When making a -choice there are lot of factors which drive the decision like Existing . More user friendly -than Informatica Supports different Not Possible types of text files means you can read same file with different structures It is a code based ETL It is engine based ETL tool. delimiter. the power this tool is or bat etc. 9. if it generates after any that cannot be development cannot be taken care through the seen or modified. integration with third party tools. it\'s XML based. infrastructure. with numerous points of failure. Robust transformation language – Much more Robust User defined functions – allows for custom components Instant feedback – On execution. 3 tools to develop. Metadata management. complexity of transforms. data volumes. Ab Initio tells you how many records have been processed/rejected/etc. and detailed performance metrics for each component. but it is slow and difficult to adapt to. 18. 16. Tool Support etc. Informatica is very basic as far as transformations go It requires that you code custom transformations in C++ (or VB if you are on a windows platform). Consolidated Interface – Ab Initio has one tool Ab Initio has supplemental repository Informatica has one huge log! Very inefficient when working on a large process. 15. 14. 13.12. and is repositorycentric Scalability higher than DataStage . Informatica has a debug mode. not just Budget. 17. Error Handling In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. test and debug one Informatica \'mapping\'. Project time line. resources. a Informatica does support workflows and scheduling. I – Descriptors: • Define virtual fields (read-only) • Expression defines virtual field – Very similar to BASIC expression – Expression is evaluated when virtual field is accessed – Must be compiled Example: Bonus = min_lvl * 10 • Uses – Boost performance – Create standard interface • • • • Create from DataStage command line Use REVISE utility to add record specifying field to hashed file dictionary Import metadata as a UniVerse table Access using UV stage Why to create Distributed files? • Physically partition data • Overcome size limitations on single files • Support faster lookups • Bring disparate data together • • • • • • • • • • Use stage variables to store values and accumulations between reads Merge sequential data using Merge plug-in (Merge Stage) When no line terminators use DataStage BASIC System Variables: @Date. 22.19. 21. @False.5 Informatica was 6-times more productive (It took 6 DataStage mappings compared to 1 Informatica mapping). Top-Down approach. @True.5x faster than DataStage 7. @Time. @Null Use Iconv to convert a string date to an internal integer (day number) Use Oconv to convert a date number to a string in a specified format Job Sequencer gives facility to handle job exceptions Only one exception handler per job Exception Activity has no input links Can be linked to outputs such as Notification Activity . 20. Informatica PowerCenter 2. @Who. geekinterview.• Notification Activity sends an email when invoked http://datawarehouse.com/FAQs/Informatica .ittoolbox.com/topics/t.com/groups/technical-functional/informatica-l/897455? sp=CM http://www.asp?t=421&p=421&h1=421 http://datawarehouse.ittoolbox.