Datastage Interview Questions

Datastage Interview Questions - AnswersDatastage Interview Questions What is the flow of loading data into fact & dimensional tables? Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table. What is the default cache size? How do you change the cache size if needed? Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location What is Modulus and Splitting in Dynamic Hashed File? In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting". What are Stage Variables, Derivations and Constants? Stage Variable - An intermediate processing variable that retains value during read and doesn’t pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. Constant - Conditions that are either true or false that specifies flow of data with a link. Types of views in Datastage Director? There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, Program Generated Messages. 1 Datastage Interview Questions - Answers Types of Parallel Processing? A) Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Orchestrate Vs Datastage Parallel Extender? Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender. Importance of Surrogate Key in Data warehousing? Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e. Surrogate Key is not affected by the changes going on with a database. How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. How do you execute datastage job from command line prompt? Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname Functionality of Link Partitioner and Link Collector? Link Partitioner: It actually splits data into various partitions or data flows using various partition methods. Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to target. Types of Dimensional Modeling? Dimensional modeling is again sub divided into 2 types. a) Star Schema - Simple & Much Faster. Denormalized form. b) Snowflake Schema - Complex with more Granularity. More normalized form. c) Galaxy scheme or complex – multi star schema 2 Datastage Interview Questions - Answers Differentiate Primary Key and Partition Key? Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key. Differentiate Database data and Data warehouse data? a) Detailed or Transactional b) Both Readable and Writable. c) Current. Containers Usage and Types? Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. Compare and Contrast ODBC and Plug-In stages? ODBC: a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures. Plug-In: a) Good Performance. b) Database specific. (Only one database) c) Cannot handle Stored Procedures. Dimension Modelling types along with their significance Data Modelling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relatioships). b) Dimensional Modelling. What are Ascential Dastastage Products, Connectivity Ascential Products Ascential DataStage Ascential DataStage EE (3) Ascential DataStage EE MVS Ascential DataStage TX 3 Datastage Interview Questions - Answers Ascential QualityStage Ascential MetaStage Ascential RTI (2) Ascential ProfileStage Ascential AuditStage Ascential Commerce Manager Industry Solutions Connectivity Files RDBMS Real-time PACKs EDI Other Explain Data Stage Architecture? Data Stage contains two components, Client Component, Server Component. Client Component:  Data Stage Administrator.  Data Stage Manager  Data Stage Designer  Data Stage Director Server Components:    Data Stage Engine Meta Data Repository Package Installer Data Stage Administrator: (Roles and Responsibilities ) Used to create the project Contains set of properties We can set the buffer size (by default 128 MB) We can increase the buffer size. We can set the Environment Variables. In tunable we have in process and inter-process In-process—Data read in sequentially Inter-process— It reads the data as it comes. It just interfaces to metadata. 4 Datastage Interview Questions - Answers Data Stage Manager: We can view and edit the Meta data Repository. We can import table definitions. We can export the Data stage components in .xml or .dsx format. We can create routines and transforms We can compile the multiple jobs. Data Stage Designer: We can create the jobs. We can compile the job. We can run the job. We can declare stage variable in transform, we can call routines, transform, macros, functions. We can write constraints. Data Stage Director: We can run the jobs. We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly) We can monitor the jobs. We can release the jobs. What is Meta Data Repository? Meta Data is a data about the data. It also contains  Query statistics  ETL statistics  Business subject area  Source Information  Target Information Source to Target mapping Information What is Data Stage Engine? It is a JAVA engine running at the background. What is Dimensional Modeling? Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access. What is Star Schema? Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded by dimensions table. Dimension Table: It contains a primary key and description about the fact table. Fact Table: It contains foreign keys to the dimension tables, measures and aggregates. What is surrogate Key? It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table. We can store up to 2 billion record. 5 What is Snowflake schema? It is partially normalized dimensional model in which at two represents least one dimension or more hierarchy related tables. Semi-Additive: Measures can be added across some dimensions. Universe. table size. Eg. What are Macros? They are built from Data Stage functions and do not require arguments.Answers Why we need surrogate key? It is used for integrating the data may help better for primary key. Unidata. aggregating data and converting data from one data type to another. and links and stages belonging to the current job. What are Active and Passive stages? Active Stage: Active stage model the flow of data and provide mechanisms for combining data streams. IPC stage. File types. Transformer. Eg. These can be used in 6 . joins. What is sequencer? It sets the sequence of execution of server jobs. What are stage variables? Stage variables are declaratives in Transformer Stage used to store values. Eg. Additive Fact: Measures can be added across any dimensions. What is ODS? Operational Data Store is a staging area where data can be rolled back. Eg. Row Merger etc. Average Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are measured across the dimensions with a same set of measures Explain the Types of Dimension Tables? Conformed Dimension: If a dimension table is connected to more than one fact table. De-generative Dimension: It is line item-oriented fact table design. Junk Dimension: The Dimension table. aggregator. Passive Stage: A Passive stage handles access to Database for the extraction or writing of data.Datastage Interview Questions . key updates. Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension. sort. % age. DRS stage etc. the granularity that is defined in the dimension table is common across between the fact tables.H file to facilitate getting information about the current job. discount Non-Additive: Measures cannot be added across any dimensions. Explain Types of Fact Tables? Factless Fact: It contains only foreign keys to the dimension tables. Index maintenance. which contains only flags. (Because memory is allocated at the run time). A number of macros are provided in the JOBCONTROL. Stage variables are active at the run time. disconnected inserts and partitioning. A local container is edited in a tabbed page of the job’s Diagram window. What index is created on Data Warehouse? Bitmap index is created in Data Warehouse. DataStage provides two types of container:  Local containers. and before/after subroutines. Containers enable you to simplify and modularize your server job designs by replacing complex areas of the diagram with a single container stage. Some functions have 0 arguments. These are created within a job and are only accessible by that job. Arguments are always in parentheses.  BASIC functions: A function performs mathematical or string manipulations on the arguments supplied to it. separated by commas. DSHostName DSJobStatus DSProjectName DSJobName DSJobController DSJobStartDate DSJobStartTime DSJobStartTimestamp DSJobWaveNo DSJobInvocations DSJobInvocationId DSStageLastErr DSStageType DSStageInRowNum DSStageVarList DSLinkLastErr DSLinkName DSStageName DSLinkRowCount What is keyMgtGetNextValue? It is a Built-in transform it generates Sequential numbers. These are created separately and are stored in the Repository in the same way that jobs are. argument) 7 .  Shared containers.Datastage Interview Questions . You can also use shared containers as a way of incorporating server job functionality into parallel jobs. job control routines.Answers expressions (for example for use in Transformer stages). filenames and table names. as shown in this general syntax: FunctionName (argument. What is container? A container is a group of stages and links. most have 1 or more. and return a value. There are two types of shared container What is function? ( Job Control – Examples of Transform Functions ) Functions take arguments and return a value. Its input type is literal string & output type is string. DSAttachJob DSSetParam Set limits for the job you want to control Request that a job is run Wait for a called job to finish Gets the meta data details for the specified link Get information about the current project Get buffer size and timeout value for an IPC or Web Service stage Get information about the controlled job or current job DSSetJobLimit DSRunJob DSWaitForJob DSGetLinkMetaData DSGetProjectInfo DSGetIPCStageProps Get information about the meta bag properties associated with the named job Get information about a stage in the controlled job or current job Get the names of the links attached to the specified stage DSGetJobMetaBag Get a list of stages of a particular type in a job.. DSLogFatal Log an information message in a job's log file.Answers  DataStage BASIC functions: These functions can be used in a job control routine.and after-stage subroutines. To do this . DSGetStageTypes Get information about a link in a controlled job or current job DSGetLinkInfo Get information about a controlled job’s parameters DSGetParamInfo Get the log event from the job log Get a number of log events on the specified subject from the job log Get the newest log event.Datastage Interview Questions . from the job log DSGetLogEntry DSGetLogSummary Log an event to the job log of a different job Stop a controlled job Return a job handle previously obtained from DSAttachJob DSLogEvent DSStopJob DSDetachJob Log a fatal error message in a job's log file and aborts the job.. DSGetStagesOfType Get information about the types of stage in a job. Specify the job you want to control Set parameters for the job you want to control Use this function .. these are useful in active stage expressions and before. which is defined as part of a job’s properties and allows other jobs to be run and controlled from the first job.. DSLogInfo DSGetJobInfo DSGetStageInfo DSGetStageLinks DSGetNewestLogId 8 . of a specified type. Some of the functions can also be used for getting status information on the current job. either in VOC as a callable item. Custom UniVerse functions.Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Based on the primary key values in dimensional table. Before/After subroutines. DSTranslateCode Checks if a BASIC routine is cataloged. Log a warning message in a job's log file. B) Snowflake Schema . DSSendMail DSTransformError Convert a job control status or error code into an explanatory text message. Denormalized form.Relatioships). A) E-R Diagrams (Entity . Insert arguments into the message template. More normalized form. The following programming components are classified as routines: Transform functions. Set a status message for a job to return as a termination message when it finishes DSCheckRoutine DSLogWarn DSMakeJobReport DSMakeMsg DSWaitForFile DSExecute DSSetUserStatus What is Routines? Routines are stored in the Routines branch of the Data Stage Repository. Question: Dimensional modelling is again sub divided into 2 types. Suspend a job until a named file either exists or does not exist.Table with Unique Primary Key. Load .Data should be first loaded into dimensional table. A) Star Schema . What is Hash file stage and what is it used for? 9 . view or edit. Consists of fields with numeric values.Answers Put an info message in the job log of a job controlling current job. Web Service routines Dimension Modeling types along with their significance Data Modelling is broadly classified into 2 types. Dimension table . Log a warning message to a job log file. then data should be loaded into Fact table. DSPrepareJob Interface to system send mail facility. B) Dimensional Modelling. Generate a string describing the complete status of a valid attached job. or in the catalog space.Simple & Much Faster. DSLogToController Ensure a job is in the correct state to be run or validated. where you can create.Complex with more Granularity. What is the flow of loading data into fact & dimensional tables? Fact table . ActiveX (OLE) functions. Execute a DOS or Data Stage Engine command from a before/after subroutine.Datastage Interview Questions . This will eliminate the unnecessary records even getting in before joins are made.Type Random 30 D" What are Static Hash files and Dynamic Hash files? As the names itself suggest what they mean.  Constraints are generally CPU intensive and take a significant amount of time to process. It is also used in-place of ODBC. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2GB and the overflow file is used if the data exceeds the 2GB size.  Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. So Reject link has to be defined every Output link you wish to collect rejected data. How did you handle reject data? Typically a Reject-link is defined and the rejected data is loaded back into data warehouse.  If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. OCI tables for better performance.sub divided into 2 types i) Generic ii) Specific Default Hased file is "Dynamic . What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?  Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.  Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs.  Tuned the 'Project Tunables' in Administrator for better performance.  Removed the data not used from the source as early as possible in the job.  Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.Answers Used for Look-ups.  Before writing a routine or a transform.Sub divided into 17 types based on Primary Key Pattern.  Try not to use a sort stage when you can use an ORDER BY clause in the database.  Tuning should occur on a job-by-job basis. What are types of Hashed File? Hashed File is classified broadly into 2 types. B) Dynamic .Datastage Interview Questions . A) Static . It is like a reference table.  Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….  Used sorted data for Aggregator.  Use the power of DBMS. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.  Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries. 10 .  Try to have the constraints in the 'Selection' criteria of the jobs itself. updates and selects. instance.ODBC drivers to connect to AS400/DB2. view or edit. 2. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers.00. The jobs in which data is read directly from OCI stages are running extremely slow. May be you can schedule the sequencer around the time the file is expected to arrive.Answers  Make every attempt to use the bulk loader for your particular database.02. The following are different types of Routines: 1. The data is dumped and sent to us. The job aborts in the middle of loading some 500. I had to stage the data before sending to the transformer to make the jobs run faster. Tell me the environment in your last projects Give the OS of the Server and the OS of the Client of your recent most project How did u connect with DB2 in your last project? Most of the times the data was sent to us in the form of flat files. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. where you can create.000 rows. where you had faced problem and How did u solve it? 1. What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository. Read the String functions in DS Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start.02' . Bulk loaders are generally faster than using ODBC or OLE. Before-After Job subroutines 3.  Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. Transform Functions 2. repeats ] What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. Job Control Routines How did you handle an 'Aborted' sequencer? In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. 'iSeries Access ODBC Driver 9. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former. ] length ] string [ delimiter. 11 . Tell me one situation from your last project.Datastage Interview Questions . Answers  Under UNIX: Poll for the file. then fact tables: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables. trickle fed constantly What other ETL's you have worked with? 12 . Export the whole project as a dsx. How would call an external Java function which are not supported by DataStage? Starting from DS 6. How will you determine the sequence of jobs to load into data warehouse? First we execute the jobs that load the data into Dimension tables.0 we have the ability to call external Java functions using a Java package from Ascential. Recompile all jobs. The above might raise another Why do we have to load the dimensional tables first. which can do a simple rename of the strings looking up the Excel file. On an OCI stage such as Oracle. So you have to make the necessary changes to these Sequencers. One of the most important requirements. Did you work in UNIX environment? Yes. you do have both Clear and Truncate options. then Fact tables. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job. Write a Perl program. How do you rename all of the jobs to support your new File-naming conventions? Create an Excel spreadsheet with new and old names. There is no TRUNCATE on ODBC stages. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Use crontab utility along with dsexecute() function along with proper parameters passed. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. then load the Aggregator tables (if any). Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. Then import the new dsx file probably into a new project for testing.Datastage Interview Questions . When should we use ODS? DWH's are typically read only. Once the file has start the job or sequencer depending on the file. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't). batch updated on a schedule ODS's are maintained in more real time. It is Clear table blah blah and that is a delete from statement. a) SMP .2. b) MPP .4]") Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. DS 6.exports the DataStage components. DS 5. Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname. What does a Config File in parallel extender consist of? Config file consists of the following. b) "Oconv" function . How good are you with your PL/SQL? On the scale of 1-10 say 8.External Convertion.exe .."D/MDY[2.Answers Informatica and also DataJunction if it is present in your Resume.0.5.Internal Convertion.Symmetrical Multi Processing.0. Types of views in Datastage Director? There are 3 types of views in Datastage Director 13 ."D-MDY[2. I mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code What are the command line functions that import and export the DS jobs?  dsimport. How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. DS 7.2.exe .2.5-9 What versions of DS you worked with? DS 7.Massive Parallel Processing. How to handle Date convertions in Datastage? Convert mm/dd/yyyy format to yyyy-dd-mm? We use a) "Iconv" function ..imports the DataStage components.4]").Datastage Interview Questions .? Datastage developer is one how will code the jobs.2 What's the difference between Datastage Developers. Datastage designer is how will design the job. a) Number of Processes or Nodes.  dsexport. b) Actual Disk Storage Location. here are some differences you may want to explore with each vendor:  Does the tool use a relational or a proprietary database to store its Meta data and scripts? If proprietary. The best way I’ve found to make this determination is to ascertain how successful each vendor’s clients have been using their product. Les Barbusinski’s Without getting into specifics. what are the requirements for your ETL tool? Do you have large sequential files (1 million rows.Warning Messages. It all depends on what you need the ETL to do. source systems. Accounting. and how much external scripting is required?  What kinds of languages are supported for ETL script extensions? Almost any ETL tool will look like any other on the surface.Answers a) Job View . Think about what process they are going to do. Especially clients who closely resemble your shop in terms of size. b) Log View . The trick is to find out which one will work best in your environment. industry.Datastage Interview Questions . and CRM packages?  Can the tool’s Meta data be integrated with third-party data modeling and/or business intelligence tools? If so. The often Parameterized variables in a job are: DB DSN name.R. RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. then ask how each vendor would do that.T for the data to be looked against at. dates W. Did you Parameterize the job or hard-coded the values in the jobs? Always parameterized the job. then either would probably be OK. data volumes and transformation complexity. how and with which ones?  How well does each tool handle complex transformations. RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. Event Messages. What are the requirements for your ETL tool? Do you have large sequential files (1 million rows. then ask how each vendor would do that. for example) that need to be compared every day versus yesterday? If so.Dates of Jobs Compiled. Are they requiring you to load yesterday’s file into a table and do lookups? If so. for example) that need to be compared every day versus yesterday? If so. What are the main differences between Ascential DataStage and Informatica PowerCenter? Chuck Kelley’s You are right. However. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool.Status of Job last Run c) Status View . username. Program Generated Messages. why?  What add-ons are available for extracting data from industry-standard ERP. password. It all depends on what you need the ETL to do. If you are small enough in your data sets. then either would probably be OK. Think about what process they are going to do. in-house skill sets. they have pretty much similar functionality. There is no way you will hard–code some parameters in your jobs. If you are small enough in your data sets. platforms. 14 . Are they requiring you to load yesterday’s file into a table and do lookups? If so. the Data Administration Newsletter. Date Transformation b.Web base transformation What is the Batch Program and how can generate? Batch program is the program it's generate run time to maintain by the Datastage itself but u can easy to change own the basis of your requirement (Extraction. If you ask the vendors. Transformation. Suppose that 4 job control by the sequencer like (job 1. It is also not a good idea to depend upon a high-level manager at the reference site for a reliable opinion of the product.Commit. job 4 ) if job 1 have 10. you can see this program on job control option.000 row . which may or may not be totally accurate. Upstring Transformation 2. If you are unfamiliar with the many products available. job 2.XML transformation 4. How many places u can call Routines? Four Places u can call 1.tdan. or quirkiness with the tool that have been encountered by that customer.. You should first document your requirements. for product lists. Continue.Batch program are generate depends your job nature either simple job or sequencer job. After you are very familiar with the products. identify all possible products and evaluate each product against the detailed requirements. Ultimately. If job fail means data type problem or missing column action .Answers Ask both vendors for a list of their customers with characteristics similar to your own that have used their ETL product for at least a year. Ask both vendors and compare the answers.Commit . There are numerous ETL products on the market and it seems that you are looking at only two of them.com. Then interview each client (preferably several people at each site) with an eye toward identifying unexpected problems. benefits. You will not want the vendor to have a representative present when you speak with someone at the reference site. in this condition should go director and check it what type of problem showing either data type problem. ask each customer – if they had it all to do over again – whether or not they’d choose the same tool and why? You might be surprised at some of the answers. warning massage.after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then. 15 . job fail or job aborted. call their references and be sure to talk with technical people who are actually using the product.Datastage Interview Questions . Joyce Bischoff’s You should do a careful research job when selecting products. How can short out the problem? Suppose job sequencer synchronies or control 4 job but job 1 have problem. job 3. you may refer to www. they will certainly be able to tell you which of their product’s features are stronger than the other product.So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option (i) On Fail -. Loading) .Transform of the Before & After Subroutines 3. Managers may paint a very rosy picture of any selected product so that they do not look like they selected an inferior product.Transform of routine a. Continue (ii) On Skip -. 5... 16 . when triggering the Foreign tables load Job trigger them only when Primary Key load Jobs run Successfully ( i. Under UNIX: Poll for the file. Which does not meet raise exceptional data and cleanse them. Then import the new dsx file probably into a new project for testing. will drastically performance will go down. which can do a simple rename of the strings looking up the Excel file.. you can disable all the constraints on the tables and load them.. Primary key foreign key constraints. How do you rename all of the jobs to support your new File-naming conventions? Create a Excel spreadsheet with new and old names. So you have to make the necessary changes to these Sequencers.e.. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. 2)The Switch stage is limited to 128 output links. Again Run the job defiantly u gets successful massage What happens if RCP is disable? In such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased. Once the file has start the job or sequencer depending on the file What are Sequencers? Sequencers are job control programs that execute other jobs with preset Job parameters. Continue . and probably some minor ones as well. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. OK trigger) 2) To improve the performance of the Job. A. Once loading done. I want my primary key tables to be loaded first and then my foreign key tables and also primary key tables should be committed before the foreign key tables are executed. Question34: What is the difference between the Filter stage and the Switch stage? Ans: There are two main differences. The two main differences are as follows.e. May be you can schedule the sequencer around the time the file is expected to arrive.My target tables have inter dependencies i. 1)The Filter stage can send one input row to more than one output link. check for the integrity of the data. the Filter stage can have a theoretically unlimited number of output links.Datastage Interview Questions .Answers First u check how much data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail . Recompile all jobs.. B. Export the whole project as a dsx. This only a suggestion. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. Write a Perl program. (Note: this is not a challenge!) How can i achieve constraint based loading using datastage7. How can I go about it? Ans:1) Create a Job Sequencer to load you tables in Sequential mode In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign key tables. How did you handle an 'Aborted' sequencer? In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. normally when loading on constraints are up. The Switch stage can not the C switch construct has an implicit break in every case.. Using that stage we can eliminate the duplicates based on a key column. if the metadata is different. and compare them. Won't handle the "individual job" requirement. between JOIN stage and MERGE stage. when you create physical DB from the model. How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one. Is there a mechanism available to export/import individual DataStage ETL jobs from the UNIX command line? Ans: Try dscmdexport and dscmdimport. It helps to understand the new and already existing clients.Datastage Interview Questions . At the same time RI is being maintained at ETL process level.Answers 3) If you use Star schema modeling. The columns from the records in the master and update data set s are merged so that the out put record contains all the columns from the master record plus any additional columns from each update record that required.Merge key columns are one or more columns that exist in both the master and update records. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that we specify . It makes the research of new business possibilities possible. We can collect data of different clients with him. It is able to integrate data coming from all parts of the company. JOIN: Performs join operations on two or more data sets input to the stage and then outputs the resulting dataset. Technological advantages:  It handles all company data and adapts to the needs. You can only export full projects from the command line. Diff. How do you pass filename as the parameter for a job? Ans: While job development we can create a parameter 'FILE_NAME' and the value can be passed while How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. We can analyze trends of the data read by him. Advantages of the DataStage? Business advantages:       Helps for better business decisions. 17 . Once all dimensional keys are assigned to a fact then dimension and fact can be loaded together. How do you eliminate duplicate rows? Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. You can find the export and import executables on the client machine usually someplace like: C:\Program Files\Ascential\DataStage. you can delete all constraints and the referential integrity would be maintained in the ETL process by referring all your dimension keys while loading fact tables. MERGE: Combines a sorted master data set with one or more sorted updated data sets. Data stage designer 2. Data stage manager Data stage designer is user for to design the jobs Data stage manager is used for to import & export the project to view & edit the contents of the repository. deleting the project & setting the environment variables. validate the jobs. scheduling the jobs. It accelerates the running of the project.Answers    It offers the possibility for the organization of a complex business intelligence. Data stage director is use for to run the jobs. Easily implementable. under the control of the DS director. DS Package installer: A user interface used to install packaged DS jobs and plug-in. 18 . transform. How can u handle. 1. that extract. I want to run the multiple jobs in the single job. Flexibly and scalable. Server components DS server: runs executable server jobs. Data stage director 4. What is the architecture of data stage? Basically architecture of DS is client/server architecture. Data stage administrator 3.Datastage Interview Questions . Repository or project: a central store that contains all the information required to build DWH or data mart. and load data into a DWH. What r the stages u worked on? I have some jobs every month automatically delete the log details what r the steps u have to take for that We have to set the option autopurge in DS Adminstrator. In job properties set the option ALLOW MULTIPLE INSTANCES. Data stage administrator is used for creating the project. Client components & server components Client components are 4 types they are 1. CVSS. Iconv Function Converts a string to an internal storage format. which is available in DS manager.Answers What is version controlling in DS? In DS. other user can wait until the first user complete the operation.Datastage Interview Questions .concurrent visual source safe. VSS. Version controls r of 2 types. VSS is designed by Microsoft but the disadvantage is only one user can access at a time.1 version onwards.lets the user remove the status of the record associated with all stages of selected jobs.we can clear the log details by using the DS Director. Pivot is an active stage that maps sets of columns in an input table to a single column in an output table.(in DS Director) I developed 1 job with 50 stages.visual source safe 2. Under job menu clear log option is available. version controlling is used for back up the project or jobs. I am getting input value like X = Iconv(“31 DEC 1967”. By using this option we can clear the log details of particular job. My job takes 30 minutes time to run.It takes 31 dec 1967 as zero and counts days from that date(31-dec-1967). we can reduce time. This option is available in DS 7. Other wise we can find PID (process id) and kill the process in UNIX server. CVSS cost is high. CVSS. by using this many users can access concurrently. 1. Clear status file---. at the run time one stage is missed how can u identify which stage is missing? By using usage analysis tool. I want to run the job less than 30 minutes? What r the steps we have to take? By using performance tuning aspects which are available in DS. we can find out the what r the items r used in job. If a job locked by some user. What is the difference between clear log file and clear status file? Clear log--. how can you unlock the particular job in DS? We can unlock the job by using clean up resources option which is available in DS Director.”D”)? What is the X value? X value is Zero. When compared to VSS. 19 . Tuning aspect In DS administrator : in-process and inter process In between passive stages : inter process stage OCI stage : Array size and transaction size And also use link partitioner & link collector stage in between passive stages How to do road transposition in DS? Pivot stage is used to transposition purpose. Integration testing: According to dependency we will put all jobs are integrated in to one sequence. Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM) What happens when you have a job that links two passive stages together? Obviously there is some process going on. What is the use use of Nested condition activity? Nested Condition.C .To import the DataStage components Dsexport. Next take three jobs according to dependency in one more sequence and schedule that job only Sunday. select a process with holds a lock and click Logout If it still doesn't help go to the datastage shell and invoke the following command: ds.. How can u do it? First you have to schedule A & C jobs Monday to Saturday in one sequence.)  from command line using a dsjob command  Datastage routine can run a job (DsRunJob command)  by a job sequencer How to invoke a Datastage shell command? Datastage shell commands can be invoked from :  Datastage administrator (projects tab -> Command)  Telnet client connected to the datastage server How to stop a job when its status is running? To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). Allows you to further branch the execution of a sequence depending on a condition. column mismatching. Size of the particular data type.To export the DataStage components How many hashing algorithms are available for static hash file and dynamic hash file? Sixteen hashing algorithms for static hash file. That is called control sequence. I have three jobs A.Answers What is the Unit testing. Under covers Ds inserts a cut-down transformer stage between the passive stages..B. which just passes data straight from one stage to the other.exe ---. If it doesn't help go to Job -> Cleanup Resources. System testing: System testing is nothing but the performance tuning aspects in Ds. Which are dependent on each other? I want to run A & C jobs daily and B job runs only on Sunday.exe ---. What are the ways to execute datastage jobs? A job can be run using a few different methods:  from Datastage Director (menu Job -> Run now. What are the command line functions that import and export the DS jobs? Dsimport.Datastage Interview Questions . integration testing and system testing? Unit testing: As for Ds unit test will check the data type mismatching.tools 20 . Answers It will open an administration panel..the red fatal. How to run and schedule a job from command line? To run a job from command line use a dsjob command Command Syntax: dsjob [-file | [-server ][-user ][-password ]] [] The command can be placed in a batch file and run in a system scheduler. then try invoking one of the clear locks commands (options 7-10). even though different versions of Datastage use different system dll libraries.WHT080. Can be written in DataStage (for example as the constraint expression): incol.Datastage Interview Questions . Go to 4. User privileges for the default DataStage roles? The role privileges are:  DataStage Developer .Administer processes/locks .. the message which is included in the sequence abort message 21 .STAT command  ANALYZE. These are:  FILE. then try invoking one of the clear locks commands (options 7-10). however it's a good idea to increase the timeout.Administer processes/locks . How to release a lock held by jobs? Go to the datastage shell and invoke the following command: ds. To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. Both should be invoked from the datastage command shell.Stage does not support in-process active-to-active inputs or outputs To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer. Error in Link collector . Go to 4.empname matches '.user with full access to all areas of a DataStage project  DataStage Operator .tools It will open an administration panel..has privileges to run and manage deployed DataStage jobs  -none.FILE command Is it possible to run two versions of datastage on the same pc? Yes. Then select Inter process which will let the link collector run correctly. Buffer size set to 128Kb should be fine.' what is the difference between logging text and final text message in terminator stage Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.no permission to log on to DataStage What is a command to analyze hashed file? There are two ways to analyze a hashed file. That application can unregister and register system libraries used by Datastage...  The Final Warning Text .  Informational . What is the DataStage equivalent to like option in ORACLE The following statement in Oracle: select * from ARTICLES where article_name like '%WHT080%'. DSLogInfo() type message.is a green line. Datastage routine to open a text file with error catching Note! work dir and file1 are parameters passed to the routine.JOB_0100_2).SOURCE Procedures must have an output link The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage. To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform How to invoke an Oracle PLSQL stored procedure from a server job To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used. However it needs a flow of at least one record to run. even server jobs can be run in parallel. "JobControl") 22 . * open file1 OPENSEQ work_dir : '\' : file1 TO H. add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping  Put Stored Procedure (STP) stage as a destination. It can be designed in the following way:  source odbc stage which fetches one record from the database and maps it to one column .Answers Error in STPstage . To get rid of error fill in the 'Procedure name' field. If required. To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully". type in the procedure name and select Transform as procedure type.for example: select sysdate from dual  A transformer which passes that record through.STDPROC property required for stage xxx The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. Fill in connection parameters. In the input tab select 'execute procedure for each row' (it will be run once). It can be deleted at any time and will be automatically recreated by datastage on the next run. Error in STPstage . It occurs even if the Procedure call syntax is filled in correctly. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100. The job can now be run simultaneously from one or many sequence jobs. Design of a DataStage server job with Oracle plsql procedure call Is it possible to run a server job in parallel? Yes. "JobControl") END ELSE CALL DSLogInfo("Unable to open file".Datastage Interview Questions . "A") ******* will read the first 32 chars Call DSLogInfo("******************** Record read: " : firstline.32]. Datastage will remember all the test arguments during future tests.' which will open a new window.FILE1 How to test a datastage routine or transform? To test a datastage routine or transform go to the Datastage Manager." ". select a routine you want to test and open it.Datastage Interview Questions . Navigate to Routines. First compile it and then click 'Test. Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'.RECORD FROM H. * open file1 OPENSEQ work_dir : '\' : file1 TO H. How to construct a container and deconstruct it or switch between local and shared? To construct a container go to Datastage designer.. Enter test parameters in the left-hand side column and click run all to see the results. When hashed files should be used? What are the benefits or using them? Hashed files are the best way to store data for lookups. "JobControl") END ELSE CALL DSLogInfo("Unable to open file".Answers ABORT END Datastage routine which reads the first line from a text file Note! work dir and file1 are parameters passed to the routine. Hashed files are especially useful if they store information with data dictionaries (customer details. "JobControl") END firstline = Trim(FILE1. They're very fast when looking up the key-value pairs. and share can be re-used.FILE1 ELSE Call DSLogWarn("******************** File is empty".RECORD[1. select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared. countries.. In the same way it can be converted back to local. 23 . Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu. "JobControl") ABORT END READSEQ FILE1. Stored this way it can be spread across the project and accessed from different jobs. exchange rates).FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully". Local will be only visible in the current job. "JobControl") CLOSESEQ H. " ". An UPDATE is issued first and if succeeds the INSERT is ommited.  Update existing rows only . and oracle char to datastage char. Datastage trim function cuts out more characters than expected By deafult datastage trim function will work this way: Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected.updates existing data rows (UPDATE) or adds new rows (INSERT).updates existing rows (UPDATE statement).deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT)."B") Database update actions in ORACLE stage The destination table can be updated using various Update actions in Oracle stage. The biggest problem is to map correctly oracle NUMBER(x.Datastage Interview Questions .  Replace existing rows completely .  Delete existing rows only . Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. What is the use of INROWNUM and OUTROWNUM datastage variables? @INROWNUM and @OUTROWNUM are internal datastage variables which do the following:  @INROWNUM counts incoming rows to a transformer in a datastage job  @OUTROWNUM counts oucoming rows from a transformer in a datastage job These variables can be used to generate sequences. If set to 0 the commit will be issued at the end of a successfull transaction. There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar.deletes the existing rows (DELETE statement).x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction. To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d ".adds new rows (INSERT) or updates existing rows (UPDATE).only adds new rows (INSERT statement).deletes matched rows (issues only the DELETE statement). numbering rows and also for debugging and error tracing. then adds new rows (INSERT). An INSERT is issued first and if succeeds the UPDATE is ommited. id's.  Update existing rows or insert new rows . primary keys.Removes leading and trailing occurrences of character.y) format. 24 . They play similiar role as sequences in Oracle. Available actions:  Clear the table then insert rows . How to adjust commit interval when loading data to the database? In earlier versions of datastage the commit interval could be set up in: General -> Transaction size (in version 7.  Insert new rows or update existing rows . That is beacuse by default an R parameter is assumed which is R .  Truncate the table then insert rows . and reduces multiple occurrences to a single occurrence.x it's obsolete) Starting from Datastage 7. The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly.deletes the contents of the table (DELETE statement) and adds new rows (INSERT).  Insert rows without clearing .Answers Corresponding datastage data types to ORACLE types? Most of the datastage variable types map very well to oracle types. User-defined SQL file . Syntax: Iconv (string. conversion code) Oconv(expression.Datastage Interview Questions . "D-E") = "14-10-2006" Oconv(14167.]") = "14 OCTOBER 2006" Oconv(12003005. Basically the cause of a problem is a failure between DataStage client and the server communication.if the above does not help How to check Datastage internal error descriptions To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS. Error timeout waiting for mutex 25 . execute the DS.HELP.the data is written using a user-defined SQL statement.A.Answers   User-defined SQL .where in that case the number 081021 is an error number The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.REINDEX ALL command from the Datastage shell . . conversion ) Some useful iconv and oconv examples: Iconv("10/14/06"."MD2") Iconv and oconv can be combined in one expression to reformat date format easily: Oconv(Iconv("10/14/06".MESSAGE WHERE @ID='081021'.030. "D2/"). "D DMY[.the data is written using a user-defined SQL statement from a file. "MD2$.. "D2/") = 14167 Oconv(14167. Use and examples of ICONV and OCONV functions? ICONV and OCONV functions are quite often used to handle data in Datastage. ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.") = "$120."D-E") = "14-10-2006" ERROR 81021 Calling subroutine DSR_RECORD ACTION=2 Datastage system help gives the following error desription: SYS.05" That expression formats a number and rounds it to 2 decimal places: Oconv(L01. 081021 MESSAGE.TURNOVER_VALUE*100. The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow. The solution to the issue is: Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully. dsrpc: Error writing to Pipe. com/eservice/public/welcome.increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab). After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager.com Can Datastage use Excel files as a data input? 26 . . The error usually appears when using Link Collector. It can be found on the IBM client support site (need to log in): https://www.HELP. Just Download and install the “XP SP2 patch” for the Datastage client. DataStage/SQL: Illegal placement of parameter markers The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production). ERROR 30107 Subroutine failed to complete successfully Datastage system help gives the following error desription: SYS. It may also appear when doing a lookup with the use of a hash file or if a job is very complex.REINDEX ALL command from the Datastage shell Datastage Designer hangs when editing job activity properties The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows.. with the use of many transformers. . ds_ipcgetnext() . review fetches and lookups and try to optimize them (especially have a look at the SQL statements). There are a few things to consider to work around the problem: .do Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section.Answers The error message usually looks like follows: .try to simplify the job as much as possible (especially if it’s very complex).ascential.. The solution to the issue is: Rebuild the repository index by executing the DS.ibm. Sometimes users face problems when trying to log in (for example when the license doesn’t cover the IBM Active Support).ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of nulls and try to avoid language specific chars which may cause the problem. then it may be necessary to contact the IBM support which can be reached at WDISupport@us. 930107 MESSAGE.timeout waiting for mutex There may be several reasons for the error and thus solutions to get rid of it. Link Partitioner and Interprocess (IPC) stages. The solution of the problem is very simple.Datastage Interview Questions . Consider splitting it into two or three smaller jobs.. save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data. In most cases parallel jobs and stages look similiar to the Datastage Server objects. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. a 4 CPU server will process the data four times faster than a single CPU machine. partitioning and parallel processing. Parallel jobs support a completely new set of stages. This means for instance that once the data is evenly distributed.this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC . not sequentially. platform independent and uses the processing node concept. The job developer only chooses a method of data partitioning and the Datastage EE engine will execute the partitioned and parallelized processes. On Datastage servers operating in Windows it can be set up here: Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (. GRID. The key concept of ETL Pipeline processing is to start the Transformation and Loading tasks while the Extraction phase is still running. The concept is hidden from a Datastage programmer. Transform.Datastage Interview Questions . which implement the scalable and parallel data processing mechanisms. The EE architecture is process-based (rather than thread processing).01 Differences between Datastage Enterprise and Server Edition 1.xls) -> Provide a Data source name -> Select the workbook -> OK Save Excel file as CSV . The main outcome of using a partitioning mechanism is getting a linear scalability. The main disadvantage is that it is impossible to do this on an Unix machine. or MPP architecture (massively parallel processing). Pipelining means that each part of an ETL process (Extract. Each partition of data is processed by the same operation and transformed in the same way. Section 1. Datastage EE is able to execute jobs on multiple CPUs (nodes) in parallel and is fully scalable. Parallel processing Datastage jobs are highly scalable due to the implementation of parallel processing. which means that a properly designed job can run across resources within a single machine or take advantage of parallel platforms like a cluster. Datastage Enterprise Edition automatically combines pipelining. Load) is executed simultaneously.Answers Microsoft Excel spreadsheets can be used as a data input in Datastage. In rough outline: 27 . Partitioning and Pipelining Partitioning means breaking a dataset into smaller sets and distributing them evenly across the partitions (nodes). however their capababilities are way different. for instance record and column level format properties. What APT_CONFIG in ds? APT_CONFIG is just an environment variable used to idetify the *. 3. Sample . Datastage Enterprise Edition adds functionality to the traditional server stages. Complex flat file. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).its incorporated with Designer itself. OSH executes operators . 5. and it also contain a webbrowsers 1. Teradata & DB2 o Development and Debug stages .apt file.To implement scd we have seperate stage(SCD stage) 2... In most cases no manual intervention is needed to implement optimally those techniques. 28 . filtering. 6. and the job may take a very log time for the compilation to complete.5 & 8.There is no need of hardcoding the parameters for every job and we have a option called Parameter set. pre-built components representing stages used in Datastage jobs.1? New version of DS is 8.instances of executable C++ classes. using Server jobs might be just easier to develop. This is why parallel jobs run faster. Remove Duplicates .. even if processed on one CPU.sequence. o Data set. Partitioning and Parallelism. Merge.Peek. managed and controlled by Datastage Server runtime environment o Parallel jobs have a built-in mechanism for Pipelining. for example: o Enterprise Database Connectors for Oracle. Sequence jobs are the same in Datastage EE and Server editions. aggregating 2.apt file that has the node's information and Configuration of SMP/MMP server.if we create the parameter set. Dont confuse that with *. Modify.Answers o Parallel jobs are executable datastage programs. 3. When processing large data volumes Datastage EE jobs would be the right choice. Tail.. Funnel. Row Generator. Datastage EE brings also completely new stages implementing the parallel concept. understand and manage. Head.. When a company has both Server and Enterprise licenses. what is the difference between ds 7. 4.0 and supprots QUALITY STAGE & Profile Stage and etc. what happens when job is compiling? During compilation of a DataStage Parallel job there is very high CPU and memory utilization on the server. both types of jobs can be used. o Join. Column Generator. Server Jobs are compiled into Basic which is an interpreted pseudo-code.Datastage Interview Questions . however when dealing with smaller data environment.we can call the parameter set for whole project or job. Lookup File Set . File set... o Parallel jobs are a lot faster in such ETL tasks like sorting. Copy.we dont have client manager tool in vversion 8. anyaways the APT_CONFIG_FILE (not just APT_CONFIG) is the configuration file that defines the nodes (the scratch area temp area) for the specific project is it possible to add extra nodes in the configuration file? what is RCP? n how does it works? Run time column propagation is used in case of partial schema usage.Datastage Interview Questions . If RCP is enabled in our project we can define only the columns which we are interested in and other rest of the columns datastage will send through various other stages. According to documentation Runtime column propagation (RCP) allows DataStage to be flexible aboutthe columns you define in a job. You need to specify the same schema file forSequential File Stage 5-47any similar stages in the job where you want to propagate columns.Sequential files unlike most other data sources do not have inherentcolumn definitions and so DataStage cannot always tell where there areextra columns that need propagating. and scrach information. You can only use RCP on sequentialfiles if you have used the Schema File property (see Schema File onpage 5-8 and on page 5-31) to specify a schema which describes all thecolumns in the sequential file. This will ensure such columns reach to our Target eventhough they are not used in between of the stages. for parallel process normally two nodes are required its name like 10. and datastage understands the architecture of system based on this Configfile.Answers Apt_configfile is used for to store the nodes information. when we only know about the columns to be processed and we want all other columns to be propagated to target as they are we check enable RCP option in administrator or output page columns tab or stage page general tab and we only need to specify the schema of tables we are concerned with . starschema n snowflake schema? n Difference? Star Schema De-Normalized Data Structure Snowflake Schema Normalized Data Structure 29 . If RCP is enabled for a project you can justdefine the columns you are interested in using in a job but ask DataStageto propagate the other columns through the various stages. Stagesthat will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export Run Time Column Propagation can be used with Column Import Stage.20. and it contains the disk storage information.. So suchcolumns can be extracted from the data source and end up on your datatarget without explicitly being operated on in between. uses validation data tables.Answers Category wise Single Dimension Table More data dependency and redundancy No need to use complicated join Dimension table split into many pieces less data dependency and No redundancy Complicated Join Query Results Faster Some delay in Query Processing No Parent Table It May contain Parent Table Simple DB Structure Complicated DB Structure Difference bet OLTP n Datawarehouse? The OLTP database records transactions in real time and aims to automate clerical data entry processes of a business entity. complex. Optimized for validation of incoming data during transactions.Datastage Interview Questions . Addition. The data warehouse on the other hand does not cater to real time operational requirements of the enterprise. However. usually adding or retrieving a single row at a time per table. 30 . unpredictable queries that access many rows per table. modification and deletion of data in the OLTP database is essential and the semantics of the application used in the front end impact on the organization of the data in the database. It is more a storehouse of current and historical data and may also contain data extracted from external data sources. the data warehouse supports OLTP system by providing a place for the latter to offload data as it accumulates and by providing services which would otherwise degrade the performance of the database. Supports thousands of concurrent users. Loaded with consistent. Differences Data warehouse database and OLTP database Data warehouse database Designed for analysis of business measures by categories and attributes Optimized for bulk loads and large. valid data. Optimized for a common set of transactions. requires no real time validation Supports few concurrent users relative to OLTP OLTP database Designed for real time business operations. It is used for plain string matching. how to remove duplicates from table? select count(*) from MyTable and select distinct * from MyTable to copy all distinct values into new table select distinct * into NewTable from MyTable SELECT email. Data modeling involves a progression from conceptual model to logical model to physical schema. There are a range of tools used to achieve this such as data dictionaries. 31 . Hence the best one to be used is grep. the relationship between those entities and their attributes. decision tables. schematic diagrams and the process of normalisation. decision trees.esal from (select ename. grep is a combination of both egrep and fgrep. by default grep will function as egrep but will have string searching ability too.Answers What is datamodalling? The analysis of data objects and their relationships to other data objects. fgrep can not search for regular expressions in a string. how to draw second highest salary? select ename. egrep can search regular expressions too. COUNT(email) AS NumOccurrences FROM users GROUP BY email HAVING ( COUNT(email) > 1 ) Difference bet egrep n fgrep? There is a difference. Data modeling is often the first step in database design and object-oriented programming as the designers first create a conceptual model of how data items relate to each other.esal from hsal order by esal desc) where rownum <=2. select max(salary ) from emp table where sal<(select max (salary)from emp table) select max(sal) from emp where level=2 connect by prior sal>sal group by level.Datastage Interview Questions . If you don't specify -E or -F option. Data modelling is the process of identifying entities. x .1 Scd2 stage. Numeric Permissions: CHMOD can also to attributed by using Numeric Permissions: 400 read by owner 040 read by group 004 read by anybody (other) 200 write by owner 020 write by group 002 write by anybody 100 execute by owner 010 execute by group 001 execute by anybody what is the difference between ds 7.Group that owns the file. r . The "f" does not stand for "fast" . Difference between internal sort and external sort? 32 .1 and the following are the new in 8. o .c" (Yes.Execute or run the file as a program.) Fgrep still has its uses though.c" is usually slower than "egrep foobar *. g . a .All. w .Answers fgrep = "Fixed GREP". fgrep searches for fixed strings only.Read the file. this is kind of surprising.Other. and may be useful when searching a file for a larger number of strings than egrep can handle. egrep = "Extended GREP" egrep uses fancier regular expressions than grep. since it has some more sophisticated internal algorithms than grep or fgrep.rangelookup .qs.dataconnection. Many people use egrep all the time.Write or edit the file.1? The main difference is dsmanager client is combined with dsdesigner in 8.5 & 8.Datastage Interview Questions .User who owns the file. Try it.in fact.ps. and is usually the fastest of the three programs CHMOD command? Permissions u . "fgrep foobar *. Expression that specifies value to be passed on to the target column. how many types of parallelisms n partions r there? Two types smp n mpp.is used to propagate the columns which r not define in the metadata.Tx will develop c++ code in the background. compile what APT_CONFIG in ds? Its configuration file which defines parallelism to our jobs. combinations and processes used in a job. load recover etc. job seq is used to run the group of jobs based upon some conditions.Answers Performance wise internal sort is best becoz it doesn’t use any buffer where as external sort takes buffer memory to store rec.job information will be updated in metadata repository.. env var: admin Stage Variable . 33 . what is RCP? n how does it works? Run time column propagation . pipeline parallelism: here each satge will work on separate processor What is the difference between Job Control and Job Sequence Job control specially used to control the job.goto jobprop-have a option allow multiple instances.An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation . it possible to add extra nodes in the configuration file? Yes it is possible to add extra nodes go to configuration management tools where u find apt_conf edit that for ur req no of nodes. dashboard information. means through this we can pass the parameters. nodes.shows operators. datasets. what happens when job is compiling? 1. 4. For final/incremental processing we keep all the jobs in one diff seq and we run the jobs at a time by giving some triggers. it possible to run multiple jobs in a single job? Yes . 3. partitions.. some conditions.Conditions that are either true or false that specifies flow of data with a link.all processing stages will develop osh code 2.Datastage Interview Questions . how data is moving from one stage to another? In virtual dataset form.. What is APT_DUMP_SCORE? APT_DUMP_SCORE . some log file information. Constant . how u pass only required number of rec through partitions? Go to jobproperties-execution-enable trace compile-give req number of rec. Datastage Interview Questions . How to develop the SCD using LOOKUP stage? we can impliment SCD by using LOOKUP stage. we can think it as a seperate tools for dwh.. here they have removed the manager. and there are some advanced stages are there. They have included the QualityStage.desig. keeping the same partition which is in prev stage.5.e.Answers What is the max size of Data set stage? (PX) no limit? performance in sort stage See if it is orcle db then u can write user def query for sort and remove duplicates in the source itself. which is used for data validation which is very very importent for dwh. 34 . only web sphere Datastage and Quality stage. I think there is no version like standard Ascential DataStage 7.. If u see the design. I know only the advanced edition of Datastage i. not for scd2. it is released by IBM itself and given the version as 8..5 (Enterprise Edition ) & Standard Ascential DataStage 7. after that in t/r we have to give the conditon. What is the diffrence between IBM Web Sphere DataStage 7.1. we have to take source(file or db) and dataset as a ref link(for look up) and then LOOKUP stage. after that we have to take two targets for insert and update. and maintaining some key partition teqniques u can improve the performence.director)... it is included in designer itself (for importing and exporting) and in this they have added some extra stages like SCD stage .0. If it is not the case means better go for some key partion teqniques in sort. in this there are only 3 client tools(admin. by using this we can impliment scd1 and scd2 directly.5. in this we have to compare the source with dataset and we have to give condition as continue.5 Version ? IBM Information Server also known as DS 8 has more features like Quality Stage & MetaStage . then u can easily understand that. there we have to manually write the sql insert and update statements. continue there. It maintains its repsository in DB2 unlike files in 7. remove duplicates and give unique partition key. Also it has stage specifically for SCD 1 & 2. but it is for only scd1. don't allow the duplicates . What are the errors you expereiced with data stage Here in datastage there are some warnings and some fatal errors will come in the log file. There are somany things are available in Qualitystage. server jobs can extact total rows from source to anthor stage then only that stage will be activate and passing the rows into target level or dwh.. but in parallel jobs it is two types 1. logfile must be cleared with no warnings also. a sequencial file or any other file can have either an input link or an output ink.. For example. whereas we dont have any such concept in server jobs....Datastage Interview Questions . There are lot of differences in using same stages in server and parallel.. ******************************************************************** server jobs can compile and run with in datastage server but parallel jobs can compile and run with in datastage unix server.based on statistical performence we can extract some rows from source to anthor stage at the same time the stage will be activate and passing the rows into target level or dwh. but in server it can have both(that too more than 1). .etc what are the main diff between server job and parallel job in datastage? in server jobs we have few stages and its mainly logical intensive and we r using transformer for most of the things and it does not uses MPP systems in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in built stages in parallel jobs and it uses MPP systems *********************************************************** In server we dont have an option to process the data in multiple nodes as in parallel.... Parameter not found in job load recover. In parallel we have an advatage to process the data in pipelines and by partitioning.Answers If there is any fatal error means the job got aborted but if there are any warnings are there means the job not aborts but we have to handle those warnings also. so many errors will come in diff jobs. control job is failed bcoz of some .. in parallel. child job is failed bcoz of some ..it will maintain only one node with in source and target.pipe line parallelisam 2. 35 .partion parallelisam 1.it is time taking. if the source contains the varchar and the target contains integer then we have to use this Modify Stage and we have to change according to the requirement.partion parallelisam will maintain more than one node with in source and target. In sequential file data can be viewed any where. Data sets are operating system files.Answers 2. Extraction of data from the datset is much more faster than the sequential file.ds. Sort the data before remove duplicate stage. You can think of each link in a job as carrying a data set. When do u use them. b)this require a job level tuning or server level tuning. how can we improve the performance of the job while handling huge amount of data a)Minimize the transformer state.Datastage Interview Questions . Why you need Modify Stage? When you are able to handle Null handling and Data type changes in ODBC stages why you need Modify Stage? Used to change the datatypes. which can then be used by other WebSphere DataStage jobs. we can view the data through view data facility available in datastage but it cant be viewed in Linux or back end system. Modify Stage is used for the purpose of Datatype Change. Using data sets wisely can be key to good performance in a set of linked jobs. 36 . Reference table have less amount of data then you can use lookup.Reference table have huge amount of date then you can use join stage. The Data Set stage allows you to store data being operated on in a persistent form. which by convention has the suffix . available from the WebSphere DataStage Designer or Director In datset dat is stored in some encrypted format ie.. And we can do some modification in length also. job level tuning use Join for huge amount of data rather than lookup. a)Sequential stage is use for format of squential file and Dataset is use for any type ofÂ format file (random)Â b)Parallel jobs use data sets to manage data within a job. You can also manage data sets independently of a job using the Data Set Management utility. use modify stage rather than transformer for simple transformation. each referred to by a control file. server level tuning this can only be done after having adequate knowledge of the serever level parameter which can improve the server execution performance. in job level we can do the follwing. What is the difference between Squential Stage & Dataset Stage. then Fact tables.ittoolbox. 37 . Try using the new user name into datastage and I am sure that you should be able to do it.mainframe routines which will used in mainframe jobs DataStage Parallel routines made really easy http://blogs. In Protected Project all jobs are read only. there are 3 kind of routines is there in Datastage.Create new user with a password. Restart your comp and login with the new user name. then load the Aggregator tables (if any). The sequence of the job can also be determined by the determining the parent child relationship in the target tables to be loaded. You cant modify the job.parlell routines which will used in parlell jobs These routines will write in C/C++ Language 3.dsx format and change the attribute which store the readonly information from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job). parent table always need to be loaded before child tables. By creating Protected Project. then import the job again and override or rename the existing job to have both of the form. Error while connecting DS Admin? All you have to do is go settings-control panel-User accounts. If you modify that job it will not effect the job. b)A job can be made read only by the follwing process: Export the job in the . 1.Answers HI How can we create read only jobs in Datastage.Datastage Interview Questions .com/dw/soa/archives/datastage-parallel-routines-made-really-easy-20926 How will you determine the sequence of jobs to load into data warehouse? First we execute the jobs that load the data into Dimension tables.server routines which will used in server jobs. these routines will write in BASIC Language 2. How can we implement Slowly Changing Dimensions in DataStage?. By using this we can do. a)We can implement SCD in datastage 1. which contains the transactional data. b)By Database. OLTP is for transactional process and OLAP is for Analysis purpose. type2: it will maintain the both current and historical values. d)dwh: it contains current and historical data very summary data it follows denormalization 38 .Answers DataStage . type3: it will maintain the current and partial historical values.Type 3 SCD:insert value to the column the old value and update the existing column with the new value b) by using lookup stage and change capture stage we will implement the scd..Type 2 SCD:insert new rows if the primary key is same and update with effective from date as JobRundate and to date to some max date 3.delete header and footer on the source sequential How do you you delete header and footer on the source sequential file and how do you create header and footer on target sequential file using datastage? In Designer Pallete Development/Debug we can find Head & tail. one means OLTP (On Line Transaction Processing)...Datastage Interview Questions . Current.Type 1 SCD:insert else update in ODBC stage 2.. Differentiate Database data and Data warehouse data? Data in a Database is Detailed or Transactional Both Readable and Writable. we have 3 types of scds type1:it will maintain the current values only.. c)Database data is in the form of OLTP and Data warehouse data will be in the form of OLAP. This can be the source systems or the ODS (Operational Data Store). it is much inferior as compared to datastage..Informatica does not support full pipeline parallelism (although it claims). e)Main difference lies in parellism. datasets for much more efficient lookup.. Datastage uses parellism concept through node configuration. but the caching is horrible. 2.Datastage PX provides many more robust partitioning options than informatica.Answers DIMENSIONAL MODEL non volatile How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. g)Informatica and DataStage both are ETL tools.Informatica supports flat file lookup.Datastage Interview Questions . DataStage is way more powerful and scalable than Informatica. In Informatica the only way is to do a Union. DataStage supports hash files. In my opinion.informatca is scalable than datastage b)In my view Datastage is also Scalable. of transformers copared to Informatica which makes user to get difficulties while working d)The main difference is Vendors. Parallelism .Datastage is having less no. For Datastage it is a Top-Down approach. the difference lies in the number of built-in functions which makes DataStage more user friendly c)In my view. Nothing but ETL the main difference between Informatica and DataStage is for Informatica the repository (container of meta data is database-meta data is stored in database for data stage the repository is file-meta data is stored in file before going to ETL Informatica & DataStage will check the repository for meta data here accessing a file is more faster than database because file is static but data is more secure in data base than file-data may be corrupted in file hence finally 39 . File Lookup . but when it comes to scalabality in performance. 3.. Informatica has more developer-friendly features. what is the difference between datastage and informatica a)The main difference between data stge and informatica is the SCALABILTY. where Informatica does not f)I have used both Datastage and Informatica. 4. Based on the Businees needs we have to choose products. Here are a few areas where Informatica is inferior 1. You can also re-partition the data whichever way you want. which are used for data acquisition process. Merge/Funnel .Datastage has a very rich functionality of merging or funnelling the streams. lookup filesets. Partitioning . which by the way is always a Union-all. Each one is having plus from their architecture. Hash file used as a reference for look up. So check the back end and accordingly put LongVarchar in the DataStage with the maximum number of length which is used in Database. so still have the restriction of one at a time . you can only have one instance of it running at any given time.SAS is highly flexible compared to other BI solution. Enter a name for the invocation or a job parameter allowing the instance name to be supplied at run time An 'invocation id' is what makes a 'multi-instance' job unique at runtime. sequential file are converted in a hash file and hashed file are used as a hash Lookup. With normal jobs.Answers we can conclude that data stage will perform faster than Informatica but when it comes to security issue Informatica is better than DataStage h)SAS DI studio is best when compared to Informatica and Datastage as it generates SAS code at the back end . Why the sequential file not used in Hash Lookup? This question is not proper. No duplicates in Hashfile. c)This kind of error occurs when you have CLOB in back end and Varchar in DataStage. Multi-instance jobs extend that and allow you to have multiple instances of that job running (hence the name). you can run multiple 'copies' of the same job as long as the currently running invocation ids are unique. I encounter this problem only for part of data. How do you fix the error "OCI has fetched truncated data" in DataStage a)Can we use Change capture stage to get the truncated data's..Datastage Interview Questions .Members please confirm b)I have same problem and don't know what is the solution. A sequential file is just a file with no key column.e. So. Which partition we have to use for Aggregate Stage in parallel jobs ? 40 . Difference between Hashfile and Sequential File? Hash file stores the data based on hash algorithm and on a key value. duplicates will be removed in hashfile i. What is Invocation ID? This only appears if the job identified by Job Name has Allow Multiple Instance enabled. but here is a answer Because sequential file has no key. They are still a 'normal' job under the covers. How to connect two stages which do not have any common columns between them? If suppose two stage don’t have the same column name then in between use one Transformer stage and map the required column.it's just that now that 'one' includes the invocation id. Sequential file cannot b)Hash file can be stored in DS memory (Buffer) but Sequential file cannot be. Thus the result will not be useful for this aggregation. 3.The data is coming periodically to the staging layer. b)I think the above answer is a little misleading. 1.Cleansing means LTRIM/RTRIM etc.Answers a)By default this stage allows Auto mode of partitioning. 4) I even think the entire partition method can be usefull. how do we create index in data satge? What type of index are you looking for ? If it is only based on rows. Do a hash partition on the grouping keys. This will ensure that all the similiar group keys lie in a particular partition.After extraction the data is transfered to the staging layer for cleansing purpose.Data should be first loaded into dimensional table. But it will be slightly higher overhead as compared to hash partitioning.An ODS and the Staging Area are the two types of layer between the source system and the target system. Load . 1) Identify the grouping keys you want to aggregate on. Based on the primary key values in dimensional table. Consists of fields with numeric values. use @inrownum or @outrownum What is the flow of loading data into fact & dimensional tables? Fact table . stage in parallel mode. the data should be loaded into Fact table. 2.After 41 . The best partitioning is based on the operating mode of this stage and preceding stage.Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table.Finally the Fact tables are loaded from the corresponding source tables from the staging area. it will first collect the data and before writing it to the file using the default Auto collection method. 2) In a stage prior to aggr.Table with Unique Primary Key.Datastage Interview Questions . b)The data is extracted from the different source systems. If the aggregator is in parallel mode then we can put any type of partitioning in the drop down list of partitioning tab. 3) Now the result of partition will be appropriate. Most of the time you'll be using aggr. The source data is first loading into the staging area. Dimension table .An ODS is used to store the Resent data. it doesnt indicate that the key columns that you are grouping on will lie in the same partition. where data cleansing takes place. The data from staging area is then loaded into dimensions/lookups. If the aggregator is operating in sequential mode. Now if you use the auto partioning mode. Generally auto or hash can be used. a)Here is the sequence of loading a datawarehouse. . The stage which do some process into it is called active stages .Answers that the data is transformed according to the buisness needs with the help of the ETL Transformations. I dealed witht that kind of error once. Transform and Load proess seperately.And then the data is finally loaded into the target system or data warehouse. through calling a routine. Data in sequential file always runs sequentially. Aggregators – What does the warning “Hash table has grown to ‘xyz’ …. Aggrigate. But we keep the Extract . Job failed after 5000 records are loaded. Is Hashed file an Active or Passive Stage? When will be it useful? Hash file stage is a Passive stage.How can we create Containers? 3.How can we improve the performance of DataStage? 42 . 1. This status of the job is abort .. Generally only load job never failes unles there is a data issue. there are some DB tools that do this automatically If you want to do this manually. How do you extract job parameters from a file? Through user variable activity in sequencer Job. if job failed in the middle then read the number from the file and process the records from there only ignoring the record numbers before that try @INROWNUM function for better result. Sort. All data issues are cleared before in trasform only.. There are few algorithams for doing this process. If your system memory is full then you get that kid of weird messages. Hash Files are created as dynamic files using hashing algo How do you load partial data after job failed source has 10000 records. Update the file as you insert the record. Read Data-statructures books for the algoritham models. My solution to that is use multiple chunks of data and multiple aggrigators.Datastage Interview Questions . Instead of removing 5000 records from target . your system memory will be occupied by the data that is going to aggrigator. Ex: Transformer. Keep track of number of records in a has file or test file. What is a sequential file that has single input link?? Sequential file always hasÂ single link because its it cannot accept the multiple links or threads.” mean? Aggrigator cannot store the data onto disk like Sortstage Do the data landing. what is hashing algorithm? Hashing is a technic how you store the data in dynamic files.What about System variables? 2. How can i resume the load a)There are lot of ways of doing this. • They can contain alphanumeric characters and underscores.using hashed File8. They are read only and start with an @. Maximum how many characters we can give for a Job name in DataStage? a)Answers for ur Question in simple words:2.How can we implement Lookup in DataStage Server jobs? 8. d)The following rules apply to the names that you can give DataStage jobs: • Job names can be any length. c)In server canvas we can improve performance in 2 ways : Firstly we can increase the memory by enabling interprocess row buffering in job properties .Access speed is slow in sequential file rather than hashed File12.Values that would be required during the job run 5. including spaces e)1.We can use this stage to connect two passive stages or 2 active stages.Transforms are the manipulation of data during teh load. 10.thy are of 2 types.Containers is a group of stages and links.If you know just tell me b)System variables comprise of a set of variables which are used to get system information and they can be accessed from a transformer or a routine.what are the Job parameters? 5. • They must begin with an alphabetic character.Trun in-process buffer and transaction size4.Project tunables can be set through the Administrator.Using HASH FILE 8.CDC Stage9.Containers are nothing but the set of stages with links3.using a row id or seq generated numbers 43 .How can we implement Slowly Changing Dimensions in DataStage?.Managing the array and transaction size. 7.System Variables are inbuilt functions that can be called in a transformer stage 2. 6.10. 4. 9.Using IPC.Difference between Hashfile and Sequential File? 12. Using the target oracle stages depending on teh update action 9.Lots of there7.Powerful function for Date transformation11. Job category names can be any length and consist of any characters.Datastage Interview Questions .local containers and shared containers 3.Answers 4.Routines are which call the jobs or any actions to be performed using DS. and Secondly by inserting an IPC stage we break a process into 2 processes.What are all the third party tools used in DataStage? 7.6.What is iconv and oconv functions? 11.Nothing but parameters to pass in runtime5.what is the difference between routine and transform and function? 6.How can we join one Oracle source and Sequential file?. I feel. The following are some of the steps.Datastage Interview Questions .We dont have many date functions available like in Informatica or traditional Relational databases. if so tell us some the steps you have A) Yes. The datastage uses "UCASE". it was very difficult to find the errors from the error code since the error table did not specify the reason for the error. This step is for moving DS from one machine to another. 44 . 4) Make sure that all your DB DSNâ€™s are created with the same name as old oneâ€™s. 3) After installing the new version import the old project(s) and you have to compile them all again. version 7.using hash file the read process is faster 12.DATE FUNCTIONS 11.X. Datastage is peculiar when we compare to other ETL tools. i dont see any issues with Datastage. Have you ever involved in updating the DS versions like DS 5. the most difficult part is understanding the "Datastage director job log error messages'. It doesn't give u in proper readable message.0 Install process collects project information during the upgrade.0 server before the upgrade.dsx file 2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade. * I donot know about other tools since this is the only tool that i have used until now. I have taken in doing so: 1) Definitely take a back up of the whole project(s) by exporting the project as a . while loading the data due to some regions job aborts? b)1. 3. 5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS CD that can do this for you.it can be any length What are the difficulties faced in using DataStage ? or what are the constraints in using DataStage ? a)1)If the number of lookups are more? 2)what will happen. Datastage is like unique product interms of functions ex: Most of the database or ETL tools use for converting from lower case to upper case : UPPER. And as a fresher i did not know what the error codes satnd for :) * Another issue is that the help in the datastage was not of much use since it was not specific and was more general.Answers 10. Otherthan that. You can use â€˜Compile Allâ€™ tool for this. 2. c)* The issue that i faced with datastage is that.Sequential file reads data sequentially. 6) Do not stop the 6. But it was simple to use so liked using it inspite of above issues. Use a transformer stage (without an input link)Â to get this path in the server job. On the output tab you can import the meta data (columns) of the xml file and then useÂ them as other input columns in the rest of the jobs. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.Warning Messages. hi friends. U can use XML input to read the XML document. place this value from the transformer.Datastage Interview Questions . From what I know there are four views 1> StatusÂ 2> Schedule 3> Log 4> Detail. For each and every element of XML .Dates of Jobs Compiled. import the xml file metadata in the designer repository. What is the default cache size? How do you change the cache size if needed? Default cache size is 256 MB. we should give the XPATH expression in the XML input. Program Generated Messages. Event Messages. How do you track performance statistics and enhance it? Through Monitor we can view the performance statistics. Job view is not one of them.Status of Job last run c) Status View . b) Log View .Answers What r XML files and how do you read data from XML files and what stage to be used? a)In the pallet there is Real time stages like xml-input. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over thereregardsjagan 45 . u can use XML metadata importer to import the XML source definition. b)You can right click on the server job and select the "view performance statistics" option.Once it is done. This will show the output in the number of rows per second format when the job runs. c)This is how it can be done Define the xml file path in the administrator Under environmental parameters. There are four views in Director. XML stage document clearly explanins this.xml-transformer b)First. Default read cache size is 128MB. In the input tab under the xml src. Types of vies in Datastage Director? There are 3 types of views in Datastage Director a) Job View .xml-output. Use the xml file input stage. Only these rows are output through the output link. you can send them into a transformer. This is primarily used for hash file data cache in the server. 2) Now what we do is. we will define: where deptno > 503 3) Once the rows are output from the OCI stage. if u need to capture mismatches between the two sources. what is quality stage and profile stage? Quality Stage:It is used for cleansing . Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter b)You can insert the parameter values in a table and read them when the package runs using ODBC Stage or Plug-In stage and use DS variables to assign them in the data pipeline. I am a little tentative because. This setting is only can be done in Administrator not in job level. where deptno <= 503. a)If the volume of data is high then we should use Join stage instead of Lookup. Job level tuning is available only for Buffer Size. or pass the parameters using DSSetParam from the controling job (batch job or job sequence) or Job Control Routine from with DS or use dsjob -param from within a shell script or a dos batch file when running from CLI. I am not sure if I have answered the question or not. What are the important considerations while using join stage instead of lookups. lookups provide easy option how to implement type2 slowly changing dimenstion in datastage? give me with example? Hi.Datastage Interview Questions . there are four departments in an office. file or another stage where you want to capture the rejected rows. Please do verify and let us know if this answer is wrong.Answers b)The default cache size is 128 MB.2. We place a where condition. b)1. 46 . what is the use and advantage of procedure in datastage? To trigger Database operations at before or after DB stage access. we will place some conditions like 'where' inside the OCI stage and the rejected rows can be obtained as shown in the example below: 1) Say. Ste the default values of Parameters in the Job Sequencer and map these parameters to job.Profile stage:It is used for profiling b)ProfileStage is used for analysing data and their relationships. How do you catch bad rows from OCI stage? The question itself is a little ambiguous to me. 501 through 504. place some constraint on it and use the reject row mechanism to collect the rows. I think the answer to the question might be. take another output link to a seq. In that link. How do you pass the parameter to the job sequence if the job is running at night? 1. Overflow space is only used when data grows over the reserved size for someone of the groups (sectors) within the file. example a new column will be added which shows the original address as New york and the current address as Florida. no trace of the old record at all.modulous (no. 47 . storage and performance can become a concern. Type2 should only be used if it is necessary for the data warehouse to track the historical changes. In general we use Type-30 dynamic Hash files. Helps in keeping some part of the history and table size is not increased. Type 2: A new record is added into the customer dimension table. simple to use. In Type2 New record is added. For example: There exists a customer called lisa in a company ABC and she lives in New York. so Type 3 should only be used if the changes will only occur for a finite number of time. and this can be static or dynamic. therefore both the original and the new record Will be present.Answers Slow changing dimension is a common problem in Dataware housing. In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. Later she she moved to Florida. The company must modify her address now.Datastage Interview Questions . But one problem is when the customer moves from Florida to Texas the new york information is lost. There are many groups as the specified by the modulus. Advantage of using this type2 is. Static files do not adjust their modulous automatically and are best when data is static. Therefore. the customer is treated essentially as two different people. In general 3 ways to solve this problem Type 1: The new record replaces the original record. b)you can use change-capture stage This will tell you whether the source record is insert/update/modified after comparing with DWH record and then accordingly you can choose the action How to implement the type 2 Slowly Changing dimension in DataStage? You can use change-capture & change apply stages for this What are Static Hash files and Dynamic Hash files? As the names itself suggest what they mean. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size. b)Dynamic Hash Files can automatically adjust their sie . Type 3: The original record is modified to reflect the changes. Historical information is maintained But size of the dimension table grows. a)The hashed files have the default size established by their modulus and separation when you create them. In Type1 the new one will over write the existing one that means no history is maintained. of groups) and separation (group size) based on the incoming data. the new record will get its own primary key. History of the person where she stayed last is lost. Type 30 are dynamic. Since Static HFs do not create hashing groups automatically. How did u connect to DB2 in your last project? Using DB2 ODBC drivers. What is the order of execution done internally in the transformer with the stage editor having input links on the lft hand side and output links? a)Stage variables. server job runs on on node whereas parallel job runs on more than one node. The following stages can connect to DB2 Database: ODBC DB2 Plug-in Stage Dynamic Relational Stage How do you merge two files in DS? Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different. constraints and column derivation or expressions. U can output to multiple output links by defining constraints on the output links. when the group cannot accomodate a row it goes to overflow. i think 'insert to update' is updated value is inserted to maintain history There is a lock for update option in Hashed File Stage. b)Using Funnel stage you can merge the data from two files together. What is ' insert for update ' in datastage Question is not clear still. Overflows should be minized as much as possible to optimal performance.Datastage Interview Questions . 48 . What is the difference between Datastage Server jobs and Datastage Parallel jobs? Basic difference is server job runs on windows platform usually and parallel job runs on unix platform. b)There is only one Primary input link to the Transformer and there can be many reference input links and there can be many output links. which locks the hashed file for updating when the search key in the lookup is not found.Answers Overflow groups are used when the data row size is equal to or greater than the specified Large Record size in dynamic HFs. c)We can use either FUNNEL stage or use the sequentil stage and read more then one file(both files format should be same). c)U can define a job sequence to send an email using SMTP activity if the job fails. data llrlism pipeline llrlism round robin e)there two types of parallel processing1) SMP --> Symmertical Multi Processing2) MPP---> Massive Parallel Processing f)Parallel processing are two types.Answers U can edit the order of the input and output links from the Link ordering tab in the transformer stage properties dialog. are they also 2 types of Parallel processing? d)3 types of llrlism . MPP .Datastage Interview Questions . Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP . u will be called to fix and rerun the job. execSH b)U can call external functions. c)Then how about Pipeline and Partition Paralleism. What happens if the job fails at night? Job Sequence Abort b)If you are oncall. subroutines by using Before/After stage/job Subroutines : ExecSH ExecDOS or By using Command Stage Plug-In or by calling the routine from external command activity from Job Sequence.Symmetrical Multi Processing. How will you call external function or subroutine from datastage? there is datastage option to call external programs . or Use dsJob -log from CLI. 1) Pipeline parellel processing 49 .Massive Parallel Processing. Or log the failure to a log file using DSlogfatal/DSLogEvent from controlling job or using a After Job Routine. there wont be any Oracle file. Can you please explain what your actual question is? How do you pass filename as the parameter for a job? a)While job developement we can create a paramater 'FILE_NAME' and the value can be passed while running the job. Assign the users to the project can also be done here. Clusters: same as MPP. click on the "Use Job Parameter" and select the parameter name which you have given in the above. select the NLS tab. where you can enter your parameter name and the corresponding the path of the file. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. b)It is primarily used to create the Datastage project. etc. DS offers 2 types of parallelism to take advantage of the above hardware: 1. I never heard about Oracle input file.Answers 2) Partitioning parellel processing g)Hardware wise thereÂ areÂ 3 types of parallel processing systems available: 1. Keep the project default in the text box. MPP (Massively Parallel Processing Systems: multiple CPUs each having a personal set of resources . and. control the purging of the Repository. Copy the parameter name from the text box and use it in your job. single OS) 2. 2. SMP (symetric multiprocessing: multiple CPUs. set parameters of the jobs at project level. Use the file name in the stage(source or target or lookup) 3. but physically dispersed (not on the same box & connected via high speed networks). shared memory.Datastage Interview Questions . It is Oracle table or view object.memory. Partition Parallelism What is DS Administrator used for . assign the user roles to the project. b)1. OS. 50 . The selected parameter name appears in the text box beside the "Use Job Parameter" button.did u use it? The Administrator enables you to set up DataStage users. How do you do oracle 4 way inner join if there are 4 oracle input files? The Question asked incorrectly. if National Language Support (NLS) is enabled. but physically housed on the same machine) 3. Pipeline Parallelism 2. Go to the stage Tab of the job. c)1. install and manage maps and locales. supply the file name at the run time. Define the job parameter at the job level or Project level. 2. Here you can see a grid. 2. DB2.Answers How do you populate source files? there are many ways to populate one is writting SQL statment in oracle is one way How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm? We use a) "Iconv" function .e 'yyyy-dd-mm' How do you execute datastage job from command line prompt? Using "dsjob" command as follows.Internal Convertion.4]") b)Here is the right conversion: Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname.To improve the target load process. Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname. Random etc. What are all the third party tools used in DataStage? a)Autosys.Datastage Interview Questions .While using Hash partition we specify the Partition Key. in format specify which format u want i."D/MDY[2. We should define the partition based on the stages( in datastage) or transformations(in Informationca) we use in the job(Datastage) or mapping(in Informatica). TNG. plz go through Database doc or Datastage or Informatica doc on partitioning.4]").External Convertion. Partition key is the key that we use while partition the table(in database). There are several methods of partition like Hash. dsjob -run -jobstatus projectname jobname Differentiate Primary Key and Partition Key? Primary Key is a combination of unique and not null. event coordinator are some of them that I know and worked with 51 . process the source records in ETL(in the etl tools)."D-MDY[2.2. It can be a collection of key values called as composite primary key. %format%) This shuld work. a)Primary key is the key we define on the table column or set of columns(composite pk) to make sure all the rows in a table are unique. Partition Key is a just a part of Primary Key.4]"). c)ToChar(%date%."D-YDM[4. If u need more info. we use partition.2.. b) "Oconv" function .2."D/MDY[2.2]") . Once we set specific variable that variable will be availabe into the project/job. the total is converted to ASCII. b)By using Routines we can return values but by transformers we cannot return values Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage? Currently.Datastage Interview Questions . what is the difference between routine and transform and function? Difference between Routiens and Transformer is that both are same to pronounce but Routines describes the Business logic and Transformer specifies that transform the data from one location to another by applying the changes by using transformation rules ..We can set eithere as Project level or Job level.. and writes the results to an output data set. you can use hash file to eliminate duplicates. b)If you dont have remove duplicates stge.For that we can got to DS Admin ..databases more dimesnionsal modeling c)Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields) what is an environment variable?? a)Basically Environment variable is predefined variable those we can use while creating DS job..databases more dimesnionsal modeling b)mutliproceesing... the total no of processes are 6x4=24 Explain the differences between Oracle8i/9i? mutliproceesing. for ex if there are 6 active stages (like transforms) linked by some passive stages. how many processes does datastage create? Answer is 40 You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40 b)It depends on the number of active stages on canvas and how they are linked as only active stages can create a process. 52 . even tho the individual records are stored as EBCDIC...Answers b)Maestro Schedular is another third party tool. We can also define new envrionment variable. c)Contl-M job schedular How do you eliminate duplicate rows? a)Use Remove Duplicate Stage: It takes a single sorted data set as input. removes all duplicate records. If your running 4 ways parallel and you have 10 stages on the canvas. If you create an encrypted environment variable it will appears as the string "*******" in the Administrator tool and will appears as junk text when saved to the DSParams file or when displayed in a job log.keep file in this directory and if you make manual changes to the DSParams file you will find Administrator can roll back those changes to DSParams. 53 . 1. ex: $APT_CONFIG_FILE Like above we have so many environment variables.Datastage Interview Questions .." button. It is possible to copy project specific parameters between projects by overwriting the DSParams and DSParams. c)Here is the full FAQ on this topic:Creating project specific environment variables.Answers I hope u understand. b)Theare are the variables used at the project or job level.csv how find duplicate records using transformer stage in server edition a)This is questions has got more answers as the elimination od duplicates are situation specific.keep. Set the value of these encrypted job parameters to $PROJDEF.Set the Default value of the new parameter to "$PROJDEF".Migrating Project Specific Job ParametersIt is possible to set or copy job specific environment variables directly to the DSParams file in the project directory. You will need to type it in twice to the password entry box.we can associate the configuration file(Wighout this u can not run ur job).There are two types of variables .If you have an encrypted environment variable it should also be an encrypted job parameter.Go to Job Properties and move to the parameters tab.. Can write a SQL qurey depending upon the fileds.On the General tab click the "Environment." button.Click on the "User Defined" folder to see the list of job specific environment variables.Environment Variables as Job Parameters. You can use a has file. 2... increase the sequential or dataset read/ write buffer.We can use them to to configure the job ie..Open up a job. by nautre which doesnt allow dupilcates...Database=#$DW_DB_NAME#Password=#$DW_DB_PASSWORD#File=#$PROJECT _PATH#/#SOURCE_DIR#/Customers_#PROCESS_DATE#.for further details refer the DS Admin guide.... It may be safer to just replace the User Defined section of these files and not the General and Parallel sections. Please go to job properties and click on Paramer tab then click on "add environment variable" to see most of the environment variables.keep files.string and encrypted. Depending upon the siutation we can use the best choice to remove duplicates.Click on the "Add Environment Variables. Attach a reject link to see the duplicates for your verification.When the job parameter is first created it has a default value the same as the Value entered in the Administrator.Choose the project and click the "Properties" button.Using Environment Variable Job ParametersThese job parameters are used just like normal parameters by adding them to stages in your job enclosed by the # symbol. This provides robust security of the value. By changing this value to $PROJDEF you instruct DataStage to retrieve the latest Value for this variable at job run time.Start up DataStage Administrator. There is also a DSParams. These phantoms writes logs reg the stage/job. This process is called phantom process. what is panthom error in data stage I know about the Phontom Process in datastage. an error meesage is written and these errors are called phantom errors. an error meesage is written and these errors are called phantom errors. Example is you you want to connect to database you need userid . These are constant through out the project so they will be created as env variables. This approach requires sorted input. c)For every job in DataStage. b)Enviroment Variables are the one who set the enviroments. You can use the resource manager to cleanup that kind of process. By using this if there is any change in password or schema no need to worry about all the jobs . These logs are stored at &ph& folder. a phantom is generated for the job as well as for every active stage which contributes to the job. Oce you set these varicables in datastage you can use them in any job as a perameter.We can set in project level or job level once we set the variable the variable will be available in the project. Phantoms can be killed through DataStage Administrator or at server level. 54 .Datastage Interview Questions . password and schema. change it at the level of env variable that will take care of all the jobs. what is the use of environmental variables? Environment variables are predefined variable those we can use while creating datastage jobs. If there is any abnormality ocuurs. If there is any abnormality ocuurs. These phantoms writes logs reg the stage/job. If a process is running and if you kill the process some times the process will be running in the background.Answers b)Transformer stage to identify and remove duplicates from one output. Phantoms can be killed through DataStage Administrator or at server level. a phantom is generated for the job as well as for every active stage which contributes to the job. and direct all input rows to another output (the "rejects"). use them where ever you are want with #Var# . b)For every job in DataStage. These logs are stored at &ph& folder. hi. it takes more space in database and it may not be cost effective for client. Disadvantages of staging area a)I think disadvantage of staging are is disk space as we have to dump data into a local area. As per my knowledge concern.asList(names)). With DSJob command you can run any datastage job in datastage enviroment. 55 . in the older architectures people use to create the Batchjob to control the remaining datastage jobs in the process in KEN BLEND arch. but you can force the datastage to take a set of records and then commit them. In case of Oracle stage in Trasaction Handling Tab you can set the number of rows per transaction. b)If the datais large and if you cannot process the the full data in one time process you will generally use the Range partitioning.println(name).Answers how can we run the batch using command line? DSJOB Command is the command to run the datastage jobs from command line . You will generaly load the Dimension tables first and then Facts. what is fact load? a)You load the facts table in the data mart with the combined input of ODBC (OR DSE engine) data sources..out. b)Yes. what is job commit in datastage? a)job commit means it saves the changes made b)If you see datastage job commits each record in general cases. its like a disadvantage of staging area. there is no other disadvantages of staging area. Facts will have the relative information of dimension. How can we remove duplicates using sort stage? a)Set the "Allow Duplicates" option to false b) TreeSet<String> set = new TreeSet<String>(Arrays. using a row constraint. Explain a specific scenario where we would use range partitioning ? a)It is used when Data Volumn Is high..It's Partitioning by Column wise. for (String name : set) System. You also create transformation logic to redirect output to an alternate target. the REJECTS table.Datastage Interview Questions . b)In a star schema there will be facts and dimension tables to load in any datawarehouse environments. applying fns etc. It is not possible What is repository? Repository resides in a spcified data base. You can speed it up by using link partitioner to split the data from source into differernt links. What is the alternative way where we can do job control?? Job Control will possble Through scripting.Datastage Interview Questions . No. Suppose you have a source and target and a transformer in between that does some processing. Can you convert a snow flake schema into star schema? Yes. then as per business requirements we identify the facts(columns or measures on which business is measured) and then load into fact tables. What is Fact loading.Answers this is enough for sorting and removing of duplicate elements (using Java 5 in this example) what is the difference between RELEASE THE JOB and KILL THE JOB? Release the job is to release the job from any dependencies and run it. how to do it? a)firstly u have to run the hash-jobs. mapping information and all the respective mapping information.need of the job. rawdata.. secondly dimensional jobs and lastly fact jobs. Controling is dependent on Reqirements. Link Collector & Inter Process (OCI) Stage whether in Server Jobs or in Parallel Jobs ? And SMP is a Parallel or Server ? You can use Link partitioner and link collector stages in server jobs to speed up processing. Repository is a content which is having all metadata (information). b)Once we have loaded our dimensions. Kill the job means kill the job that's currently running or scheduled to run. b)Jobcontrol can be done using : Datastage job Sequencers Datastage Custom routines Scripting Scheduling tools like Autosys Where we can use these Stages Link Partetionar. it holds all the meta data. We can convert by attaching one hierarchy to lowest level of another hierarchy. 56 . apply the Business logic and then collect the data back using link collector and pump it into output. eliminating rekeying and the manual establishment of cross-tool relationships. Where can you output data using the Peek Stage? In datastage Director! Look at the datastage director LOg b)The output of peek stage can be viewed in director LOG and also it can be saved as a seperate text file? Do u know about METASTAGE? in simple terms metadata is data about data and metastge can be anything like DS(dataset. it provides seamless cross-tool integration throughout the entire Business Intelligence and data integration lifecycle and toolsets. eliminating rekeying and the manual establishment of cross-tool relationships. Meta Data defines the type of data we are handling.) and perform analysis on dependencies.etc) b)MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on... c)Metastage is a metadata repository in which you can store the metadata (DDLs etc.. Based on patented technology... change impact etc.... f)MetaStage is a persistent metadata Directory that uniquely synchronizes metadata across multiple separate silos.Answers IPC stage is also intended to speed up processing.Datastage Interview Questions . e)MetaStage is a persistent metadata Directory that uniquely synchronizes metadata across multiple separate silos. it provides seamless cross-tool integration throughout the entire Business Intelligence and data integration lifecycle and toolsets 57 ... This Data Definitions are stored in repository and can be accessed with the use of MetaStage. d)METASTAGE is datastage's native reporting tool it contains lots of functions and reports.sq file... Based on patented technology.

Comments

Description