Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

March 25, 2018 | Author: boddu_raghunarayana | Category: Data Warehouse, Grid Computing, Scalability, Load Balancing (Computing), Database Index


Comments



Description

Design Approach to Handle Late Arriving Dimensions and Late Arriving FactsJohnson Cyriac Dec 29, 2013 DW Design | ETL Design     inShare In the typical case for a data warehouse, dimensions are processed first and the facts are loaded later, with the assumption that all required dimension data is already in place. This may not be true in all cases because of nature of your business process or the source application behavior. Fact data also, can be sent from the source application to the warehouse way later than the actual fact data is created. In this article lets discusses several options for handling late arriving dimension and Facts. What is Late Arriving Dimension Late arriving dimensions or sometimes called early-arriving facts occur when you have dimension data arriving in the data warehouse later than the fact data that references that dimension record. For example, an employee availing medical insurance through his employer is eligible for insurance coverage from the first day of employment. But the employer may not provide the medical insurance information to the insurance provider for several weeks. If the employee undergo any medical treatment during this time, his medical claim records will come as fact records with out having the corresponding patient dimension details. Design Approaches Depending on the business scenario and the type of dimension in use, we can take different design approaches. Ads not by this site Hold the Fact record until Dimension record is available One approach is to place the fact row in a suspense table. The fact row will be held in the suspense table until the associated dimension record has been processed. This solution is relatively easy to implement, but the primar y drawback is that the fact row isn‟t available for reporting until the associated dimension record has been handled. This approach is more suitable when your data warehouse is refreshed as a scheduled batch process and a delay in loading fact records until the dimension records are available is acceptable for the business. 'Unknown' or default Dimension record Another approach is to simply assign the “Unknown” dimension member to the fact record. On the positive side, this approach does allow the fact record to be recorded during the ETL process. But it won‟t be associated with the correct dimension value. The "Unknown" fact records can also be kept into a suspense table. Eventually, when the Dimension data is processed, the suspense data can be reprocessed and associate with a real, valid Dimension record. Inferring the Dimension record Another method is to insert a new Dimension record with a new surrogate key and use the same surrogate key to load the incoming fact record. This only works if you have enough details about the dimension in the fact record to construct the natural key. Without this, you would never be able to go back and update this dimension row with complete attributes. Note : When you get all other attributes for the patient dimension record in a later point. Late Arriving Dimension with multiple historical changes . you will have to do a SCD Type 1 update for the first time and SCD Type 2 going forward. it is almost certain that the "patient id" will be part of the claim fact. So we can create a new placeholder dimension record for the patient with a new surrogate key and the natural key "patient id". which is the natural key of the patient dimension. Late Arriving Dimension and SCD Type 2 changes Late arriving dimension with SCD Type 2 changes gets more complex to handle.In the insurance claim example explained in the beginning. . When loading the fact record. This further leads in modify any subsequent fact records surrogate key to point the new surrogate key. This is because we have to search back in history within the dimensions to decide how to assign the right dimension keys that were in effect when the activity occurred in the past. For example you might update your marital status in your HR system way later than your marriage date. If the late arriving fact need to be associated with an SCD Type 2 dimension. This leads to the creation of new dimension record with new surrogate key and modify any subsequent fact records surrogate key to point the new surrogate key. there may be multiple SCD Type 2 changes to the placeholder dimension record. we can handle late arriving dimension by keeping an "Unknown" dimension record or an "Inferred" dimension record. Late Arriving Dimension with retro effective changes You can get Dimension records from source system with retro effective dates. Design Approaches Unlike late arriving dimensions. Below data flow describes the late arriving fact design approach. You will have to scan forward in the dimension to see if there is any subsequent type 2 rows for this dimension. Ads not by this site What is Late Arriving Facts Late arriving fact scenario occurs when the transaction or fact data comes to data warehouse way later than the actual transaction occurred in the source application. This leads to a new dimension record with a new surrogate key and changes in effective dates for the affected dimension.As described above. which acts an a placeholder. late arriving fact records can be handles relatively easily. the situation become messy. Even before we get the full dimension record details from the source system. the associated dimension table history has to be searched to find out the appropriate surrogate key which is effective at the time of the transaction occurrences. This update come to data warehouse with retro effective date. 2013 DW Design | ETL Design    . We would also like to hear how you have handled late arriving dimension and fact in your data warehouse. Leave us your questions and commends.Hope you guys enjoyed this article and gave you some new insights into late arriving dimension and fact scenarios in Data Warehouse. SOFT and HARD Deleted Records and Change Data Capture in Data Warehouse Johnson Cyriac Dec 8. we mean to capture those changes that have happened at the source side so far after we have run our job last time. One thing to make clear is that Purging might be enabled at your OLTP. i. Effects in DW for Source Data Deletion DW tables can be divided into three categories as related to the deleted source data. INVALID transactions removed from source.    NEW transactions happened at source. Ads not by this site Usually in our ETL we take care of the 1st and 2nd case(Insert/Update Logic). the 3rd change is not captured in DW unless it is specifically instructed in the requirement specification. In this article we will deep dive into different aspects for change data in Data Warehouse including soft and hard deletions in source systems. CORRECTIONS happened on old transactional values or measured values. In Informatica we call our ETL code as „Mapping‟. but that is a different scenario. different techniques to capture change data and a change data capture frame work as well. along with some transformations in between. .. as per the business rules. But when it‟s especially amended. Here we are more interested about what was DELETED at Source because the transactions was NOT valid. we need to devise convenient ways to track the transactions that were removed i. data may get changed at source in three different ways.e OLTP keeping data for a fixed historical period of time. Now. because we MAP the source data (OLTP) into the target data (DW) and the purpose of running the ETL codes is to keep the source and target data in sync. inShare10 In our couple of prior articles we spoke about change data capture. to track the deleted records at source and accordingly DELETE those records in DW.e. Revisiting Change Data Capture (CDC) When we talk about Change Data Capture (CDC) in DW. it‟s a common practice in most DWs when we face this kind of situations. When the DW table does not track history on data changes and deletes are allowed against the source table. Some OLTPs keep the field name as ACTIVE with the values as „I‟. Ads not by this site  Physical Delete :. This approach is quite safe and also known as Soft DELETE. We just need to filter the records from source using that STATUS field and issue an UPDATE in DW for the corresponding records. This is usually done after thorough discussing with Business Users and related business rules are strictly followed.  Logical Delete :. „U‟ or „D‟. The DW table will retain the record that has been deleted in the source system.1. 3. deleting records from DW table is forbidden. If your Business requires DELETION.In this case the record related to invalid transactions are fully deleted from the source table by issuing DML statement. ETL Perspective on Deletion When we have „Soft DELETE‟ implemented at the source side. it has to be done after proper discussions with Business. 2. This is also known as Hard DELETE. Again. If a record is deleted in the source table. where „D‟ means that the record is deleted or the record is INACTIVE. but this record will be either expired in DW based on the change captured date or 'Soft Delete' will be applied against it.In this case. since the requirement is to keep the exact snapshot of the source table at any point of time. we have a specific flag in the source table as STATUS which would be having the values as „ACTIVE‟ or „INACTIVE‟. it becomes very easy to track the invalid transactions and we can tag those transactions in DW accordingly. then there are two ways. however. it is also deleted in the DW. Few things to be kept in mind in this case. if we are deleting records from DW. When the DW table load nature is 'Truncate & Load' or 'Delete & Reload'. When the DW table tracks history on data changes and deletes are allowed against the source table. . we don't have any impact. Types of Data Deletion Academically. Deletion in Data Warehouse : Dimension Vs Fact In most of the cases. we can source the same. But it becomes quite cumbersome and costly when no account is kept of what was deleted at all. Along with this. join the Audit table and the Source table based on NK and logically delete them in DW too. it's always safe to keep a copy of the OLD record in some AUDIT table. . e. we need to add specific filters while fetching source data. if Audit Table is maintained at source systems for what are transactions were deleted. as it helps to track any defects in future. ORDERS_Hist. Deletion in Dimension Tables Ads not by this site If we have DELETION enabled for Dimensions in DW.If only ACTIVE records are supposed to be used in ETL processing. The below Trigger will work fine to achieve this. we see only the transactional records to be deleted from source systems. we need to use different ways to track them and update the corresponding records in DW.g. Let's take this ORDERS table into consideration. which would store the DELETED records from ORDERS. While pushing the data into Exploration Data Warehouse. i. For „Hard DELETE‟. In these cases. DELETION of Data Warehouse records are a rare scenario. Sometimes INACTIVE records are pulled into the DW and moved till the ETL Data Warehouse level.e. this trigger would not degrade the performance much. since DELETION hardly happens. we can have a History table for ORDERS. only the ACTIVE records are sent for reporting purpose. A simple DELETE trigger should work fine. Ads not by this site The AUDIT Fields will convey when this particular record was deleted and by which user. But this table needs to be created for each and every DW table where we want to keep the audit of what was DELETED. If the entire record is not need and only fields involved in Natural Key(NK) may work, we can have a consolidated table for all the Dimensions. Here the Record_IDENTIFIER field contains the values of all the columns involved in the Natural Key(NK) separated by '#' of the table mentioned in the OBJECT_NAME field. Sometimes, we face a situation in DW where a FACT table record contains a Surrogate Key(SK) from a Dimension but the Dimension table doesn't own it anymore. In those cases, the FACT table record becomes orphan and it will hardly be able to appear in any report since we always use the INNER JOIN between Dimensions and Fact while retrieving data in the reporting layer, and there it misses the Referential Integrity(RI). Suppose, we want to track the orphan records from the SALES Fact table in respect of Product Dimension. We can use the query as below. Ads not by this site So, the above query will provide only the Orphan records, BUT certainly it cannot provide you the records DELETED from the PRODUCT_Dimension. So, one feasible solution could be while populating the EVENT table with the SKs from PRODUCT_Dimension that are being DELETED, provided we don't reuse our Surrogate Keys. So, when we have both the SKs and the NKs from the PRODUCT_Dimension in the EVENT table for DELETED entries, we can achieve a better compliance over the Data Warehouse data. Another useful but least used approach is enabling the audit for any table for DELETE in an Oracle DB using queries like the following. Audit DELETE on SCHEMA.TABLE; The table DBA_AUDIT_STATEMENT will contain all the related details related to this deletion, example the user who issued the, exact DML statement and so on, but this cannot provide you with the record that was deleted. Since this approach cannot directly provide you information on which record was deleted, it‟s not so useful in our current discussion, so I would like to keep aloof from the topic here. Deletion in Fact Tables Now, this was all about DELETION in DW Dimension tables. Regarding FACT data DELETION, I would like to cite an extract of what Ralph Kimball has to say on Physical Deletion of Facts from DW. Change Data Capture & Apply for 'Hard DELETE' in Source Again, whether we should track the DELETED records from source or not depends on the type of table and its Load Nature. I will share few genuine scenarios that are usually faced in any DW and discuss about the solutions accordingly. 1. Records are DELETED from SOURCE for a known Time Period, no Audit Trail was kept. In this case, the ideal solution is to DELETE the entire records‟ set in DW for the Target table and pull the source records once again for the time period. This will bring the DW in sync with Source and DELETED records also will not be available in DW. Usually time period is mentioned in terms of Ship_DATE or Invoice_DATE or Event_DATE, i.e. a DATE type field from the actual dataset of the source table is used, and hence the way we can filter the records for Extraction from source table using WHERE clause, we can do the same in DW table as well. Obviously, in this case we are NOT able to capture the 'Hard DELETE' from the Source i.e., we cannot track the History of DATA, but we would be able to bring the Source and DW in sync at the least. Again, this approach is recommended only when the situation occurs once in a while and not on regular basis. 2. Records are DELETED from SOURCE on regular basis with NO Timeframe, no Audit Trail was kept. The possible solution in this case would be to implement FULL Outer JOIN between the Source and the Target table. The tables should be joined on the fields involved in the Natural Key(NK). This approach will help us to track all three kinds of changes to source data in one shot. The logic can be better explained with the help of a Venn diagram. Even though I'm mentioning it a DELETION. we are using the entire data set from both the ends and this will obviously obstruct the smooth processing of ETL when data volume increases.    Records that have values for the NK fields only from the Source and not from the Target.g. they should go for the INSERT flow. what we do with those DELETED records from Source. . i. apply 'Soft DELETE' or 'Hard DELETE' in DW. depends on our requirement specification and business scenarios. These are all new records coming from source. e. This is mainly related to incorrect transactions in Legacy Systems.e. Now. they should go for the UPDATE flow. Records are DELETED from SOURCE. These are already existing records of Source. Whenever we go for a FULL Outer JOIN between Source and Target. Records that have values for the NK fields only from Target. But this approach is having severe disadvantage in terms of ETL Performance. These are the records that were somehow DELETED from Source table. will go for the DELETE flow. Audit Trail was kept.Out of the Joiner (kept in FULL Outer Join mode). which usually send data in flat files. it's NOT the kind of Physical DELETION that we discussed previously. Mainframes. 3. Records that have values for the NK fields from both the Source and the Target. but the aggregated measures become NULL in the aggregated FACT table. DW contains both the old set of records and the newly arrived records. at first the transaction happened(with the older record) and then it became invalid(with the newer record).e. the sales figure are same as the old ones but they are negative. People who read this also read : Surrogate Key Generation Approaches Using Informatica PowerCenter Johnson Cyriac Nov 21. thus diminishing the impact of those invalid transactions in DW to NULL. source team sends those transactions related records again to DW but with inverted measures. Aggregated FACT contains the correct data at the summarized level. but the transactional FACT dual set of records. So. Leave us your questions and commends. i.When some old transactions become invalidated. i. 2013 ETL Design | Mapping Tips    . Only disadvantage of this approach is. Hope you guys enjoyed this article and gave you some new insights into change data capture in Data Warehouse. which together About the Author represent the real scenario. We would like to hear how you have handled change data capture in your data warehouse.e. special care should be given to make sure that no keys are duplicated. Using Sequence Generator Transformation This is the simplest and most preferred way to generate Surrogate Key(SK). Here in this article we will concentrate on different approaches to generate Surrogate Key for different type ETL process. Surrogate Key for Dimensions Loading in Parallel When you have a single dimension table loading in parallel from different application data sources. Below shown is a reusable Sequence Generator transformation. Lets see different design options here. . We discussed about Surrogate Key in in detail in our previous article. We create a reusable Sequence Generator transformation in the mapping and map the NEXTVAL port to the SK field in the target table in the INSERT flow of the mapping. The start value is usually kept 1 and incremented by 1. 1. inShare9 Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. so that the same transformation can be reused in multiple mappings. . 2. Using Database Sequence We can create a SEQUENCE in the database and use the same to generate the SKs for any table. Using SQL Transformation You can create a create reusable reusable SQL Transformation as shown below. Ads not by this site Note : Make sure to create a reusable transformation. Below shown is the sequence generator transformation.Customer_SK MINVALUE 1 MAXVALUE 99999999 START WITH 1 INCREMENT BY 1. This can be invoked by a SQL Transformation or a using a Stored Procedure Transformation. It takes the name of the database sequence and the schema name as input and returns SK numbers. CREATE SEQUENCE DW.NEXTVAL port from the Sequence Generator can be mapped to the surrogate key in the target table. which loads the same dimension table. First we create a SEQUENCE using the following command. Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the transformation and the output can be mapped to the target SK column. Below shown is the SQL transformation image. Using Stored Procedure Transformation We use the SEQUENCE DW.Customer_SK to generate the SKs in an Oracle function, which in turn called via a stored procedure transformation. Create a database function as below. Here we are creating an Oracle function. CREATE OR REPLACE FUNCTION DW.Customer_SK_Func RETURN NUMBER IS Out_SK NUMBER; BEGIN SELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL; RETURN Out_SK; EXCEPTION WHEN OTHERS THEN raise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM); END; You can import the database function as a stored procedure transformation as shown in below image. Now, just before the target instance for Insert flow, we add an Expression transformation. We add an output port there with the following formula. This output port GET_SK can be connected to the target surrogate key column.  GET_SK =:SP. CUSTOMER_SK_FUNC() Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective Surrogate Key for Non Parallel Loading Dimensions If the dimension table is not loading in parallel from different application data sources, we have couple of more options to generate SKs. Lets see different design options here. Using Dynamic LookUP When we implement Dynamic LookUP in any mapping, we may not even need to use the Sequence Generator for generating the SK values. For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we can exploit this advantage. The Integration Service uses the following process to generate Sequence IDs. since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP. The Integration Service generates a Sequence-ID for each row it inserts into the cache.   When the Integration Service creates the dynamic lookup cache. The Integration Service increments each sequence ID by one until it reaches the smallest existing value minus one. When the Integration Service inserts a row of data into the cache. it gets the SK value from the Target Dynamic LookUP cache. So. if we take this port and connect to the target SK field. When the Integration Service reaches the maximum number for a generated sequence ID. the session fails. If the Integration Service runs out of unique sequence ID numbers. based on the Associated Ports matching. . For any records which is already present in the Target. it generates a key for a port by incrementing the greatest sequence ID value by one. it tracks the range of values for each port that has a sequence ID in the dynamic lookup cache. there will not be any need to generate SK values separately. it starts over at one. Above shown is a dynamic lookup configuration to generate SK for CUST_SK. Ads not by this site The disadvantage of this technique lies in the fact that we don’t have any separate SK Generating Area and the source of SK is totally embedded into the code. which loads the Customer Dimension table. '1' AS CUSTOMER_ID FROM CUSTOMER_DIM Next in the mapping after the SQ use an Expression transformation.    VAR_COUNTER = IIF(ISNULL( VAR_INC ). but will get the MAX available SK in a different way.  SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY FROM CUSTOMER_DIM . So in the Mapping. Select CUSTOMER_KEY as Return Port and Lookup Condition as  CUSTOMER_ID = IN_CUSTOMER_ID Use the SQL Override as below:  SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY. we will have a mapping variable. Before that session in the Workflow. Using Mapping & Workflow Variable Here again we will use the Expression transformation to compute the next SK. for the first row we will look up the Dimension table to fetch the maximum available SK in the table. Here the O_COUNTER will give the SKs to be populated in CUSTOMER_KEY. say LKP_CUSTOMER_DIM. $$MAX_CUST_SK which will be set with the value of MAX (SK) in Customer Dimension table. We will create the following ports in the EXP to compute the SK value. we add a dummy session as s_Dummy. Next we will keep on incrementing the SK value stored in the variable port by 1 for each incoming row. The purpose is to get the maximum SK value in the dimension table. VAR_INC +1) VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER When the mapping starts.g. first create a Unconnected Lookup for the dimension table. we have a session s_New_Customer. Here actually we will be generating the SKs for the Dimension based on the previous value generated.Using Expression Transformation Suppose we are populating a CUSTOMER_DIM. Say the SK column is CUSTOMER_KEY and the NK column is CUSTOMER_ID.LKP_CUSTOMER_DIM('1'). 0) + 1. NVL(:LKP. In s_Dummy. Suppose. e. So. We pull this MAX (SK) from the SQ and then in an EXP we assign this value to the mapping variable using the SETVARIABLE function. . the sequence is:   Post-session on success variable assignment of First Session: o $$MAX_SK = $$MAX_CUST_SK Pre-session variable assignment of Second Session: o $$START_VALUE = $$MAX_SK Now in the actual mapping. we add an EXP and the following ports into that to compute the SKs one by one for each records being loaded in the target. $$START_VALUE + 1. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-session variable assignment section of s_New_Customer. INP_CUSTOMER_KEY) –.  VAR_COUNTER = IIF (ISNULL (VAR_INC).We will have the CUSTOMER_DIM as our source table and target can be a simple flat file. VAR_INC + 1) About   the Author VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER will be connected to the SK port of the target. which will not be used anywhere. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. But how can we pass the parameter value from one session into the other one? Here the use of Workflow Variable comes into picture. we assign the value of $$MAX_CUST_SK to $$START_SK. OUT_COUNTER Hope you enjoyed this article and earned some new ways to generate surrogate keys for your dimension tables. but the value we assigned to the variable will persist in the repository. In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. Please leave us a comment or feedback if you have any. we will have the following ports in the EXP:   INP_CUSTOMER_KEY = INP_CUSTOMER_KEY -– The MAX of SK coming from Customer Dimension table. OUT_MAX_SK = SETVARIABLE ($$MAX_CUST_SK. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy. we are happy to hear from you.Output Port This output port will be connected to the flat file port. So. When. Why and Why Not Johnson Cyriac Nov 13. These SKs should be fetched from the most recent versions of the dimension records. It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table. Finally the FACT table in DW contains the factual data along with corresponding SKs from the Dimension tables.People who read this also read : Surrogate Key in Data Warehouse. What. It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table. different dimensional attributes are looked up in the corresponding Dimensions and SKs are fetched from there. starting with one and going up to the highest number that is needed.    It is UNIQUE since it is sequentially generated integer for each record being inserted in the table. What Is Surrogate Key Surrogate Key (SK) is sequentially generated meaningless unique number attached with each and every record in a table in any Data Warehouse (DW). Surrogate Key Pipeline and Fact Table During the FACT table load. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. . 2013 DW Design | ETL Design     inShare11 Surrogate keys are widely used and accepted design standard in data warehouses. It join between the fact and dimension tables and is necessary to handle changes in dimension table attributes. Ads not by this site Ralph Kimball emphasizes more on the abstraction of NK. Surrogate Keys should NOT be: .The below diagram shows how the FACT table is loaded from the source. As per him. But. joining different Dimensions with the Fact using SKs. Why Should We Use Surrogate Key Basically it’s an artificial key that is used as a substitute for a Natural Key (NK). SK is much needed when we have very long NK or the datatype of the NK is not suitable for Indexing. SK is just an Integer attached to a record for the purpose of joining different tables in a Star or Snowflake schema based DW. We should have defined NK in our tables as per the business requirement and that might be able to uniquely identify any record. The below image shows a typical Star Schema. Fact. Composed of natural keys glued together. Replacing big. Bridge or Aggregate table. which would be enough for any dimension and SK would not run out of values. few more reasons for choosing this SK approach are:     If we replace the NK with a single Integer. . Apart from these.   Smart. Advantage of a four-byte integer key is that it can represent more than 2 billion different values. SK is usually independent of the data contained in the record. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintain Referential Integrity (RI). not even for the Big or Monster Dimension. Implemented as multiple parallel joins between the dimension table and the fact table. SK would be a great candidate for a Good Key in a DW. a “good key” is a column that has the following properties:         It forced to be unique It is small It is an integer Once assigned to a row. it provides an extra edge in the ETL performance by fastening data retrieval and lookup. it never changes Even if deleted. we have the advantage of storage space reduction as well to implement the SK in our DW. So. it will never be re-used to refer to a new row It is a single column It is stupid It is not intended as being remembered by users If the above mentioned features are taken into account. So. Why Shouldn’t We Use Surrogate Key There are myriad number of disadvantages as well while working with SK. storing of concise SKs would result in less amount of space needed. As per Thomas Kejser. Hence it provides Data Abstraction. tight integer SKs is bound to improve join performance. where you can tell something about the record just by looking at the key. Therefore over usage of SKs lead to the problem of disassociation. since joining two Integer columns works faster. apart from the abstraction of critical business data involved in the NK. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK which may be alphanumeric. ugly NKs and composite keys with beautiful. and here instead of storing of those big or huge NKs. It has become a Standard Practice to associate an SK with a table in DW irrespective of being it a Dimension. it should be able to save a substantial amount of storage space. Let’s see them one by one:  The values of SKs have no relationship with the real world meaning of the data held in a row. we cannot understand anything about the data in a record simply by seeing only the SK. socalled double or triple barreled joins. parallel processing and high availability to .     The generation and attachment of SK creates unnecessary ETL burden.e. Data Migration. i. Replication of data from one environment to another. Reference : Ralph Kimball. unique index is applied on that column. Even query optimization becomes difficult since SK takes the place of PK. Sometimes it may be found that the actual piece of code is short and simple. we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. dynamic partitioning. since Unique Constraint is defined on the SK and not on the NK. The grid option delivers the load balancing. This may come as an extra overhead on the ETL. SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK. If duplicate records come from the source. And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique index on the SK. any mismatch in the SK for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema. Thomas Kejser Ads not by this site People who read this also read : Informatica PowerCenter on Grid for Greater Performance and Scalability Johnson Cyriac Oct 31. 2013 ETL Design | Performance Tips     inShare7 Informatica has developed a solution that leverages the power of grid computing for greater data integration scalability and performance. there is a potential risk of duplicates About the Author being loaded into the target. becomes difficult since SKs from different Dimension tables are used as the FKs in the Fact table and SKs are DW specific. During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension. but generating the SK and carrying it forward till the target adds extra overhead on the code. We can see two nodes Node_1. the Integration Service runs a service process on each available node of the grid to increase performance and scalability. When you run a workflow on a grid. with two node Grid configuration. When you run a session on a grid. Node_2 and the Node_GRID grid created using two nodes. you can configure workflows and sessions to run on a grid. Domain : A PowerCenter domain consists of one or more nodes in the grid environment. What is PowerCenter On Grid Performance Improvement Features Pushdown Optimization Pipeline Partitions Dynamic Partitions Concurrent Workflows Grid Deployments Workflow Load Balancing When a PowerCenter domain contains multiple nodes. Node : A node is a logical representation of a physical machine that runs a PowerCenter service. PowerCenter services run on the nodes. In this article lets discuss how to setup Infrmatica Workflow to run on grid. The integration service Int_service_GRID is running on the grid. the Integration Service distributes session threads to multiple DTM processes on nodes in the grid to increase performance and scalability. . Admin Console with Grid Configuration Below shown is an Informatica Admin Console.ensure optimal scalability. A domain is the foundation for PowerCenter service administration. performance and reliability. You can setup the workflow to run on grid as shown in below image.Setting up Workflow on Grid When you setup a workflow to run grid. and predefined Event-Wait tasks within workflows across the nodes in a grid. the Integration Service distributes workflows across the nodes in a grid. It also distributes the Session.You can assign the integration service. which is configured on grid to run the workflow on grid. . Command. You can setup the session to run on grid as shown in below image.Setting up Session on Grid When you run a session on a grid. The Load Balancer distributes session threads to DTM processes running on different nodes. the Integration Service distributes session threads across nodes in a grid. You might want to configure a session to run on a grid when the workflow contains a session that takes a long time to run. . Workflow Running on Grid Below workflow monitor screen shots sows a workflow running on grid. . You see two of the session in the workflow wf_Load_CUST_DIM run on Node_1 and other one on Node_1 from 'Task Progress Details' Window. . High Availability : Grid complements the High Availability feature or PowerCenter by switching the master node in case of a node failure. we are happy to hear from you.Key Features and Advantages of Grid    Load Balancing : While facing spikes in data processing. By adapting to available resources.. Dynamic Partitioning : Dynamic Partitioning helps making the best use of currently available nodes on the grid. please leave us a comment or feedback if you have any. CPU utilization. Hope you enjoyed this article. load balance guarantees smooth operations by switching the data processing between nodes on the grid. memory requirements etc. People who read this also read : Time Zones Conversion and Standardization Using Informatica PowerCenter . it also helps increasing the performance of the whole ETL process. The node is chosen dynamically based on process size. This ensures the monitoring and the shorten time needed for recovery processes. The time standardization will be done as part of the ETL. You can learn more about the dimensional modeling aspect from Ralph Kimball. but not the data modeling part. 2013 ETL Design | Mapping Tips     inShare8 When your data warehouse is sourcing data from multi-time zoned data sources. Be sure to have the ports created in the same order. you can create below ports and the corresponding expressions. Data in the warehouse needs to be standardized and sales transaction need to be captured in local as well as GMT time. as well as local times. This reusable transformation can be used in any Mapping. it is recommended to capture a universal standard time. which needs a time standardization. This transformation can be reused in all the ETL process. This design enables analysis on the local time along with the universal standard time. o o o o LOC_TIME_WITH_TZ : STRING(36) (Input) DATE_TIME : DATE/TIME (Variable) TZ_DIFF : INTEGER (Variable) TZ_DIFF_HR (V) : INTEGER (Variable) . Local sales applications are capturing sales in the local time. data type and precision in the transformation. Building the Reusable Expression You can create the reusable transformation in the Transformation Developer. which is used to integrate sales data from different global sales regions in to the enterprise data warehouse. Sales transactions are happening in different time zones and from different sales applications. which needs the time zone conversion. In this article lets discuss about the implementation using Informatica PowerCenter. Solution : Create a reusable expression to convert the local time into GMT time.Johnson Cyriac Oct 23. Same goes with transactions involving multiple currencies. We will concentrate only on the ETL part of time zone conversion and standardization. Business Use Case Lets consider an ETL job. In the expression transformation. which loads the warehouse. o o o o TZ_DIFF_MI (V) : INTEGER (Variable) GMT_TIME_HH : DATE/TIME (Variable) GMT_TIME_MI : DATE/TIME (Variable) GMT_TIME_WITH_TZ STRING(36) (Output) Now create expressions as below for all the ports. .2)) TZ_DIFF_MI : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ.30.FF AM TZH:TZM'.0.-1.TZ_DIFF_MI*TZ_DIFF) GMT_TIME_WITH_TZ : TO_CHAR(GMT_TIME_MI.TZ_DIFF_HR*TZ_DIFF) GMT_TIME_MI : ADD_TO_DATE(GMT_TIME_HH.US AM') || ' +00:00' Note : The expression is based on the timestamp format 'DD-MON-YYYY HH:MI:SS. Below shown is the completed expression transformation.'DD-MON-YYYY HH:MI:SS. this expression might not work.'HH'. which needs the time zone conversion. The reusable transformation can be used in any Mapping.31.2)) GMT_TIME_HH : ADD_TO_DATE(DATE_TIME.29).1)='+'. Ads not by this site o o o o o o o DATE_TIME : TO_DATE(SUBSTR(LOC_TIME_WITH_TZ.'MI'.'DD-MON-YY HH:MI:SS.US AM') TZ_DIFF : IIF(SUBSTR(LOC_TIME_WITH_TZ. If you are using a different oracle timestamp format.1) TZ_DIFF_HR : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ. Below is the expression transformation with the expressions added.34. Below shown is a mapping using this reusable transformation. We will be more than happy to help you. Download You can download the reusable expression we discussed in this article. Expression Usage This reusable transformation takes one input port and gives one output port. The input port should be a date timestamp with time zone information. Note : Timestamp with time zone is processed as STRING(36) data type in the mapping. Hope this tutorial was helpful and useful for your project.You can see a sample output data generated by expression as shown in below image. Please leave you questions and commends. People who read this also read : . Click here for the download link. All the transformations should use STRING(36) data type. Source and target should use VARCHAR2(36) data type. It converts the transformation logic into SQL statements. Performance Enhancements Features The main PowerCenter features for Performance Enhancements are. we will go over the designs tips and tricks for ETL load performance improvement. In addition to the features provided by PowerCenter. which can directly execute on database. Workflow Load Balancing. Pipeline Partitions. Concurrent Workflows. Pushdown Optimization Pushdown Optimization Option enables data transformation processing.Informatica Performance Tuning Guide. so far we covered about the performance turning basics. Dynamic Partitions. Performance Tuning Tutorial Series Part I : Performance Tuning Introduction.Part 4 Johnson Cyriac Nov 30. 1. Performance Enhancements . Other Performance Tips and Tricks. In this article we will cover different performance enhancement features available in Informatica PowerCener. 3. 4. 1. 2013 ETL Admin | Performance Tips     inShare10 In our performance turning article series. Part III : Remove Performance Bottlenecks. 2. identification of bottlenecks and resolving different bottlenecks. 7. . Part IV : Performance Enhancements. This minimizes the need of moving data between servers and utilizes the power of database engine. Grid Deployments. Pushdown Optimization. 6. 5. Part II : Identify Performance Bottlenecks. to be pushed down into any relational database to make the best use of database processing power.  Read More about Pushdown Optimization.  Read More about Dynamic Session Partition. 3. Using Dynamic Session Partitioning capability. Concurrent Workflows Ads not by this site A concurrent workflow is a workflow that can run as multiple instances concurrently. We can configure two types of concurrent workflows. It can be concurrent workflows with the same instance name or unique workflow instances to run concurrently. . Partitioning option will let you split the large data set into smaller subsets which can be processed in parallel to get a better session performance. When you run a workflow on a grid.  Read More about Concurrent Workflows. Session Partitioning The Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing.  Read More about Session Partitioning. 2. the Integration Service distributes session threads to multiple DTM processes on nodes in the grid to increase performance and scalability. Dynamic Session Partitioning Informatica PowerCenter session partition can be used to process data in parallel and achieve faster data delivery. the Integration Service runs a service process on each available node of the grid to increase performance and scalability. 5. you can configure workflows and sessions to run on a grid. A workflow instance is a representation of a workflow. PowerCenter can dynamically decide the degree of parallelism. When you run a session on a grid. Grid Deployments When a PowerCenter domain contains multiple nodes. The Integration Service scales the number of session partitions at run time based on factors such as source database partitions or the number of CPUs on the node resulting significant performance improvement. 4. It may dispatch tasks to a single node or across nodes on the grid. 2013 ETL Admin | Performance Tips     inShare9 . Read More about Grid Deployments. the Load Balancer dispatches different tasks in the workflow such as Session. We would like to reference those tips and tricks in this article for your reference.  Read More about Other Performance Tips and Tricks. Hope you guys enjoyed these tips and tricks and it is helpful for your project needs. Leave us your questions and commends. We would like to hear any other performance tips you might have used in your projects. 6. and predefined Event-Wait tasks to different nodes running the Integration Service. 7. Ads not by this site People who read this also read : Informatica PowerCenter Load Balancing for Workload Distribution on Grid Johnson Cyriac Nov 8.  Read More about Workflow Load Balancing. Command. When you run a workflow. Workflow Load Balancing Informatica Load Balancing is a mechanism which distributes the workloads across the nodes in the grid. Other Performance Tips and Tricks Through out this blog we have been discussing different tips and tricks to improve your ETL load performance. Load Balancer matches task requirements with resource availability to identify the best node to run a task. how to use load balancer to setup high workflow priorities and how to allocate resources. and predefined Event-Wait tasks within workflows across the nodes in a grid.Informatica PowerCenter Workflows runs on grid. It also distributes Session. Assign resources : You assign resources to tasks. the Load Balancer dispatches different tasks in the workflow such as Session. When multiple tasks are waiting to be dispatched. You can adjust the workflow priorities and the assign resources needs for tasks. Session. Identifying the Nodes to Run a Task Load Balancer matches the resources required by the task with the resources available on each node. This article describes. What is Informatica Load Balancing Performance Improvement Features Pushdown Optimization Pipeline Partitions Dynamic Partitions Concurrent Workflows Grid Deployments Workflow Load Balancing Informatica load Balancing is a mechanism which distributes the workloads across the nodes in the grid. Assigning Service Levels to Workflows Service levels determine the order in which the Load Balancer dispatches tasks from the dispatch queue. Command. It dispatches tasks in the order it receives them. such that load balancer can distribute the tasks to the right nodes and right priority. the Load Balancer dispatches these tasks to nodes where the resources are available. the Load Balancer dispatches high priority . If the Integration Service is configured to check resources. Ads not by this site Assign service levels : You assign service levels to workflows. distributes workflow tasks across nodes in the grid. When you run a workflow. Load Balancer matches task requirements with resource availability to identify the best node to run a task. Command. and predefined Event-Wait tasks require PowerCenter resources to succeed. Command. and predefined Event-Wait tasks to different nodes running the Integration Service. It may dispatch tasks to a single node or across nodes on the grid. PowerCenter uses load balancer to distribute workflows and session tasks to different nodes. Service levels establish priority among workflow tasks that are waiting to be dispatched. when multiple workflows are running in parallel. Integration service will be limited to run You give Higher Service Level for the workflows. . You assign service levels to workflows on the General tab of the workflow properties as shown below. Service Levels are set up in the Admin console. Assigning Resources to Tasks If the Integration Service runs on a grid and is configured to check for available resources. the Load Balancer uses resources to dispatch tasks. You create service levels and configure the dispatch priorities in the Administrator tool.tasks before low priority tasks. The Integration Service matches the resources required by tasks in a workflow with the resources available on each node in the grid to determine which nodes can run the tasks. which needs to be dispatched first. You can configure the resource requirements by the tasks as shown in below image. Hope you enjoyed this article and this will help you prioritize your workflows to to meet your data refresh time lines. People who read this also read : Dynamically Changing ETL Calculations Using Informatica Mapping Variable Johnson Cyriac Oct 16. the source qualifier needs source file from File Directory NDMSource. which is accessible only from one node.Below configuration shows that. we are happy to hear from you. 2013 ETL Design | Mapping Tips    . Please leave us a comment or feedback if you have any. Available resource on different nodes are configured from Admin console. Sales commission is one of the fact table data element. The changing expression for the calculation will be passed into the mapping using a session parameter file. Such as a discount calculation which changes every month or a special weekend only logic. inShare6 Quite often we deal with ETL logic. Create a mapping variable $$EXP_SALES_COMM and set the isExpVar property TRUE as shown in below image. Step 1 : As the first step. Here we will be building the dynamic sales commission calculation logic with the help of a mapping variable. Sales Commission calculation can be : 1. Sales Commission = Sales * 18 / 100 2. There is a lot of practical difficulty in making such frequent ETL change into production environment. It is a factor of sales or sales revenue or net sales. Sales Commission = Sales Revenue * 20 / 100 3. The calculation need to be used by the month end ETL will be decided by the Sales Manager before the month ETL load. . its calculation is dynamic in nature. Sales Commission = Net Sales * 20 / 100 Note : The expression calculation can be as complex as the business requirement demands. which is very dynamic in nature. lets build the mapping logic. Best option to deal with this dynamic scenario is parametrization. In this article let discuss how we can make the ETL calculations dynamic. The sales department wants to build a monthly sales fact table. Mapping Configuration Now we understand the use case. Business Use Case Lets start our discuss with the help of a real life use case. The fact table need to be refreshed after the month end closure. Note : Precision for the mapping variable should be big enough to hold the whole expression. . Step 2 : In an expression transformation. Below shown is the screenshot of expression transformation. Note : All the ports used in the expression $$EXP_SALES_COMM should be available as an input or input/output port in the expression transformation. create an output port and provide the mapping variable as the expression. we are happy to hear from you. This clearly eliminate the need of a ETL code change. You can update the expression in the parameter file when ever a change is required in the sales commission calculation. please leave us a comment or feedback if you have any. Hope you enjoyed this article.Workflow Configuration In the workflow configuration. People who read this also read : . Step 1 : Create the session parameter file with the expression for Sales Commission calculation with the below details. we will create the parameter file with the expression for Sales Commission and set up in the session. [s_m_LOAD_SALES_FACT] $$EXP_SALES_COMM=SALES_REVENUE*20/100 Step 2 : Set the parameter in the session properties as shown below. With that we are done with the configuration. ERP. We will have this transformation explained in this article with a use case. For example. Informatica has provided a variety of connector to get data extracted from such data sources. When you run a session with an HTTP transformation. .It posts data to the HTTP server and passes HTTP server responses to a downstream transformation in the mapping. you will see all sorts of data sources. the Integration Service connects to the HTTP server and issues a request to retrieve data from or update data on the HTTP server. Update data on the HTTP server :. http://rate-exchange. like Mainframe. you can create HTTP transformations in the Transformation Developer or in the Mapping Designer. 2.com/currency?from=USD&to=EUR Using HTTP Transformation you can : Ads not by this site 1.It retrieves data from the HTTP server and passes the data to a downstream transformation in the mapping. 2013 Transformations     inShare12 In a matured data warehouse environment. Web Services. Message Queues. Machine Logs. Using Informatica HTTP transformation. Hadoop etc. The Interface Between ETL and Web Services Johnson Cyriac Sep 30.appspot. Developing HTTP Transformation Like any other transformation. you can make Web Service calls and get data from web servers. all the configuration required for this transformation in on the HTTP tab. you can get the currency conversion rate between USD and EUR by calling this web service call. Read data from an HTTP server :. As shown in below image. What is HTTP Transformation The HTTP transformation enables you to connect to an HTTP server to use its services and applications.Informatica HTTP Transformation. Header. Input.. In the above shown image. Used to construct the final URL for the GET method or the data for the POST request. Contains data from the HTTP response.Read or Write data to HTTP server As shown in the image. Select GET method to read data and POST or SIMPLE POST method to write data to an HTTP server. we have two input ports for the GET method and the response from the server as the output port Configuring a URL . Passes responses from the HTTP server to downstream transformations. Contains header data for the request and response. you can configure the transformation to read data or write data to the HTTP server. Configuring Groups and Ports Base on the type of the HTTP method. on the HTTP tab. you choose and the port group and port in the transformation in the HTTP tab.    Output. "from" and "to" currency. Solution : Here in the ETL process lets us use a web service call to get the real time currency conversion rate and convert the foreign currency to USD. which is used to integrate sales data from different global sales regions in to the enterprise data warehouse.Create the HTTP Transformation like any other transformation in the mapping designer. This web service call is to get the currency conversion and we are passing two parameters to the base url. The Designer constructs the final URL for the GET method based on the base URL and port names in the input group. you can see the base url and the constructed URL. Below shown is the configuration. http://rate-exchange.appspot.com/ for the demonstration. Data in the warehouse needs to be standardized and all the sales figure need to be stored in US Dollars (USD). you can create an HTTP connection object in the Workflow Manager. This connection can be used in the session configuration to connect the HTTP server. We need to configure the transformation for the GET HTTP method to access currency conversion data. HTTP Transformation Use Case Lets consider an ETL job. We will use HTTP Transformation to call the web service. We will be using the web service from http://rate-exchange. This web service take two parameters. which includes the query parameters. "from currency" and "to currency" and returns a JSON document.The web service will be accessed using a URL and the base URL of the web service need to be provided in the transformation. Connecting to the HTTP Server If the HTTP server requires authentication.appspot. Ads not by this site . For the demo. In the above shown image. we will concentrate only on the HTTP transformation.com/currency?from=USD&to=EUR Step 1 :. with the exchange rate information. . The ports need to be string data type and the port name should match with the url parameter name.Create two input ports as shown in below image.Step 2 :. Now you can provide the base URL for the web service and the designer will construct the complete URL with the parameters included. .Step 3 :. {"to": "USD". you can plug in the transformation into the mapping as shown in below image.Step 4 :. Parse the output from HTTP Transformation in an expression transformation and do the calculation to convert the currency to USD. "from": "EUR"} Finally. "rate": 1. .The output from the HTTP transformation will look similar to what is given below.3522000000000001. What is SQL Transformation The SQL transformation can be used to processes SQL queries midstream in a mapping. 2013 Transformations     inShare10 SQL statements can be used as part of pre or post SQL commands in a PowerCenter workflow. You can execute any valid SQL statement using this transformation. Configuring SQL Transformation . These are static SQLs and can run only once before or after the mapping pipeline is run. Please let us know if you have any difficulties in trying out HTTP transformation or share us if you use any different use cases you want to implement using HTTP transformation. SQLs Beyond Pre & Post Session Commands Johnson Cyriac Sep 24. we can use SQL statements much more effectively to build your ETL logic. This can be external SQL scripts or SQL queries that are created with in the transformation. SQL transformation processes the query and returns rows and database errors if any. People who read this also read : Informatica SQL Transformation. With the help of SQL transformation.Hope you enjoyed this tutorial. In this tutorial lets learn more about the transformation and its usage with a real time use case. You can output multiple rows when the query has a SELECT statement. . which is passed in through the input ports of the transformation. You pass a script name to the transformation with each input row.You can change the query statements and the data. but you can use query parameters to change the data. ScriptError returns errors that occur when a script fails for a row. Dynamic SQL query :.SQL transformation can run in two different modes. the Integration Service prepares the SQL statement once and executes it for each row. you pass script file name with the complete path from the source to the SQL transformation ScriptName port. With a dynamic query.   Script mode :. Script Mode An SQL transformation running in script mode runs SQL scripts from text files. The SQL query can be static or dynamic. You can pass strings or parameters to the query to define dynamic queries. Above shown is an SQL transformation in Script Mode. It outputs script execution status and any script error. It will be either PASSED or FAILED. You can pass strings or parameters to the query from the transformation input ports to change the SQL query statement or the query data. ScriptResult port gives the status of the script execution status.Runs SQL scripts from text files that are externally located. In the script mode. the Integration Service prepares the SQL for each input row. Query Mode When SQL transformation runs in query mode. You cannot use scripting languages such as Oracle PL/SQL or Microsoft/Sybase T-SQL in the script.The query statement does not change. Query mode :. which is passed in through the input ports of the transformation.Executes a query that you define in a query editor. It creates an SQL procedure and sends it to the database to process. With static query. ScriptError as output. which will have a ScriptName input and ScripResult. it executes an SQL query defined in the transformation. The database validates the SQL and executes the query.   Static SQL query :. which can take the name of the oracle sequence generator. Step 1 :. which runs in query mode has two input parameters and returns one output. The ETL architect needs to create an Informatica reusable component. which can be reused in different dimension table loads to populate the surrogate key. Solution : Here lets create a reusable SQL transformation in Query mode. The surrogate key for each of the dimension tables are populated using an Oracle Sequence. .Above shown SQL transformation. and pass the sequence number as the output. SQL Transformation Use Case Ads not by this site Lets consider the ETL for loading Dimension tables into a data warehouse.Once you have the transformation developer open you can start creating the SQL transformation like any other transformations. It opens up a window like shown in below image. Step 2 :. . database connection type and you can make the transformation active or passive. you need to make the transformation active. We are passing in the database schema name and the sequence name. database type. If the database connection type is dynamic. It return sequence number as an output port. you can dynamically pass in the connection details into the transformation.Now create the input and output ports as shown in the below image.This screen will let you choose the mode. If the SQL query returns more than one record. Step 3 :. Here we are making the query dynamic by passing the schema name.Using the SQL query editor. Using the 'String Substitution' ports we can make the SQL dynamic. sequence name dynamically as an input port. we can build the query to get the sequence generator. . That is all we need for the reusable SQL transformation.We can use this transformation just like any other reusable transformations. sequence name) and returns one output value (sequence number). sequence name as input ports and returns sequence number. Need to pass in the schema name. Step 4 :. which can take two input values (schema name. Below shown is the completed SQL transformation. which can be used to populate the surrogate key of the dimension table as shown below. . What is Java Transformation . integration service will convert the SQL as follows during the session runtime. its components and its usage with the help of a use case. 2013 Transformations     inShare9 Java is.As per the above example. one of the most popular programming languages in use. Please let us know if you have any difficulties in trying out this tutorial or share us if you use any different use cases you want to implement using SQL transformation. particularly for client-server web applications. In this article lets learn more about Java Transformation. SELECT DW. With the introduction of PowerCenter Java Transformation. Hope you enjoyed this tutorial. Ads not by this site People who read this also read : Informatica Java Transformation to Leverage the Power of Java Programming Johnson Cyriac Sep 17. ETL developers can get their feet wet with Java programming and leverage the power of Java.S_CUST_DIM.NEXTVAL FROM DUAL. The PowerCenter Client uses the Java Development Kit (JDK) to compile the Java code and generate byte code for the transformation. Below image shows different code entry tabs under 'Java Code'.Import third-party Java packages. The PowerCenter Client stores the byte code in the PowerCenter repository. Using the code entry tabs with in the transformation. you can import Java packages. you can use the variables and methods on any code entry tab except the Import Packages tab. Developing Code in Java Transformation You can use the code entry tabs to enter Java code snippets to define Java transformation functionality.With Java transformation you can define transformation logic using java programming language without advanced knowledge of the Java programming language or an external Java development environment. built-in Java packages. the Integration Service uses the Java Runtime Environment (JRE) to execute the byte code and process input rows and generate output rows. After you declare variables and methods on the Helper Code tab. Helper Code :. . or custom Java packages.   Import Packages :. and write Java code that defines transformation behavior for specific transformation events. define Java expressions. write helper code.Define variables and methods available to all tabs except Import Packages. When the Integration Service runs a session with a Java transformation. Employee description. name.    On Input Row :.Define transformation behavior when it receives an input row. name. and the Manager name. You can use this only with active Java transformations.Define transformation behavior when it receives a transaction notification. you can drag and drop ports from other transformations to create new ports. Age. On Receiving Transaction :. Employee description. We need to create an ETL transformation to find the manager name for a given employee based on the manager ID and generates output file that contain employee ID. We are using only Java Transformation other than source.Define Java expressions to call PowerCenter expressions. target and source qualifier. and the manager ID. Java Expressions : . The employee data source contains the employee ID.Use this tab to define transformation logic when it has processed all input data. . Below shown is the complete structure of the mapping to build the functionality we described above.Once you have the source and source qualifier pulled in to the Java Transformation and create input and output ports as shown in below image. You can use this in multiple code entry tabs. Just like any other transformation. Step 1 :. Ads not by this site Java Transformation Use Case Lets take a simple example for our demonstration. The Java code in this tab executes one time for each input row On End of Data :. . This tab can be used to import any third party java classes or build in java classes.Step 2 :.Now move to the 'Java Code' tab and from 'import package' tab import the external java classes required by the java code. objects and functions required by the java code. which will be written in 'On Input Row'.In the 'Helper Code' tab. import java. Below is the code used. . define the variables. import java.Map.util.HashMap. Here we have created four objects.As shown in above image here is the import code used.util. Step 3 :. define the ETL logic. which will be executed for every input record. isRoot = false. Below is the complete code we need to place it in the 'On Input Row' generateRow = true. private static Object lock = new Object(). String> (). private boolean generateRow. generateRow = false. } if (isNull ("EMP_DESC_INP")) { setNull("EMP_DESC_OUT"). String> empMap = new HashMap <Integer. if (isNull ("EMP_ID_INP") || isNull ("EMP_NAME_INP")) { incrementErrorCount(1).private static Map <Integer. Step 4 :.In the 'On Input Row' tab. } else { EMP_ID_OUT = EMP_ID_INP. } else { EMP_DESC_OUT = EMP_DESC_INP. private boolean isRoot. EMP_NAME_OUT = EMP_NAME_INP. . put (new Integer(EMP_ID_INP). empMap. Compile the Java Code To compile the full code for the Java transformation. Remaining tabs in this java transformation do not need any code for our use case. click Compile on the Java Code tab.get(new Integer (EMP_PARENT_EMPID))). } synchronized(lock) { if(!isParentEmpIdNull) EMP_PARENT_EMPNAME = (String) (empMap. save the transformation to the repository. } if(generateRow) generateRow(). If the Java code does not compile successfully. Completed Mapping Remaining tabs do not need any code for our use case and all the ports from the java transformation can be connected from the source qualifier and to the target.} boolean isParentEmpIdNull = isNull("EMP_PARENT_EMPID"). if(isParentEmpIdNull) { isRoot = true. correct the errors in the code entry tabs and recompile the Java code. logInfo("This is the root for this hierarchy. Below shown is the completed structure of the mapping. Hope you enjoyed this tutorial. Please let us know if you have any difficulties in trying out this java code and java transformation or share us if you use any different use cases you want to implement using java transformation. After you successfully compile the transformation."). EMP_NAME_INP). . setNull("EMP_PARENT_EMPNAME"). With this we are done with the coding required in Java Transformation and only left with code compilation. The Output window displays the status of the compilation. Includes the time the thread waits for other thread processing. Thread statics gives run time information from all the three threads. Gathering Thread Statistics You can get thread statistics from the session long file. Ads not by this site    Run Time : Amount of time the thread runs. Source.Part 2 Johnson Cyriac Sep 8. When you run a session. In this article we will cover the methods to identify different performance bottlenecks. Busy Time : Percentage of the run time. Target & Mapping Bottlenecks Using Thread Statistics Performance Tuning Tutorial Series Part I : Performance Tuning Introduction. the session log file lists run time information and thread statistics with below details. Identify Performance Bottlenecks . 2013 ETL Admin | Performance Tips     inShare9 In our previous article in the performance tuning series. we covered the basics of Informatica performance tuning process and the session anatomy. Part III : Remove Performance Bottlenecks.People who read this also read : Informatica Performance Tuning Guide. transformation and writer thread. It is (run time . session performance counter and workflow monitor properties to help us understand the bottlenecks.idle time) / run time x 100. The session log provides enough run time thread statistics to help us understand and pinpoint the performance bottleneck. Part II : Identify Performance Bottlenecks. . reader. Idle Time : Amount of time the thread is idle. Part IV : Performance Enhancements. Here we will use session thread statistics. the session log lists run information and thread statistics similar to the following text. Hint : Thread with the highest busy percentage is the bottleneck. the transformation thread is the bottleneck in the mapping. To determine which transformation in the transformation thread is the bottleneck. Additional to that. In this session. transformation thread shows how much busy each transformation in the mapping is. Understanding Thread Statistics When you run a session. Session Bottleneck Using Session Performance Counters . about 9. The total run time for the transformation thread is 506 seconds and the busy percentage is 99. transformation and writer thread and how much time is spent on each thread and how busy each thread is.6% and 24%. The transformation RTR_ZIP_CODE had a busy percentage of 53%. you will see reader. The reader and writer busy percentages were significantly smaller. This means the transformation thread was never idle for the 506 seconds. view the busy percentage of each transformation in the thread work time breakdown. Thread Work Time : The percentage of time taken to process each transformation in a thread.7%. If you read it closely. Note : Session Log file with normal tracing level is required to get the thread statistics. output rows. You can see the transformations in the mapping and the corresponding performance counters. Ads not by this site Gathering Performance Counters You can setup the session to gather performance counters in the workflow manager. Analyzing these performance details can help you identify session bottlenecks. and error rows for each transformation. Understanding Performance Counters Below shown image is the performance counters for a session. A non-zero counts for readfromdisk and writetodisk indicate sub-optimal settings for transformation index or data caches. Below image shows the configuration required for a session to collect transformation performance counters.All transformations have counters to help measure and improve performance of the transformations. which you can see from the workflow monitor session run properties.. The Integration Service tracks the number of input rows. This may indicate the need to tune session transformation caches manually. . Readfromdisk and Writetodisk : If these counters display any number other than zero. Below is a sample message seen in the session . can slowdown reading. Session Bottleneck Using Session Log File When the Integration Service initializes a session.     Errorrows : Transformation errors impact session performance. Not having enough buffer memory for DTM process. tune the lookup expressions for the larger lookup tables. transforming or writing and cause large fluctuations in performance. Rowsinlookupcache : Gives the number of rows in the lookup cache. Integration service will write a warning message in to the session log file and gives you the recommended buffer size. you should eliminate the errors to improve performance. it allocates blocks of memory to hold source and target data. you can increase the cache sizes to improve session performance.A non-zero count for Errorrows indicates you should eliminate the transformation errors to improve performance. If a transformation has large numbers of error rows in any of the Transformation_errorrows counters. Readfromcache and Writetocache : Use this counters to analyze how the Integration Service reads from or writes to cache. If the session is not able to allocate enough memory for the DTP Process. To improve session performance. please leave us a comment or feedback if you have any. check if the tasks running on the system are using the amount indicated in the Workflow Monitor or if there is a memory leak. Increase DTM buffer size of the session. System Bottleneck Using the Workflow Monitor You can view the Integration Service properties in the Workflow Monitor to see CPU. What is Next in the Series The next article in this series will cover how to remove bottlenecks and improve session performance. and swap usage of the system when you are running task processes on the Integration Service. If the memory usage is close to 95%. Memory Usage : The percentage of memory usage includes other external tasks running on the system. Swap Usage : Swap usage is a result of paging due to possible memory leaks or a high number of concurrent tasks. Hope you enjoyed this article. 2013 Mapping Tips | Performance Tips     inShare10 . A high CPU usage indicates the need of additional processing power required by the server. To troubleshoot. use system tools to check the memory usage before and after running the session and then compare the results to the memory usage while running the session.Message: WARNING: Insufficient number of data blocks for adequate performance. Use the following Integration Service properties to identify performance issues:    CPU% : The percentage of CPU usage includes other external tasks running on the system. People who read this also read : Implementing Informatica PowerCenter Session Partitioning Algorithms Johnson Cyriac Jul 20. we are happy to hear from you. memory. The recommended value is xxxx. . Lets consider a business use case to explain the implementation of appropriate partition algorithms and configuration. Below image shows how to setup pass through partition for three different sales regions. it is important to choose the appropriate partitioning algorithm or partition type. Pass-through Partition A pass-through partition at the source qualifier transformation is used to split the source data into three different parallel processing data sets. hence the number of records processed for every region varies a lot. In this article lets discuss the optimal session partition settings. Below is the simple structure of the mapping to get the assumed functionality. The warehouse target table is partitioned based on product line. In additional to that. Part III : Dynamic Partition. Daily sales data generated from three sales region need to be loaded into an Oracle data warehouse. Part II : Partition Implementation.Informatica PowerCenter Session Partitioning can be effectively used for parallel data processing and achieve faster data delivery. The sales volume from three different regions varies a lot. Business Use Case Partition Tutorial Series Part I : Partition Introduction. Parallel data processing performance is heavily depending on the additional hardware power available. Below image shows three additional Source Filters.Ads not by this site Once the partition is setup at the source qualifier. you get additional Source Filter option to restrict the data which corresponds to each partition. Be sure to provide the filter condition such that same data is not processed through more than one partition and data is not duplicated. one per each partition. . . Hash auto key partition algorithm will make sure the data from different partition is redistributed such that records with the same key is in the same partition. Hash Auto Key Partition At the Aggregator transformation. Round robin partition can be setup as shown in below image. use round robin partition algorithm at the next transformation in pipeline.Round Robin Partition Since the data volume from three sales region is not same. So that the data is equally distributed among the three partitions and the processing load is equally distributed. Processing records of the same aggregator group in different partition will result in wrong result. data need to redistribute across the partitions to avoid the potential splitting of aggregator groups. This algorithm will identify the keys based on the group key provided in the transformation. . Here the target table is range partitioned on product line. Create a range partition on target definition on PRODUCT_LINE_ID port to get the best write throughput.Key Range Partition Use Key range partition when required to distribute the records among partitions based on the range of values of a port or multiple ports. Below images shows the steps involved in setting up the key range partition. . Click on Edit Keys to define the ports on which the key range partition is defined. Choose the ports on which the key range partition is required.A pop up window shows the list of ports in the transformation. Now give the value start and end range for each partition as shown below. . In this article we will cover the methods to resolve different performance bottlenecks. target and mapping performance turning techniques in detail. source. 2013 ETL Admin | Performance Tips     inShare8 In our previous article in the performance tuning series. we covered different approaches to identify performance bottlenecks. . Sessions that use a large number of sources and targets might require additional memory blocks. Hope you enjoyed this article. Ads not by this site People who read this also read : Informatica Performance Tuning Guide. This algorithm can be used in most of the places where hash auto key algorithm is appropriate. Resolve Performance Bottlenecks . It reads partitioned data from the corresponding nodes in the database. This algorithm can be applied either on the source or target definition.We did not have to use Hash User Key Partition and Database Partition algorithm in the use case discussed here. Hash User Key partition algorithm will let you choose the ports to group rows among partitions. Please leave your comments and feedback. I. it allocates blocks of memory to hold source and target data. cache memory.Part 3 Johnson Cyriac Oct 8. We will talk about session memory. Database partition algorithm queries the database system for table partition information. Buffer Memory Optimization When the Integration Service initializes a session. Performance Tuning Part I : Performance Part II : Identify Part III : Remove Part IV : Performance Enhancements. The largest precision among all the source and target should be the buffer block size for one row. Note : You can identify DTM buffer bottleneck from Session Log File. you might need to increase or decrease the buffer block size. To identify the optimal buffer block size. a buffer block should accommodates at least 100 rows at a time. Tutorial Series Tuning Introduction. . Not having enough buffer memory for DTM process.  Buffer Block Size = Largest Row Precision * 100 You can change the buffer block size in the session configuration as shown in below image. Performance Bottlenecks. You can do this by adjusting the buffer block size and DTM Buffer size. sum up the precision of individual source and targets columns. Performance Bottlenecks. Check here for details. Ideally. transforming or writing and cause large fluctuations in performance. Adding extra memory blocks can keep the threads busy and improve session performance. target data. Optimizing the Buffer Block Size Depending on the source. can slowdown reading. 1. 2. You can identify the required DTM Buffer Size based on below calculation.9 Ads not by this site . the Integration Service creates more buffer blocks.   Session Buffer Blocks = (total number of sources + total number of targets) * 2 DTM Buffer Size = Session Buffer Blocks * Buffer Block Size / 0. Increasing DTM Buffer Size When you increase the DTM buffer memory. which improves performance. If the allocated cache memory is not large enough to store the data. Caches Memory Optimization Transformations such as Aggregator. Note : You can examine the performance counters to determine what all transformations require cache memory turning. .You can change the DTM Buffer Size in the session configuration as shown in below image. Check here for details. the Integration Service stores the data in a temporary cache file. Rank. Lookup uses cache memory to store transformed data. which includes index and data cache. 1. Session performance slows each time the Integration Service reads from the temporary cache file. II. Increasing the Cache Sizes You can increase the allocated cache sizes to process the transformation in cache memory itself such that the integration service do not have to read from the cache file. .You can calculate the memory requirements for a transformation using the Cache Calculator. Below shown is the Cache Calculator for Lookup transformation. .You can update the cache size in the session property of the transformation as shown below. Limiting the number of connected input/output or output ports reduces the amount of data the transformations store in the data cache. Oracle. Optimizing the Target The most common performance bottleneck occurs when the Integration Service writes to a target database. limit the number of connected input/output and output only ports. which speeds performance. When bulk loading. check here for details. Note : Target bottleneck can be determined with the help of Session Log File. the Integration Service bypasses the database log. III. Without writing to the .2. Using Bulk Loads You can use bulk loading to improve the performance of a session that inserts a large amount of data into a DB2. small database network packet sizes. Sybase ASE. 1. Limiting the Number of Connected Ports For transformations that use data cache. Small database checkpoint intervals. or problems during heavy loading operations can cause target bottlenecks. or Microsoft SQL Server database. DB2. check here for details. As a result. 2. Inefficient query or small database network packet sizes can cause source bottlenecks. you slow the loading of data to those tables. you can improve the performance by increasing the network packet size. You can increase the number of target connection groups in a session to avoid deadlocks. Increase the network packet size to allow larger packets of data to cross the network at one time. 3. Minimizing Deadlocks Encountering deadlocks can slow session performance. Dropping Indexes and Key Constraints When you define key constraints or indexes in target tables. the target database cannot perform rollback. Sybase and Teradata. you might know properties about the source tables that the database optimizer does not. Usually. Optimizing the Query If a session joins multiple source tables in one Source Qualifier. To decrease the number of checkpoints and increase performance. drop indexes and key constraints before you run the session. use a different database connection name for each target instance. 1. Sybase ASE. the database optimizer determines the most efficient way to process the source data. Increasing Database Checkpoint Intervals The Integration Service performance slows each time it waits for the database to perform a checkpoint. To improve performance. 6.database log. Optimizing the Source Performance bottlenecks can occur when the Integration Service reads from a source database. You can rebuild those indexes and key constraints after the session completes. configure PowerCenter to use an external loader for the following types of target databases. Note : Session Log File details can be used to identify Source bottleneck. IV. you may not be able to perform recovery. Using External Loaders To increase session performance. To use a different target connection group for each target in a session. 4. External loader can be used for Oracle. however. Increasing Database Network Packet Size If you write to Oracle. The database administrator can . you might be able to improve performance by optimizing the query with optimizing hints. 5. or Microsoft SQL Server targets. However. increase the checkpoint interval in the database. Where possible. 2. Increase the network packet size to allow larger packets of data to cross the network at one time. the unnecessary datatype conversion slows performance. Focus on mapping-level optimization after you optimize the targets and sources. V. 1. Sybase ASE. check here for details. Generally. reduce the number of times the mapping performs the task by moving the task earlier in the mapping. Minimizing Aggregate Function Calls : When writing expressions.     Factoring Out Common Logic : If the mapping performs the same task in multiple places. Optimizing Expressions You can also optimize the expressions used in the transformations. Delete unnecessary links between transformations to minimize the amount of data moved. isolate slow expressions and simplify them. Note : You can identify Mapping bottleneck from Session Log File. For example SUM(COL_A + COL_B) performs better than SUM(COL_A) + SUM(COL_B) Replacing Common Expressions with Local Variables : If you use the same expression multiple times in one transformation. When possible. Each time you use an aggregate function call. For example. you can make that expression a local variable. For example. Increasing Database Network Packet Size If you read from Oracle. if you look up large amounts of data . you can improve the performance by increasing the network packet size. Choosing Numeric Versus String Operations : The Integration Service processes numeric operations faster than string operations. the Integration Service must search and group the data. eliminate unnecessary datatype conversions from mappings. Optimizing the Mappings Mapping-level optimization may take time to implement. 2. then back to an Integer column. if a mapping moves data from an Integer column to a Decimal column.create optimizer hints to tell the database how to execute the query for a particular set of source tables. or Microsoft SQL Server sources. but it can significantly boost session performance. Optimizing Datatype Conversions You can increase performance by eliminating unnecessary datatype conversions. factor out as many aggregate function calls as possible. Configure the mapping with the least number of transformations and expressions to do the most amount of work possible. you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the mapping. EMPLOYEE_NAME and EMPLOYEE_ID. on two columns. 2013 Mapping Tips | Performance Tips     inShare7 Informatica Pushdown Optimization Option increases performance by providing the flexibility to push transformation processing to the most appropriate processing resource. Where possible. configuring the lookup around EMPLOYEE_ID improves performance. data transformation logic can be pushed to source database or target database or through the PowerCenter server. This gives the option for the ETL architect to choose the best of the available resources for data processing. you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the transformation. Note : Tuning technique for different transformation will be covered as a separate article. use operators to write expressions. Optimizing Transformations Ads not by this site Each transformation is different and the tuning required for different transformation is different. Hope you enjoyed this article. 3. please leave us a comment or feedback if you have any. But generally. What is Next in the Series The next article in this series will cover the additional features available in Informatica PowerCenter to improve session performance. What is Pushdown Optimization . Using Operators Instead of Functions : The Integration Service reads expressions written with operators faster than expressions with functions. we are happy to hear from you. People who read this also read : Informatica PowerCenter Pushdown Optimization a Hybrid ELT Approach Johnson Cyriac Jul 30. Using Pushdown Optimization. the Integration Service expands the mapplet and treats the transformations in the mapplet as part of the parent mapping. the Integration Service analyzes the mapping and transformations to determine the transformation logic it can push to the database. How Pushdown Optimization Works When you run a session configured for pushdown optimization. This minimizes the need of moving data between servers and utilizes the power of database engine. 1. to be pushed down into any relational database to make the best use of database processing power. It converts the transformation logic into SQL statements. Different Type Pushdown Optimization You can configure pushdown optimization in the following ways. The amount of transformation logic one can push to the database depends on the database. Source-side pushdown optimization 2. and mapping and session configuration.Performance Improvement Features Pushdown Optimization Pipeline Partitions Dynamic Partitions Concurrent Workflows Grid Deployments Workflow Load Balancing Pushdown Optimization Option enables data transformation processing. The Integration Service converts the transformation logic into SQL statements and sends to the source or the target database to perform the data transformation. If the mapping contains a mapplet. which can directly execute on database. transformation logic. Target-side pushdown optimization 3. Full pushdown optimization Source-side pushdown optimization Ads not by this site . the Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement. Then. The Integration Service generates an INSERT. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. If you run a session that contains an SQL override or lookup override. DELETE. it reads the results of this SQL statement and continues to run the session. the Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the database. the Integration Service generates a view based on the override. Then. the Integration Service drops the view from the database.When you run a session configured for source-side pushdown optimization. When the session completes. It then generates a SELECT statement and runs the SELECT statement against this view. starting with the first transformation in the pipeline it can push to the database. or UPDATE statement based on the transformation logic for each transformation it can push to the database. The Integration Service generates a SELECT statement based on the transformation logic for each transformation it can push to the database. . When you run the session. Target-side pushdown optimization When you run a session configured for target-side pushdown optimization. the Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. it executes the generated SQL. the source and target must be on the same database. If you configure a session for full pushdown optimization. it performs partial pushdown optimization instead. Configuring Session for Pushdown Optimization A session can be configured to use pushdown optimization from informatica powercenter workflow manager. and the Integration Service cannot push all the transformation logic to the database. To use full pushdown optimization. If the session contains an SQL override or lookup override. When you run a session configured for full pushdown optimization. You can open the session and choose the Source. the Integration Service analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it analyzes the target. .Full pushdown optimization Ads not by this site The Integration Service pushes as much transformation logic as possible to both source and target databases. the Integration Service generates a view and runs a SELECT statement against this view. Target or Full pushdown optimization as shown in below image. It generates SQL statements that are executed against the source and target database based on the transformation logic it can push to the database. You can additionally choose few options to control how integration service push data transformation into SQL statements. Allows the Integration Service to create temporary view objects in the database when it pushes the session to the database. Indicates that the database user of the active database has read permission on the idle databases. Allow Temporary Sequence for Pushdown. . Allow Pushdown for User Incompatible Connections.    Allow Temporary View for Pushdown. Allows the Integration Service to create temporary sequence objects in the database. Below screen shot shows the available options. . Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the corresponding SQL statement that is generated for the specified selections. You can invoke the viewer from highlighted 'Pushdown Optimization' as shown in below image.Using Pushdown Optimization Viewer Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. we must update the pushdown option in the session properties. you do not change the pushdown configuration.Pushdown optimizer viewer pops up in a new window and it shows how integration service converts the data transformation logic into SQL statement for a particular mapping. To change the configuration. When you select a pushdown option or pushdown group in the viewer. . Please leave us your comments and feedback. Hope you enjoyed this article and it is informative.Things to Consider before Using Pushdown Optimization When you run a session for full pushdown optimization.    A long transaction uses more database resources. if the session contains a large quantity of data. A long transaction locks the database for longer periods of time. Consider the following database performance issues when you generate a long transaction. the database must run a long transaction. A long transaction can increase the likelihood that an unexpected event may occur. 2013 Mapping Tips | Transformations . and thereby reduces the database concurrency and increases the likelihood of deadlock. Ads not by this site People who read this also read : Mapping Debugger to Troubleshoot your Informatica PowerCenter ETL Logic Johnson Cyriac Aug 17. 1. Target Window : Shows what data is processed into the target instance. if the record is going to get inserted. You can choose a specific transformation from the drop down list to see how data looks like at that particular transformation instance for a particular source row. updated. 2. deleted or rejected. 3. If there are multiple target instances. The debugger user interface shows the step by step execution path of a mapping and how the source data is transformed in the mapping. Below shown is the windows in the Mapping Designer. that appears when you run the Debugger. Mapping Window : Mapping window shows the step by step execution path of a mapping. Understand Debugger Interface The debugger user interface is integrated with Mapping Designer. You can see. you can choose the target instance name from the drop down window. to see its data. Once you invoke the debugger. This window gets refreshed as the debugger progresses from one transformation to other. you get few additional windows to display the debugging information such as transformation instance window to show how the data is transformed at a transformation instance and target window to show what data is written to the target. Features like "break points". It highlights the transformation instance which is being processed and shows the breakpoints setup on different transformations.    inShare8 Debugger is an integral part of Informatica PowerCenter mapping designer. Instance Window : View how data is transformed at a transformation instance. 4. which help you in troubleshooting the ETL logical error or data error conditions in an Informatica mapping. Ads not by this site . "evaluate expression" makes the debugging process easy. Debugger Log : This window shows messages from the Debugger. which is to be used by mapping being debugged. Target instance window is showing first two records set for update.Above image shows the mapping with one breakpoint set on the expression transformation. Configuring the Debugger Before the debugger can run we need to setup the break points and configure the session. Setting up break point is optional to run the debugger. especially when the mapping is pretty big and complex. And the Instance window is showing how the third record from the source is transformed in the expression EXP_INSERT_UPDATE. But this option helps to narrow down the issue faster. Creating Breakpoints . when processing the CUST_ID = 1001. You can set the number of rows to skip or a data condition or both. You also set the number of errors to skip for each break point before the Debugger pauses. you might want to see what is going wrong in the expression transformation EXP_INSERT_UPDATE for a specific customer record.   Error Breakpoints : When you create an error break point. but specific transformations where you expect a logical or data error. Below shown is a Data Break point created on EXP_INSERT_UPDATE. say CUST_ID = 1001. With this setting the debugger pauses on the transformation EXP_INSERT_UPDATE. you can pause the Debugger on specific transformation or specific condition is satisfied. For example. with condition CUST_ID = 1001. You can start the Break point Window from Mapping -> Debugger -> Edit Breakpoints (Alt+F9) as shown in below image. By setting a break point.When you are running a debugger session. the Debugger pauses when the data break point condition evaluates to true. Data Breakpoints : When you create a data break point. the Debugger pauses when the Integration Service encounters error conditions such as a transformation error. you may not be interested to see the data transformations in all the transformations instances. . You can set two types of break points. an existing reusable session. Setting up break point is optional to run the debugger But this option helps to narrow down the issue faster. You can start the Debugger Wizard from Mapping -> Debugger -> Start Debugger (F9) as shown in below image.In the same way. . or create a debug session instance for the mapping you are going to debug. Configuring the Debugger In addition to setting breakpoints. an existing non-reusable session. When you configure the Debugger. enter parameters such as the Integration Service. especially when the mapping is pretty big and complex. Use the Debugger Wizard in the Mapping Designer to configure the Debugger against a saved mapping. we can create error breakpoints on any transformation. you must also configure the Debugger. . Next window will give an option to choose the sessions attached to the mapping which is being debugged. You choose an existing non-reusable session.From below shown window you choose the integration service. or create a debug session instance. an existing reusable session. . the Integration Service does not connect to the target. You can select the target instances you want to display in the Target window while you run a debug session.You can choose to load or discard target data when you run the Debugger. If you discard target data. Ads not by this site Running the Debugger When you complete the Debugger Wizard shown in the configuration phase in the step above. After you review or modify data. . the Debugger moves in and out of running and paused states based on breakpoints and commands that you issue from the Mapping Designer. you can see the transformation data in the Instance Window.With this settings the mapping is ready to be debugged. the Integration Service starts the session and initializes the Debugger. you can continue the Debugger in the following ways. Different commands to control the Debugger execution is shown in below image. When the Debugger is in paused state. This menu is available under Mapping -> Debugger. After initialization. The Debugger continues running until it encounters the next break. you can use the Expression Editor to evaluate expressions using mapping variables and ports in a selected transformation. The Debugger continues running until it reaches the next transformation or until it encounters a break.   Continue to the next break : To continue to the next break. then click Step to Instance (Ctrl+F10) option. in cause if you find the expression result is erroneous. select the transformation instance in the mapping. The Debugger continues running until it reaches the selected transformation in the mapping or until it encounters a break. You can access Evaluate Expression window from Mapping -> Debugger -> Evaluate Expression. Continue to the next instance : To continue to the next instance. This option is helpful to evaluate and rewrite an expression. Step to a specified instance : To continue to a specified instance. If the current instance has output going to more than one transformation instance. . click Next Instance (F10) option. Evaluating Expression When the Debugger pauses. the Debugger stops at the first instance it processes. click Continue (F5). This option is helpful. the current instance displays in the Instance window. Tuning and Bottleneck Overview .Modifying Data When the Debugger pauses.Part 1 Johnson Cyriac Aug 25. if you want to check what would be the result if the input was any different from the current value. Please let us know if you have any difficulties in trying out mapping debugger and subscribe to the mailing list to get the latest tutorials in your mail box. 2013 ETL Admin | Performance Tips . People who read this also read : Informatica Performance Tuning Guide. You can modify the data from the Instance Window. Hope you enjoyed this tutorial. You can make the data modifications to the current instance when the Debugger pauses on a data break point. Before we understand different bottlenecks. This performance tuning article series is split into multiple articles. integration service start Data Transformation Manager (DTM). An iterative method of identifying one bottleneck at a time and eliminate it. Part IV : Performance Enhancements.    inShare11 Performance tuning process identifies the bottlenecks and eliminate it to get a better acceptable ETL load time. Reader thread is responsible to read data from the sources. Part II : Identify Performance Bottlenecks. transformation thread and writer thread. lets see the components of Informatica PowerCenter session and how a bottleneck arises. Part III : Remove Performance Bottlenecks. In this article we will discuss about the session anatomy and more about bottlenecks. It might need further tuning on the system resources on which the Informatica PowerCenter Services are running. target. which goes over specific areas of performance tuning. mapping and further to session tuning. Ads not by this site Determining the best way to improve performance can be complex. Performance bottlenecks can occur in the source and target. The first step in performance tuning is to identify performance bottlenecks. the mapping. Informatica PowerCenter Session Anatomy When a PowerCenter session is triggered. Performance Tuning and Bottlenecks Overview Performance Tuning Tutorial Series Part I : Performance Tuning Introduction. the session. Tuning starts with the identification of bottlenecks in source. then identify and eliminate the next bottleneck until an acceptable throughput is achieved is more effective. and the system. Transformation threads process data . which is responsible to start reader thread. Slowness in reading data from the source leads to delay in filling enough data into DTM buffer. Source Bottlenecks Performance bottlenecks can occur when the Integration Service reads from a source database. Any data processing delay in these threads leads to a performance issue. Above shown is the pictorial representation of a session. . So the transformation and writer threads wait for data. Finally loaded into the target by the writer thread.according to the transformation logic in the mapping and writer thread connects to the target and loads the data. Inefficient query or small database network packet sizes can cause source bottlenecks. Reader thread reads data from the source and data transformation is done by transformation thread. This delay causes the entire session to run slower. Ads not by this site Small database checkpoint intervals. until the data is written to the target. So the the reader and transformer threads to wait for free blocks.Target Bottlenecks When target bottleneck occurs. or problems during heavy loading operations can cause target bottlenecks. writer thread will not be able to free up space for reader and transformer threads. This causes the entire session to run slower. Mapping Bottlenecks . small database network packet sizes. Hope you enjoyed this article. we will cover how to identify different bottlenecks using session thread statics and session performance counters and more. such as Aggregator. With mapping bottleneck. target. Session Bottlenecks If you do not have a source. and read and write data. please leave us a comment or feedback if you have any. and Rank. and small commit intervals can cause session bottlenecks. mapping.A complex mapping logic or a not well written mapping logic can lead to mapping bottleneck. Sorter. Session bottleneck occurs normally when you have the session memory configuration is not turned correctly.  People who read this also read : . you may have a session bottleneck. or mapping bottleneck. we are happy to hear from you. transformation or writer thread. The Integration Service uses system resources to process transformations. run sessions. transformation thread runs slower causing the reader thread to wait for free blocks and writer thread to wait blocks filled up for writing to target. low buffer memory. This in turn leads to a bottleneck on the reader. What is Next in the Series In the next article in this series. System Bottlenecks After you tune the source. The Integration Service also uses system memory to create cache files for transformations. Small cache size. Joiner. Lookup. consider tuning the system to prevent system bottlenecks. and session. XML. target. We can set the session to get verbose data from all the transformations in the mapping or specific transformations in the mapping. We will set up the session to debug these two transformations. we need to setup the session to get log file with detailed verbose data. In this mapping we have one lookup transformation and an expression transformation. For our discussion.Session Logfile with Verbose Data for Informatica Mapping Debugging Johnson Cyriac Aug 31. but there are instances where we need to go for a different troubleshooting approach for mappings. For our demo. such as what data is stored in the cache files. 2013 Mapping Tips | Transformations     inShare7 Such Debugger is a great tool to troubleshoot your mapping logic. Here we are setting the tracing level for the lookup transformation. lets consider a simple mapping. how variables ports are evaluated. we are going to collect verbose data from the lookup and expression transformation. information helps in complex tricky troubleshooting. Below shown is the structure of the mapping. Setting Up the Session for Troubleshooting Before we can run the workflow and debug. We can set up the session to for debugging by changing the Tracing Level to Verbose Data as shown in below image. . Session log file with verbose data gives much more details than the debugger tool. we are setting up the Tracing Level to Verbose Data for the expression transformation as well as shown below.As we mentioned. . Below shown part of the log file. The highlighted section shows the data is read from the lookup source LKP_T_DIM_CUST{{DSQ}} and is build into LKP_T_DIM_CUST{{BLD}} cache. Read and Understand the Log File Ads not by this site Once you open the session log file with verbose data. details what data is stored in the lookup cache file. Further you can see the values stored in the cache file. . we can scroll down through the session log and look for transformation thread. Since we have are interested in the data transformation details. you going to notice a lot more information that we normally see in a log file.Note : We can override the tracing level for all the individual transformations at once from Configuration Object -> Override Tracing property. Lookup transformation out put is send out to next transformation EXP_INSERT_UPDATE.Further down through the transformation thread. You can see what data is received by EXP_INSERT_UPDATE from the Lookup transformation. you can see three records are passed on to LKP_T_DIM_CUST from the source qualifier SQ_CUST_STAGE. Rowid is helpful to track the rows between the transformations. You can see the Rowid in the log file. . No user interface : No user interface is available. it requires an extra bit of effort to understand the verbose data in session log file. which is useful in detailed debugging. You can see only one row at a time in debugger tool. But skipped from this demo. how variables ports are evaluated and much more. Detailed info : Verbose data gives much more details than the debugger. all the debugging info is provided in text format. which is some times irrelevant for your troubleshooting. you will see how data is passed into and out of the expression transformation. Pros and Cons Both the debugger tool and debugging using verbose data got its own plus and minuses. Hope you enjoyed this tutorial. All in one place : We get all the detailed debugger info in one place. Cons    Difficult to understand : Unlike debugger tool. Pros    Faster : One you get the hang of verbose data. it is faster debugging using session log file than the debugger tool. which might not be a proffered way for some.Since we have enabled verbose data for the expression transformation as well. such as what data is stored in the cache files. additional to the above details. Lot of info : Session log file with verbose data gives much more details than the debugger tool. which help you go through how rows are transformed from source to target. Please let us know if you have any difficulties in trying out this debugging approach or share us if you use any different methods for your debugging. You do not have to patiently wait to get info from each transformation like debugger tool. Ads not by this site People who read this also read : . Informatica PowerCenter Partitioning for Parallel Processing and Faster Delivery .
Copyright © 2024 DOKUMEN.SITE Inc.