Sybase RepServer Performance Tuning Wp 022708

SYBASE REPLICATION SERVER PERFORMANCE AND TUNINGUnderstanding and Achieving Optimal Performance with Sybase Replication Server ver 2.0.1 Final v2.0.1 Table of Contents Table of Contents .............................................................................................................................i Author’s Note ................................................................................................................................ iii Introduction.....................................................................................................................................1 Document Scope ...........................................................................................................................1 Major Changes in this Document..................................................................................................2 Overview and Review .....................................................................................................................5 Replication System Components ..................................................................................................5 RSSD or Embedded RSSD (eRSSD) ............................................................................................6 Replication Server Internal Processing .........................................................................................7 Analyzing Replication System Performance...............................................................................10 Primary Dataserver/Database......................................................................................................13 Dataserver Configuration Parameters .........................................................................................13 Primary Database Transaction Log .............................................................................................14 Application/Database Design......................................................................................................15 Replication Agent Processing.......................................................................................................29 Secondary Truncation Point Management ..................................................................................29 Rep Agent LTL Generation.........................................................................................................31 Replication Agent Communications ...........................................................................................34 Replication Agent Tuning ...........................................................................................................34 Replication Agent Troubleshooting ............................................................................................41 Replication Server General Tuning.............................................................................................53 Replication Server/RSSD Hosting ..............................................................................................53 RS Generic Tuning......................................................................................................................55 RSSD Generic Tuning.................................................................................................................63 STS Tuning .................................................................................................................................63 RSM/SMS Monitoring ................................................................................................................66 RS Monitor Counters ..................................................................................................................67 Impact on Replication .................................................................................................................75 RS M&C Analysis Repository ....................................................................................................76 RS_Ticket....................................................................................................................................77 Inbound Processing.......................................................................................................................87 RepAgent User (Executor) ..........................................................................................................87 SQM Processing..........................................................................................................................97 SQT Processing.........................................................................................................................113 Distributor (DIST) Processing ..................................................................................................127 Minimal Column Replication....................................................................................................141 Outbound Queue Processing ......................................................................................................145 DSI SQM Processing ................................................................................................................147 DSI SQT Processing .................................................................................................................148 DSI Transaction Grouping ........................................................................................................155 DSIEXEC Function String Generation .....................................................................................165 DSIEXEC Command Batching.................................................................................................172 DSIEXEC Execution.................................................................................................................179 DSIEXEC Execution Monitor Counters ...................................................................................180 DSI Post-Execution Processing.................................................................................................183 i Final v2.0.1 End-to-End Summary ................................................................................................................ 184 Replicate Dataserver/Database.................................................................................................. 187 Maintenance User Performance Monitoring............................................................................. 187 Warm Standby, MSA and the Need for RepDefs ..................................................................... 192 Query Related Causes............................................................................................................... 194 Triggers & Stored Procedures................................................................................................... 196 Concurrency Issues................................................................................................................... 199 Procedure Replication ................................................................................................................ 201 Procedure vs. Table Replication ............................................................................................... 201 Procedure Replication & Performance ..................................................................................... 202 Procedure Transaction Control ................................................................................................. 207 Procedures & Grouped Transactions ........................................................................................ 210 Procedures with “Select/Into”................................................................................................... 210 Replication Routes ...................................................................................................................... 217 Routing Architectures............................................................................................................... 217 Routing Internals ...................................................................................................................... 225 Routing Performance Advantages ............................................................................................ 229 Routing Performance Tuning.................................................................................................... 229 Parallel DSI Performance .......................................................................................................... 233 Need for Parallel DSI................................................................................................................ 233 Parallel DSI Internals................................................................................................................ 234 Serialization Methods ............................................................................................................... 244 Transaction Execution Sequence .............................................................................................. 249 Large Transaction Processing................................................................................................... 253 Maximizing Performance with Parallel DSI’s .......................................................................... 259 Tuning Parallel DSI’s with Monitor Counters.......................................................................... 265 Text/Image Replication .............................................................................................................. 273 Text/Image Datatype Support................................................................................................... 273 RS Implementation & Internals ................................................................................................ 275 Performance Implications ......................................................................................................... 282 Asynchronous Request Functions ............................................................................................. 283 Purpose ..................................................................................................................................... 283 Implementation & Internals ...................................................................................................... 285 Performance Implications ......................................................................................................... 287 Multiple DSI’s ............................................................................................................................. 289 Concepts & Terminology.......................................................................................................... 289 Performance Benefits................................................................................................................ 289 Implementation ......................................................................................................................... 290 Business Cases.......................................................................................................................... 305 Integration with EAI .................................................................................................................. 309 Replication vs. Messaging ........................................................................................................ 309 Integrating Replication & Messaging ....................................................................................... 312 Performance Benefits of Integration......................................................................................... 312 Messaging Conclusion.............................................................................................................. 313 ii Final v2.0.1 Author’s Note Thinking is hard work – “Silver Bullets” are much easier. Several years ago, when Replication Server 11.0 was fairly new, Replication Server Engineering (RSE) collaborated on a paper that was a help to us all. Since that time, Replication Server has gone through several releases and Replication Server Engineering has been too busy keeping up with the advances in Adaptive Server Enterprise and the future of Replication Server to maintain the document. However, the requests for a paper such as this have been a frequent occurrence, both internally as well as from customers. Hopefully, this paper will satisfy those requests. But as the above comment suggests, reading this paper will require extensive thinking (and considerable time). Anyone hoping for a “silver bullet” does not belong in the IT industry. This paper was written for and addresses the functionality in Replication Server 12.6 and 15.0 with Adaptive Server Enterprise 12.5.2 through 15.0.1 (Rep Agent and MDA Tables). As the Replication Server product continues to be developed and improved, it is likely that later improvements to the product may supersede the recommendations contained in this paper. It is assumed that the reader is familiar with Replication Server terminology, internal processing and in general the contents of the Replication Server System Administration Guide. In addition, basic Adaptive Server Enterprise performance and tuning knowledge is considered critical to the success of any Replication System’s performance analysis. This document could not have been achieved without the considerable contributions of the Replication Server engineering team, Technical Support Engineers, and the collective Replication Server community of consultants, educators, etc. who are always willing to share their knowledge. Thank you. Document Version: 2.0.1 January 7, 2007 iii 000.0. Lately. implementers also need to be realistic as well. Significantly. some customers have claimed that by using multiple DSI’s. however. It is expected that the reader is already familiar with Replication Server internal processing and basic replication terminology as described in the product manuals. it is best to lay some ground rules about what to expect or not to expect from this paper. This paper will not discuss Replication Server system administration. It all depends.000 transactions per day (containing 120. The goal of this paper is to educate so that the reader understands why they may be seeing the performance they are and suggest possible avenues to explore with the goal of improved performance without resorting to the old tried-and-true trial-and-error stumble-fumble.0 GHz P4 XEON with internal SCSI disks. they have achieved 10. In the past. that places considerable responsibility on the system designers. it is expected that this paper will be expanded to cover several topics only lightly addressed in this version or not addressed at all.6 has been able to sustain ~3.9. Focusing on the latter: • • • • • • • This paper will not discuss database server performance and tuning (although it frequently is the cause of poor replication performance) except as required for replication processing. this list mostly focused on broader topics such as routing and heterogeneous replication.2TB database (RS 12. This paper focuses heavily on Replication Server in an Adaptive Server Enterprise environment. Consequently.000. Product management recently got a call from a customer asking if Replication Server could achieve replicating 20GB of data in 15 minutes. Replication Server is a highly configurable and highly tunable product.6) have been added to Parallel DSI’s.1 Introduction “Just How Fast Is It?” This question gets asked constantly.000. your results may vary. However.5) and DSI commit control (new in 12. Replication Server has been clocked at 2. As usual. This paper will not discuss how to “benchmark” a replicated system.000 transactions an hour!! Although this sounds unrealistic. implementers and operations staff to design and implement an efficient data movement strategy – as well as operations staff to monitor.000 rows/sec on a dual 3. Those familiar with Replication Server may want to skip to the specific detail sections that are applicable to their situation. Of course.4GB/hr sustained in a 1. Unfortunately. Additionally. In the future. The reality is that this is likely not even achievable using raw file IO streaming commands such as the unix dd command – let alone via a process that needs to inspect the data values and decide on subscription rules. it is doubtful that attempting to read this paper at a single sitting will be beneficial. future topics will likely be new features added to existing functionality – much like the addition of the discussions on DSI partitioning (new in 12. Of course. This paper will discuss the internal processing of the Replication Server. 1 . a monitored benchmark in 1995 using Replication Server 10. Now that we know what we are going to skip. ASE’s Replication Agent and the corresponding tuning parameters that are specific for performance. RS 12. Because performance and tuning is so situational dependent. the stock RSE reply used to be 5MB/min (or 300MB/hr) based on their limited testing on development machines (small ones at that). Routing has since been added. tune and adjust the implementation as necessary. Document Scope Before we begin.Final v2. This paper will not discuss non-ASE RepAgent performance (perhaps it will in a future version) except where such statements can be made generically about RepAgents.2TB database and more than 40GB has been replicated in a single day into the same 1. However.0 and ASE 11. This paper will not discuss Replication Server Manager.5 achieved 4. And every other trite caveat muttered by a tuning guru/educator/consultant. what we will cover: This paper will discuss all of the components in a replication system and how each impacts performance.000 transactions (each with 10 write operations) a day from the source replicating to three destinations (each with only 5 DSI’s) for a total delivery of 12. your expectations also need to be realistic. there are no standard benchmarks such as TPC-C for replication technologies and RSE does not have the bandwidth nor resources to do benchmarking.000 write operations). while heterogeneous replication has since been documented in the Replication Server Documentation.3 on Compaq Alpha GS140’s for the curious).000. As a result. Added section Updates 1. “scan” with example to clarify output of start/end/current markers and log recs scanned. An attempt was made to red-line the changed sections.1 Major Changes in this Document Because many people have read earlier versions of this document.Final v2.one wonders how Microsoft produces their own documentation with such unreliable tools).0 as compared to v1.0 The following additions were made to this document in v2.0. this document is produced using MS Word .9 2. Updates 1. Added discussion about rs_ticket Added 12. TD & MD Expanded discussion on transaction execution sequence to cover “disappearing updates” more thoroughly. RS itself. including minor changes not noted above.6: Document Topic Batch processing Batch processing Batch processing Rep Agent processing Monitors & Counters Rep Agent User Thread SQM Thread DIST Thread Parallel DSI Routing RS & EAI Modification Added NT informal benchmark with 750-1. Added discussion on sp_sysmon repagent output as well as using MDA tables. including a benchmark from a financial trading system.6 1.9 The following additions were made to this document in v1. However.which provides extremely rudimentary. As a result. inconsistent (and sometimes not persistent) and unreliable red-lining capabilities (it also crashes frequently during spell checking and hates numerical list formats…. Expanded section to include processing & diagram Added diagram to illustrate message queues Expanded discussion on SRE.000 rows/second Added trick to show how to replicate the SQL statement itself instead of the rows. This will aid by allowing them to skip to the applicable sections to read the updated information. the following sections will list the topics added to respective sections.6 and 15. add view to span counter tables. red-lining will not be used to denote changes.9: Document Topic RS Overview RS Internals Application Design Application Design Modification Add description of embedded RSSD Discussion on SMP feature and internal threading Impact of "Chained Mode" on RepAgent throughput and RS processing Further emphasized the impact of high-impact SQL statements and the fact that the latency is driven by the replicate DBMS vs. Added information about join to rs_databases and recommendation to increase RSSD size. Discussion on SMP feature and impact on configuration parameters such as num_mutexes. etc.0 counters to each section with samples Discussion about embedded RSSD & tuning Rep Agent Tuning RS General Tuning RS General Tuning RS General Tuning RS General Tuning 2 . Added section. etc.9 as compared to v1. Added discussion about ignore_dupe_key and CLR records with impact on RS Added description of sp_help_rep_agent dbname. Added discussion on using procedures to emulate dynamic SQL (fully prepared statements) and performance gains as a result at the replicate database. Procedure Replication Text Replication 3 .0. Updated for commit control Added discussion about MDA-based monitor tables to detect contention.1 Document Topic Routes Parallel DSI Parallel DSI Replicate Dataserver/ Database Modification Added 12. but also replaces the section on the Replicate DBMS resources. SQL tracking. this not only replaces the section on the legacy Historical Server.0.Final v2.1 that allows the use of a global unique nonclustered index on the text pointer instead of the mass TIPSA update when marking tables with text for replication. Because of the depth of detail. and RS performance metrics Removed somewhat outdated section on Historical Server and added new material on monitoring with MDA tables and in particular a lot of details on using the WaitEvents and the monOpenObjectActivity/monSysStatement tables.6 and 15.0 counters and discussion about load balancing using multiple routes in multi-database configurations. Added discussion about changes in ASE 15. . etc. However.1 Overview and Review Where Do We Start? Unfortunately. etc. Accordingly. this paper will attempt to trace a data bit being replicated through the system. The only addition to this over pictures in the product manuals is the inclusion of SMS – in particular. Host LOG RSM RSSD RSSD DS PDB LOG RA PDS RS RDS LOG RDB Figure 1 – Components of a Simple Replication System Of course. A quick illustration is included below: 5 .0. will be described to help the reader understand what is happening (or should happen?) at each stage of data movement. These topics include text/image replication. there are several RS commands that will help isolate where to begin. processes. the same abbreviations used in product manuals as well as educational materials are used. Replication Server Manager (RSM) and the inclusion of the host for the RS/RSSD. the various threads. The last words of that sentence hold the key…it’s a distributed system. one example of which is the typical Warm-Standby configuration. A quick review of the components in a replication system and the internal processing within Replication Server are illustrated in the next sections Replication System Components The components in a basic replication system are illustrated below. the above is extremely simple – the basic single direction primary to replicate distributed system. often the system design involves more than one RS. That means that there are lots of pieces and parts that contribute to Replication Server performance – most of which are outside of the Replication Server. For clarity. this is the same question that is asked by someone faced with the task of finding and resolving throughput performance problems in a distributed system. After getting the data to the replicate site. if just designing the system and you wish to take performance in to consideration during the design phase (always a must for scalable systems). a number of topics will be discussed in greater detail. then the easiest place to begin is the beginning. Whether for performance reasons or due to architectural requirements. parallel DSI’s. After the system has been in operation. Along the way.Final v2. DBA's now have a choice of using the older ASE-based RSSD implementation or the new embedded RSSD.eliminating RS crashes due to log suspend. Consequently. Starting with version 12. ASE will "spin" looking for work.Final v2.a useful feature when doing extensive monitoring using monitor counters o The eRSSD transaction log is automatically managed . This includes: o RS will automatically start and stop the eRSSD DBMS. migrating RS between different platforms is much simpler than the cross-platform dump/load (XPDL) procedure for ASE (although manual steps may be required in either situation). or the dangerous practice of ‘truncate log on checkpoint’ Reduced impact on smaller single or dual cpu implementations – ASE as a DBMS is tuned to consume every resource it can – and even when not busy. ASE as a RSSD platform can lead to a "heavy" cpu and memory footprint in smaller implementations – robbing memory or cpu resources from the RS itself.0. the only reason that might tip a DBA to using ASE for the RSSD for new installation using RS 15 is simply due to familiarity.such as the ASE Sybase 6 . an RSSD have shown no difference in performance impact. RS’s RSSD primary user activity does not reach the levels that would distinguish the two. some customers have progressed to multi-level tree-like structures or virtual networks exploiting high-speed bandwidth backbones to form information buses.6. Benchmarks using an eRSSD vs. the added capability of routing with an embedded RSSD removes any architectural advantage over using ASE Since an ASA database is bi-endian. One other difference is that tools and components shipped with ASE . With RS 15. RSSD or Embedded RSSD (eRSSD) Those familiar with RS from the past have always been aware that the RS required an ASE engine for managing the RSSD. o The eRSSD will automatically grow as space is required . The eRSSD is an ASA based implementation that offers the following benefits: • Easier to manage – much of the DBA tasks associated with managing the DBMS for the RSSD have been built-in to the RS. While theoretical design and architectures would allow an ASE system to outscale an ASA based system. • • • • As a result.0. Today.1 LOG PRS RSSD LOG RRS RSSD RSM PRS RSSD DS PDB LOG RA PDS PRS RRS RSSD DS RRS RDS LOG RDB IRS IRS RSSD DS LOG IRS RSSD Figure 2 – Components of a Replication System Involving More Than One RS The above is still fairly basic. 0. which is slightly more accurate than those documented in the Replication Server Administration Guide: Figure 3 – Replication Server Internal Processing Flow Diagram Replicated transactions flow through the system as follows: 1. The dAIO polls the O/S for completion and notifies the SQM that the I/O completed. it is strictly the starting point to beginning to understand how Sybase Replication Server processes transactions. Most of these are extremely similar and only differ in the relationships between the SQM. the Replication Agent can safely move the secondary truncation point forward (based on scan_batch_size setting). most Replication Server administrators immediately picture the internal threads. The similar Sybase Central ASA plug-in is not shipped with Replication Server.1 Central Plug-in . For the sake of this paper. Again. Consequently. Replication Agent User thread functions as a connection manager for the Replication Agent and passes the changes to the SQM. This is especially useful when wanting to reverse engineer RSSD procedures or quickly view data in one of the tables. many Replication Server administrators stop there. While understanding the internal threads is an important fundamental concept. Replication Agent forwards logged changes scanned from the transaction log to the Replication Server. Additionally. Closed. and as a result never really understand how Replication Server is processing their workload. which as of this writing. we will be using the following diagram.Final v2. Unfortunately. The Stable Queue Transaction (SQT) thread requests the next disk block using SQM logic (SQMR) and sorts the transactions into commit order using the 4 lists Open. Replication Server Internal Processing When hearing the terms “internal processing”. SQT and dAIO threads. The SQM notifies that Asynchronous I/O daemon (dAIO) that it has scheduled an I/O. Replication Server Threads There are several different diagrams that depict the Replication Server internal processing threads. One way of obtaining the same tools is to simply download the SQL Anywhere Developer’s Edition. Details about what is happening within each thread as data is replicated will be discussed in later chapters. Once written to disk. 4. The Stable Queue Manager (SQM) writes the logged changes to disk via the operating systems asynchronous I/O routines. 5.allows DBA’s to connect to the ASE RSSD to view objects and data. is free. the 3. and Truncate. 2. Transactions from source systems are stored in the inbound queue until a copy has been distributed to all subscribers (outbound queue). this leaves the administrator ill equipped to resolve issues and in particular to analyze performance bottlenecks within the distributed system. Read. it filters and normalizes the replicated transactions according to the replication definitions. 7 . since the threads were internal to RS and execution could be controlled by the OpenServer scheduler. grouping. Transact SQL) and applies the transaction to the replicated database. it is illustrated here for future discussion. Further discussion about RS SMP capabilities and the impact on performance will be discussed later. SQT/SQM and queue data flow and the lack of a SQT thread reading from the outbound queue (instead. Once the delivery strategy is determined. the transaction is put in the closed list and the SQT alerts the Distributor thread that a transaction is available. Inter-Thread Messaging Additionally. the reader is referred to the Replication Server System Administration Guide for details of internal processing for replication systems involving routing or Warm Standby. inter-thread communications is not accomplished via a strict synchronous API call. whether subscription migration is necessary.5. those in the product manuals is the inclusion of the System Table Services (STS). the native threading improved the RS throughput. This point in the process serves as the boundary between the inbound connection process and the outbound connection processing. The state of “Locking Resource” corresponds more to this condition – the thread in question is attempting to grab exclusive access to a shared resource and is waiting on another thread to release the mutex.6 improved this by reducing the internal contention from the initial 12. In RS 12. each thread simply writes a message into one of the target thread’s OpenServer message queue (standard OpenServer in memory message structures for communicating between OpenServer threads) specific for the message type. The DSI Executor translates the replicated transaction functions into the destination command language (i. The Distributor reads the transaction and determines who is subscribing to it. The DSI Scheduler uses the SQM library functions (SQMR) to retrieve transactions from the outbound queue. 9. Replication Server was a single process using internal threads for task execution along with kernel threads for asynchronous I/O.6 – with or without SMP enabled – the native threading implementation allows the thread execution to be controlled by the OS – consequently mutexes had to be added to several shared resources. threads are required to “lock” the resource for their exclusive use – typically by grabbing the mutex that controls access to the resource. Once the target 8 . you may sometimes see a state of “Locking Resource” when issuing an admin who command. Version 12. etc.) 11. 12.6 prior to attempting SMP. 8. However. the only difference here vs. Each of the main threads discussed above were implemented as full native threads. in RS. the DSI-S is illustrated making SQMR/SQT library calls). there are resources – typically memory structures – that are shared among the different threads. Beginning with version 12. conflicting access to shared resources could often be avoided simply due to the fact that only one thread would be executing at a time. Typically in most multi-threaded applications. While the difference is slight. For example. Once all of the subscribers have been identified. Instead. the SQM writes to the queue using the async i/o interface and continues working. By itself. 6. the DSI Scheduler then passes the transaction to a DSI Executor. the SQT cache is shared between the SQT thread and an SQT client such as a Distributor thread (this shared cache will be important to understanding the hand-off between DSI-S and DSIEXEC threads later). To coordinate access to such shared resources (so that one thread does not delete it while another is using it – or one be reading while another has not finished writing and get corrupted values). parallelism. Once the commit record for a transaction has been seen.e. 7. then uses SQT library functions to sort them into commit order (in case of multiple source systems) and determines delivery strategy (batching. Similar to the inbound queue.Final v2.5 implementation – consequently DBA's should consider upgrading to version 12.0. In RS 12. Because mutex allocation is so quick.1 read request is done via async i/o by the SQT’s SQM read logic and the SQT notified by the dAIO when the read has completed. Asynchronous I/O daemon (dAIO). RS is undergoing a significant state change – for example switching the active in a Warm Standby. even without enabling SMP. etc. at which point the requesting thread is blocked and has to wait. 10. The dAIO will notify the SQM when the write has completed. which could run on multiple processors.6 and higher. Grabbing a mutex really does not take but a few milliseconds – unless someone else has it already. Transactions are stored in the outbound queue until delivered to the destination. one new aspect of this from an internals perspective is that shared resources now required locks or mutexes. Again. it is likely that when you see this. a SMP version of RS exploiting native OS threads was available via an EBF.5 and earlier non-SMP environments. Replication Server SMP & Internal Threading In the past. Keeping these differences in mind. The SMP capabilities could be enable or disable through configuring the Replication Server. the Distributor thread forwards the transaction to the SQM for the outbound queue for the destination connections. In RS 12. For Sybase ASE. Similarly at the replicate. By now. it can detect if the transaction has already been applied. Consequently. although the current design for Replication Server doesn’t require such. Consequently. Note that the message queues are not really tied to a specific thread . Since this section was strictly intended to give you a background in Replication Server internals. those familiar with many of the Replication Server configuration parameters will have realized the relationship between several fairly crucial configuration parameters: num_threads. OQID Processing One of the more central concepts behind replication server recovery is the OQID – Origin Queue Identifier. num_msgqueues and num_msgs (especially why this number could be a large multiple of num_msgqueues). in addition to using message queues.1 thread has processed each message. The OQID is used for duplicate and loss detection as well as determining where to restart applying transactions during recovery. the SQT and DIST threads can communicate using Callback routines.0. As a result. the specifics of this relationship will be discussed later in the section discussion Replication Server tuning. any time the RS detects an OQID lower than the last one. when the DSI compares the ODID in the rs_lastcommit table with the one current in the active segment. log page timestamp and log record row id (rid). callbacks are used primarily between threads in which one thread spawned the other and the child thread needs to communicate to the parent thread. Due to the fact the OQID contains log specific information. 9 . Accordingly. This resembles the following: OpenClient Callback Rep Agent User SQM OpenServer Message Queues Figure 4 – Replication Server Inter-Thread Communications Those familiar with multi-threaded programming or OpenServer programming will recognize this as a common technique for communication between threads – especially when multiple threads are trying to communicate with the same destination thread. it can somewhat safely assume that it is a duplicate. it is possible to have more message queues than threads. The OQID is generated by the Replication Agent when scanning the transaction log from the source system. it can use standard callback routines or put a response message back into a message queue for the sending thread. An example of this in Replication Server is the DIST and SQT threads. the OQID is a 36 byte binary value composed of the following elements: Byte 1-2 3-8 9-14 15-20 21-28 29-30 31-32 33-34 35-36 Contents Database generation id (from dbcc gettrunc()) Log page timestamp Log page rowid (rid) Log page rid for the oldest transaction Datetime for oldest transaction Used by RepAgent to delete orphaned transactions Unused Appended by TD for uniqueness Appended by MD for uniqueness Through the use of the database generation id. As a result.Final v2. each OQID format will be dependent upon the source system. The SQT thread for any primary database is started by the DIST thread. a single thread may be putting/retrieving messages from multiple message queues. ASE guarantees that the OQID is always increasing sequentially.but rather to a specific message. the rest of this document will be divided into sections detailing how these components work in relation to possible performance issues.0. Instead. The getdate() function is a direct poll of the system clock on the replicate. Note that the oldest open transaction position is also part of the ASE. comes from the getdate() function call in rs_update_lastcommit.1 Why would there be duplicates?? Simply because the Replication Server isn’t updating the RSSD or the rs_lastcommit table with every replicated row. Consequently. the rs_lastcommit commit time is for the last command in the batch – and not necessarily the one issued that you are testing with. The major sections will be: • • • • • • Primary Dataserver/Database Replication Agent Processing Replication Server and RSSD General Tuning Inbound Processing Outbound Queue Processing Replicate Dataserver/Database After these sections have been covered in some detail. since transactions are grouped when delivered via RS (topic for later). The best mechanism to determining latency is to simply run a batch of 1. Those familiar with creating intermediate replication routes and concept of logical network topology provided by the intermediate routing capability will recognize the benefit of this behavior. If the Replication Server is keeping the system up to the point a stop watch would be necessary. At the replicate. on the other hand. This ‘pause’ is built in so that subsequent transactions can be batched into the buffer for similar processing.e. then you don’t have a latency problem. as we will see later. This may include system transaction id’s or other system generated information that uniquely identifies each transaction to the Replication Agent. This is deliberate. On the other hand. which is synched with the system clock about once per minute. this document will then cover several special topics related to DSI processing in more detail. the OQID ensures that only a single copy of a message is delivered in the event that the routing topology changes. If. however.000 normal business transactions (can be simulated with atomic inserts spread across the hot tables) into the primary and monitor the end time at the primary and replicate. Additionally.Final v2. Should the system be halted mid-batch and then restarted. Analyzing Replication System Performance Having set the stage. it is updating every so often after a batch of transactions has been applied. but not more than a minute as it is re-synched each minute. An important aspect of the OQID is the fact that each replicated row from a source system is associated with only one OQID and vice versa.. The dest_commit time in the rs_lastcommit table. network outage). This is extremely inaccurate. it may appear to be worse than it is. the Replication Agent may have to restart at the point of the oldest transaction and rescan to ensure that nothing is missed. As discussed later. There can be drift obviously. For large sets of transactions. if the last command was a long running procedure. if the Replication system is shutdown. much like network packeting.. the oldest open transaction position is necessary for recovery. The danger is that some people have attempted to use the OQID or origin commit time in the rs_lastcommit table for timing. For heterogeneous systems. Those familiar with TCP programming will recognize this buffering as similar to the delay that is disabled by enabling TCP_NO_DELAY as well as other O/S parameters such as tcp_deferred_ack_interval on Sun Solaris. the database generation (bytes 1-2) and the RS managed bytes (33-36) are the same. but also in replication routing. the origin commit time comes from the timestamp in the commit record (a specific record in the transaction log) on the primary. however the other components depend on what may be available to the replication agent to construct the OQID. This time is derived from the dataserver’s clock. This is key to not only identifying duplicates for recovery after a failure (i. obviously a stop watch is not even necessary. the ASE Rep Agent does not actually ever read the secondary truncation point. it finishes at the primary in 1 minute and at the replicate in 5 minutes – then you have a problem – maybe. Since the Replication Agent could be scanning past the primary truncation point and up to the end of the log.. From this aspect. This includes: • • • Procedure Replication Replication Routes Parallel DSI Performance 10 . The resulting difference between the two could be quite large in one sense or even negative if the replicate’s clock was slow. First. a similar situation occurs in that the Replication Server begins by looking at the oldest active segment in the queue – which may contain transactions already applied. it is possible that the first several have already been applied. In any case. the Replication Agent and Replication Server both have deliberate delays built in when only a small number of records are received. 1 • • • • Text/Image Replication Asynchronous Request Functions Multiple DSI’s Integration with EAI 11 .0.Final v2. . ASE engineers began tapping in to this resource by caching subquery results. system administrators who have tuned the procedure cache to the minimal levels prior to implementing replication may need to increase it slightly to accommodate Replication Agent usage if using an earlier release of ASE.Final v2. this has changed. “9204” -. One of those ways is proper tuning of the database engine’s system configuration settings. When the Replication Agent thread was internalized within the ASE engine (ASE 11. etc. For example.9. replication can impact system administration in many ways. including: • • • • • Poor transaction management. etc. the default of 20% often meant that ~400MB of memory was being reserved for procedure cache. have a direct impact on the performance of the Replication Agent or in processing transactions within the Replication Server. High-impact SQL statements . in a system with 2GB of memory dedicated to the database engine.) will quickly point to significant flaws in the primary database design or implementation. Dataserver Configuration Parameters While Sybase has striven (with some success) to make replication transparent to the application. It also used procedure cache.2 or earlier.0) have moved this requirement from procedure cache to additional memory grabbed at startup similar to additional network memory. a bad design will also cause replication performance to suffer. text/image replication states.) Note that all of these have problems in a distributed environment – whether using Replication Server or MQSeries messaging. “<filepathname>” sp_config_rep_agent <db_name>. However.such as a single update or delete statement that affects a large number of rows (>10. in recent years. implementing database replication or other forms of distributing database information (messaging. You can see how much memory a Replication Agent is using via the 9204 trace flag (additional information on enabling/disabling Replication Agent trace flags is located in the Replication Agent section).Caching LTL statements pending transfer to the Replication Server As a result. Transaction Cache .0. “traceon”. In this section. The Replication Agent uses memory for several critical functions: Schema Cache . Consequently.monitor for a few minutes sp_config_rep_agent <db_name>. While it is true that many replication performance problems can be resolved there. real procedure cache used by stored procedure plans is less than 10MB. in procedure cache. the procedure cache was grossly oversized. particularly with stored procedures. batch processes. this may not be as great of a problem as ASE 11. Later releases of ASE (from ASE 12. it was no different. The reason is than in most large production systems. the proper design of a database system for distributed environments is beyond the scope of this paper. In addition to the Replication Agent Thread (even though significantly better than the older LTM’s as far as impact on the dataserver). we tend to focus quickly at the replicate. sp_config_rep_agent <db_name>. A truer statement has never been written. such as table.5). duplicate rows. In fact.e. Procedure Cache Sizing A common misconception is that procedure cache is strictly used for caching procedure query plans. However. we will begin with basic configuration issues and then move into some of the more problematic design issues that affect replication performance. While they may “work”. sort buffers. Often. used in the construction of LTL. if using ASE 12. “traceoff”. Inappropriate design for a distributed environment (heavy reliance on sequential or pseudo keys) Improper implementation of relational concepts (i. Single threaded batch processes. “9204” 13 . “trace_log_file”.1 Primary Dataserver/Database It is Not Possible to Tune a Bad Design The above comment is the ninth principal of the “Principals of OLTP Processing” as stated by Nancy Mullen of Andersen Consulting (now Accenture?) in her paper OLTP Program Design in OLTP Processing Handbook (McGrawHill). but in most cases. Several settings that would not normally be associated with replication. consequently under utilized and contributed to the lack of resources for data cache. column names.000). In many cases when replication performance is bad. etc. synchronization. lack of primary keys. it is not transparent to the database server. the primary database often also plays a significant role. they are not scalable.Caching for database object structures. nonetheless. Not only can you not fix it by replication.5. Rather than trying to 14 . The reason is that if the named cache is for mixed use (log and data). The reason stems from the fact that the Replication Agent cannot read a log page until it has been flushed to disk. a good RAID-based log device will be sufficient to enable the SSD to be used as a stable device or other requirement for general server performance (tempdb). User Log Cache (ULC) User (or Private) Log Cache was implemented in Sybase SQL Server 11. the Replication Agent performance may drop to as low as 1GB/hr. One reason is that it is extremely rare that the SQT workload is the performance bottleneck. In theory.0 as a means of reducing transaction log semaphore contention and the number of times that the same log page was written to disk.” While the intention may have been the largest size buffers were used. One aspect of this that could have had a large impact on the performance of replication server was that this would mean that a single transaction’s log records would be contiguous on disk vs. the ULC would be flushed much more frequently than normal. more than likely other buffer pools larger than 4K have been established. while in others it appears to use different pools almost exclusively for different periods of time. etc.). a little known fact is stated: “Regardless of how many buffer pools are configured in a named data cache. but as of this writing not enough statistics are available to tell how much of a positive impact this has on throughput by reducing the SQT workload. particularly the speed at which the Replication Agent can read and forward transactions to the Replication Server. the quicker Replication Agent will be able to scan the transaction log on startup. the best configuration is a separate dedicated log cache with all but 1MB allocated to 4K buffer pools.Final v2.0. binding the transaction log to a named cache can have significant performance benefits. If forced to read from disk. it needs access to the object’s metadata structures. syscolumns. In fact. Adaptive Server only uses two of them. experience monitoring production systems suggests that it is the buffer pool with the largest buffer space instead in some cases. some DBA’s simply assume that any 4KB I/O’s must be the transaction log. the probability is much higher that the Replication Agent can read the log from memory vs. a decision was made in the design of SQL Server 11. this can lead to higher transaction log contention as well as negating the potential benefit to the SQT thread. the Replication Agent processing will be slowed while waiting for the disk I/O to complete. In the Adaptive Server Enterprise Monitor Historical Server User’s Guide. As one would suspect. If forced to read this from disk. Some installations have opted to use Solid State Disks (SSD’s) as transaction log devices to reduce user transaction times. This would significantly reduce the amount of sorting that the SQT thread would have to do within the Replication Server. Unfortunately. Operating Systems have matured considerably. in order to ensure low latency and due to an Operating System I/O flushing problems. etc. Metadata Cache The metadata cache itself is important to replication performance. It uses the 2K buffer pool and the pool configured with the largest-sized buffers. that if the OSTAT_REPLICATED flag was on. As will be discussed later. would the records be written to the physical transaction log. eliminating the primary cause and hence the need for this.x. if a named cache is available. A word of caution. Primary Database Transaction Log As you would assume. recovery and during processing when physical i/o is required. While it may be tempting to simply allocate a small 4K pool in an existing cache. in some cases. Careful monitoring of the metadata cache via sp_sysmon during periods of peak performance will allow system administrators to size the metadata cache configurations appropriately. a properly sized ULC would mean that only when a transaction was committed. as the Replication Agent reads a row from the transaction log.5. For example. While such devices would help the Replication Agent. when it could be query activity – counters available through sp_sysmon do not differentiate log I/O from data pages. interspersed with other user’s transactions. In ASE 12. the system behaves as if it did not have any ULC. The faster the device. the primary transaction log plays an integral role in replication performance. Named Cache Usage Along with log I/O sizing. Over the years.1 Generally speaking. a 50MB dedicated log cache would have 49MB of 4K buffers and 1MB of 2K buffers. Physical Location The physical location of the transaction log plays a part in both the database performance as well as replication performance. if resources are limited. While this does happen immediately after the page is full due to recovery reasons. A rule of thumb if sizing a new system for replication might be to use the metadata cache requirements as a starting point. this ULC flush was removed. However. the Replication Agent’s memory requirements will be less than normal server’s metadata cache requirements for system objects (sysobjects. disk. Purportedly. books. undoubtedly the best way to improve replication performance from the primary database perspective is the application or primary database design itself. fetch. all data retrieval and modification commands (delete. A classic case of this can be witnessed during large bcp operations (100. Cross-database referential integrity would be required (for which a trigger vs. since the transactions are committed vs.0. insert. Simple transactions that only involve queries vs. what would be lost by separating the database into two physical databases – one containing the authors. Newer versions of Replication Server have reduced the impact by removing empty transactions earlier – those from chained transactions as well as system transactions such as reorgs. while the other functions strictly as the sales order processing database? The answer is not much. That means the other 20% of the transactions are administering the lists of authors. the former is better. etc. While some would say that it would involve cross-database write operations. etc. 15 .5. Let’s assume that 80% of the transactions are store orders. book prices. books. etc. select. Application driven security implementations (screen navigation permissions. or is it more important to enforce referential integrity at all points and force recovery of both systems?? Obviously. is it more important to have a record of a sale to a store in the dependent database even if the parent store record is lost due to recovery. Multiple Physical Databases One of the most frequent complaints is that the Replication Agent is not reading the transaction log fast enough. etc. it is a database meant to track the sales of books to stores from a warehouse. when ever normal OLTP processing causes the Replication Agent to lag behind. etc. shipment tracking events.. control states. books and even stores would be entered into the system outside the scope of the transaction recording book sales. DML operations result in empty transactions. With the exception of bulk operations. new authors. In ASE 12. While some might think that the User Log Cache would filter these empty log transactions from even reaching the transaction log itself.000 or more rows) in which the overhead of constructing LTL for each row is significant enough to cause the Replication Agent to begin to lag behind. Besides the obvious negative impact on application performance. which are committed as usual. And yet. they have a negative impact in replication as well as these empty transactions are forwarded to the Replication Server. Application/Database Design While the above configuration settings can help reduce performance degradation. Chained Mode Transactions In chained mode. Appropriately designed. As a result. every time that a separate log cache has been enabled. an application that uses chained mode will degrade Replication Agent throughput as well as increase the processing requirements for Replication Server. there is a point where the Replication Agent is simply not able to keep up with the logged activity.) Static information such as tangible business objects including part lists. The biggest impact on RS is from the implicit transactions that result from select statements – which in most applications accounts for 75-80% of all activity in a DBMS. it makes sense to separate a logical database into several physical databases for the following types of data groupings: • • • • Application object metadata such as menu lists. The real crux of the matter is. Business event data such as sales records. etc. stores and other fairly static information. Although some times this can be alleviated by properly tuning the Replication Agent thread. declarative integrity may be more appropriate). Earlier versions of Replication Server would filter these empty transactions at the DSI thread due to the way transaction grouping works.2. the mythical pubs2 application. user actions that result in empty transactions will still result in empty begin/commit pairs sent to the RS. However.1 second guess this. the most frequent cause is the failure on the part of the database designers to consider splitting the logical database into two or more physical databases based on logical data groups. As a result. In most cases where the RepAgent was lagging. it is much simpler to simply restrict any named cache to only 2 sizes of buffer pools and use a dedicated log cache for this purpose.Final v2. open. rolled back. the current threading model. and update) implicitly begin a transaction. but even this does not pose a recovery issue except to academics. customers have witnessed an immediate 100% improvement in Replication Agent throughput as long as the RepAgent stayed within the log cache region. If maintained in the same database. however. these empty transactions are instead flushed to the transaction log. the real answer is not really. suppliers. this extra 20% of the transactions could be just enough to cause a single Replication Agent to lag behind the transaction logging. Consider for example. prompting calls for the ability to have more than one Replication Agent per log or a multi-threaded Replication Agent vs. the replication agent has been improved to eliminate the empty transactions from system transactions. adjusting the above configuration settings. Avoid Unnecessary BLOBs The handling of BLOB (text/image) data is becoming more of a problem today as application developers faced with storing XML messages in the database are often choosing to store the entire message as a BLOB datatype (image for Sybase if using native XML indexing). anytime a minimally logged function is executed in a database. consider the types of data that you may be storing in a text or image column. an applicant may be required to explain late payments to a specific credit account. Stored as text datatype. consider the following: • • Text/Image data is typically static. or other infrequently access reference data. the ability to perform transaction log dumps is voided. However. you also will increase the degree of parallelism on the inbound side of Replication Server processing. the primary data related to business processing can support transaction log dumps allowing up to the minute recovery as well as be brought online faster after a system shutdown. Now. consider replication. a common requirement may be to determine the number of credit accounts and balances with any reported late payments for customers who are late in paying their current bill. If a person’s credit report is stored as a single text datatype. number of delinquent payments. during a Warm Standby failure. It is probably advisable to put such tables in a separate database with a view in the original for application transparency purposes. storing structured data in a BLOB datatype is actually orders of magnitude less efficient for the application. text/image or other types of BLOBs can significantly slow Rep Agent performance due to having to also scan the text chains – a slow process in any event. For example. This might allow a bank to reduce it’s risk of exposure either dynamically or avoid it altogether by refusing credit to someone who’s profile would suggest a greater chance of defaulting on the loan. First of all. consider the “credit report” instance alluded to earlier. So how does a separate database improve recoverability? First. Once inserted. This last is important from a different perspective. The point of this discussion is not to discourage storing XML documents when necessary – in fact storing the credit report as an entire entity might be needful – particularly if exchanging it with other entities. The last item might catch many people by surprise and immediate generate cautions about cross database transactions. For example. it can detract from the business’s ability to perform business analysis functions critical to profitability. Consequently.1 • One-up/sequential key tables used to generate sequential numbers Not only does this naturally lend itself to the beginnings of shareable data segments reusable by many applications. In addition. databases containing text/image data often must be backed up using full database dumps. in some cases splitting a database can be highly recommended for other reasons. By separating the text/image data. The first two are obvious solutions to replication performance degradation as result of text processing. most applications will use minimally logged functions such as writetext (or the CT-Library equivalent ct_send_data() function) to insert the text. dedicating one to reading text data Enable separate physical connection at the replicate to write the data – improving overall throughput as nontextual data is not delayed while text or image data is processed by the DSI thread. annotations about a specific charge are difficult to record. As an example. For instance. but the number of real transactions stranded may be able to be determined with more accuracy – and the associated key sequences preserved. etc. As a result. the tendency of 16 . Some financial institutions store loan applicant credit reports as text datatypes (although not recommended). The latter comment is not so obvious. Consequently. by doing so. As will be illustrated later. By placing the one-up key tables in a separate database. it is rarely updated and the most common write activity post-insert will be a delete operation performed during archival. this will require significant time to perform – depending on the quantity and speed of backup devices.0. digitized applications containing signatures. the application must then perform the parsing to determine such items as the credit score. In addition. However. the gap of missing rows can be determined from the key table. they effectively have a dedicated Replication Agent – and simple path through the Replication Server. For any large database. if applying for a mortgage. Additionally. Improve overall application/database recoverability. one-up/sequential key tables will have considerably less latency than the main data tables. under any recovery scenario – either the correct next value could be determined by scanning the real data or. it is less likely that any transactions were stranded. the number of open charge accounts.Final v2. The reasons for this are: • • • Enable multiple Replication Agents to work in parallel – in effect. Consider the common problem of databases containing large text or image objects. it would be difficult to link the applicant’s rebuttal (which would be a good use of text) with the specific account. Other organizations will frequently store customer emails. In most cases. To avoid transaction log issues with text/image. or even the location of specific shipments would require the XML document to be parsed. lastdate) (dependent info) (lastuser. Questions such as whether ground facility capacity had been exceeded. Several of the more common inefficiencies are discussed below.0. re-routing of shipments due to delays. lastdate) (lastuser. finding shipments). lastdate) (lastuser. An inefficient application not only increases the I/O requirements of the primary database. While serving an extremely useful purpose in providing the means to communicate with other systems. On top of which. etc. lastdate) (lastuser. it can seriously degrade overall application performance if XML messages are stored as a single text datatype. instead of a single record. lastdate) As a result. While some of this is unavoidable to insure business requirements are met.e. lastdate) (property info) (lastuser. lastdate) (dependent info) (lastuser. lastdate) (lastuser. insert update update update update update loan_application loan_application loan_application loan_application loan_application loan_application (name. lastdate) Some may question the reality of such an example. Consider the query “What scheduled flights or delayed flights are scheduled to arrive in the next 1 hour?” Transaction Processing After the physical database design itself. While doable. insert update update update update update update update update update update update loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application (name. address) update loan_application (property info) update loan_application (dependent info) Now. if a cargo airplane’s schedule and load manifest were stored in XML format as a text datatype. 5. User inserts basic loan applicant name. Replication Server normalization/distribution/subscription processing.1 some is to think of the RDBMS as a big bit bucket to store all of their data as “objects” in XML format without recognizing the futility of doing so. the next largest contributor is how the application processes transactions. often such operations require the retrieval of a large number of data values and subsequent parsing to find the desired information. one of Sybase’s mortgage banking customers had a table containing 65 columns requiring 8-10 application screens before completely filled out. including user auditing information (last_update_user). 17 . Consider the following mortgage application scenario: 1. address information As user transitions to next screen for property info. consider what happens at the replicate (if triggers are not turned off for the connection) – local trigger firings at the replicate are bolded. While remaining unnamed. address) (lastuser. Avoid Repeated Row Re-Writes One of the more common problems brought about by forms-based computing is that the same row of data may be inserted and then repeatedly updated by the same user during the same session. it also can significantly degrade replication performance. 3. 4. XML is mainly an application layer communications protocol. order totals. the business’s routing/scheduling and in transit visibility functions would be extremely hampered. User adds the property information (stored in same database table). lastdate) (lastuser. A classic scenario is the scenario of filling out an application for loans or other multi-part application process. consider the actual I/O costs if the database table had a trigger that recorded the last user and datetime that the record was last updated.Final v2. 6. lastdate) (property info) (lastuser. lastdate) (lastuser. It is real. and text indexing/XML indexing may assist in some efforts (i. As user transitions to the next screen. the property information is saved to the database User adds dependent information (store in same table in denormalized form) User hits save before asking credit info (not stored in same table) Just considering the above scenario. the info is saved to the database. 2. etc. Similarly. For example. A second common scenario is one in which fields in the “record” are filled out by database triggers. it may add extra work to the replication process. address) (lastuser. the following database write operations would be initiated by the application: insert loan_application (name. the Replication Agent must process 6 records – each of which will incur the same LTL translation. Final v2. The last one typically is not a problem for replicated systems. however. The simple fact of the matter is that any batch SQL statement logs each row individually in the transaction log. This can be seen in the following graph which compares a straight bcp in.1 After each screen.000 100. Consider what happens for each SQL statement as it hits ASE: • • • • • • • • SQL statement is parsed by the language processor SQL statement is normalized and optimized SQL is executed Task is put to sleep pending lock acquisition and logical or physical I/O Task is put back on runnable queue when I/O returns Task commits (writes commit record to transaction log) Task is put to sleep pending log write Task sends return status to client When this much overhead is executed for every row affected in a batch process. This is extremely rare and usually only is present in extremely high OLTP systems where contention avoidance is paramount.000 250. Batch Insert Speeds 800 700 600 Seconds 500 400 300 200 100 0 0 25.0. any distributed system is left with the unenviable task of moving the individual statements enmass (and frequently as one large transaction). an insert/select statement. rather than filling out a structure/object in memory. Understanding Batch Processing Most typical batch processes involve one of the following types of scenarios: • • • Bulkcopy (bcp) of data from a flat file into a production table. During normal database processing. the process slows to a crawl. A single or multiple stream of individual atomic SQL statements affecting one row each.000 150. Replication was enabled in a Warm-Standby configuration for availability purposes. a bcp in using a batch size of 100. Bulk SQL statement via insert/select or massive update or delete statement. bulk SQL statements. Consequently. this led to an extremely high amount of contention within the table made worse by the continual page splitting to accommodate the increasing row size. This is more common than it should be as bcp-ing data is inherently problem-prone. what’s the problem with this? The problem is the dismal performance of executing atomic SQL statements vs. each screen saved the data to the database. you can guess the performance implications within Replication Server from such a design.in an unreplicated system . Although successful.000 bcp in bcp -b100 insert/select 100 grouped inserts Rows Figure 5 – Non-replicated Batch Insert Speeds on single CPU/NT 18 . the first two are – and it has nothing to do with Replication Server.000 200. So.000 50. and atomic inserts grouped in batches of 100 . of course. check constraints nor fire triggers.000 180. In fact.000 rows at a time as atomic transactions. the primary ASE can execute the batch SQL along the performance lines as indicated above – easily completing 250. a single update or delete operation could easily affect 100’s of thousands of rows. The problem is that a typical batch process may contain dozens to hundreds of such bulk SQL statements . While slow bcp is an order of magnitude slower than fast bcp. This leads to the first key concept that is indisputable. If the messaging system treats each transaction as a singular message to maintain transactional consistency. Notice that the results are fairly linear and show a marked difference between the grouped atomic inserts and any of the bulk statements (a factor of 700%). it would have the same problem as RS . a recent test with a common financial trading package that had a single delete of ~800. Consider for example.1 The above test was run on a small NT system.Final v2. the relative difference holds. it is limited to 50 statements per batch). So why is this important? One of the biggest causes in latency within a replicated environment is bulk SQL operations during batch processing . Using the above as an indication. As a result.000 Latency N/A 7-12 min 5-7 min 53 min Inbound Queue DSI Replicate ASE It is extremely important to realize.however this is even not attainable as it is unlikely that RS could group 100 inserts into a single batch (as we will see later. compile. RS was able to execute the same volume of inserts and achieve the same throughput. and finally the message system applies the data as SQL statements to the destination system.in particular high-impact update and delete statements. To further illustrate that this is not just a Replication Server issue. the message bus stores the messages to disk (if durable messaging is used). the lack of concurrency at the primary translates directly into replication performance problems at the replicate. the parse. people find it surprising that Replication Server has difficulty keeping up.000 15.000 120. This is clearly illustrated above in the insert batch 19 . it then repopulated the table using inserts of 1. optimize steps are either eliminated or only executed once. the bcp utility translates the rows of data into individual insert statements. In these cases. with parallel DSI’s. the replicate system unfortunately has to follow the atomic SQL statement route – and suffers mightily as it attempts to execute 250.0. Parallel DSI’s and smaller transactions must be used to avoid latency. the ever-common bulkcopy problem. the premise is false. Had the delete (above) note been clogging the system.slow execution by the target server.but rather the inability of the target dataserver to process each statement quickly enough that causes the latency. but for some reason is unbelievable as so many are quick to blame RS for the latency: Key Concept #1: Replication Server with a single DSI/single transaction will be limited in its ability to achieve any real throughput by the replicate data server’s (DBMS) performance. however. Consequently. there would have been near-zero latency for the inserts.5 . In the first place. it is still several orders of magnitude faster than individual insert statements that Replication Server will use at the replicate.each one compounding the problem. At this point. the message agent (such as TIBCO’s ADB) polls the messages from this table (similar to the RepAgent). The problem is that all that is in the transaction log is the 250. As a result. It was interesting to note that while the financial package used a single delete statement to remove the rows. Batch Process/Bulkcopy Concurrency In some cases. consider the typical messaging implementation: a message table is populated within ASE (similar to the transaction log). declarative referential integrity. Only if transactional consistency is ignored and the messages applied in parallel could the problem be overcome.000 individual inserts.000 rows in less than 2 minutes.not the SQL statement that caused the problem. it is not the Replication Server that can’t achieve the throughput . the best it could hope for would be 12 minutes of execution instead of 1. Beyond that point. To see the impact of this in real life. “Net gouge” for years has stated that during slow bcp.000 row images . the only difference between “slow” bcp and “fast” bcp is that the individual inserted rows are logged for “slow” bcp whereas in “fast” bcp only the space allocations are logged. If you think about what was mentioned earlier. Note that in the cases of the bcp or the single large insert/select. it is still a bulk operation and consequently does not validate user-defined datatypes. since RS is sending individual inserts.000 rows showed the following statistics (over several executions): Component Primary ASE (single delete stmt) Rep Agent RS (Inbound Queue) Outbound Queue Rows/Min 800. 100. Consider the 20 . cause less concurrency problems at the primary and improve replication throughput.hence the comparable performance of the insert/select (which would log each row as well). In fact. as noted earlier in the financial trading system example. An alternative approach in which a delete list is generated and then used to cursor through the main tables using concurrent processes may be more recoverable. As a result. Would Replication Server keep up? It probably would still lag. Replication Agent was tuned appropriately. by sequentially loading the tables. Would the Parallel DSI’s be used/effective? Most assuredly. DOL/RLL locking at the replicate database. an insert of ~800. If bcp’d sequentially using slow bcp.0. it would be held in the inbound queue until the commit record was seen by the SQT thread – as a large transaction. it should be easy to see how the Replication Server lagged behind.Final v2. If attempting to use parallel DSI’s. when replication is implemented. Even if bcp batching were enabled. it now may take only 2 hours to load the data (arguably less if not batching) and 3 hours at the replicate. it may take 1-2 hours to load the data. but not as much.000 row transactions executed using 10 parallel DSI’s completed at the replicate in the same amount of time as it took to execute at the primary . no matter how many were configured. a single purge script begins by deleting masses of records using SQL joins to determine which rows can be removed. Typically. Batch Scenario with Parallelism Now. Replication Agent probably was not tuned (batching and ltl_batch_size) as will be discussed in the next section. At the primary.000 statements as a large transaction vs. Checking the replicated database during this time shows extremely little CPU or I/O utilization and the maintenance user process busy only a fraction of the time. All the normal “things” are tried and even parallel DSI’s are implemented – all to no avail. concurrent DSI threads would suffer a high probability of contention.1 test (figure 5) as the bcp in this case was a “slow” bcp . The same scenario is evident in purge operations.000 rows in 1. Unfortunately. • • • • Some of the above will be addressed in the section specific to Parallel DSI tuning. It also illustrates a very key concept: Key Concept #2: The key to understanding Replication Server performance is understanding how the entire Replication System is processing your transaction. Further. however. as it had to apply it as a single transaction. Customer decides that Replication Server just can’t keep up. Optionally. especially on heap tables or indexes – due to working on a single table. DSI serialization was set to “wait_for_start” (see Parallel DSI tuning section). the Replication Server could only ever use a single DSI. Typical Batch Scenario Now. consider the scenario of a nightly batch load of three tables. the batch process at the replicate requires 8-10 hours to complete. The reality of the above scenario is that several problems contributed to the poor performance: • The bcp probably did not use batching (-b option) and as a result was loaded in a single transaction. The problem is of course that this is identical from a replication perspective as a bcp operation – a large transaction with no concurrency. tables partitioned (although not necessary for performance gains – if partitioned. consider what would likely happen if the following scenario was followed for the three tables: • • • • • • All three tables were bcp’d concurrently using a batch size of 100. this will force the use of the less efficient default serialization method of “wait_for_commit”. DOL/RLL is a must). it also meant that Replication Server only considered a small number of threads preserved for large transactions. this may incur multiple scans of the inbound queue to recreate the transaction records due to filling the SQT cache.any latency would be simply due to the RS processing overhead. Lack of batch size in the bcp (-b option) more than likely drove Replication Server to use large transaction threads – while this may have reduced the overall latency in one area due to not having to wait for the DSI to see the commit record. Replication Server was tuned to recognize 1. Would the SQT cache size fill? Probably not. exceeding the time requirements and possibly encroaching on the business day. This was all accomplished with 10 parallel threads in RS with dsi_serialization_method set to ‘isolation_level_3’. the system predefined 10 ranges of rows (i. a single large transaction. 10001-15000. 5001-10000. but specified a rowcount of 100 or 250 as noted above. null. Ignoring the replication aspects. 2.5 and RS 12.5 running on the same host machine.5 can take advantage of a feature introduced with ASE 12. Accordingly.000 rows in a single transaction.).000 Row Bulk Insert Between Two Tables Method Single SQL statement (insert/select) 10 threads processing 1 row at a time 10 threads processing 100 ranged rows at a time* 10 threads processing 250 ranged rows at a time* Time (sec) 1 57 5 1 By ranged rows (*). default getdate() not null. An interesting test (some results were described above) was done on a dual processor (850MHz P3 standard (not XEON)) NT workstation with ASE 12.0 that allows the actual replication of a SQL statement. Replicating SQL for Batch Processing The fundamental problem in batch processing is that a single SQL statement at the primary is translated into thousands of rows at the replicate – each row requiring RS resources for processing and then the typical parse. Several batch inserts of 25. It is possible to achieve the same performance as large bulk statements by running parallel processes using smaller bulk statements on predefined ranges Atomic statement processing is slow This leads to a second key concept: Key Concept #3: The optimal primary transaction profile for replication is concurrent users updating/inserting/deleting small numbers of rows per transaction spread throughout different tables.000 rows were conduction from one database on the ASE engine to another using a Warm Standby implementation.0.000 rows each in batches of 100 than for a single process to delete 100. 1-5000. optimize and sleep pending I/O at the replicate dataserver delays.e. users of ASE 12. the best way to improve replication performance of large batch operations is to alter the batch operation to use concurrent smaller transactions vs. As each thread initialized. For updates and deletes.Final v2.000-100.5 and RS 12. default getdate() not null create unique clustered index rep_sql_idx on replicated_sql (sql_statement_id) go create trigger replicated_sql_ins_trig on replicated_sql for insert as begin 21 .000 row insert into one table from a different table (mimicking a typical insert from a staging table to production table): 50. etc. It then performed the same insert/select. By using 10 processes to perform the inserts in 250 row transactions in pre-defined ranges. It just means it is better from a replication standpoint for 10 processes to delete 1. it was assigned a specific range.1 following benchmark results from a 50.000 rows per second total throughput (and since ASE was configured for 2 engines. That does not mean low volume! It can be extremely high volume. this machine was sorely over utilized).0) varchar(1800) datetime datetime identity. the above benchmark easily demonstrates a couple of key batch processing hallmarks: 1. Consider the following code fragment: if exists (select 1 from sysobjects where name="replicated_sql" and type="U" and uid=user_id()) drop table replicated_sql go create table replicated_sql ( sql_statement_id sql_string begin_time commit_time ) go numeric(20. RS was still able to reliably achieve 750-1. byte 6 of @@options & 0x02 = 2 is on -.6. we'd better check if we can turn them on if proc_role('replication_role')=0 begin raiserror 30000 "%1!: You must have replication role to execute this procedure at the replicate". true go if exists (select 1 from sysobjects where name="sp_replicate_sql" and type="P" and uid=user_id()) drop proc sp_replicate_sql go create proc sp_replicate_sql @sql_string varchar(1800) as begin declare @began_tran tinyint. set a save point so we are well-behaved if @@trancount=0 begin select @began_tran=1 begin transaction rep_sql end else begin select @began_tran=0 save transaction rep_sql end -.substring(@@options.in unix.okay. @proc_name if @triggers_state=0 set triggers off return(-1) end else if @began_tran=1 commit tran if @triggers_state=0 set triggers off return (0) end go exec sp_setrepproc 'sp_replicate_sql'. @proc_name varchar(60) select @proc_name=object_name(@@procid) -. 22 .1)) & 0x02 = 0) begin select @triggers_state=0 -.1 declare @sqlstring varchar(1800) select @sqlstring=sql_string from inserted set replication off execute(@sqlstring) set replication on end go exec sp_setreptable replicated_sql.pubs2 With all tables named replicated_sql ( sql_statement_id identity. If already in tran. For NT.check for trigger state. 'function' go Then use the following replication definitions (this example is for a Warm Standby between two copies of pubs2 with a logical connection of WSTBY. @proc_name if @began_tran=1 rollback tran return(-1) end set triggers on end else begin select @triggers_state=1 end -. now we can do the insert insert into replicated_sql (sql_string) values (@sql_string) if @@error!=0 or @@rowcount=0 begin rollback tran rep_sql raiserror 30001 "%1!: Insert failed.pubs2) Create replication definition replicated_sql_repdef With primary at WSTBY.since triggers are off.0. @triggers_state tinyint. Transaction rolled back". the bytes may be swapped if (convert(int.check for tran state.Final v2. Inc. Sybase provided a capability to execute dynamically constructed SQL statements using the execute() function. Since it is a binary number. the only problem is that with Warm Standby. it now becomes a simple matter to replicate a proc that turns on triggers. particularly bcp’s are not able to use this for the simple fact that the source data needs to exist at the replicate already.pubs2 deliver as sp_replicate_sql ( @sql_string varchar(1800) ) send standby all parameters go Now.’CA’)” The trick is in the highlighted portions of the trigger and the stored procedure. it would appear to be a normal implementation. inserts a SQL string into a table. • 23 .5 and RS 12.@@options.5 versions of the products. the answer is it allows you to replicate truncate table or SQL deletes against the table when it begins getting unwieldy. Consider the following code snippet that might be used when moving rows from a staging database to the production system: create proc load_prod_table @batch_size int=250… as begin declare @done_loading tinyint select @done_loading=0 set rowcount @batch_size while @done_loading=0 begin insert into prod_table… select from staging_table if @@rowcount=0 select @done_loading=1 delete staging_table end end This appears to be fairly harmless. if you really want to amaze your friends. As stated. This is perhaps the biggest failure that affects performance. However. Rather than enabling triggers for the entire session and cause performance problems during the day. you need to consider the byte order on your host. if placed directly in a replicated procedure. simply execute something like the following: Exec sp_replicate_sql “insert into publishers values (‘9990’. the bulk insert into the production database using ‘insert into … select…’ can be replicated in this fashion as well.and without fully understanding the internal workings of ASE – implement an easy work around. if the execute() function is in a trigger. Of course. this is a neat trick for handling updates and deletes.5. However. Accordingly.0. particularly with partitioned tables.Final v2. which in turn triggers the execution of the string. Rather than simply indiscriminately turning the triggers off and then on and the beginning and end of the procedure. while it has been stated that this is limited to the 12. if batch feeds are bcp’d into staging databases on both systems (which should be done in WS situations).x version. The assumption is that the insert only READ ‘rowcount’ rows from the source data. if worker threads are involved. Inserts. but the SQL statement would be limited to 255 characters due to the varchar(255) limitation prior to ASE 12. two things are wrong with it: • The assumption is that the same rows selected for insert will be the same rows deleted. Batch Processing & Ignore_dupe_key Some of the more interesting problems arise when programmers make logical assumptions . this also provides us a way to audit the execution of batch SQL and compare commit times for latency purposes (even replicated SQL statements could run for a long time). Rep Agent behaves fine. why replicate both the table and the proc? Well.1 sql_string varchar(1800) ) primary key (sql_statement_id) send standby replication definition columns go create function replication definition sp_replicate_sql with primary at WSTBY. Remember. we simply borrow a trick that dsi_keep_triggers simply calls the ‘set triggers off’ command. Now then. However. @@options is an undocumented global variable that stores session settings – such as ‘set arith_abort on’. however. the Rep Agent stack traces and fails (a nasty recovery issue for a production database).’Sybase. Starting in ASE 12. the delete could affect other rows than those inserted. By the way. it will in fact work with any 12. we employ trick #2 .’. and assuming that the proc is NOT replicated. As a result.0. triggers are turned off by default via the dsi_keep_triggers setting (and it probably is off for most other normal replication implementations as well). and then the proc returns triggers to the original setting and exits. we simply insert the desired SQL statement in a table. this may not be the case. etc. However.’Dublin’. Additionally. Accordingly.00 records in which 50% of them are duplicates (every other row) or already exist in the target table. that setting rowcount affects the final result set – and does not limit any subqueries. on the second pass through the loop.0. the delete would only remove 250 of them."expected batch=2") insert into test_table_staging values (6. col_2 varchar(40) null ) go insert into test_table_staging values (1.Final v2. the insert would scan 250 rows it had already scanned and then an addition 500 rows to get 250 unique ones that it could insert."expected batch=3") go if exists (select 1 from sysobjects where name="lsp_insert_test_table" and type="P" and uid=user_id()) drop proc lsp_insert_test_table go CREATE PROC insert_test_table @batchsize INT = 5 AS BEGIN DECLARE @cnt @myloop @err @del SELECT INT. And the delete would remove 250."expected batch=1") insert into test_table_staging values (3. @myloop INSERT test_table (col_1.1 Why is the last bullet so important? Remember. @cnt = -1."expected batch=1") insert into test_table_staging values (4. it may require ASE to scan hundreds or thousands of rows to generate ‘rowcount’ unique rows for a table in which ignore_dupe_key is set for the primary key index. col_2 varchar(40) null ) go create unique nonclustered index test_table_idx on test_table (col_1) with ignore_dup_key go if exists (select 1 from sysobjects where name="test_table_staging" and type="U" and uid=user_id()) drop table test_table_staging go create table test_table_staging ( col_1 int not null. So."expected batch=3") insert into test_table_staging values (11.added to track deletes."expected batch=3") insert into test_table_staging values (11. even though 100."expected batch=1") insert into test_table_staging values (5."expected batch=2") insert into test_table_staging values (7. why is this a problem? Lets assume that we have a batch of 100."expected batch=2") insert into test_table_staging values (7. However. etc. @myloop = 1 SET ROWCOUNT @batchsize WHILE BEGIN @cnt != 0 select "Loop ----------. And so forth."expected batch=3") insert into test_table_staging values (10. INT."expected batch=1") insert into test_table_staging values (3.". Hence ‘select sum(x) from y group by a’ will return ‘rowcount’ rows despite the fact it may have to scan millions to generate the sums."expected batch=1") insert into test_table_staging values (2."expected batch=3") insert into test_table_staging values (12.750 rows already scanned plus the final 500 (with 250 unique). int -.000 rows with 50% unique and a batch size of 250 would suggest a fairly smooth 200 iterations through the loop. A reproduction of this problem (for the confused or interested) is as the below: use tempdb go if exists (select 1 from sysobjects where name="test_table" and type="U" and uid=user_id()) drop table test_table go create table test_table ( col_1 int not null. the insert would scan 500 rows already processed plus 500 new rows. by the last iteration. int."expected batch=2") insert into test_table_staging values (9. Assuming rowcount is set to 250. Essentially. As a result."expected batch=2") insert into test_table_staging values (8. it would mean that the insert would have to scan 500 rows in order to generate 250 unique ones to be inserted. the insert would be scanning 49. col_2) 24 . On the third pass. @err = 0. isql CHINOOK ---------col_1 col_2 ----------.---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 25 . executing the procedure without any parameter value should result in a ROWCOUNT limit of 5 rows: use tempdb go select * from test_table_staging go exec insert_test_table go select * from test_table go The output from this as executed is: ---------. consider the procedure execution – loop iteration 1 is contained below: ----------------.@myloop) test_table_staging @cnt = @@ROWCOUNT.0. --------test_table: (1 row affected) col_1 col_2 ----------.. @cnt set rowcount @batchsize DELETE test_table_staging set rowcount 0 select "test_table_staging:" select * from test_table_staging -. showing the original 15 rows containing 3 duplicates (3.@del set rowcount @batchsize select @myloop = @myloop + 1 END RETURN 0 END go Consider the following sample execution – since the default is set to 5...7.9.added to show what is inserted to this point. Now.added to show what is left select "Delete Rowcount = ". Note the highlighted rows (5.1 SELECT FROM SELECT col_1. select "Rowcount = " .----------Loop ----------1 (1 row affected) Duplicate key was ignored. and 10) and their expected batch.Final v2.---------------------------------------1 expected batch=1 2 expected batch=1 3 expected batch=1 3 expected batch=1 4 expected batch=1 5 expected batch=2 6 expected batch=2 7 expected batch=2 7 expected batch=2 8 expected batch=2 9 expected batch=3 10 expected batch=3 11 expected batch=3 11 expected batch=3 12 expected batch=3 (15 rows affected) The above is the output from the first select statement. and 11). col_2+" ==> actual batch="+convert(varchar(3). @err = @@ERROR set rowcount 0 select "test_table:" select * from test_table -. --------test_table: (1 row affected) col_1 col_2 ----------.----------Rowcount = 5 (1 row affected) ----------------test_table_staging: (1 row affected) col_1 col_2 ----------.Final v2. because the delete is an independent statement.---------------------------------------5 expected batch=2 6 expected batch=2 7 expected batch=2 7 expected batch=2 8 expected batch=2 9 expected batch=3 10 expected batch=3 11 expected batch=3 11 expected batch=3 12 expected batch=3 (10 rows affected) -----------------. it simply deletes the first 5 rows.----------Loop ----------2 (1 row affected) Duplicate key was ignored. the subquery select in the insert statement had to read 6 rows – consequently row_id 5 was actually inserted as part of the first batch.0. which contains the duplicate. However. Because of the duplicate row for row_id 3. Now. consider what happens with loop iteration #2: ----------------.---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 (10 rows affected) ----------.----------Delete Rowcount = 5 (1 row affected) Note what occurred.1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 (5 rows affected) ----------.----------Rowcount = 5 (1 row affected) ----------------test_table_staging: (1 row affected) col_1 col_2 ----------.---------------------------------------9 expected batch=3 10 expected batch=3 11 expected batch=3 26 . leaving row_id 5 in the list. Of course.----------Loop ----------3 (1 row affected) Duplicate key was ignored. Because of the row_id 5 is repeated and the duplicate for row_id 7.----------Delete Rowcount = 5 (1 row affected) ----------------. Finally. notice what occurred.----------Delete Rowcount = 5 (1 row affected) Again. the insert scans 7 rows to achieve the rowcount of 5.----------Loop ----------4 (1 row affected) --------test_table: (1 row affected) col_1 col_2 ----------.Final v2.---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) ----------.0.1 11 expected batch=3 12 expected batch=3 (5 rows affected) -----------------.---------------------------------------(0 rows affected) -----------------. leaving rows 9 &10 still in the staging table. the delete only removes the next five. we come to the last loop iteration: ----------------.---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 27 . --------test_table: (1 row affected) col_1 col_2 ----------.----------Rowcount = 2 (1 row affected) ----------------test_table_staging: (1 row affected) col_1 col_2 ----------. ----------Rowcount = 0 (1 row affected) (return status = 0) col_1 col_2 ----------.---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) Normal Termination Output completed (1 sec consumed). band-aids don’t stick.0. again): • • Each loop iteration causes and additional 250 duplicate insert rows to be replicated along with 250 CLR records over the previous iteration By the last iteration. Consequently. it turns out to be tremendous performance boost. this actually did happen at a major bank.000 row delete using 250 row iterations.250 total CLR records (250+500+750+…49. yes. RS must then remove the duplicate inserts that the CLR records point to.925. Sometimes. given the data quality.1 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) ----------.850. The point of this discussion is that even though the SQL to remove the duplicates from the staging table appeared to be a slower design than the quick “band-aid” of ignore_dupe_key. The implementation does not check to ensure that the rows inserted are the rows being deleted.000 row insert of 50. By this point you can determine that the following is occurring (assuming the 50. “SET ROWCOUNT” affects the number of rows affected by the statement vs. Can you guess the impact on: • • Your transaction log at the primary system (remember. In this case. as soon as each log page is flushed.750 duplicate insert records. the rows processed by subquery or other individual parts of the statement. and may have happened at least one more that we are aware of. bulk SQL first logs the affected rows and THEN applies them).750) and a duplicate number of inserts for a whopping total of 9. Consequently. Since the log page contains the duplicate rows for those being inserted (remember.500 unnecessary records on top of the 50. in reality. Consequently an insert limited by SET ROWCOUNT to 5 rows may have to read 6 or more rows if a duplicate is present. all those CLR and inserts are logged)!!! The Replication Server performance as it also removes all the duplicates!!! Oh. So what’s the problem?? A couple of points are key to understanding what is happening: • • When a duplicate is encountered. With all 200 iterations. causing subsequent batches to begin with duplicates. this seemingly innocent 100.750 CLR records plus 250 duplicate inserts from the last batch along with the 250 CLR records and then (last but not least) the 250 actually inserted rows.00 new rows results in an astounding 4. it replicates records for uncommitted transactions as well as committed. Because of the implementation. the server uses a Compensation Log Record (CLR) to undo a previous log record – in this case.Final v2. RS receives ~49. This is all in one transaction. some rows could be “dropped” without even being inserted. it also reads the CLR records – which is needful. the duplicate insert. the Rep Agent can read it. • Now then. since the Rep Agent can be fully caught up.000 rows really wanted. 49. each duplicate compounds the problem.500+49. 28 . however. these are not necessarily accurate. In reality. The Rep Agent starts scanning from the location provided by the Replication Server The Replication Agent scans for a configurable number (scan_batch_size) log records. The Rep Agent asks the Replication Server where the secondary truncation point should be. The Rep Agent asks the Replication Server who the maintenance user is for that database. The Replication Server looks the maintenance user up in the rs_maintusers table in the RSSD database and replies to the Rep Agent. The Replication Server looks up the locater in the rs_locaters table in the RSSD database and replies to the Rep Agent. After reaching scan_batch_size log records.3 using normal RAID devices (vs. The Replication Agent logs in to the Replication Server and requests to “connect” the source database (via the “connect source” command) and provides a requested LTL version. In this section we will be examining how the Replication Agent works – and in particular.5. Note that there are many factors that contribute to RepAgent performance – cpu load from other users. etc. The Rep Agent moves the secondary truncation point to the log page containing the oldest open transaction received by Replication Server.2 RepAgent thread on a single cpu NT machine is capable of sending >3. “Zero-ing the LTM” resets the secondary truncation point back to the beginning of the transaction log.0. The Replication Agent cannot read past the primary truncation point. two bottlenecks quite easily overcome by adjusting configuration parameters. 7. However. Secondary Truncation Point Management Every one knows that the ASE Replication Agent maintains the ASE secondary truncation point. the discussions on Log Transfer Language and the general Rep Agent communications are common to all replication agents as all are based on the replication agent protocol supported by Sybase. As mentioned earlier. since this paper does not yet address many of the aspects of heterogeneous replication. Readers should expect to achieve the same results if their system is notoriously cpu or network bound (for example). However. there is a lot more communication and control from the Replication Server in this process than realized. When this request is received. comments will be made that the ASE Rep Agent is not able to keep up with logging in the ASE. 29 . 5.6. 4. 3. Replication Agent performance is crucial.000 updates/second to Replication Server 12. for systems with large direct electronic feeds or sustained bulk loading. In a different type of test.0 is capable of maintaining over 2GB/Hr from a single database in ASE 11. SSD’s). there are a lot of misconceptions about the secondary truncation point and the Replication Agent. This is especially true if the bulk of the transactions originate from GUI-base user screens since such applications naturally tend to have an order of magnitude more reads than writes. including: • • • The Replication Agent looks for the secondary truncation point at startup and begins re-reading the transaction log from that point.1 Replication Agent Processing Why is the Replication Agent so slow??? Frequently. 8. At this writing.9. a properly tuned Rep Agent on a properly tuned transaction log/system will have no trouble keeping up. 6. As you would guess from the previous sentence. 2. the ASE 12. Repeat step 5. the Replication Server responds with the cached locater which contains the log page containing the oldest open transaction received from the Replication Agent.Final v2. For most normal user processing. Replication Agent Communication Sequence The sequence of events during communication between the Replication Agent and the Replication Server is more along the lines of: 1. In addition. network capabilities. the Replication Server writes this cached locater to the rs_locaters table in the RSSD. the Replication Agent requests a new secondary truncation point for the transaction log. Replication Server responds with the negotiated LTL version and upgrade information. this section should be read in the context of the ASE Replication Agent thread. a complete replication system based on Replication Server 12. 000 records. 30 . setting it to 20..e. In contrast to the last paragraph. most of the time that the Rep Agent asks the RS for something.000 improved overall RS throughput by 30%.db 300 [mode] lti 300 get maintenance user for ds.db 0x0000aaaa0000bbbbbbb select from rs_users where. the tradeoff to this is that the secondary truncation point stays at a single location in the log – translates to a higher degree of space used in the transaction log. if you notice.0. For example. Benchmarks have show that raising scan_batch_size can increase replication throughput significantly. a better solution would have been to improve the DSI or other throughput to allow it to keep up without throttling back the RepAgent. the RS has to check with the RSSD – or update the RSSD (i..1) Replication Agent Scanning The second can be overcome with a willingness to absorb more log utilization.db db_name_maint get truncation site.1 An interaction diagram for this might look like the following: RepAgent ct_connect(ra_user. the locater).0. As the other threads are able to keep up more now. don’t put the RSSD to far (network wise) from the RS.db 0x0000aaaa0000bbbbbbb insert into rs_locaters values (0x000aaaa0000…) Figure 6 – Replication Interaction Diagram for Rep Agent to RSSD The key elements to get out of this are fairly simple: • • • Keep the RSSD as close as possible to the RS Every scan_batch_size rows. In addition. the Rep Agent stops forwarding rows to move secondary truncation point.Final v2. The secondary truncation point is set to the oldest open transaction received by Replication Server – which may be the same as the oldest transaction in ASE (syslogshold) or it may be an earlier transaction as the Rep Agent has not yet read the commit record from the transaction log. While not definite.DB get truncation site.. there is considerable thought within Sybase that this has the same impact of exec_cmds_per_timeslice in that it "throttles" the RepAgent back and allows other threads to have more access time. select from rs_locaters. at an early Replication Server customer.. some have reported better performance with lower scan batch size – particularly in Warm Standby situations.000 log records happen pretty quickly. The default scan_batch_size is 1. there is less contention for the inbound queue (SQM reads are not delaying SQM writes)....ra_pwd) cs_ret_succeed connect source lti ds.. log_scan() LTL SQL Replicate DS. The result is that the Rep Agent is frequently moving the secondary truncation point.0. As anyone who has read the transaction log will tell you. Regarding the first. So. While decreasing the RepAgent workload is one way to solve the problem. database recovery time as well as replication agent recovery time will be lengthened as the portion of the transaction log that will be rescanned at database server and replication agent startup will be longer. The best place is on the same box and have the primary network listener for the RSSD ASE be the TCP loopback port (127. Of course. Rep Server RSSD select from rs_sites. select from rs_maintusers.1. 002PM'. Any agent that wishes to replicate data via Replication Server must use this protocol.001PM'. The basic commands are listed in the table below. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000005. @imagecol=hastext rep_if_changed.added for clarity distribute @origin_time='Apr 15 1988 10:23:23.@text_col=hastext always_rep.@moneycol=$1.005PM'.@floatcol=3. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000004.added for clarity distribute @origin_time='Apr 15 1988 10:23:23.12.@decimalcol=.rs_writetext append first last changed with log textlen=30 @text_col=~.!!?This is the text column value. request to retrieve maintenance user name to filter transactions applied by the replication system.added for clarity distribute @origin_time='Apr 15 1988 10:23:23.@charcol='first insert'.3. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000006. @binarycol=0xaabbccddeeff.added for clarity distribute @origin_time='Apr 15 1988 10:23:23. @datetimecol='4-15-1988 10:23:23.006PM'. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000003. get maintenance user get truncation distribute begin transaction commit/rollback transaction applied execute sqlddl append dump purge Used to distribute begin transaction statements Used to distribute commit/rollback statements Used to distribute insert/update/delete SQL statements Used to distribute both replicated procedures as well as request functions Used to distribute DDL to WS systems Used to distribute the dump database/ transaction log SQL commands Used during recovery to notify Replication Server that previously uncommitted transactions have been rolled back.56.2.@bitcol=1 -. this is a very simple protocol with very few commands. @tran_id=0x000000000000000000000001 applied 'ltltest'. @tran_id=0x000000000000000000000001 begin transaction 'Full LTL Test' -. @tran_id=0x000000000000000000000001 applied 'ltltest'.@identitycol=1.0. Fortunately. @numericcol=2. -.@smallintcol=1.rs_insert yielding after @intcol=1.56.1. @tran_id=0x000000000000000000000001 applied 'ltltest'.D *@4ª -. @smallmoneycol=$0.@rsaddresscol=1. @tran_id=0x000000000000000000000001 applied 'ltltest'. request to retrieve a log pointer to the last transaction received by the Replication Server.rs_writetext append @imagecol=~/!!7Ufw@4ª»ÌÝîÿðÿ@îO@Ý@y@f -.001PM'.rs_writetext append first changed with log textlen=119 @imagecol=~/!"!gx"3DUfw@4ª»ÌÝîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bhP+|p{`"]?>.@realcol=2.Final v2. @smalldatetimecol='Apr 15 1988 10:23:23. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000001. 31 .added for clarity distribute @origin_time='Apr 15 1988 10:23:23.@tinyintcol=1.@varcharcol='first insert'.1 Rep Agent LTL Generation The protocol used by sources to replication server is called Log Transfer Language (LTL). @varbinarycol=0x01112233445566778899. A sample of what LTL looks like is as follows: distribute @origin_time='Apr 15 1988 10:23:23.003PM'. LTL Command connect source Subcommand Function request to connect a source database to the replication system in order to start forwarding transactions. much the same way that RS must use SQL to send transactions to ASE.004PM'.002PM'. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000002. the distribute command will make up most of the communication between the Rep Agent and the Rep Server. @numericcol=2. for a DML operation is determined by the replication agent from the transaction log.12. once retrieved. @binarycol=0xaabbccddeeff. @varcharcol='first insert'. @imagecol=notrep rep_if_changed.@tinyintcol=1. @smalldatetimecol='Apr 15 1988 10:23:23. @text_col=notrep always_rep.007PM'. @floatcol=3. you will notice several things: • • The appropriate replicated function (rs_update. the Rep Agent can simply skip to the next record. This information. 8. @binarycol=0xaabbccddeeff. If it was. @datetimecol='Apr 15 1988 10:23:23.@text_col=notrep always_rep.. then the XLS checks to see if the DML logged event is nested inside a stored procedure that is also replicated.002PM'. rs_insert.3. 32 . 3. If not. @bitcol=0 Although it looks complicated.2. @smalldatetimecol='Apr 15 1988 10:23:23. Although currently beyond the scope of this paper.@smallmoneycol=$0. 7.@rsaddresscol=1.2.@charcol='first insert'. @varbinarycol=0x01112233445566778899. <col name>=<value>. 2. …]] As you could guess. @moneycol=$1. can be cached for subsequent records. @imagecol=notrep rep_if_changed. 6. If not. @tinyintcol=1. (XLS) If so. @realcol=2. @rsaddresscol=1. The basic syntax for a distribute command for a DML operation is as follows: distribute <commit time> <OQID> <tran id> applied <table>. the obvious question is how does it get to that point? The answer is based on two separate processes – the normal ASE Transaction Log Service (XLS) and the Rep Agent. the Rep Agent proceeds with constructing LTL. (XLS) The XLS receives a log record to be written from the ASE engine (XLS) The XLS checks object catalog to see if logged object’s OSTAT_REPLICATED bit is set. @tran_id=0x000000000000000000000001 applied 'ltltest'. @moneycol=$1.56.1 @tran_id=0x000000000000000000000001 applied 'ltltest'.002PM'. @bitcol=1 after @intcol=1. @smallintcol=1. This improves Replication Agent performance by reducing the size of the LTL to be transmitted and allowing the Replication Agent to drop columns not included in the replication definition. @varcharcol='first insert'. Looking closely at what is being sent. (XLS) If not. @charcol='updated first insert'.rs_writetext append last @imagecol=~/!!Bîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bh -. @floatcol=3. in general.@identitycol=1.added for clarity distribute @origin_time='Apr 15 1988 10:23:23. dump records. <col name>=<value>.@datetimecol='Apr 15 1988 10:23:23.3. @identitycol=1. then the XLS sets the log record’s LSTAT_REPLICATE flag bit (XLS) The XLS writes the record to the transaction log (RA) Some arbitrary time later. etc.002PM'.@decimalcol=.Final v2. 9. etc. the LTL distribute command illustrated above does leave us with another key concept: Key Concept #4: Ignoring subscription migration. The process is similar to the following: 1. The DIST/SRE determines which functions are sent according to migration rules. Currently. 5. Having determined what the Replication Agent is going to send to the Replication Server. the above is fairly simple – all of the above are distribute commands for a part of a transaction comprised of multiple SQL statements. etc. @decimalcol=.12. the ASE Replication Agent does not support this interface. 4. the Rep Agent determines if the log record is a “special log record” such as begin/commit pairs.56. If it is set. However.) is part of the LTL (highlighted above) The column names are part of the LTL The latter is not always the case as some heterogeneous Replication Agents can cheat and not send the column names (assuming Replication Definition was defined with columns in same order or through a technique called “structured tokens”. the XLS simply skips to writing the log record.1.0.56. while the DSI determines the SQL language commands for that function. @varbinarycol=0x01112233445566778899. rs_update.002PM'. Rep Agent proceeds to LTL generation.rs_update yielding before @intcol=1. the appropriate replication function rs_insert.@realcol=2. the XLS simply skips to writing the log record.1.56. @smallmoneycol=$0. (RA) If so.@smallintcol=1.@numericcol=2. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000007. this is achieved by the Replication Agent directly accessing the RSSD to determine replication definition column ordering. …]] [after <col name>=<value> [. (RA) If not.<function> yielding [before <col name>=<value> [. the Rep Agent reads the log record (RA) The Rep Agent checks to see if the log record’s LSTAT_REPLICATE bit is set. If ‘batch_ltl’ is true. This latter is due to the fact that the Rep Agent can only read flushed log pages (flushed to disk). it reads the objects metadata from system tables (syscolumns). LTL batching can significantly improve Rep Agent processing as it can scan more records prior to sending the rows to the Rep Server (effectively a synch point in Rep Agent processing). This process is illustrated below. more procedure cache may be necessary on systems with a lot of activity on large numbers of tables. consequently. (RA) Rep Agent constructs LTL statement for the logged operation 14. This TIPSA is then used to find the data row for the text modification.ASE XLS and Replication Agent Execution Flow The following list summarizes this into key elements how this affects replication performance and tuning.0. the Rep Agent waits until the LTL buffer is full prior to sending the records to the Rep Server. ASE XLS Service Receive log record NO Is OSTAT_REPLICATED set? NO Rep Agent Processing Read next record from transaction log Does record have LSTAT_REPLICATE set NO YES YES Nested in Store Procedure? NO Is record BT/CT or schema change YES Store Procedure OSTAT_REPLICATED set? NO YES Set log record’s LSTAT_REPLICATE bit Is logged operation an update YES Read before/after image from log NO Write record to transaction log Is operation a writetext YES Find datarow for writetext NO Is replicated object metadata in RA cache NO YES Read object metadata from syscolumns Construct LTL rs_datarow_for_writext LTL for text/image chain Is LTL batching on YES Pause for LTL buffer to fill NO Send LTL to Replication Server Figure 7 . The data row for writetext is then constructed in LTL. In addition. Replicating text/image columns can slow down Rep Agent processing of the log due to reading the text/image chain. (RA) The Rep Agent checks to see if the logged row was a text chain allocation. • Replication Agent has a schema cache to maintain object metadata (schema cache) for constructing LTL as well as tracking transactions (transaction cache). (RA) The Rep Agent checks to see if the operation was an update. 13. If so. Then the text chain is read and constructed into LTL chunks of text/image append functions.e. (RA) If ‘batch_ltl’ parameter is false (default). for which no subscriptions or Warm Standby exists) has a negative impact on Rep Agent performance as it must perform LTL generation needlessly. 11. Marking objects for replication that are not distributed (i. 12. the Rep Agent passes the LTL row to the Rep Server using the distribute command. it will always be working on a different log page than the XLS service. Rep Agent checks it’s own schema cache (part of proc cache) to see if the logged object’s metadata is in cache.1 10.Final v2. If so. As a result. In • • • 33 . careful monitoring of the system metadata cache to ensure that physical reads to system tables are not necessary. The two services are shown side-by-side due to the fact that they are independent threads within the ASE engine and execute in parallel on different log regions. If not. it reads the text chain to find the TIPSA. (RA) LTL Generation begins. it also reads the next record to construct the before/after images. the server can begin parsing the message. it can send the next packet. in practice. replicating procedures for large impact transactions could improve performance significantly. ASE engineering added several new configuration parameters to the replication agent. 34 . Unfortunately. nowhere does it say that enabling replication slows down the primary by resorting to all deferred updates vs.5). they do have to synchronize periodically for the client to receive error messages and statuses. however. As this was a frequent cause of criticism. the best way to improve Rep Agent performance is to minimize it’s workload.5. achieving greater parallelism between the two processes If the Rep Agent configuration batch_ltl is true. the messages are sent via passthru mode to the Rep Server. Note that in the above list. it is sent to the Rep Server. replicating the table will require 1. Rep Agent will batch LTL to optimize network bandwidth (although the TDS packet size is not configurable prior to ASE 12. For example. The last sentence may not make sense yet.1 addition. The default is "false" according to the manuals. a deferred one. In addition. the commands are sent in batches. However. If not. However. Replication Agent Tuning Prior to ASE 12. Asynchronously. The destination server can begin parsing the messages (but not executing) as received. By replicating the procedure only a single LTL statement will need to be generated and processed by Replication Server. When the client is done. a client can send multiple packets to the server without having to wait for the receiver to process them fully. typical client connections to Adaptive Server Enterprise are not passthru connections. Every 2K. In passthru mode. A common question is “What does it mean by passthru mode?” The answer lies in how the server responds to packets. Replication Agent Communications The Rep Agent connects to the Replication Server in “PASSTHRU” mode. for the before and after images respectively. consequently. While an update will generate two log records. these “extra” rows will consume space in the inbound stable queue and valuable CPU time for the distributor thread. a replicated procedure only requires a single row for the Replication Agent to process no matter how many rows are affected by it.5* Explanation Specifies whether RepAgent sends LTL commands to Replication Server in batches or one command at a time. the Rep Agent synchs with the Rep Server by sending an EOM (at an even command boundary – EOM can not be placed in the middle of an LTL command).000 LTL statements to be generated (and compared in the distributor thread).000 rows. This can be achieved by not replicating text/image columns where not necessary and ensuring only objects for which subscriptions exist are marked for replication. it sends an End-Of-Message (EOM) packet that tells the server to process the message and respond with status information. In either case. How this is achieved as well as the benefits and drawbacks are discussed in the Procedure Replication section. Key Concept #5 – In addition to Rep Agent tuning. By contrast. if a procedure modifies 1. the existence of the two log records has led many to mistakenly assume that replication reverts to deferred updates. This technique provides the Rep Agent/Rep Server communication with a couple of benefits: • • Rep Agent doesn’t have to worry if the LTL command spans multiple packets. in-place updates.0. A way to think of it is that the client can simply start sending packets to the server and as soon as it receives packet acknowledgement from the TDS network listener. The reason is that this was always a myth. • Procedure replication can improve Rep Agent throughput by reducing the number of rows for which LTL generation is required. as each LTL row is created. the ASE server processes the commands immediately on receipt and passes the status information back to the client. most current ASE’s default this to “true”. Some of these new parameters as well as other pre-existing parameters are listed below: Parameter (Default) batch ltl Default: True Suggest: True (verify) ASE 11. the Replication Agent thread embedded inside ASE could not be tuned much. When set to "true".Final v2. the actual modification can be a normal update vs. 1 or earlier as this may cause RepAgent to skip or truncate wide data. it is normally the primary database. Specifies whether." For Replication Server versions 12.5 and later.RepAgent shuts down if it encounters log records containing widedata. ·off . This option requires the Replication Server Advanced Security option as well as the Security option for ASE to enable SSL-based data integrity. wider columns and parameters. ·skip . For Replication Server versions 12." Specifies whether to encrypt all messages sent to Replication Server. the default value is "off. it is normally the data server for the primary database.1 and earlier.1 Parameter (Default) connect database Default: [dbname] Suggest: [dbname] connect dataserver Default: [dsname] Suggest: [dsname] data limits filter mode Default: stop or off Suggest: truncate ASE 11. 11.5 12. Specifies how RepAgent handles log records containing new." Specifies the amount of time after Rep Agent has reached the end of the transaction log and no activity has occurred before the Rep Agent will fade out it’s connection to the Replication Server.5 fade_timeout Default: 30 11.RepAgent skips log records containing wide data and posts a message to the error log. before attempting to send them to Replication Server.RepAgent truncates wide data to the maximum the Replication Server can handle.Final v2.RepAgent allows all log records to pass through. This is the database name RepAgent uses for the connect source command. or larger column and parameter counts. Specifies whether all messages exchanged with Replication Server should be checked for tampering. Warning! Sybase recommends that you do not use the "data_limits_filter_mode. ·stop .0 12.0 12.0. or to stop. the default value is "stop. This command is still supported as of ASE 12. Specifies whether to check the source of each message received from Replication Server. when Sybase Failover has been installed. This option requires the Replication Server Advanced Security option as well as the Security option for ASE to enable SSL-based data encryption.0 12. The default is "true. This is the data server name RepAgent uses for the connect source command.5 Explanation Specifies the name of the temporary database RepAgent uses when connecting to Replication Server in recovery mode.5.0 12.2 although not reported when executing sp_config_rep_agent to get a list of configuration parameters and their values. Specifies the name of the data server RepAgent uses when connecting to Replication Server in recovery mode. ·truncate . The default value of data limits filter mode depends on the Replication Server version number.5* ha failover Default: true Suggest: true msg confidentiality Default: false Suggest: false msg integrity Default: false Suggest: false msg origin check Default: false Suggest: false msg out-of-sequence check Default: false Suggest: false 12. Specifies whether to check the sequence of messages received from Replication Server. RepAgent automatically starts after server failover.0 35 . off" setting with Replication Server version 12. 5 11.5* rs username 11. Specifies the number of seconds RepAgent sleeps before attempting to reconnect to Replication Server after a retryable error or when Replication Server is down. Specifies the maximum number of log records to send to Replication Server in each batch. RepAgent again queries Replication Server for a secondary truncation point after scan timeout seconds. Specifies whether RepAgent should require mutual authentication checks when connecting to Replication Server. Controls the duration of time table or stored procedure schema can reside in the RepAgent schema cache before expiring.0 Explanation Specifies whether messages received from Replication Server should be checked to make sure they have not been intercepted and replayed.Final v2.5* 11. This is stored in encrypted form in the sysattributes table. If network-based security is enabled and you want to establish unified login. The new or existing password that RepAgent uses to connect to Replication Server. RepAgent asks Replication Server for a new secondary truncation point.5* scan_batch_size Default: 1000 Suggest: 10. This option is not implemented. When the maximum number of records is met. The name of the Replication Server to which RepAgent connects and transfers log transactions.0. This is a factor. The default is 1000 records. The new or existing user name that RepAgent thread uses to connect to Replication Server.5* schema_cache_growth_factor Default: 1 Suggest: 1-3 12. If Replication Server has acknowledged all records and no new transaction records have arrived at the log. Range is 1 to 10. scan timeout 'scan_timeout_in_seconds' RepAgent continues to query Replication Server until Replication Server acknowledges previously sent records either by sending a new secondary truncation point or extending the transaction log. The default is 60 seconds. The thread execution priority for the Replication Agent thread within the ASE engine. you must specify NULL for repserver_password when enabling RepAgent at the database.1 Parameter (Default) msg replay detection Default: false Suggest: false mutual authentication Default: false Suggest: false priority Default: 5 Suggest: 4 retry_time_out Default: 60 ASE 12.5* rs servername 11. 12. Specifies the network-based security mechanism RepAgent uses to connect to Replication Server. The default is 15 seconds.0 12. This should not be adjusted for low volume systems. so setting it to ‘2’ doubles the size of the schema cache. Larger values mean a longer duration and require more memory. This is stored in the sysattributes table.5* rs password 11. Specifies the number of seconds that RepAgent sleeps once it has scanned and processed all records in the transaction log and Replication Server has not yet acknowledged previously sent records by sending a new secondary truncation point.5 Security mechanism 12. Accepted values are 4-6 with the default being 5. This is stored in the sysattributes table.000+ for high volume systems only scan_time_out Default:15 Suggest: 5 11.0 36 . RepAgent sleeps until the transaction log is extended. 5 skip unsupported features Default: false Suggest: false trace flags Default: 0 trace log file Default: null Suggest: [filename as needed] Traceoff Traceon unified login Default: false Suggest: false 11. with the default of 2K. Disables Replication Agent tracing activity. and system transactions to the warm standby database. Larger send buffer sizes will reduce network traffic. Enables Replication Agent tracing activity. Valid values are true/false.5* 11. Since every LTL command contains the oqid. 37 . Could severely degrade Rep Agent performance due to file I/O. The default is "false. this has the ability to significantly reduce network traffic.Final v2. Specifies whether RepAgent sends information about maintenance users.1 Parameter (Default) send_buffer_size Default: 2K Suggest: 8-16K ASE 12. This option is normally used if Replication Server is a lower version than Adaptive Server. See example later.5.5* 11." Instructs RepAgent to skip log records for Adaptive Server features unsupported by the Replication Server.5 was the first ASE with the Rep Agent Thread internalized.5 Explanation Determines both the size of the internal buffer used to buffer LTL as well as the packet size used to send the data to the Replication Server." Specifies whether RepAgent ignores errors in LTL commands. or 16K (case insensitive)." RepAgent logs and then skips errors returned by the Replication Server for distribute commands.5. ** In ASE 12.0.5* 12. as it has to do less sends. specifies whether RepAgent seeks to connect to other servers with a security credential or password.5* 11." send maint xacts to replicate Default: false Suggest: false (don’t change) send structured oqids Default: false Suggest: true 11." Specifies whether the Replication Agent will send queue IDs (OQIDs) to the Replication Server as structured tokens or as binary strings (the default). this may be ‘fixed’ in a later EBF – if so. the short_ltl_keywords parameter seemed to operate in the reverse – setting ltl_short_keywords to ‘true’ resulted in the opposite of what was expected. When a network-based security system is enabled. subcommands. The default is "false.5 send_warm_standby_xacts Default: false for most. but sometimes used different names. When set to "false.1. schema." RepAgent shuts down when these errors occur.5* 12. default is false. corrective action may be required. true for Warm Standby 11.5* short ltl keywords Default: false** Suggest: false** ( true)** skip ltl errors Default: false Suggest: false 12. The default is "false." This is a bitmask of the RepAgent traceflags that are enabled. etc." Similar to "send structured oqids". Accepted values are: 2K. The valid traceflags are in the range 9201-9220 (not all values are valid).5 11. However. an external Log Transfer Manager (LTM) was used – it had similar parameters for those above. Specifies the full path to the file used for output of the Replication Agent trace activity. The default is "false. Note that this is not tied to the ASE server page size.0 * Some parameters above are noted as having been first implemented in ASE 11. whether using this parameter or not.5. LTL keywords are commands. 4K.0. this specifies whether the Replication Agent will use abbreviated LTL keywords to reduce network traffic. 8K. This is due to the fact that ASE 11. The default value is "false. This option should be used only with the RepAgent for the currently active database in a warm standby application. Specifies whether RepAgent should send records from the maintenance user to the Replication Server for distribution to subscribing sites.5 11. This option is normally used in recovery mode. When set to "true. Prior to ASE 11. The default is "false. one customer had a number of larger OLTP systems and the usual collection of lesser volume systems. take care in assuming that these settings can be adopted as “standard configurations” and applied unilaterally. While your optimal configuration may differ. there are 8 priority levels with the lower levels having the highest execution priority (similar to operating system priorities). while adjusting scan_batch_size (and other settings) to drastically higher values may help in high-volume situations. a couple of the new parameters take a bit more explanation and are detailed in the following paragraphs. priorities 3-6 are the only ones associated with user tasks with 4-6 corresponding to the Logical Process Manager’s EC1-EC3 Execution Classes. Scan_Batch_Size As mentioned in the description. each RepAgent being started will be bound to the next available engine. several of the configuration parameters that will have the most impact on performance have been high-lighted. For example. The RepAgent is then placed at specified priority on the runnable queue of the affinitied engine. What if more than one database is being replicated? How are the cpu’s distributed to avoid cpu contention with one engine attempting to service multiple Rep Agents running at “highest” priority level of 3? At start-up. care should be taken to avoid monopolizing a cpu.000 batch size at which point the secondary truncation point would finally be moved. Best approach is for an OLTP system is to set the priority initially to 4 and see how far the Rep Agent lags (after getting caught up in the first place). As a result. it checks to see if the log has been extended. If user processes begin to suffer. if multiple engines are available. a suggested configuration setting is mentioned. Although a setting of “3” allows a Replication Agent thread to be scheduled more often than user threads. The reason is that when the RepAgent reaches the end of the log portion it was scanning.000 as it did benefit the larger systems. the secondary truncation point may not move during that time period – significantly impacting log space. These levels are: Level 0 1 2 3 4 5 6 7 Priority Kernel Reserved Reserved Highest High Medium Low Idle CPU EC1 Execution Class EC2 Execution Class EC3 Execution Class Maintenance Tasks Housekeeper Default for all users/processes Rep Agent highest in 12. A discussion about these is not included here as in each of the above. this is possible. the RepAgent is affinity bound to a specific ASE engine. the most frequently asked for feature to the ASE Replication Agent thread. However. However.Final v2. If ASE is unable to affinity bind the RepAgent process to any available engines. Subsequent Replication Agents are then bound in order to the engines.5. the LPM EC Execution Classes did not apply to the Replication Agent Threads (nor any other system threads). In an attempt to adopt “standard configurations” (always a hazardous task). there was no way to control a Replication Agent’s priority. it does so without requesting a secondary truncation point if the scan_batch_size has not been reached. than additional cpu’s and engines may have to be added to the primary to avoid 38 . in one of the lesser systems. but are a low enough volume that it would take hours or days to reach the scan_batch_size. ASE error 9206 is raised. this setting should be left at the default or possibly decreased. in high volume environments. Then.0.1 In the above tables. Within ASE. only if necessary bump the priority up to 3. It turned out that the system only had about 140 transactions per hour – which would take about 48 days to reach the 20. they had adopted a scan_batch_size setting to 20. If so. Rep Agent Priority Beyond a doubt. it simply starts scanning again – while not starting over. the first RepAgent will be bound to engine 0 and the second RepAgent will be bound to engine 1. As of ASE 12. Although attempted by many. Consequently.5 Priority Class Kernel Processes As illustrated above. In addition. The reason should be clear from the description – the RepAgent stops scanning to request a secondary truncation point less often. setting scan_batch_size higher can have a noticeable improvement on Replication Agent throughput. the transaction log started filling and could not be truncated. For example: if max online engines = 4. was the ability to increase the priority. Ouch!! Consequently.5. these are a good starting point. until ASE 12. in very low volume environments. if the system is experiencing “trickle” transactions which always extend the log. the “distribute” command is represented by the token “_ds”. before the packet size could be adjusted. -. significant work is involved in preparing data for transfer. The Replication Server does not acknowledge that the data from the Replication Agent has been received until it has been written to disk. even without the scan_batch_size. some of these capabilities have been introduced in the Replication Agent thread internal to ASE.we wish – make us an offer commit tran 39 . As a result. While the 2K setting at first glance may seem the logical choice. price. increasing the priority of the RepAgent may help. As of ASE 12. 3. send_structured_oqids and short_ltl_keywords. As a result. For example. optimal network utilization can be achieved.contrary to belief. by increasing the send_buffer_size.0. The more data is segmented into packets. title. when the Replication Agent was internalized in ASE 11. managing the TCP/IP layers and handling network interrupts requires significant CPU involvement. TCP/IP typically penalizes systems that transmit a large number of small packets. 0. using short LTL keywords.5. specifically by reducing the amount of overhead in the LTL protocol and compressing the data values. Additionally.’Dublin’. We would use the following SQL statements: Begin tran add_book Insert into publishers values (‘9990’.5. Structured tokens are a mechanism for dramatically reducing the network traffic caused by replication. In terms of effort. this could involve an update to the RSSD to record the new space allocation. structured tokens for data values. In the full structured token implementation. By allowing the user to specify the size of the internal buffer/packet size. These counters are described in much more detail in the "Replication Agent Troubleshooting: Using sp_sysmon' section below Structured Tokens Heterogeneous Replication Agents have had the capability for a while to send the Replication Server structured tokens and shortened key words. total_sales. advance. For example. 2000’. the maximum frame size supported by the networking link layer has an impact on CPU utilization. These two new parameters.’.make up a number for number of times downloaded ‘This what happens on sabbaticals taken by geeks – and why Sybase still offers them’. let’s say we want to add this white paper to the list of titles in pubs2 (ignoring the author referential integrity to keep things simple). increasing the priority will do little as the actual cause is elsewhere. Inc.0. the number of sync points is decreased and overall network efficiency improved. As a result.1 Rep Agent lag while maintaining performance. the internal buffer also had to be adjusted. The last has been an extremely frequent request – to be able to control the size of the packets the Replication Agent uses – similar to the db_packet_size DSI tuning parameter. To aid in this. focus strictly on reducing the overhead of the LTL protocol and do not attempt to reduce the actual column values themselves.’CA’) Insert into titles (title_id. notes. The process of dividing data into multiple packets for transfer. it may not be the optimal setting. the average LTL distribute command would be shortened by a total of 20 bytes. etc. 0) –. pubdate. If a significant amount of time is spent waiting on the cpu (WaitEventID’s 214 & 215).00. there is an implicit sync point every 2K of data from servers previous to ASE 12. however. changing the priority will only have a positive effect when the ASE engine cpu time is being monopolized by user queries.0.00. ASE 12. contract) values (‘PC9900’. pub_id. including using shortened LTL key words. within the Replication Server. If a new segment needs to allocated.’Replication Server Performance & Tuning’. It should be noted that the earlier LTM’s already had an internal buffer of 16K.’9990’. we didn’t get paid extra 100. this buffer was reduced to 2K – more than likely to reduce the latency during low to mid volume situations.Final v2.’Sybase. The transport layer limits the TCP packet size to the maximum network interface frame size to avoid fragmentation. type. Send_buffer_size As noted above. There is a word of caution about this – you may not see any improvement in performance by raising the execution priority in current ASE releases as the main bottleneck isn't the ASE cpu time. but rather the ASE internal scheduling for network access and the RS ability to process the inbound data to the queue fast enough.7+ and 12.free to all good Sybase customers 0. 2. Consequently.5. The size of the internal buffer used to hold LTL until sent to the Replication Server The amount of LTL sent each time The packet size used to communicate with the Replication Server. this is achieved in a number of ways.5. Consequently. This can be determined by monitoring monProcessWaits for the RepAgent spid/kpid. If not.3+ added several new sysmon counters. the processing of the Replication Agent user thread and SQM is nearly synchronous for recovery reasons.’popular_comp’. -. While a savings of 7 bytes for one command may not appear that great. the more CPU resources are needed.0. -. the send_buffer_size parameter really affects three things: 1. for high volume systems. ‘November 1. 6 0x000000000000445800034348494e4f4f4b7075627332 _cm tran Turning on both short_ltl_keywords and structured oqids.~$)pub_name=~"-Sybase. it uses it's own memory outside of the main ASE pool.6 0x000000000000445800034348494e4f4f4b7075627332 _ap owner =~"$dbo ~"'titles. As you can see by the first example..~$)con REPAGENT(4): [2002/09/08 17:55:46.~$%%city=~"'Dublin.0.7[000000000000]DX[00]'CHINOOKpubs2 applied owner =~"$dbo ~"'titles.A[000000000000]DX[00000c]@[00]([00000c]@[00]'[0000928101]'wO[00000000].6 ~.~$.~$&price=~(($0. the LTL command verbs are replaced with what kind of looks almost like abbreviations. REPAGENT(4): [2002/09/08 17:55:46.1 Tracing the LTL under normal replication (see below). When the rep agent reads a DML before/after image from the log it first checks this cache. and were traced from the EXEC module consequently.A[000000000000]DX[00000c]@[00]+[00000c]@[00]'[0000928101]'wO[00000000]. add another column.~$'pub_id=~"%%9990.6 ~.6 ~.~$%%type=~"popular_comp.~$%%city=~"'Dublin. length tokens and data values remain untouched in both streams.6 0x000000000000445800034348494e4f4f4b7075627332 _ap owner =~"$dbo ~"+publishers.7[000000000000]DX[00]'CHINOOKpubs2 commit transaction ** A couple of comments – this is ASE 12. the ‘false’ setting appears to be backwards for the short_ltl_keywords as setting it to ‘true’ along with structured_oqids results in the second sequence.4 ~. with short_ltl_keywords set to ‘false’.A[000000000000]DX[00000c]@[00]'[00000c]@[00]'[0000928101]'wO[00000000]. As a result.~$&title=~"HReplication Server Performance & Tuning.24] [00000000]DX[00000c]@[00])[00000c]@[00]'[0000928101]'wO[00000000].total_sales=100 .a hash table lookup in schema cache is quicker than a logical i/o in metadata cache). if you insert a row. It used to be (11. REPAGENT(4): [2002/09/08 17:55:12.~!*rs_insert _yd _af ~$'pub_id=~"%%9990.~$(pubdate=~*620001101 00:00:00:000.24] tract=0 distribute 1 ~*620020908 17:55:45:543. the schema cache may grow (somewhat). As a result. we get the following LTL stream: REPAGENT(4): [2002/09/08 17:55:12.~$&title=~"HReplication Server Performance & Tuning.~$&price=~(($0.~$(advance=~(($0.e.~$)contract=0 _ds 1 ~*620020908 17:55:32:543.~$'pub_id=~"%%9990.7[000000000000]DX[00]'CHINOOKpubs2 applied owner =~"$dbo ~"+publishers.consequently.7[000000000000]DX[00]'CHINOOKpubs2 begin transaction ~")add_book for ~"#sa distribute 4 ~. The LAN replication agent used for heterogeneous replication is capable of stripping out the column names as it reads the column order from the replication definition and formats the columns in the stream accordingly.5 LTL (version 300) – some examples in this document use older LTL versions.most transactions only impact <10 tables). Inc.i.~$)pub_name=~"-Sybase.~!*rs_insert yielding after ~$)title_id=~"'PC9900. it follows an LRU/MRU chain much like any cache in ASE . however.0000.(A) either a large number of objects are replicated and the transaction distribution is fairly even across all objects (rare .total_sales=100 .4 REPAGENT(4): [2002/09/08 17:55:12. 40 .x) made up from proc cache.and why Sybase still offers them.23] 3000092810127681300000000. The transaction cache is used to store open transactions.24] distribute 1 ~*620020908 17:55:45:543. nor ignore it for the new one.a schema cache and a transaction cache. Customers with the most issues similar to (A) are those replicating a lot of procs as you can have a lot of procs modifying a small number of tables.0000.if it stays consistent then you are fine.~!*rs_insert yielding after ~$'pub_id=~"%%9990.0.and why Sybase still offers them.4 0x000000000000445800000c40000300000c400003000092810127681300000000.~$&state=~"#CA distribute 4 ~. Each cache item essentially is a row from sysobjects and associated child rows from syscolumns in a hash tree.Final v2.23] 0x000000000000445800000c40000700000c400003000092810127681300000000. Customers with the most issues similar to (B) are those that tend to change the DDL to tables/procs frequently..A[0000] REPAGENT(4): [2002/09/08 17:55:46.~$(pubdate=~*620001101 00:00:00:000.~$(advance=~(($0.23] The LTL packet sent is of length 1097. it may look slightly different. then it has to do a look up in sysobjects and syscolumns (hopefully in metadata cache and not physical i/o . Note that the column names. The schema cache can "grow" in one of two ways .6 ~. If not found. the Rep Agent contains 2 caches . Inc.4 ~. Schema Cache Growth Factor As mentioned earlier. As mentioned in the table.0000.0000.23] _ds 1 ~*620020908 17:55:32:543.~$.~$&state=~"#CA _ds 4 0x000000000000445800000c40000500000c40000 REPAGENT(4): [2002/09/08 17:55:12.~!*rs_insert _yd _af ~$)title_id=~"'PC9900. Accordingly. more frequently hit tables will be in cache while those infrequently will get aged out. RA needs to send the appropriate info for each . You can watch the growth with RA trace 9208 .6 0x000000000000445800034348494e4f4f4b7075627332 _bg tran ~")add_book for ~"#sa _ds 4 0x000000000000445800000c40000400000c400003000092810127681300000000. datatype tokens. The other cache (the topic of this section) basically caches components from sysobjects and syscolumns. don't send the new column for the old row. The reason is that RA needs to send the correct version of the schema at the point that the DML happened.~$&notes=~#"3This what happens on sabbaticals taken by geeks . as of 12.~$&notes=~#"3This what happens on sabbaticals taken by geeks . insert another row. we get the following: REPAGENT(4): [2002/09/08 17:55:46. and/or (B) the structure of tables/columns are being modified.~$%%type=~"popular_comp.24] The LTL packet sent is of length 958. several trace flags exist. “traceoff”. the logical page id’s are assigned in device fragment order. Consider the following example database creation script (assume 2K page server): create database sample_db on data_dev_01=4000 log on log_dev_01=250 go alter database sample_db on data_dev_02=4000 log on log_dev_01=2000 go This would more than likely result in a sysusages similar to (dbid for sample_db=6): Dbid 6 6 6 6 Segmap 3 4 3 4 Lstart 0 2048000 2176800 4224800 Size 2048000 128800 2048000 1024000 Vstart … (…) Executing sp_help_rep_agent sample_db could yield the following marker positions: 41 . This can lead to frequent accusations that the Replication Agent is always #GB behind. and “current marker”. Depending on the hardware platform. this can be considerable. for tougher problems. In the case of LTL tracing (9201). At a basic level.0. However. while 3 triples the size. other processing on the box.monitor for a few minutes sp_config_rep_agent <db_name>. RepAgent Trace Flags However. “end marker”. “trace_log_file”. “9204” -. setting it to 2 doubles the size of the cache. sp_help_rep_agent can help track where in the log and how much of the log the Rep Agent is processing. The problem is that these are reported as logical pages on the virtual device(s). In other words. Of particular interest are the columns that report the transaction log endpoints and the Rep Agent position – “start marker”. Rep Agent trace flags should only be used when absolutely necessary. For NT.Final v2.log' Determining RepAgent Latency Another useful command when troubleshooting the Replication Agent is the sp_help_rep_agent procedure call. “traceon”. sp_sysmon ‘RepAgent’ or the MDA based monProcessWaits table are the best bets. setting this above 1 is probably useless. anything above 3 may not be recommended. Remember. unless you have over 100 objects being replicated per database. The trace flags and output file are specified using the normal sp_config_rep_agent procedure as in the following: sp_config_rep_agent <db_name>. As a result. 'c:\\ltl_verify. “<filepathname>” sp_config_rep_agent <db_name>. “9204” However.1 The RepAgent config "schema cache growth factor" is a factor . for performance related issues. note that you will need to escape the file path with a double slash as in: exec sp_config_rep_agent pubs2. Hence. Trace Flag 9201 9202 9203 9204 9208 Trace Output Traces LTL generated and sent to RS Traces the secondary truncation point position Traces the log scan Traces memory usage Traces schema cache growth factor Output from the trace flags is to the specified output file. etc. tracing the Rep Agent has a considerable performance impact as the Rep Agent must also write to the file.not a percentage .consequently it is extremely sensitive. 'trace_log_file'. Replication Agent Troubleshooting There are several commands for troubleshooting the Rep Agent. Final v2.000 or 25.20) (134923.11) (133849.20) (134923. once the XLS wakes up the Rep Agent from a sleeping state.20) (134923.22) (133278.22) (133278.20) (137410.2) (137410. However.7) (134378.22) (133278.20) (134923.0.20) (134923.19) (134562. Some administrators have attempted to run sp_help_rep_agent every second and were extremely surprised to see little or no change in the “log records scanned” (or even a drop).and may get a new end marker position. What the “scan” section of sp_help_rep_agent is reporting in the “log records scanned” is the number of records scanned towards the “scan batch size”.20) (134923. The “log records scanned” works similar but on a more predictable basis.20) (134923. Consider the following output from a sample scan: start marker (133278. This is what causes the confusion .5) (133931.7) (135725. the start and end markers are set to the current log positions.5) (135000.22) (133278.9) (134037. 2063800/512=4031MB) – a good trick when the transaction log is only slightly bigger than 2GB. One of the most understood aspects of sp_help_rep_agent is the “scan” output .9) (135084. which has a default value of 1.22) (133278.3) (134471. In reality. one of the Rep Agent configuration settings is “scan batch size”.23) (135447.13) (135642. As it nears the end marker.20) (134824.22) (133278.20) (134923.assuming that the end maker points to the final page of the log.20) (134923.20) (134923.20) (134923. The Rep Agent commences scanning from that point.2) (137410.22) (133278. the counter is reset. the Replication Agent is only 31MB behind ( (42298424224800)+(2176800-2166042)=5042+10758=15800.22) (134923. At the default value.particularly when just using the default as the Rep Agent is capable of scanning 1. The reason is that the Rep Agent was working on subsequent scan batches.22) (133278.2) (137410.20) (134923.000 clears it up. it requests an update .6) (134116.22) (133278.20) (134923.1 Start Marker 2148111 End Marker 4229842 Current Maker 2166042 Those quicker with the calculator than familiar with the structure of the log would erroneously conclude that the Rep Agent is running ~4GB behind (4229842-2166042=2063800.20) (134923.22) (133278.000.0) log rec scanned 3587 4807 5841 6849 7856 8810 10083 11038 12048 13163 14171 15286 16375 17516 18341 19509 20437 21605 22613 23621 24790 1061 1963 3184 4300 5283 recs/ sec 0 1220 1034 1008 1007 954 1273 955 1010 1115 1008 1115 1089 1141 825 1168 928 1168 1008 1008 1169 1271 902 1221 1116 983 scan cnt 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 tot recs 3587 4807 5841 6849 7856 8810 10083 11038 12048 13163 14171 15286 16375 17516 18341 19509 20437 21605 22613 23621 24790 26061 26963 28184 29300 30283 42 . Once the “scan batch size” number of records are reached.22) (133278.20) (134923.20) (134923.20) (134923.20) (134923. For the first part.20) end marker (134923.22) (133278.14) (133765.15) (134201.22) (133278.20) (134923.3) (133594.20) (134923.0) (135266.20) (134923.000 records a second from the transaction log.20) (134923.2) (137410.2) (137410.22) (133278.5) (134726. 15800/512=31 ) .0) (134902. If you remember.2) current marker (133493.2) (137410.2) (137410.8) (134658.20) (134923.5) (135371.20) (134923.22) (133278.23) (135549. monitoring this value can be extremely confusing.5) (135169.the makers (as listed above) as well as the “log records scanned”.15) (134294. if setting “scan batch size” to a more reasonable value of 10.15) (133681.2) (137410.22) (133278. all recs. Adding more fun to the problem of determining the latency is the fact that the transaction log is a circular log. 43 .2) end marker (137410.20) (134923. As you can see from the first high-lighted section.20) (134923. Determine the distance from the end of the current segment in sysusages Add the space for all other segments between the current segment and the segment containing the end marker.2) (137410. Otherwise.23) (135985. row=0.20) (134923.7) (137416.2) (137410.2) (137410. If DBCC printed error messages.15) (136277.2) (137410. The above output was taken from an NT stress test of a 50.20) (134923. the end marker was updated.2) (137410.2) (137410.12) (135904. obj=0.20) (134923.0. update some row – and then check in master.8) (136895. LOG SCAN DEFINITION: Database id : 5 Backward scan: starting at end of log maximum of 1 log records.16) (136712.20) (134923. consequently.2) (137410.9) (137339. there isn't a built-in function that returns the last log page. 1) go DBCC execution completed.11) (136091. 3.2) (137410. -1.17) (136451. header only dbcc log(5. 0. If there are not any other open transactions.. 2.dbid=5.Final v2.9) (137068.20) (134923. -1.9) current marker (135815.20) (134923. 0.syslogshold.2) (137410.20) (134923.11) (137244.2) (137410.2) (137416.0.9) log rec scanned 6371 7433 8389 9663 10832 11894 13009 13965 15346 16195 17098 18187 19275 20284 21346 22463 23447 24588 513 recs/ sec 1088 1062 956 1274 1169 1062 1115 956 1381 849 903 1089 1088 1009 1062 1117 984 1141 925 scan cnt 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 tot recs 31371 32433 33389 34663 35832 36894 38009 38965 40346 41195 42098 43187 44275 45284 46346 47463 48447 49588 50513 Before asking about where the 3 right-most columns are in your sp_help_rep_agent output. Add the distance for the end marker within the end marker’s segment Unfortunately. it is possible for the markers to have wrapped around. The second high-lighted area illustrates the ‘scan batch size’ rollover effect on ‘log recs scanned’.20) (134923.11) (136188. one way to find the last log page is to use dbcc log as in: use pubs2 go begin tran mytran rollback tran go dbcc traceon(3604) go -.20) (134923.23) (136370.2) (137410. recs = last one.2) (137410. the above output is from a modified version of sp_help_rep_agent.1 start marker (134923. page=0. The logic to calculating the latency is: 1.11) (136980.18) (136803. as the current marker approached the end marker.2) (137410.0) (136636.20) (134923.20) (134923.2) (137410.2) (137410. perhaps the easiest way is to begin a transaction.000 row update done by 10 parallel tasks.2) (137410.20) (137410. The Rep Agent was configured for a ‘scan batch size’ of 25.2) (137410.20) (134923.17) (137161.20) (134923.20) (134923.20) (134923.000. contact a user with System Administrator (SA) role.5) (136566. Consequently. Normal Termination Output completed (0 sec consumed).… Where the first number in parenthesis is current log page (and row) – note that the sessionid points to the log page and row where the transaction begin record is (since this was an empty tran.0. If a user began a transaction and went to lunch. Using sp_sysmon Most DBA’s are familiar with sp_sysmon – until the advent of the MDA monitoring tables in 12.getdate()) from master. Simply because the STP and the current oldest open transaction is 30 minutes does not mean that the Rep Agent will take 30 minutes to scan that much of the log – consider if the Rep Agent is down or a low volume system.3.syslogshold where name = '$replication_truncation_point' and dbid=db_id() This tells how far behind in minutes the Replication Server is (kind of).starttime.1 LOG RECORDS: ENDXACT (13582.13 attcnt=1 rno=14 op=30 padlen=0 sessionid=13582. the STP won’t move until the transaction is committed. Another alternative that measures instead the difference in time between the last time the secondary truncation point was updated is by using the following query: -. ~50% of the log records scanned were sent to the RS. The syntax is: -. If DBCC printed error messages. A little known fact is that while the default output for sp_sysmon does not include RepAgent performance statistics. this may give the impression that the Replication Agent is lagging.0. the STP points to the page containing the oldest open transaction that the Replication Server has processed. The problem is that it can be highly inaccurate. Unfortunately.executed from the current database select db_name(dbid). it is immediately preceding). The difference between ‘Log Records Scanned’ and ‘Log Records Processed’ – is fairly obvious – ‘Processed’ records were converted into LTL and sent to the RS.14) sessionid=13582.5. when in reality. The second reason that this can be inaccurate is a matter of interpretation. the suggestion made earlier to either invoke your own transaction (add a where clause of 'spid = @@spid' to the above) or just grab the latest one and hope that it isn't a user gone to lunch. this procedure was the staple for most database monitoring efforts (unfortunately so. contact a user with System Administrator (SA) role. per sec -----------Log Scan Activity Updates Inserts Deletes Store Procedures DDL Log Records n/a n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a n/a count ---------101317 19 0 0 0 % of total ---------n/a n/a n/a n/a n/a 44 . In the example above. some the main points of interest are repeated below (header lines repeated for clarity): per sec -----------Log Scan Summary Log Records Scanned Log Records Processed n/a n/a per xact -----------n/a n/a count ---------206739 105369 % of total ---------n/a n/a The log summary section is a good indicator of how much work the RepAgent is doing – and how much information is being sent to the Replication Server. as Historical Server provided more useful information and yet was rarely implemented). executing the procedure and specifically asking for the “repagent” report does provide more detailed information than what is available via sp_help_rep_agent. “repagent” While the output is described in chapter 5 of the Replication Server Administration Guide. stp_lag=datediff(mi.sample the server for a 10 minute period and then output the repagent report exec sp_sysmon “00:01:00”.Final v2. Remember.. the current marker value may be very near the end of the transaction log.13 len=28 odc_stat=0x0000 (0x0000) loh_status: 0x0 (0x00000000) endstat=ABORT time=Oct 22 2004 11:26:39:166AM xstat=0x0 [] Total number of log records 1 DBCC execution completed. NULL.0. IndexID int. NULL. NULL. Since the above is at the index level.3+ -specifically monOpenObjectActivity table which has the following definition: -.0. you will need to isolate IndexID=[0|1] to avoid picking up rows inserted into the index nodes (which of course are not replicated). NULL. NULL. though.Final v2. DBName varchar(30) ObjectName varchar(30) LogicalReads int PhysicalReads int APFReads int PagesRead int PhysicalWrites int PagesWritten int RowsInserted int RowsDeleted int RowsUpdated int Operations int LockRequests int LockWaits int OptSelectCount int LastOptSelectDate datetime UsedCount int LastUsedDate datetime ) materialized at "$monOpenObjectActivity" go NULL.000 records – since each transaction is a begin/commit pair. “replicate_if_changed”. NULL. CLR’s refer to Compensation Log Records – and clearly point to a design problem as earlier discussed with indexes using ignore_dup_row or ignore_dup_key (discussed earlier in the Primary Database section on Batch Processing & ignore_dup_key). Normally. More detail about which tables were updated/inserted/deleted can be gotten from the MDA monitoring tables in 12.1 definition create table monOpenObjectActivity ( DBID int. NULL. Most of the above statistics should be fairly obvious. the last three should bear some attention. NULL. this should be zero. except ‘Prepare’ – which refers to twophased commit (2PC) ‘prepare transaction’ records that are part of the commit coordination phase. ‘DDL Log Records’ refers to DDL statements that were replicated – generally this should be zero. NULL. NULL.369 processed – 101. ObjectID int. NULL. NULL. with only minor lifts in a Warm Standby when DDL changes are made – hence we exclude this from concern. ‘Writetext Log Records’ will show how many writetext operations are being replicated. While the first four are fairly obvious (updates. If you see a large number of text rows being replicated.000 log records sent to the RS were not DML statements (105.336 DML = 4.0. deletes and proc execs replicated). ‘Text/Image Log Records’ is similar but a bit different in that it displays how many row images are processed (we need to confirm whether this is rs_datarow_for_writetext or the actual number of text rows). NULL. ‘Maintenance User’ refers of course to maintenance user applied transactions that are in turn re-replicated.5. per sec -----------Transaction Activity Opened Commited Aborted Prepared Maintenance User n/a n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a n/a count ---------2015 2016 0 0 0 % of total ---------n/a n/a n/a n/a n/a Here are the missing 4. NULL. but if a logical Warm Standby is also the target of a different replication source. then the primary database in the logical pair is responsible for re-replicating the data to the standby database.1 Writetext Log Records Text/Image Log Records CLRs n/a n/a n/a n/a n/a n/a 0 0 0 n/a n/a n/a The Log Scan Activity contains some useful information if you think something is occurring out of the norm.ASE 15. NULL.033) – most of these are transaction records as seen in the next section. inserts. The transaction flow is SourceDB RS PrimaryDB RS/WS StandbyDB as illustrated below: 45 . NULL. 2015+2016=4031 records sent to the Replication Server. you may want to investigate whether a text/image column was inappropriately marked or left at “always_replicate” vs. NULL. Some may have notice in the Scan Activity that ~4. In reality. remember that the RepAgent wait count may reflect the before and end state of the benchmark run. The next section of the RepAgent sp_sysmon report is only handy from the unique perspective of DDL replication. the RepAgent may have to scan forward/backward to determine the correct column names. then the 100. the likely 46 . Obviously a count of zero is not desired. So if doing benchmarking.0 % of total ---------n/a n/a n/a n/a Here. datatypes. to send to the Replication Server. it will have to scan backwards (possibly) to find the alter table record to determine the appropriate columns to send.0 0 0 0 0. the RepAgent was waiting when the sp_sysmon started.0 n/a n/a n/a n/a n/a n/a n/a n/a When a table is altered via alter table. per sec -----------Schema Cache Lookups Forward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) Backward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) per xact -----------count ---------% of total ---------- n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 0 0 0 0. the RepAgent caught up twice and waited ~7 seconds each time for more information to be added to the log. In the above example from a 1 minute sysmon taken during heavy update activity. On startup. this is done using an auxiliary scan from the main log scan and is why often the Rep Agent will be seen with two scan descriptors active in the transaction log. Or it seems so.Final v2.000 updates occurred in ~2. this is reporting the number of times the RepAgent has asked the RS for a new secondary truncation point and then moved the secondary truncation point in the log.0. The next section is one of the more useful: per sec -----------Truncation Point Movement Moved Gotten from RS n/a n/a per xact -----------n/a n/a count ---------107 107 % of total ---------n/a n/a As expected. waiting is not ‘bad’ as it refers to the time that the Rep Agent was fully caught up and was waiting for more log records to be added to the transaction log. Consequently. One way to think of this is from the perspective of someone doing an alter table and then shutting down the system. the RepAgent can’t just use the schema from sysobjects/syscolumns because some of the log records may contain rows that had extra or fewer columns. If ‘Moved’ is more than one less than ‘Gotten’. etc.1 Remote Site Direct Replication Re-Replicated HQ Warm Standby Figure 8 – Path for External Replicated Transactions in Warm Standby The next section of the Rep Agent sp_sysmon output is the ‘Log Extension’ section: per sec -----------Log Extension Wait Count Amount of time (ms) Longest Wait (ms) Average Time (ms) n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a count ---------2 14750 14750 7375.000 transactions – taking 45 seconds to process – then the RepAgent was caught up again. Incidentally. Remember. that is about 2/sec – so increasing scan_batch_size in this case should not have too much a detrimental impact on recovery.2 per sec ------------ per xact -----------n/a per xact ------------ count ---------726 count ---------- % of total ---------n/a % of total ---------- 47 .Final v2. However. We were using the default 2K packets between the RepAgent and the Replication Server and nearly 90% of them were full.6 % of total ---------n/a n/a n/a n/a In this sample. normalized the LTL according to the replication definitions. Note too that we are sending 2K buffers which include column names – hence the number of packets in this case is probably much bigger than the number of log pages scanned – perhaps 20% more.0.8 % I/O Busy -------2. The bottom statistic ‘Average Packet’ is simply ‘Amount of Bytes Sent’ – ‘Packets Sent’ and can be misleading. The connection section is really only useful when you are having network problems – and should be accompanied by the normal errors in the ASE errorlog: per sec -----------Connections to Replication Server Success n/a Failed n/a per xact -----------n/a n/a count ---------0 0 % of total ---------n/a n/a The next section is also when to pay attention to – the network activity: per sec -----------Network Packet Information Packets Sent Full Packets Sent Largest Packet Amount of Bytes Sent Average Packet n/a n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a n/a count ---------16860 14962 2048 30955391 1836. For those who have been following this.000 records processed.0 % of total ---------n/a n/a n/a n/a n/a In the above case. this 45 second update sent ~30MB to the RS – a rate of 40MB/min or >2GB/hr. but then has to wait for the RS to parse the LTL.000 in this case. the RepAgent waited on the RS to conduct I/O nearly 17. Sent. check the above statistic on the number of packets and you will see the problem with RepAgent performance – a lot of hurry-up-and-wait. Note in this discussion we are talking about ‘Log Records Processed’ and not ‘Scanned’ – while the record count can happen quickly. Let’s take a look at an actual snapshot from a customer’s system.2 % Transaction Summary ------------------------Committed Xacts Transaction Detail ------------------------- per sec -----------1. It can scan from the log at fairly tremendous speed. The following statistics are from a 10 minute sp_sysmon – only the transaction and RepAgent sections are reported here: Engine Busy Utilization -----------------------Engine 0 CPU Busy -------3.000 times. the RepAgent requested a new truncation point – requests sent in separate packets from the LTL buffers – requests that can skew the average. The final section of the report offers perhaps the biggest clues into why the RepAgent may be lagging: per sec -----------I/O Wait from RS Count Amount of Time (ms) Longest Wait (ms) Average Wait (ms) n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a count ---------16966 11002 63 0.1 % Idle -------94. it shows that possibly bumping up the send_buffer_size may help. Now then. the RepAgent isn’t foolish enough to ask for a new truncation point every 1.000 records scanned – it actually is based off of the number of records sent to the RS.1 cause is that a large or open transaction exists from the Replication Server’s perspective (either it is indeed still open in ASE or the RepAgent just hasn’t forwarded the commit record yet). The more important statistic to watch is Full vs. so 107 is not abnormal. you would expect at least 105 truncation point movements plus one when it reached the end of the log. The number above is not necessarily high – you can gauge the number by dividing the ‘Log Records Processed’ by the RepAgent ‘scan_batch_size’ configuration – which was the default of 1. Of course having the RS on a box with fast CPU’s greatly helped in this case as will be discussed later. With 105. pack the LTL into a binary format and send to the SQM – an average of ½ of a second wait every time a packet is sent (and yes. 107 times. we did ask 2x a second for a truncation point as well). 0 ============ 5.0 0 ---------0 n/a ---------0.0 -----------0.7 0.4 20 4 4 ---------28 ========== 3216 71.4 419 468 2301 ---------3188 13.0 -----------0.0 0.9 % Replication Agent ----------------- count ---------Log Scan Summary Log Records Scanned Log Records Processed Log Scan Activity Updates Inserts Deletes Store Procedures DDL Log Records Writetext Log Records Text/Image Log Records CLRs Transaction Activity Opened Commited Aborted Prepared Maintenance User Log Extension Wait Count Amount of time (ms) Longest Wait (ms) Average Time (ms) Schema Cache Lookups Forward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) Backward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) 81061 19015 0 15845 0 0 0 0 0 0 1585 1585 0 0 0 0 0 0 0.0 0 ---------0 n/a ---------0.8 -----------5.0 ============ 4.0 0.0 0.0 -----------0.3 % 14.6 0.0 48 .4 0.0.0 0.8 3.2 % ---------99.1 % 14.0 ------------------------.Final v2.6 3.1 % 0.0 -----------0.0 0 0 0 0.0 0.2 -----------4.0 % Data Only Locked Updates Total Rows Updated 0.4 % 14.-----------Total DOL Rows Updated 0.7 % 72.0 0 0 0 0.3 % ---------0.0 Deletes APL Deferred APL Direct DOL ------------------------Total Rows Deleted ========================= Total Rows Affected 0.3 0.0 % 0.1 Inserts APL Heap Table APL Clustered Table Data Only Lock Table ------------------------Total Rows Inserted Updates Total Rows Updated ------------------------Total Rows Updated 0.0 -----------0. we see that ASE is only using ~4% of the cpu – idle for the other 96% of the time. Upgrading to ASE 12. Interestingly enough. don’t put RS on the Sun6900 that is apparently under utilized as it is the DR host machine. a 1 minute sp_sysmon showed the following: I/O Wait from RS Count Amount of Time (ms) Longest Wait (ms) Average Wait (ms) n/a n/a n/a n/a n/a n/a n/a n/a 4869 54363 323 11. enabling SMP. As you can see from above. Even worse.3 did solve the problem and a 15 minute stress test dropped to 3 minutes. The problem is believed to be caused by the ASE scheduler not processing the RepAgent’s network traffic fast enough. The problem is in the waits on sending to the Replication Server – from the above. this can now be refuted easily). cpu time spent on other threads (DSI. Looking at the system. 49 . The point to be made here is that the RepAgent speed is directly proportionate to the speed of ASE processing the network send requests coupled with the speed of the Replication Server processing.1 Truncation Point Movement Moved Gotten from RS Connections to Replication Se Success Failed Network Packet Information Packets Sent Full Packets Sent Largest Packet Amount of Bytes Sent Average Packet I/O Wait from RS Count Amount of Time (ms) Longest Wait (ms) Average Wait (ms) 19 19 0 0 9794 8698 2048 18436223 1882. It not only would be cheaper – but better performance to have bought a smaller SMP machine for the standby ASE server and a 2-4 way screamer entry level server for RS to run on.2 n/a n/a n/a n/a Ugly. Additionally. etc. it waited ~2 minutes (107 seconds) of the 10 to send – the key is the long wait and high average wait. the application (bcp in this case) only inserted 3.2GHz CPU’s when a quad cpu Opteron based Linux box or small SMP with fast CPU’s (less than $20. even for a Warm Standby may boost RepAgent performance 10-15% by eliminating CPU contention.4 9813 107316 400 10. Consider the usual suspects: “RepAgent not scanning the transaction log fast enough” – common myth closely followed with a “multithreaded RepAgent is needed”.5. “RepAgent is contending for cpu – need to raise the priority” – another commonly blamed problem (with sp_sysmon.Final v2. delay in getting repdefs into cache (too small sts_cache_size).2 from 12.0.5. The RepAgent is literally waiting 54 seconds of the 1 minute – so it is only scanning for 6 seconds.9 Now the interesting thing about the above – of course the RepAgent was lagging – in fact it was way behind. So…don’t put RS on the old Sun v880 or v440 with the ancient 1. the issue was not Replication Server – traceflags were enable that turned RS into a datasink with no appreciable impact on performance. Key Concept #6 – The biggest determining factor in RepAgent performance is the speed of ASE sending network data and the speed of the Replication Server – hence. faster CPU’s is much better than slow CPU’s on a monster SMP machine.000 inserts during the same period – about 5x the rate – so the RepAgent scan isn’t the issue. from an RS perspective fewer. In fact.000).).0.000 rows in the 10 minutes at a rate of 5 rows/sec. DBA’s have often fallen into the trap of buying a bigger SMP machine for the DR site and hosting both RS and the standby ASE server on it. directly slows down the RepAgent speed. The RepAgent processed 15. however. Any contention at the inbound queue (readers delay writers). WaitTime Tot. In each case.----------.-------------------------------------------------29 2 wait for buffer read to complete 31 3 wait for buffer write to complete 171 8 waiting for CTLIB event to complete 214 1 waiting on run queue after yield 222 9 replication agent sleeping during flush Wait Time from Mon Tables on ASE 12.----------31 3 120 117 0 100 100 171 2178 75597 73419 21800 747900 726100 222 2 4 2 17000 54300 37300 Wait Time from Mon Tables on ASE 12. In this case the before and after samples are illustrated side by side with the first column being the first sample and the second column being the end sample. there was a 400% drop in the waits moving from 12.Waits totWaits WaitTime WaitTime totWaitTime ----------. a rollover occurs and it re-increments from the rollover.sysprocesses where program_name=’repagent’ and dbid=db_id(‘<tgt_db_name>’) select * From monProcessWaits where KPID=@ra_kpid and SPID=@ra_spid waitfor delay “00:05:00” select * from monProcessWaits where KPID=@ra_kpid and SPID=@ra_spid There are very few RepAgent specific wait events as most of the causes of RepAgent waits are due to generic ASE processing vs.5. This can be especially useful for RepAgent performance analysis when determining whether the hold up is the time spent doing the log scan or whether it is due to waiting on the network send aspect.----------.. Consequently the time waiting is the difference between samples.5.3 WaitEventID t1.WaitTime t2. the key is to realize that it requires at least two samples to be effective.----------.2 WaitEventID t1.0..5.----------. RepAgent specific.----------. consider the following event descriptions and counter values.3 to 12.5.Waits Tot. @ra_spid=spid from master.2 finally showing the 50 . it will be much faster to first retrieve the RepAgent’s KPID and SPID from master. The other key is that the monProcessWaits table has two parameters – KPID and SPID. only WaitEventID’s that showed a difference between the samples is reported.3 is the monProcessWaits table. @ra_spid int select @ra_kpid=kpid. If the counter hits 2 billion.0. when focusing on a specific Replication Agent. the RepAgent spid is spending much more time waiting on CT-Lib events than anything else.sysprocesses and supply it as SARG values such as: declare @ra_kpid int.WaitTime ----------. Consequently.0.Waits t2. The reason for this is that the output values are counters that are incremented from the server boot infinitely until shutdown.----------.----------. as is evident.5.1 Utilizing monProcessWaits One of the key MDA monitoring tables added in 12.Waits t2.Waits t1.----------.----------.0. However.Final v2.----------. To understand what how to use monProcessWaits. WaitEventID WaitClassID Description ----------.-----------29 2 2 0 0 0 0 31 283 403 120 1900 3200 1300 171 223623 306426 82803 410700 593700 183000 214 3 3 0 0 0 0 222 13988 13990 2 1032636100 1032659600 23500 In both cases illustrated above. The RepAgent wait events are: Wait Event 221 222 223 Event Class 9 9 9 Description replication agent sleeping in retry sleep replication agent sleeping during flush replication agent sleeping during rewrite To illustrate the point about the most frequent causes of RepAgent waits.----------. 0. Same as above (cpu contention). – including vendors. One method is to use Replication Server’s Monitors & Counters feature – focusing on the EXEC. SQM. the application performance stress test also improved from 13-14 minutes to 3 minutes – a matching 400%. and STS thread counters. the next step to determine the cause of the slow LTL transfer process. If you see a significant number of 171’s (as illustrated above) and as will be the most common. It is much better to take samples at timed intervals – such as every minute or so. 215 waiting on run queue after sleep 222 replication agent sleeping during flush An important point about WaitEventID=222 – if you are doing benchmarking. if you sample the counters prior to the stress test starting and then at the end. you will minimally see “2” waits – the reason is likely the RepAgent was at the end of the log when the first sample was taken and the last sample was taken after the RepAgent had finished reading out the test transactions. Fault Isolation Questions (FIQ’s) Most of us are familiar with FAQ’s – which strive to serve as a loosely defined database of previously asked questions and answers. end of log) 51 . This will help eliminate false highs/lows at the boundary of the tests such as 222. etc. adjusting the RepAgent priority is likely not going to help throughput issues. this often leads to phone calls to Technical Support with “it’s slow” and no information about what may be going on to help identify why it may be slow. A common problem is that most programmers and database administrators today have poor fault isolation skills largely due to the lack of organized fault isolation tree use by the DBA’s. the RepAgent was sleeping (due to sending data to RS . the only writing RepAgent does is to update the dbinfo structure with the new secondary truncation point – so any large values here could be indication of more serious problems This corresponds directly to RepAgent transferring data to the RS.e. The following questions are useful in helping to isolate the potential causes of RepAgent performance: • How far behind is the Replication Agent (MB)? (current marker vs. We’ll morph that a bit to make FIQ’s – questions you should ask when troubleshooting. strictly a before and after snapshot is not that informative. Some of the wait events and how they could be interpreted are summarized in the following table: ID 29 WaitEvent Description wait for buffer read to complete Possible Intepretation/Reaction Waiting on log scan – check to see if log page for scan point is within the log cache – possibly use more cache partitions Typically. While the first two may be obvious. the RepAgent was scanning and got bumped off the cpu at the end of its timeslice (i. Unfortunately. In this case.1 RepAgent waiting for a buffer read (logical IO from the log cache) and cpu access. This typically is an indication of the rep agent reaching the end of the transaction log and sleeping on the log flush.Final v2. Not surprisingly. the STS counters is useful to determine if the RS is hitting the RSSD server to read in repdefs – which are used by the EXEC thread during normalization. it had to wait on other users before it could reclaim the cpu and continue scanning. In this case.any network or physical disk io results in the spid being put on the sleep queue) and when the network operation was complete. This will be the most common and can be as the result of several things: • • • 214 waiting on run queue after yield Slow network access from ASE Slow network access at RS Slow inbound queue SQM (exec cache full) 31 wait for buffer write to complete 171 waiting for CTLIB event to complete CPU contention with other processes – unless you see a fair number of waits. As with any monitoring activity. didn’t reach the scan_batch_size before the timeslice) and had to wait to regain access to the cpu. sqt)? Where is the RSSD in relationship to the RS? Is the RepAgent waiting a long time for the secondary truncation point for the RS at the end of each ltl_batch_size? What do the contents of the last several blocks of the inbound queue look like? Were lengthy DBA tasks such as “reorg reclaim_space <tablename>” issued without turning replication off for the session (via “set replication off”)? The last question is a bit strange. not knowing if the transaction will contain replicated commands.1 • • • • • • • • • • What is the rate at which Replication Agent appears to be processing log pages (MB/min)? What is the rate at which pages are being appended onto the transaction log (MB/min)? (monDeviceIO) How much cpu time is the RepAgent getting? (monSysWaits/monProcessWaits) Is there a named cache specifically for the transaction log with most of the cache defined for a 4K (or other) pool and sp_logiosize set to the pool size? What are the configuration values for the RepAgent? Do any columns in the schema contain text? If so. etc.)? Is there a named cache pool for the text columns (later discussion)? Is it the latency the result of executing large transactions (how many commands show up in admin who. As a result.Final v2. As of ASE 12. 52 . RepAgent prior to ASE 12. As was described earlier.2 forwards these BT/CT pairs to the RS where eventually they are filtered out.5.0. what is the replication status for the text columns (always.2. but it turns out that some DBA tasks issue a huge number of empty begin/commit transaction pairs. it is a good practice to put “set replication off” at the beginning of most DBA scripts such as dbcc or reorg commands. if changed. the RepAgent is smart enough to filter out empty BT/CT pairs caused by system transactions.5. catalog lookups and other RSSD accesses increase this load considerably.0. due to environment specific requirements. you may achieve better performance with different configurations than those mentioned here.0. This leads a critical key performance concept for RSSD hosting: Key Concept #7: Always place the RSSD database in an ASE database engine on the same physical machine host as the Replication Server or use the Embedded RSSD (eRSSD). when using the localhost entry. Replication Server/RSSD Hosting A common mistake is placing the RSSD database in one of the production systems being replicated to/from.0.Final v2. In addition. %systemroot%\system32\drivers\etc\hosts for WindowsNT). how many replicate databases involved and how much latency the business is willing to tolerate. An illustration of this is shown below: 53 .1) entries. Add the queue processing. Additionally. If using the NIC IP address. the localhost IP address refers to the host machine itself. The recommendations in this section are based on the assumption of an enterprise production system environment and consequently are significantly higher than the software defaults.0. one of the main problems stemming from this is that this frequently places the RSSD across the network from the Replication Server host. you should see an entry similar to: 127. is all depends – it depends on the transaction volume of the primary sites. the volume of interaction between the Replication Server and RSSD can be substantial – just in processing the LTL. packets destined for the machine name may not only have to hit the NIC card. It should be noted that these are general recommendations that apply to many situations. While this in itself has other issues. The answer. If you took in the host file on any platform (/etc/hosts for Unix.0. As a result. your specific business or technology requirements may prevent you from implementing the suggestions completely. On the other hand. The difference is in how communication is handled when addressing the machine via the NIC IP address or the localhost IP address. make sure that the first network addresses in the interfaces file for that ASE database engine are ‘localhost’ (127. As you saw earlier in the Rep Agent discussion on secondary truncation point management. the TCP/IP protocol stack knows that no network access is really required. but may also require NIS lookup access or other network activity (routing) that result in minimally the NIC hardware being involved. The object of this section is to cover basic Replication Server tuning issues. of course.1 localhost #loopback on IBM RS6000/AIX In addition the Network Interface Card (NIC) IP addresses. however. The latter part of the concept may take a bit of explaining. the protocol stack implements a “TCP loopback” in which the packets are essentially routed between the two applications only using the TCP stack.1 Replication Server General Tuning How much resources will Replication Server require? The above is a favorite question – and a valid one – of nearly every system administrator tasked with installing a Replication Server. implementing multiple network listeners – especially one on localhost – could result in severe performance degradation (especially when attempting large packet sizes). this can be implemented by modifying the Sybase interfaces file to include listeners at the localhost address. Even so. However. One such was AIX 4. RS queries to the RSSD may by-pass the “traffic jam” on the network listener used by all the other clients. suggestion is 2-3 for the RS. Typically. Additionally. this has substantial performance and network reliability improvements over using the network interface. localhost protocol routing As you could guess. By using the localhost address. If planning on high volume with multiple connections and using RS 12. (min 2. these must be the first addresses listed in the interfaces file in order for this to work. For example: NYPROD master master query query NYPROD_RS master master query query tcp tcp tcp tcp /dev/tcp /dev/tcp /dev/tcp /dev/tcp localhost nymachine localhost nymachine 5000 5010 5000 5010 tcp tcp tcp tcp /dev/tcp /dev/tcp /dev/tcp /dev/tcp localhost nymachine localhost nymachine 5500 5510 5500 5510 Note that many of today’s vendors have added the ability for the TCP stack to automatically recognize the machine’s IP address(es) and provide similar functionality without specifically having to use the localhost address.3 with ASE 11.1 RS RSSD RS RSSD Hostname:port localhost:port CT-Lib NetLib TCP IP Sybase Hybrid Stack CT-Lib NetLib TCP IP Network Network Figure 9 – hostname/NIC IP vs.6/SMP. A word of warning. 54 . but probably are the minimum a decent production system should consider to avoid resource contention or swapping): Resources # of CPU’s Recommendation 1 for each RS and ASE installed on box plus 1 for OS and monitoring (RSM).0. the machine should have the following minimal specifications (NOTE: The following specifications are not the bare minimums.Final v2. On some systems. 1-2 for ASE/RSSD and 1 for the OS & RSM (4-6 cpu’s). there may be a benefit to using the localhost address on machines in which the RSSD is co-hosted with application databases and the RS would have to contend with application users for the network listener with ASE.2 (neither of which is currently supported having been end-of-lifed by both IBM and Sybase years ago). Using the eRSSD reduces the cpu load significantly such that it would be rare to need more than 4 cpu’s unless 3 or more active DB connections are in the RS. 3 preferred).9. max(DSIE)=max parallel DSI threads (max(dsi_num_threads)).x The number of internal processing threads. RS Generic Tuning Generally speaking. the faster the disk I/O subsystem and the more memory . 4. Switched Gigabit Ethernet or better (10GB Ethernet or infiband) Disk Space Network The rationale behind these recommendations will be addressed in the discussions in the following sections. Each Replication Server license includes the ability to implement a “limited use” ASE solely for the purpose of hosting the RSSD (“limited use” means that the ASE server could only be used the RSSD – no application data. consequently it is not shipped as part of the RS product set. Based on the above considerations. recommend 256-512MB data and 128-256MB log due to monitoring tables.0 and higher. Recommendation is minimally 100.customers may have to coordinate with Sybase Customer service to ensure that the correct number of keys are available. 120MB sybsystemprocs.) Of course. it can grow rapidly if logging exceptions. Author’s Note: As of this writing. there should be no licensing concern to restricting the use of an ASE for the RSSD. The answer really depends on several factors: 1. life isn’t that simple.e. DSIQ=Outbound Queues.0. no 20MB master. 100+MB tempdb). you have to adjust several configuration settings within the Replication Server for optimal settings with certain minimums required based on your configuration. particularly for RS 12. client connection threads. etc. In the following sections Replication Server resource usage and tuning will be discussed in detail.1 Resources Memory Recommendation 128-256MB for ASE (32-64 for eRSSD instead) plus memory for each RS (64-128MB min) and operating system (32-64MB). 2. For ASE 15. Some of these are documented below: Parameter Replication Server Settings num_threads Default: 50 Suggest: 100+ 10. the SySAM 2 license manager will require a restricted use license key . etc. Replication Server Memory Utilization A common question is how much memory is necessary for Replication Server performance to achieve desired levels. 3. The old formula for calculating this was (#PDB * 7) + (#RDB * 3) + 4 + (num_client_connections) + (parallel DSI’s) + (subscriptions) + … Ver. because of the autoexpansion. Although the eRSSD uses significantly less system space (i.5+ 55 . the faster Replication Server will be. Min of 256MB with 1GB recommended ASE requirements plus RAID 0+1 device for stable queues – separate controllers/disks for ASE and RS. each RS implemented at a site could have it’s own ASE specifically for the RSSD. subscriptions. RSIQ=Route queues. etc. Transaction volume from primary systems Number of primary and replicate systems Number of parallel DSI threads Number of replicated objects (repdefs. Consequently. permitted). is assumed you already have the ASE software. Although the default creation for the RSSD is only 20MB (2KB pages). However. daemons.Final v2. Explanation The new formula is: 30 + 4*IBQ + 2*RSIQ + DSIQ*(3+max(DSIE)) IBQ=Inbound queues. RSI_USER=inbound routes. This should be at least twice the number of database connections + num_concurrent_subs. RSI_SENDER=outbound routes.000 Num_stable_queues Default: 32 Suggest: 32* num_client_connections Default: 30 Suggest: 20 num_mutexes Default: 128 pre 12.152 11. Max is 60 blocks or 983. num_msgs may need to be set to 128000 Minimum number of stable queues.5*num_threads) Num_msgs Default: 45.x 10.x Maximum SQT interface cache memory for a specific database connection. depending on transaction volume. The default of 30 is probably a little high for most systems . 0.194.x The number of messages that can be enqueued at any given time between RS threads.304(4MB) 11. Amount of memory available to a Rep Agent User/Executor thread for messages waiting in the inbound queue before the SQM writes them out. The old formula for calculating this was: 12+(#PDB * 2) + (#RDB) 10.1 Parameter Num_msgqueues Default: 178 Suggest: 250+ Ver. given that this number must always be larger than num_threads.6 Suggest: see formula 10.6. in bytes. It is mentioned here to emphasize a connection setting that should be changed as the result of changing sqt_max_cache_size).0. Connection (DSI) Settings dsi_sqt_max_cache Default: 0 Suggest: 2. consider num_dsi_threads * 64KB (Note that this is a per connection. The default settings suggest a 1:256 ratio. Based on the above settings. a simpler formula would be: num_threads*2 Recommendation is 250 (2.040 exec_sqm_write_request _limit Default: 16384 Suggest: 983. RSM and other client connections (non-Rep Agent or DSI connections). means the current setting of the sqt_max_cache_size parameter is used as the maximum cache size for the connection. 1024 in 12. 10. The default.040 12.1 56 . Serious consideration should be give to setting this to 4-8MB or higher.097. Settings above 16MB are likely counter productive. the formula was changed to: 200 + 15*RA_USER + 2*RSI_USER + 20*DSI + 5*RSI_SENDER + RS_SUB_ROWS + CM_MAX_CONNECTIONS + ORIGIN_SITES RA_USER=RepAgents connecting. To calculate a start point.586 Suggest: 128.x Explanation Specifies the number of OpenServer message queues that will be available for the internal RS threads to use.072 Suggest: 4. although a 1:512 may be more advisable. The old formula for calculating this was: 2 + (#PDB * 4) + (#RDB * 2) + (#Direct Routes) However.x Maximum SQT (Stable Queue Transaction) interface cache memory (in bytes) for each connection (Primary and Replicate). ORIGIN_SITES=number of inbound queues sqt_max_cache_size Default: 131.x As of RS12.Final v2.x Number of isql. RS_SUB_ROWS and CM_MAX_CONNECTIONS are from rs_config. If sqt_max_cache_size is fairly high.5 and native threaded OpenServer. Must be set in even multiples of 16K (block size). you may want to set this at the 2-4MB range to reserve memory.20 may be a more reasonable starting point Used to control access to connection and other internal resources. 10. 120 19.192 19.Final v2.312 40. 4 of the replicate databases have had the dsi_sqt_max_cache_size set to 3MB The RSSD contains about 5MB of raw data due to the large number of tables involved. Must be set in even multiples of 16K (block size).040 (max) 960K@ if maxed 960K@ if maxed Memory 36KB default 2.040 Ver.040.x Explanation Amount of memory available to a DIST thread’s MD module for messages waiting in the outbound queue before the SQM writes them out.125 205 684 1. 11.0. let’s assume we are trying to scope a Replication Server that will manage the following: • • • • • • 20 databases (10 primary. the memory requirements can be quickly determined. md_sqm_write_request_limit and exec_sqm_write_request_limit are maxed at 983. the easy way is to just use the values below as starting points (assumes a normal number of databases ~10 or less . For the sake of discussion. Each of these resources consumes some memory. sqt_max_cache_size to 1MB.1 from md_memory_pool to the current name to correspond with the exec_sqm_write_request_limit parameter. However.960 8.if more/less adjust memory by same ratio): Normal sqt_max_cache_size dsi_sqt_max_cache_size memory_limit 1-2MB 512KB 32MB Mid Range 1-2MB 512KB 64MB OLTP 2-4MB 1MB 128MB High OLTP 8-16MB 2MB 256MB The definitions of each of these are as follows: Normal – thousands to tens of thousands of transactions per day Mid Range – tens to hundreds of thousands of transactions per day OLTP – hundreds of thousands to millions of transactions per day 57 .5MB default 205KB default 140KB default 1 MB min Example (KB) 100 7.1 Parameter md_sqm_write_request _limit Default: 16384 Suggest: 983. Max is 60 blocks or 983.040.200 Minimum Memory Requirement (MB) ~128MB Of course.200 19. once the number of databases and routes are known for each Replication Server. Note that the name was changed in RS 12.200 5. 10 replicate) along with 5 routes 2 of the 10 replicate databases have Warm Standby configurations as well. num_threads is set to 250 for good measure (system requires nearly 200) The memory requirement would be: Configuration value/formula num_msgqueues * 205 bytes each num_msgs * 57 bytes each num_mutexes * 205 bytes each num_threads * 2800 bytes each # databases * 64K + (16K * # Warm Standby) # databases * 2 * sqt_max_cache_size dsi_sqt_max_cache_size – sqt_max_cache_size exec_sqm_write_request_limit * # databases Md_sqm_write_request_limit * # databases size of raw data in RSSD (STS cache) exec_sqm_write_request_limit * #databases 983. In high-volume systems. The maximum write delay for the Stable Queue Manager if the queue is not being read. one WS pair) with 500MB of memory. waiting for the RSSD.5 sqt_init_read_delay Default: 2000 Suggest: 1000 for 12. 11. if exec_sqm_write_request limit is set appropriately.6. the built-in recoverability within RS mitigates most of the risk to the point that in testing.c(493) Additional allocation would exceed the memory_limit of '524288000' specified in the configuration. For example. however.1 High OLTP – millions to tens of millions of transactions per day Now. SQT doubles its sleep time up to the value set for sqt_max_read_delay.Final v2. it is probably a clue that you have sqt_max_cache_size set too high (checked it during a large transaction perhaps) – or you need to raise the memory_limit due to the number of connections. there are several other server-level Replication Server configuration parameters that should be adjusted.6 58 .5+ sqm_write_flush Default: on Suggest: off 12. 100 for 15. no data has ever been lost.5? 12. T. bumping it up when admin who. The length of time an SQT thread sleeps while waiting for a Stable Queue read to complete before checking to see if it has been given new instructions in its command queue. a system with 7 connections (3 Warm Standby pairs + RSSD) with a sqt_max_cache_size of 32MB may run great as long as only 2 of the connections are active (i.e.0.5? Explanation Write delay for the Stable Queue Manager if queue is being read. While a theoretical data loss can occur with this off (and UFS devices). F. 2004/09/30 00:11:55.0 12. As soon activity starts on another. (111): Additional allocation of 496 bytes to the currently allocated memory of 524287924 bytes would exceed the memory_limit of 524288000 specified in the configuration. See above for discussion about why this should be set lower. sqt doesn’t show any removed doesn’t help and also can cause you to run out of memory fairly quickly. if the command queue is empty. Setting it higher allows the RS to spend more time writing to the queue vs. Typically. With each expiration. FATAL ERROR #7035 REP AGENT(SERVER. but in low volume systems. General RS Tuning In addition to the memory configuration settings. 2004/09/30 00:11:55. the following happens: T. the rate of space being released from the queue may be impacted negatively. this parameter controls whether RS waits for I/O’s to be flushed to disk for stable queues (effectively. this could impact performance by a factor of 30% or greater (the price of insuring recoverability). One of the best ways to improve RS speed if there is latency in the inbound queue (other than Warm Standby) is to increase the sqt_max_cache_size. This should be ignored for raw partitions.DB) – s/prstok. The impact of this is that the RepAgent (inbound) and DIST (outbound) threads are slowed down to provide time for the SQM to read data from disk for the SQT or DSI threads. However. During periods of high volume activity. it is easy to run out of memory if you are not careful. These configuration settings include (note that this list does not include previously mentioned memory configuration settings): Parameter init_sqm_write_delay Default: 1000 Suggest: 50 Ver. 2004/09/30 00:11:55. this may not have much of an impact. This also impacts how frequently RS updates rs_oqid table as segments are allocated/deallocated. if using UFS devices. (111): Exiting due to a fatal error If you get the above error. init_sqm_write_max_delay Default: 10000 Suggest: 100 sqm_recover_segs Default: 1 Suggest: 10 11. Similar to the dsync option in ASE. the default settting can result in nearOLTP loads of 2+ updates/sec in the RSSD. then. Specifies the number of stable queue segments Replication Server scans during initialization. the SQT is likely rescanning a large transaction – consequently increasing this value to favor queue reading will likely result larger overall latency. RS uses the O_SYNC flag). you should definitely configure this option. SMP Tuning As mentioned earlier in the discussion on RepAgent performance. 12. The problem with using UFS is two fold: • Replication Server uses asynchronous I/O (dAIO daemon) to ensure I/O concurrency with different SQM threads for the different queues. the rate of space being released from the queue may be impacted negatively. Ordinarily.0 Ver. fewer really fast cpu’s in a small entry level server (2-4 cpu’s) is ideal. Often.6.5 support for native threads. engine based.1 Parameter sqt_max_read_delay Default: 10000 Suggest: 1000 for 12. this may not have much of an impact. Note that because RS is POSIX thread based vs. in engineering tests on Solaris. the block is marked for deletion as a unit. The reason for this discussion is that administrators need to understand that the RS will essentially be performing 16K I/O’s using sequential I/O unless the queue space gets fragmented with several different queues on the same device (see below). this is a source of frustration with novice Replication Server Administrators who vainly try to drop a partition and expect it to go away immediately. some UFS systems such as HP-UX 11 do not allow asynchronous I/O to file systems. you will need to constrain how many CPU’s it can run on by creating processor sets (or similar term for your hardware vendor) and then binding RS to that processor set. RS/SMP as of RS 12. the I/O routines with RS have not been updated to take advantage of this fact. consequently employs a multi-engine SMP environment). For most machines. With RS 12. • With the exception of the SQM tuning parameters discussed later. If more than one cpu is available. The space on each stable device is allocated in 1MB chunks divided into 64 16K blocks. As each block is read and messages processed.0. while space allocation is done strictly at the 1MB level. then. using UFS devices could cause a loss of replicated data should there be a file system error. adding /usr/lib/lwp to $LD_LIBRARY_PATH ahead of /usr/lib resulted in a 30% performance gain. this is most efficient when employing POSIX threading as kernel threads typically operate on the same cpu as the parent task.5+ is built on native threading via OpenServer 12. all I/O is done at the block (16K) level.0. In high-volume systems. there is not much manual tuning you can do to improve I/O performance.not supported for Max OSX nor Tru-64 (DEC) configure replication server set ‘smp_enable’ to ‘on’ is probably beneficial even in uniprocessor boxes (although probably only slightly with 1 cpu). Different from ASE (which loves kernel threading for I/O and context switching for user processes. The net result is that writing to a UFS effectively single threads the process as the operating system performs a synchronous write and blocks the process. OpenServer native threading can only take advantage of multiple processors when the O/S schedules a thread on another CPU. As a result. As a result. but in low volume systems. Now. There is a separate white paper on RS SMP that describes this in detail. this would lend itself extremely well to UFS (file system) storage as UFS drivers and buffer cache are tuned to sequential i/o – especially using “read-ahead” logic to reduce time waiting for physical I/O.6 Explanation The maximium length of time an SQT thread sleeps while waiting for a Stable Queue read to complete before checking to see if it has been given new instructions in its command queue.Final v2. Stable Device Tuning Stable Queue I/O As you are well aware. Still today. from an I/O perspective. Replication Server uses the stable device(s) for storing the stable queues. will the 1MB be deallocated from queue. While most vendors (HP included) have enabled the ability to specify raw I/O (unbuffered) to UFS devices.6 or RS 15. enabling SMP capabilities with -. 100 for 15. Only when all of the blocks within a 1MB allocation unit have been marked for deletion and their respective save intervals expired. for example. 59 . Individual replication messages are stored in “rows” in the block. it may already be on as it is the default). and Direct I/O In an earlier version of this document.6 and 15. As a result. of course. when 5 concurrent users were attempting database operations. in even low concurrent environments.0 ESD #1).0 manuals. Note that the OQID that is provided to the previous stage of the pipeline (RepAgent. As a result. On earlier versions of HP-UX (9. One thing should be noted after all this discussion.0 via the sqm_write_flush configuration (configure replication server set ‘sqm_write_flush’ to ‘on’ .0 training materials) that UFS devices are faster than raw devices. UFS devices. Actually. While the buffer cache can reduce the I/O wait times for highly serialized access and consequently have in the past provided performance improvements for single threaded processes or areas where a spinlock or other access restriction single threads the access. However. but rather the purpose was to illustrate how O/S vendors are changing their respective UFS device implementations to mirror the concurrent I/O capabilities of raw devices. is still raw partitions. performance improvements can be achieved.x). however. Unbuffered I/O. RS typically is not I/O bound (unless a large number of highly active connections are being supported. In fact. guaranteed recoverability. of course. Raw. RS flushes each block as it is full. the better the performance of raw devices vs. one customer who had been using UFS devices immediately noticed a 30% drop in performance. however. When the dsync flag is enabled. RS reads the RSSD to find its location within each queue and then begins checking each block after that point to see if it is still active or if already processed and comparing to the OQID last received by the next stage. but only updates the RSSD every sqm_recover_seg .x & 10. this is largely due to the buffer cache and not due to the UFS device implementation. it is a very good book for describing the O/S features that enable top performance as well as understanding the architectures of the major DBMS vendors and their implementation on the Solaris platform. Quick I/O has been certified with ASE. it has not been tested with RS (RS engineering typically does not certify hardware such as solid state disks or third party O/S drivers as these features should be transparent to the RS process and managed by the O/S). the more concurrent the I/O activity.html). or when processing large transactions). even with the buffer cache.1 Async. unless you have reason to believe that you will be I/O bound.com/books/catalog/Packer/index. The preference. For years. that a boldface warning was even placed in the ASE 12. For Solaris customers. HP and others have implemented a version of asynchronous I/O for UFS devices.4 for Sybase: Performance Brief – OLTP Comparison on Solaris 8”.and typically should be set to 10MB. As stated. as part of recovery. one way to get around this and get similar performance on UFS as with raw partitions was to use a tool such as Veritas’s Quick I/O product – which enables asynchronous I/O for UFS devices. The advantage comes from the fact that raw partitions allow concurrent I/O from multiple threads or processes by using the asynchronous I/O libraries.1 did implement the dsync flag similar to ASE 12. Sun. Unfortunately. called “Direct I/O”. With an EBF released in late 2001 (EBF number is platform dependent). UFS devices. Asynchronous I/O provided concurrent I/O and consequently scalability for parallel processing. Additionally. The same was true on SGI’s IRIX at 75% write activity for a single user. This can easily be illustrated using the transaction log. While it does not give a lot of detailed device about database tuning from a DBA’s perspective. In fact. Currently. FSync I/O RS Future Enhancement This will change in a feature being considered for a future RS release. inbound SQM. for now (12. using UFS devices with the sqm_write_flush option ‘off’ may be a usable implementation if raw partitions are not available – particularly in Warm Standby implementations where only a single SQM thread may be performing I/O. bcp or other environment in which I/O concurrency is not involved. the buffer cache is forced to flush each write – and consequently the performance advantage immediately disappears. many were tempted to switch to UFS devices using the dsync flag to ensure recoverability under the hope of getting greater performance. by Alan Packer published by Sun (http://www. UFS devices still use synchronous I/O operations. the degree of concurrency is limited in some operating systems to only 64 concurrent I/O operations – far below the capacity of ASE or RS. RS 12. UFS. As RS 12. titled: “Veritas Database Edition 1. select/into (single threads I/O’s due to system table locks). raw partitions were able to outpace UFS devices. server processes such as ASE or RS can submit large I/O requests in parallel for the same internal user task. Raw devices historically have been used to provide unbuffered I/O as well as multi-threaded/concurrent I/O against the same device – two distinctly different features. an interesting benchmark clearly illustrating the problem with UFS devices was published by Veritas a while ago.which defaults to 1MB . What that last bullet on the previous page referred to was that over the past two years. The answer simply is false. does provide the capability for RS to use UFS devices in a recoverable fashion. this is not quite correct. it may require changes to the O/S kernel to enable Direct I/O for UFS I/O activity. This. Consequently. the last bullet caused a bit of misunderstanding that UFS devices would be faster than raw partitions for stable queue devices. as mentioned above. but RS was not engineered to take advantage of it. As a result. buffered UFS devices using dsync suffer such performance degradation.sun. etc. a really good book on devices you can use to justify the use of raw partitions over file systems to paranoid Unix admins is the book Database Performance and Tuning.x was engineered prior to “Direct I/O” availability.) is 60 . Overall. typically do not allow concurrent I/O operations against the same device when using buffered IO.0.Final v2. The misunderstanding is due to a common misconception (that unfortunately was further spread in early ASE 12. this should not pose a problem.Final v2. for any source system. This will help alleviate the I/O contention between: • • Between two different high volume sources or destinations. each previous stage of the pipeline will often start reprocessing from that point and the next stage will simply treat any repeats as duplicates. hopefully few. However. was the concept of “partition affinity”. actual writes will have to occur. This would allow RS to leverage the file system buffering to speed I/O processing by caching most of the writing to the file system buffer cache. Warm Standby inbound queue and other replication sources and targets. 61 .1 based on the even segment point defined by sqm_recover_seg. The future enhancement to RS is to use the fsych i/o call on UFS devices in synchronization with the sqm_recover_seg. this means that any data processed after the OQID is written to the RSSD and until the RS crashed or was shutdown is redundant since if it was not there. unless two high volume source systems are replicating to each other. SQT and DSI/SQT cache will effectively be extended by the file system buffer cache. As a result.1 release. Effectively. you can get similar behavior through an undocumented behavior. The difference between this and adding the database connections prior to adding the extra partitions is illustrated below. True. the inbound queue is used for data modifications from the source while the outbound queue is used for data modifications from other systems destined to the connection. Partition affinity refers to the ability to physically control the placement of individual database stable queues on separate stable device partitions. one queue may end up “migrating” onto another connection’s partitions. Stable Queue Placement One often requested feature new to the 12. the situation on the right is more preferable. if any. eliminating expensive physical reads when the sqt_max_cache_size is exceeded by large transactions. However. One place where it could occur (and consequently bears some monitoring) is if a corporate roll-up system also supports a Warm-Standby. part1 part2 part3 part1 part2 part3 Connections created prior to stable devices part2 and part3 Connections created after stable devices part2 and part3 Figure 10 – Stable Device Partition Assignment & Database Connection Creation Obviously. some future release of Replication Server will likely recommend file system devices over raw partitions. As a result. another or longer save interval. it would get resent. Consequently. Since the O/S destages these as necessary to the devices. Some people would quickly point out that separation between the inbound and outbound queues for the same connection is not possible with this scheme.1. It is anticipated that this technique will achieve the following benefits: • • • RepAgent throughput will increase as the write to the inbound queue will be faster. DIST throughput will increase for a similar reason as the writes to the outbound queue will be faster.0. Default Partition Behavior Previous to 12. the Replication Server will round-robin placement of the database connections on the individual partitions. even though it may start this way. this is not necessarily a problem. when the fsych is invoked. If all of the stable device partitions are added prior to creating the database connections. Remember. due to much higher transaction volume to/from one connection vs. you can specifically assign stable queues to disk partitions through the following command syntax: Alter connection to dataserver. The real reason is that the limit in the RSSD system table rs_diskpartitions which tracks space allocations. While RAID 0+1 is preferable. create table rs_diskpartitions ( name varchar(255). logical_name varchar(30). vstart int ) go In the above DDL for rs_diskpartitions. Well. allocated_segs int. it doesn’t quite work that way with Replication Server. try as one might. Assigning disk affinity is actually more of a “hint” than a restriction. then space will be allocated according to the default behavior Stable Partition Devices Another common mistake that system administrators make. you will understand the reason for the next concept: Key Concept #8: Replication Server Stable Partitions should be placed on RAID subsystems with significant amounts of NVRAM. This has nothing to do with 32-bit addressing or the Rep Server could address a full 2GB (2048MB). Those who are familiar with vstart would be quick to claim this could be overcome simply by specifying a ‘large’ vstart and allowing 2-3 stable devices per disk partition. then the space will be allocated for that stable queue on that partition. 62 .Final v2. First. Those quick with math realize that 255 bytes*8 bits/byte=2040 bits – and hence the reason for the partition sizing limits. the following command only creates a 19MB device instead of a 20MB device 1MB inside the specified partition (and the above command would have attempted a partition of –1MB!!). add partition part2 on ‘/dev/rdsk/c0t0d1s1’ with size=20 starting at 1 Now that we understand the good. For example. As each 1MB allocation is allocated/deallocated within the device a bit is set/cleared within this column.database set disk_affinity to ‘partition_name’ [ to ‘off’] or Alter route to replication_server set disk_affinity to ‘partition_name’ [to ‘off’] Any disk partition can have multiple connections queues assigned to it.0. each connection can only be affinitied to a single partition. status int. the column allocation_map is highlighted. not pending to be dropped or dropped). Logical volumes should be created in such a way that I/O contention can be controlled through queue placement. If the space is not available. RAID 5 can be used if there is sufficient cache. Consequently. allocation_map binary(255). Rep Server will never be able to use all of the space in a 40GB drive. id int.1. If space is available on the partition and the partition exists (i. The reason is the vstart is subtracted from the size parameter to designate how far in to the device the partition will start. without volume management software. and ugly of Replication Server physical storage. however. Consequently.1 Partition Affinity In Replication Server 12. is placing the Replication Server on a workstation with only a single device (or a server but only allowing the Rep Server to address a single large disk).e. bad. each one is limited to 2040MB (less than 2GB). as documented in the Replication Server Reference Manual. this causes a problem in that while a Rep Server can manage large numbers of stable partition devices. consider the following sample of code: add partition part2 on ‘/dev/rdsk/c0t0d1s1’ with size=2040 starting at 2041 The above command will fail. currently. num_segs int. This latter restriction can be a bit of a nuisance where multiple high volume connections need more than 2040MB of queue space (particularly where the save_interval creates such situations). The reason is that the 7 partition limit in Unix would restrict it to ~14GB of space. However. obviously. • • • • • Depending on requirements.e. you can skip this section (go to STS Tuning below) as it really only applies to ASE based RSSD’s. Place the RSSD catalog tables in a named cache separate from the exceptions log (although rs_systext presents a problem – put it in with the system catalog tables) and also use a different cache for the queue/space management tables. Place the tempdb in a named cache. If a table is fully cached. Parameter (Default) sts_cache_size Default: 1000 Suggest: 5000 sts_full_cache_{table_name} Default: see notes Suggest: see notes 11.Final v2. If you are using the embedded RSSD.0. In such a case. while in version 12. and the ASE can be tuned specifically for RSSD operations (i. etc. and rs_functions as well as the defaults. subscription materialization progress. any improvement that caches RSSD data locally will help speed RS processing of replication definitions. monitor CPU utilization.1. 12.e. subscriptions and function strings. See discussion below. Or. As you can imagine. oqid tracking. the main reason is that it reduces or eliminates CPU contention that the RSSD primary user might have with production system long running queries (don’t let the parallel table scans hold your replication system hostage). configuration parameters. There are a few triggers in the RSSD.1 RSSD Generic Tuning You knew this was coming. this interaction could be considerable. bind the log to a log cache with 4K pool) for the RSSD.1 another set of parameters was added to enforce a much desired behavior (sts_full_cache_XXX) as described below. STS Tuning In the illustration of RS internals. only a single tuning parameter was available – sts_cache_size. the System Table Services (STS) module is illustrated as the interface between the Replication Server and the RSSD database. STS Cache Configuration Prior to RS 12.x+ Explanation Controls the number of rows from each table in the RSSD that can be cached in RS memory. If more than one RSSD is contained in the same ASE server. This provides the best situation for maintaining flexibility should a reboot of the production database server or RSSD database server is required. the sts_cache_size limit does not apply. at least you should have after all the discussions on the number and frequency of calls between the Replication Server and the RSSD. the minimum of 1 for individual installations. What does this mean? Consider the following points: • Place the RSSD database in a separate server from production systems. turn on TCP no delay). the common ASE may only need 2 engines vs. Suggest enabling for rs_objects. This is as much to decrease spinlock contention as much as it is to ensure that repeated hits on one RS system table don’t flush another from cache. recovery progress.1 63 . While it is not exactly possible to improve the speed of writes to the RSSD from the STS perspective. including one on the rs_lastcommit table (fortunately not there in primary or replicate systems) that is used to ensure users don’t accidentally delete the wrong rows from the RSSD. segment allocations. Key Concept #9: Normal good database and server tuning should also be performed on the RSSD database and host ASE database server. it might make sense to consolidate them on a single host (providing enough CPU’s exist) and share a common ASE. but off for all other tables. Raise the priority for the RSSD primary user. The STS is responsible for submitting all SQL to the RSSD – object definition lookups. rs_columns. if multiple Replication Servers are in the environment. Recommended setting is the number of rows in rs_objects plus some padding Controls whether a specific RSSD table is fully cached. Note that the default is on for rs_repobjs and rs_users. Dedicate a CPU to the RSSD database server. Set the log I/O size (i. no assumptions were made regarding ASE features that could reduce contention between queries or enhance performance.1. particularly in non-Warm Standby implementations where function strings are implemented.1 and you notice excessive i/o’s you may want to consider adding the following indexes or other indexes as applicable. Unfortunately. The syntax to cache an RSSD table is: configure replication server set sts_full_cache_rs_columns to ’on’ It is notable that rs_subscriptions. Typically. A complete list of tables includes: rs_classes rs_columns rs_config rs_databases rs_datatype rs_functions rs_locater rs_objects rs_publications rs_queues rs_repdbs rs_repobjs rs_translations rs_routes rs_sites rs_systext rs_users rs_versions At a minimum. rs_rules. If your system has sufficient memory. The table rs_funcstrings uses it's own cache outside of the STS cache pool. rs_dbreps. MY_RS_RSSD If the following tables show any contention. multiple threads within the Replication Server could be simultaneously using the STS module and creating concurrent connections/queries within the RSSD itself. You may wish to monitor the following tables for blocking (not deadlocks. It is not a thread nor is it a separate daemon process. Since most queries will use the primary key. RS 12. although small tables may be scanned. rs_translations (if using the HDS feature for heterogeneous support).1 STS Caching As of RS 12. care must be taken as large function string definitions could consume a lot of RS memory. However. Ditto for rs_databases. etc. The syntax for the latter is similar to: sp_object_stats "00:20:00".0. you may even want to cache rs_systext. etc. rs_whereclauses are excluded in the above list.1 Unfortunately. the RS is not tied to any specific version of ASE. rs_funcstrings.Final v2. In addition to adding the monitoring tables. rs_columns and rs_functions in addition to the rs_users and rs_repobjs that are cached by default. For example. Table rs_columns Added indexes in 12. consequently. prior to RS 12. you may want to disable sts_full_cache as the cache refresh mechanism effectively rescans the RSSD after each object creation – noticeably slowing object creation. but rather that the RS only ensured that the rs_repobjs and rs_users tables were fully cached. Additionally. most RSSD tables could be specified to be fully cached in the STS memory pool. only the rs_repobjs (stores autocorrection status of replication definitions at replicate RS’s for routes) and rs_users tables could be fully cached. On the other hand. rs_publications (if using publications). If using a version prior to 12. but blocking) using Historical Server or sp_object_stats. That is not to say that a lot of contention exists within the RSSD.1. only 2-3 i/o’s should be required to fulfill the request including index tree traversal.1 also modified some indexes within the RSSD for faster access.1 Non-unique index on (objid) Deleted indexes in 12. specifying sts_full_cache_rs_routes is probably not effective as it likely would be fully cached anyhow (most likely with <10 rows). That does not infer that other RSSD table rows were not in cache. tables frequently updated could have contention when multiple sources or destinations are involved as the tables modified often have rowsize far less than 1000 bytes (as anything with a rowsize minimally half of 1962 bytes would result in 1 row per page anyhow). it also is updated frequently – which involves updates to the RSSD anyhow (similar to rs_diskpartitions). 10. each query will retrieve an atomic row through specifying discrete primary key values. you may also want to cache rs_functions. in fact. Additionally. Consequently. While rs_locater is small and likely would be fully cached anyhow.. if creating subscriptions. RS 12. if memory permits.1 64 . you may wish to alter the tables to a datarows locking scheme: rs_diskpartitions rs_queues rs_locater rs_segments rs_oqid You also may want to closely monitor the amount of i/o required to fulfill a STS request. Some tables it doesn't make any sense to fully cache as they are considerably smaller that the sts_cache_size parameter. rather the opposite. it is recommended that you cache rs_objects. STS RSSD Table Access The STS module is literally just that. you may need to issue sp_recompile against the table to ensure that stored procedures will pick up the new index – although few stored procedures are issued by the STS (most are admin procedures issued by users directly in the RSSD such as rs_helpsub).0. phys_objowner) Non-unique index on (parentid) Non-unique index on (classid) Deleted indexes in 12. 65 . these counters have been expanded from the initial 8 to the following 9 with the addition of STSCacheExeceeded: Counter QueriesTotal SelectsTotal SelectDistincts InsertsTotal UpdatesTotal DeletesTotal BeginsTotal CommitsTotal STSCacheExceed Explanation Total physical queries sent to RSSD. – along with including the RSSD in any normal maintenance activities such as running update statistics on a periodic basis or using optdiag to monitor tables with data only locking schemes (you should never allow rows to be forwarded in the RSSD). but the other write activity is necessary for recovery and can't be reduced much (except inserts due to counters). In addition to the usual error in the errorlog. Obviously the goal is to reduce the amount of select statements issued – updates possibly can be reduce by sqm_recover_segs. simply modifying a 12. Total Update statements sent to RSSD. Total Commit Tran statements sent to RSSD.1. it is highly recommended that you contact Sybase Technical Support before making such changes and that you clearly think through all the impact of the changes to ensure that RS correct operation is not compromised. In addition. it is also advisable to run update statistics after any large RCL changes – such as adding or deleting large batches of replication definitions. Consequently.Final v2.1 Table rs_databases Added indexes in 12. Total number of time STS cached was exceeded. while others – particularly any direct row modifications – could result in loss of replicated data. it is somewhat useful to know they exist. STS Monitor Counters In RS 12. You should always verify index changes through proper tuning techniques before and after the modification. On the subject of indexes. Total Select Distinct statements sent to RSSD. but for now. dbname) Unique index on (ltype.0 installation with the above changes may degrade performance. etc. Total Begin Tran statements sent to RSSD. phys_tablename. objtype. dbid) Non-unique index on (ltype) Clustered index on (funcname) Non-unique index on (dbid.1 Unique clustered index on (objid. Total Delete statements sent to RSSD. dsname. the following monitor counters were added to track RSSD requests from the RS via the STS. Note that some specific types of STS activity can be monitored with counters for other modules. Later we will discuss how to set up these counters and how to sample them. Adding indexes or changing the locking scheme are fairly benign operations (assuming the RS is shutdown during the modification and taking into consideration the extra i/o required to maintain the new indexes). subscriptions. As of 12. colnum) Unique clustered index on (ltype. After making any indexing changes.1 rs_functions rs_objects rs_systext rs_translation Clustered index on (objid) It should be noted that the above indexes where added/deleted due to observed behavior and changes in SQL submitted by the STS. Total Select statements sent to RSSD. Total Insert statements sent to RSSD. For example.6. the SQM module includes a counter for tracking the number of updates to the rs_oqid table. Keep in mind that any RSSD changes you make will be lost during an RS upgrade or re-installation. you can watch the STSCacheExceed to determine if you need to bump up the sts_cache_size configuration parameter. All interaction to the RSM monitoring “agents” will be done through the SMS RSM “domain manager” Configure RSM Client (Sybase Central Plug-In) or other monitoring scripts to connect to the SMS RSM “domain managers”. let’s say you activate M&C for all the modules – and notice a huge number of inserts via the STS. You will have no record of these changes and it is too easy to make mistakes. will fail at the replicate due to a replicate database/ASE issue. 3. you will after the first time you have to recreate them. These RSM Servers will function as the monitoring “agents”. If Backup Server. if you don’t have a record of your replication system. is it depends. This RSM will function as the RSM “domain manager”. If more than one RS is on a host. the next question is “How is it best implemented?” The answer.1 A key point about the STS counters. consider adding multiple RSM monitoring agents every 3-5 RS’s (depending on RS load). Following the above. consider having one of the RSM “monitoring agents” on that host also monitor the process if no other monitoring capability has been implemented. Do NOT allow changes to the replication system be implemented through RSM. such as a nightly batch job.0. 2. 4. Any site that is using Replication Server in a production system without using RSM or equivalent third-party tool (such as BMC patrol) has made a grave error that they will pay for within the first 3 months of operation. Rather than think the RSSD is getting hammered by inserting into general RSSD tables. The main reason for this is that it is a GUI. 6. however. This virtually guarantees that a transaction. is that they will reflect the STS activity to record M&C activity. However.Final v2. Have developers create scripts that can be thoroughly tested and run with high assurance that “fat fingers” won’t crash the system. the classic “ran out of locks” error from the replicate ASE during batch processing. Configure one RSM Server on primary SMS monitoring workstation per replication domain. OpenSwitch or other OpenServer process is critical to operation. The last bullet is important. Why is this true?? Simply because most sites don’t test their applications today and as a consequence the transaction testing which is crucial to any distributed database implementation is missed. 5. For example. a sample environment might look like the following: 66 . consider the following guidelines: 1. of course. RSM/SMS Monitoring Installing Replication Server Manager (RSM) is an often neglected part of the installation. RSM Implementation Having established the need for it. you need to subtract the number of counter values inserted during that time period from the InsertsTotal to derive the non-statistics related insert activity. Configure one RSM Server on each host where a Replication Server or ASE resides. RSM load ratios: 1 RS = 3 ASE = 20 OpenServers. Similar to not keeping database scripts. For example. but did not quite make it in time. using an existing ASE in high volume environments. Currently. The statistics are implemented via a system of counters that can either be viewed through an RS session or can be flushed to the RSSD for later viewing/analysis. As a result.0.3+. • On one production system with a straight Warm Standby implementation. If the polling cycle is set too small – or too many individual administrators are independently attempting to monitor the system.0 and 11. Performance RSM or other SMS software monitoring can impact Replication performance in several ways: • Unlike ASE’s shared memory access to monitor counters. Similar to the partition affinity feature. in reality they are closer to Historical Server or the MDA monitoring tables in ASE 12.000) during a single day of monitoring.5. While in discussion it has often been compared to sp_sysmon. However.5 with the M&C for testing and debugging purposes only. All of this is leading up to one point: Key Concept #10 – Monitoring is critical – but make the heart beat. in RS12. the Replication Server and RSSD must be “polled” to determine system status information. The rationale is that sp_sysmon in ASE simply reports the total of any counter during the entire monitoring period. not race! RS Monitor Counters One of the major enhancements to Replication Server 12.000.0 release. Historical Server and Replication Server have both implemented a “sample interval” type mechanism in that counter values are flushed to disk on a periodic basis during the sample run. Excessive use of the heartbeat feature can interfere with normal replication. between the RS accesses and the RSM accesses to the RSSD. the monitors & counters (M&C) were originally slated for the 12. special EBF’s have been created to “backfit” RS 12.1 was performance monitoring counters. replication increased tempdb utilization by 10% (100. This allows peaks to be identified as well as actual cost of individual activities.0. it is clearly enough of a load to consider the separate RSSD server vs. with 300 counters.Final v2.6. it is difficult to document them all in the product 67 . Because the way RSM “re-uses” many of the same users. Obviously. nearly 300 counters exist.000 inserts out of 1. with the possibility of more being added in future releases. it was impossible to differentiate between RS and RSM activity. this polling could degrade RS and RSSD performance.1 PDS Monitor Server RSM RSSD RDS Monitor Server RSM RSM Historical Server SMS Trends Database DBA SMS Server Figure 11 – Example Replication System Monitoring RSM vs. A comparison of the RS 12. and last counter values. 2. three additional tables were added to the RSSD to track the counter values and store counter specifics.0 rs_statdetail table is illustrated below: 68 . 3.6 are illustrated below (along with rs_databases due to the relationship with rs_statdetail): rs_databases dsname dbname dbid dist_status src_status attributes errorclassid funcclassid prsid rowtype sorto_status ltype ptype ldbid enable_seq varchar(30) <pk> varchar(30) <pk> int tinyint tinyint tinyint rs_id rs_id int tinyint tinyint char(1) char(1) int int dbid = instance_id rs_statdetail run_id instance_id instance_val counter_id counter_val label counter_id = counter_id run_id = run_id rs_statcounters counter_id counter_name module_name display_nam e counter_type counter_status description <pk> int varchar(60) varchar(30) varchar(30) int int varchar(255) rs_statrun run_id run_date run_interval run_user run_status <pk> rs_id datetime int varchar(30) int rs_id int int int int varchar(255) <pk.1 documentation. this document will discuss the counters in detail as well as provide a list of counters that apply to each of the applicable threads in later sections.fk2> <pk. Monitor counter system tables in the RSSD RCL commands to enable and sample the counters SQL commands to sample counters flushed to the RSSD RCL commands to reset the counters The dStats daemon which performs the statistics sampling RSSD M&C System Tables. These tables for RS 12. RS Counters Overview The monitoring counters implementation and their use can be divided into five basic areas: 1.fk3> <pk> <pk. 5. etc.fk1> Figure 12 – RSSD Monitor & Counter Tables In Replication Server 15. In addition to the logic and RCL commands added to implement the counters. you can view descriptive information about the current counters by using the rs_helpcounter stored procedure. Since it is extremely applicable to performance and tuning.Final v2.0. However.0. max. 4.x rs_statdetail table and the RS 15. This section will provide an overview of the counters as well as those counters specific to RSSD activity. the rs_statdetail table changed slightly due to a different method of recording average. RS 15. Information about the individual counters is stored in the rs_statcounters. and memory utilization for the various modules. For example. If using RS 15.0 has a single counter DSIEResultTime. The rs_statcounters table is highly structured: Column Name counter_id counter_name module_name display_name counter_type counter_status description instance_id Example Value 4000 RSI: Bytes sent RSI BytesSent 1 140 Total bytes delivered by an RSI sender thread. similarly for counter_max. while counter values from each run are stored in the rs_statdetail table with the run itself stored in the rs_statrun table. the last value for the counter (counter_last) and the maximum for the counter (counter_max). you simply select counter_last. you would simply derive the average by selecting counter_total/counter_obs. the counter ids are arranged by internal RS module that the counter is used for. the total for the counter (counter_total). the counter id ranges and modules used in the rs_statcounters table: Counter Id Range 4000-4999 5000-5999 6000-6999 11000-11999 13000-13999 24000-24999 Module RSI DSI SQM STS CM SQT 69 . multiple instances of DSI-E).6 there is a single counter_val column. The following table lists.6 had counters for the last.0 rs_statdetail table comparison The main difference is that while in RS 12. 2 Explanation The id for the counter – counter id’s are arranged by module as detailed below.x and 15.0 records the number of observations (counter_obs).0 and you want the last value for DSIEResultTime.1 Figure 13 – RS 12. As mentioned earlier. RS 15. Descriptive external name for the counter Module that the counter applies to Used to identify the counter through RCL The type of counter as detailed below The relative impact of the counter on RS performance as detailed below The counter explanation The particular instance of the module or thread. time. max and total for some counters such as DSIEResultTimeLast. with a minimum of 2 connections. As a result. where RS 12. DSIEResultTimeMax and DSIEResultTimeAve.to get the average DSIEResultTime. This difference mainly effects counters tracking rates.Final v2. you will have 2 instances of DSI-S threads (or with parallel DSI.0. The average is the only change . Counters used by admin statistics.. is sampled even when sampling is not enabled.1 Counter Id Range 30000-30999 57000-57999 58000-58999 60000-60999 61000-61999 62000-62999 Module DIST DSIEXEC RepAgent (EXEC) Sync (SMP sync points) Sync Elements (mutexes) SQMR (SQM Reader) The counter type and status designate whether the counter is a total sampling. From this. many of the values are encoded. Counters that measure rate. as well as the impact of the counter on performance and other status information. These are described in the following table: Value Variable Explanation Counter Types (Enumerated) 1 2 3 4 @CNT_TOTAL @CNT_LAST @CNT_MAX @CNT_AVERAGE Keeps the total of values sampled Keeps the last value of sampled data Keeps only the largest value sampled Keeps the average of all values sampled Counter Status (Bitmask) 1 2 4 8 16 32 64 128 256 @CNT_INTRUSIVE @CNT_INTERNAL @CNT_SYSMON @CNT_MUST_SAMPLE @CNT_NO_RESET @CNT_DURATION @CNT_RATE @CNT_KEEP_OLD @CNT_CONFIGURE Counters that may impact Replication Server performance. and retains the both the current and previous value. average. sysmon command (counter_status=140 ⇐ 128 & 8 & 4 = CNT_KEEP_OLD & CNT_MUST_SAMPLE & CNT_SYSMON). For example.Final v2.0. Counters that keep the run value of a Replication Server configuration parameter. etc. in addition to being a RSI counter. sysmon command. Counters that keep both the current and previous value. For example. Counters that measure duration. When looking at rs_statrun and rs_statdetail. Counters that are not reset after initialization. consider the following example run_id and the decomposition: Figure 14 – Example rs_statrun value and composition 70 . Counters that sample even if sampling is not enabled. Counters used by Replication Server and other counters. run_id itself is composed of two components – the monitored RS’s site id (from rs_sites) in hex form and the run sequence also in hex. keeps a running total of bytes sent (counter_type=1). you can determine that the sample counter listed above (RSI: Bytes Sent). and is also used by the admin statistics. @prsid) failed as it appears to suffer from byte swapping issues). inbound queue. 1 Warm Standby DSI corresponds similarly to 0/1 for outbound/inbound SQM queue identifiers 1 . sp_helpcounter prints every column in rs_statcounters for each counter. To view a list of modules that have counters and the syntax of the sp_helpcounter procedure.0. Warm Standby DSI reader outbound queue (DSI SQT). 71 . to isolate statistics from one RS from the other. and the external name of each counter. The instance_id column values depend on the thread and more specifically the counter module. One slight gotcha with this formula is that the strtobin() function is as of yet undocumented in ASE – but also unfortunately is the only way of performing this criteria comparison (attempts to use convert(binary(4). you need to focus on the RS site_id by using a where clause similar to: strtobin(inttohex(@prs_id))=substring(run_id. sp_helpcounter prints the display name. inbound queue (not applicable) SQT DIST DSI ldbid dbid dbid 0 normal DSI. If you enter short. If you enter long. enter: sp_helpcounter To view descriptive information about all counters for a specified module. Consider the following table that illustrates the various thread and instance_id values: Counter Module REPAGENT SQM SQMR Instance_id ldbid for RS 12. module name. -1 (not applicable) DSIEXEC dbid RSI rsid You can view descriptive information about the counters stored in the rs_statcounters table using the sp_helpcounter system procedure. {type | short | long} ] If you enter type.1. If you do not enter a second parameter. counter type. while rs_helpcounter is used to find a counter by keyword in the name or by a particular status. the individual connection queues. the module name. the syntax is: rs_helpcounter { ’intrusive’ | ’sysmon’ | ’rate’ | ’duration’ | ’internal’ | ’must_sample’ | ’no_reset’ | ’keep_old’ | ’configure’ } Note the difference between the two procedures – sp_helpcounter is used to list the counters for a module (or all modules). enter: sp_helpcounter module_name [. If you need to do this. and counter status for each of the module’s counters.0 ldbid for inbound dbid for outbound ldbid Instance_val column value -1 0 1 10 11 21 0 1 -1 (not applicable) outbound queue.1 The site id is especially needed if trying to analyze across a route and you have combined statistics from more than one RSSD to perform the analysis.#dsi_num_threads This number is the specific DSIEXEC thread number. With warm standby systems. the queue related instance_val values will be reported for the logical connection due to the single queue used for the each inbound and outbound queue vs.Final v2. inbound queue outbound queue. {type | short | long} ] To list counters with a specified status. The instance_val typically maps to the connection’s dbid or rsid (for routes). and counter descriptions for each counter. sp_helpcounter prints the display name. sp_helpcounter prints the display name. module name. To list all counters that match a keyword. Probably the two most confusing values to decode are the instance_val and instance_id values.4) In which @prs_id is the site id of the RS in question from rs_sites.6 dbid for RS 15. enter: rs_helpcounter keyword [. 2. Generically. Enabling flushing Use the configure replication server command with the stats_flush_rssd option to enable or disable flushing. It turned out in reality that these counters had much much less impact than anticipated . For general monitoring. the counters do not record data and no metrics can be flushed to the RSSD. However. The default is “on.especially in determining the holdup of the DSI/DSIEXEC processing. You can enable or disable intrusive counters using the admin stats_intrusive_counter command. Configuring the flush interval for specific modules or connections Each of these steps will be discussed in more detail in the following paragraphs. Setting seconds between flushes You set the number of seconds between flushes at the Replication Server level using the configure replication server command with the stats_daemon_sleep_time option.Final v2. As far as the rest of this section. the maximum value is 3153600 seconds (365 days). The syntax is: configure replication server set ’stats_daemon_sleep_time’ to sleeptime The minimum value for sleeptime is 10 seconds. Enable sampling of non-intrusive counters Enable sampling of intrusive counters Enable flushing of counters to the RSSD (if desired) Enable resetting of counters after flush to the RSSD Set the period between flushes to the RSSD (in seconds). 3. 72 . Enabling sampling of non-intrusive counters You enable or disable all sampling at the Replication Server level using the configure replication server command with the stats_sampling option. 4. The default is “off. Initially.” The syntax is.0. The default is “off. much of the information was pulled straight from the RS 12.6) The very first thing that must be done prior to enabling M&C is to increase the size of the RSSD – hopefully you did this when you installed the RS or you will need to now. Enabling reset after flushing Use the configure replication server command with the stats_reset_afterflush option to specify that counters are to be reset after flushing . however. this may have to be decreased to 60-120 seconds (1-2 minutes) to ensure accurate latency and volume related statistics. This step is optional in a sense that you can view the statistics without flushing them. Counters that may affect performance—intrusive counters—are enabled separately so that you can enable or disable them without affecting the settings for non-intrusive counters.and in RS 15. and routes to flush.” The syntax is: configure replication server set ’stats_reset_afterflush’ to { ’on’ | ’off’ } Certain counters. connections. The default is 600 seconds. are never reset.0. such as rate counters with CNT_NO_RESET status.” The syntax is: configure replication server set ’stats_sampling’ to { ’on’ | ’off’ } If sampling is disabled.1 release bulletin and is simply repeated here for continuity (one of the benefits of working for the company is that plagiarism is allowed).” The syntax is: configure replication server set ’stats_flush_rssd’ to { ’on’ | ’off’ } You must enable flushing before you can configure individual modules. 6. these were some of the more useful counters .1 Enabling M&C Sampling (RS 12. { ’on’ | ’off’ } It is highly recommended that you enable intrusive counters. admin stats_intrusive_counter. the most beneficial use of the monitors will only be achieved via flushing them to the RSSD for later analysis and baselining configuration settings. Enabling sampling of intrusive counters Most counters sample data with minimal effect on Replication Server performance. Additionally. for narrower performance tuning related issues. it was assumed that these counters would impact performance as they primarily tracked execution times of various processing steps. The default is “on. enabling the monitors and counters for sampling is accomplished through a series of steps outlined below: 1. the notion of intrusive counters was eliminated. the default may be fine. 5. { ’on’ | ’off’ } where: • • • • • • data_server is the name of a data server. and SQT modules. Hint. and routes A hierarchy of configuration options limit the flushing of counters to the RSSD. The command admin stats_config_module lets you configure flushing for a particular module or for all modules. Consequently. The syntax is: admin stats_config_connection. You can configure flushing for individual threads or groups of threads.0. { module_name | all_modules }. Routes .x Script The typical performance analysis session might use the following series of commands to fully enable the counters: admin statistics. for a particular connection. and module_name is sqm or rsi. but all new threads will have flushing turned on. {’on’ | ’off’ } where module_name is dist. you can choose to flush metrics from a matrix of available counters.Use the admin stats_config_connection command to enable flushing for threads related to connections.” The syntax is: admin stats_config_module. database | all_connections }. rsi. {’on’|’off’} where rep_server is the name of the remote Replication Server. Connections . or sqt. The syntax is: admin stats_config_route. For multithreaded modules. database is the name of a database. This does not turn on flushing for existing threads of that module. repagent. it would be a good idea to place frequently used configurations used for counter flushing in a script file. Configuration parameters that configure counters for flushing are not persistent. { data_server. Hint. This command is most useful for single or non-threaded modules. You can set flushing on for all counters of individual modules or all modules using the command admin stats_config_module. you have greater control over which threads are set on if you use the admin stats_config_connection and admin stats_config_route commands. sqt.1 Configuring modules. SQM. dsi. Note: If you configure flushing for a thread. 'on' go configure replication server set 'stats_flush_rssd' to 'on' go configure replication server set 'stats_reset_afterflush' to 'on' go configure replication server set 'stats_daemon_sleep_time' to '60' go 73 . Replication Server also turns on flushing for the module.You can use the admin stats_config_route command to save statistics gathered on routes for the SQM or RSI modules. this will produce a lot of output. sts.Final v2. all_routes specifies all routes from the current Replication Server. { module_name | ’all_modules’ }. The number of threads for a multithreaded module depends on the number of connections and routes Replication Server manages. For example. REPAGENT. they do not retain their values when Replication Server shuts down. The default is “off. all_modules specifies the DIST. dsi. connections.x does not flush counters that have a value of zero. inbound | outbound identifies SQM or SQT for an inbound or outbound queue. reset go configure replication server set 'stats_sampling' to 'on' go admin stats_intrusive_counter.{ rep_server | all_routes }. this too will produce a lot of output. Note If a module’s flushing status is set on. DSI. counters for all new threads for that module will be set on also. Before you can configure a counter for flushing. Note Replication Server 12. { module_name | all_modules }. or sts. sqm. Example RS 12. cm. [ 'inbound' | 'outbound' ]. For multithreading modules. you can configure flushing for a module. module_name is dist. which have only one thread instance. repagent. or for all connections. all_connections specifies all database connections. make sure that you first enable the sampling and flushing of counters. sqm. 0 would look like the following: admin statistics. To determine the display name of a counter. The biggest problem with this syntax is that typically you know the interval you want (15 seconds or 1 minute) but either don’t know how long you wish to collect data for .the admin statistics. To display information. What makes this tricky is the last parameter . there was a conscious effort to simplify the commands needed to implement counter sampling. status go One word of warning . you should reset the counters after each flush – this helps prevent counter rollover during sampling – particularly for the byte-related counters. Viewing current counter values RS monitor counter values can be either viewed interactively via RCL submitted to the replication server via isql or other program or by directly querying the RSSD rs_statrun and rs_statdetail tables if the statistics were flushed to the RSSD. save. Using this duration. 10800 go admin statistics. and given a number of observations. 'all_modules'. "all".800 seconds with 720 observations yields a 15 second sample interval. As noted in the sample script above. One other difference between RS 12. For example. <module>.6. enter: admin statistics.0 is that in RS 12.x and RS 15. display_name] where module-name is the name of the module and display_name is the display name of the counter. • To view current values for all enabled counters. module_name [. dCM.0. <num observations>. SQM.collect stats for “all” modules -save them to the RSSD -collect for 3 hours at 15 sec interval admin statistics. Consequently.0) In Replication Server 15.or you know it terms of hours and minutes. a sample script to enable monitoring for RS 15. you can derive the sample interval. DSIEXEC. <num observations> is the number of observations to make. The first issue is that it is measured in seconds. use sp_helpcounter. enter: 74 . Enabling M&C Sampling (RS 15.0. Each of these methods will be discussed. reset command will truncate rs_statrun and rs_statdetail . 'on' go The first line ensures that the first sample is reset vs. MD. 3 hours translates into 10. MEM. DIST. Replication Server can display information about these modules: DSI. <save>. The other difference is that currently there is no explicit start/stop command. <sample period> The first two parameters are fairly self explanatory.0.see manual for all options/parameters): admin statistics. and MEM_IN_USE. Instead the admin statistics command uses the syntax (note this is an abbreviated syntax . REPAGENT.Final v2. counters with a value of zero will be flushed if the number of observations are greater than zero for that counter. 720. 10. As a result. you often find out that before you execute the command. reset go -.800 seconds. use the admin statistics command as specified below: • To view the current values for one or all the counters for a specified module.1 admin stats_config_module. you are deriving the parameter values with formulas such as: Sample_period = time in hrs * 60 * 60 Num_observations = sample_period / sample_interval (in seconds) Because of the usability issues with this. a subsequent release may support an enhancement to change the syntax to accepting the sample interval directly and accepting the sample period using a notation such as 3h or 120m for entering as a number of hours/minutes. In RS 15.the sample period. Along with this.so be sure to preserve the rows if you wish to keep them. counters with a value of zero were not flushed to the RSSD. it should be noted that smallest sample interval supported is 15 seconds. SQT. In RS 15. Viewing counter values via RCL Replication Server provides a set of admin statistics commands that you can use to display current metrics from enabled counters directly to a client application (instead of saving to the RSSD). RSI. holding over the cumulative counts which can distort the very first sample in the run. However the last two take a bit of getting used to.0. Additionally. instance_id. Calculating derived values when the daemon thread wakes up.1 admin statistics.Final v2. historical trend data could take up considerable space within the RSSD.module_name = ’CM’ order by counter_name. Admin statistics. In addition. and RSI modules. such as rate counters. reset to zero. sysmon [. it may not be the best option. module_name.counter_id and d. <sample period> command for RS 15. You can reset all counters except counters with CNT_NO_RESET status. reset command to reset all counters. flush_status Viewing values flushed to the RSSD You can view information flushed to the rs_statdetail and rs_statrun tables using select and other Transact-SQL commands. you want to display flushed information from the dCM module counters. and prints the results. rs_statrun r where c.0. Counters that can be reset. sysmon [. If. by setting the flush 75 . while enabling these counters require executing special routines including system clock function calls. For example. instance_id. Use the configure replication server set ’stats_reset_afterflush’ to ’on’ command. sample_period] prints the current values of the counters. it attempts to calculate derived statistics such as the number of DSI-thread transactions per second or the number of RepAgent bytes delivered per second. you can not do a full analysis. the counters have been configured to save to the RSSD either by the configure replication server set 'stats_flush_rssd' to 'on' command for RS 12. run_date from rs_statcounters c. rs_statdetail d. the best option is to use an external repository to collect the RSSD statistics and necessary information to perform analysis Resetting counters Counters are reset when a thread starts. When the daemon wakes up. The difference is that the normal counter execution code is executed regardless. for example. run_date In this instance. You can configure a sleep time for dSTATS using the configure replication server command and the stats_daemon_sleep_time parameter. intrusive counters impact RS performance within the RS binary codeline itself. You can reset counters by: • • Configuring Replication Server to ensure that counters are reset after sampling data is flushed to the RSSD.run_id and c. DSIEXEC. Another good reason is that extensive querying of the RSSD will put a load on the server that may impact the ability for it to respond as quickly to the normal RSSD processing of RS requests. samples for the specified sample period. <num observations>. One word of caution: the counters can impact RS performance indirectly.counter_id = d. enter: admin statistics.run_id = r. counter_val. ’all_modules’ • To view a summary of values for the DSI. While you can view the counter data directly in the RSSD. dSTATS manages the interface when Replication Server has been configured to flush statistics to the RSSD using the configure replication server command and the stats_flush_rssd parameter.if a route is involved. sample_period] where sample_period is the number of seconds for the run. enter: admin statistics. • To display counter flush status.6 or the admin statistics. you might enter: select counter_name. However. sample_period] zeros the counters. The biggest reason is that the counters in the RSSD only represent that Replication Server’s values . dSTATS daemon thread The dSTATS daemon thread supports Replication Server’s new counters feature by: • • Managing the interface for flushing counters to the RSSD. REPAGENT. save.0. Issuing the admin statistics. Consequently. Impact on Replication Obviously. "all". admin statistics. the impact is not as great as the name would imply – probably less than 15%. If sample_period is 0 (zero) or not present. sysmon [. which are never reset. some counters are reset automatically at the beginning of a sampling period. rs_sites out . By having these tables.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" A8192 bcp %RSSD%.rs_databases out . rs_objects. other tables such as rs_routes. RS M&C Analysis Repository The second thing you will want to do (after increasing the size of the RSSD above). The reason for rs_sites and rs_databases is that the counter instance_id’s use the RS id instead of the connection name. In addition. having a current copy from the RS allows you to quickly look at configuration values during analysis without having to log back in to the RS in question.. You will need to bcp out all the data from the RS’s involved in the replication if routes are involved.\%RSSD%\rs_statrun. analysis is much easier as the connection names can be used instead of continually looking up the corresponding dbid or prsid. While rs_config can change due to configuration changes. use separate databases in the same server due to differences in rs_statcounters (counter values and names) and rs_statdetail (counter columns). etc.contains a list of the statistics sample periods. there are a couple of notes: • • If using a mixed RS 12. rs_statdetail . is to create a repository database to upload the counter data to after collection. you may wish to create a copy of your rs_ticket table in the repository as well if you use the rs_ticket feature. As mentioned earlier. etc. could be included in the extraction if doing the analysis “blind” (without knowing the topology if routing is involved or if the proper subscriptions have been created)..bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 76 .bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%.1 interval to 1 second and collecting a wide-range of counters. the reasons for this is due to: • • • • Enable you to perform analysis of replication performance when using routes Avoid consuming excessive space in the RSSD retaining historical data Prevent cpu load of analysis queries from impacting RSSD performance for RS activities Prevent loss of statistics in RS 15. Additionally.. that many inserts/second could fill the transaction log much quicker which could result in a log suspend (definitely impeding RS performance) – but do not turn on 'truncate log on checkpoint' – or the next words you will hear from Tech Support will be "I hope you had a backup – and you know how to rebuild queues".. This function is the undocumented strtobin() and bintostr() functions that are used primarily systems involving routes.Final v2. this could slow down queue/oqid updates.. rs_subscriptions.rs_statcounters out . This can be done even with an embedded RSSD that uses ASA as a bcp out is nothing more than a select of all columns/rows from the desired table. rs_statcounters .\%RSSD% bcp %RSSD%.contains the list of all the connections rs_config . Populating the Repository The easiest way to populate the repository is to use bcp to extract all of the above tables from the RSSD. A sample repository is available in Sybase’s CodeXchange online along with stored procedures that can help with the analysis.0 environment.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%.0.contains the list of all the Replication Servers rs_databases . • Additionally. this may not be that much of a problem. On a healthy ASE.contains the counter values. The repository should contain the following tables: rs_statrun .rs_config out .contains all the configuration values In addition to these tables.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%.0 when the reset command truncates the tables Creating the Repository It is recommended that you create the repository in ASE due to a function that is not available currently in ASA or Sybase IQ.\%RSSD%\rs_sites.\%RSSD%\rs_databases. The structure and indexes for these tables can be obtained from the rs_install_systables_ase script in the RS directory. About the repository in general. you can add indexes to facilitate query performance.contains the counters descriptions/names rs_sites .rs_statrun out . A sample bcp script to do this might resemble: set RSSD=CHINOOK_RS_RSSD set DSQUERY=CHINOOK mkdir .x and RS 15.\%RSSD%\rs_statcounters. but on most RSSD's. you will notice that the RSSD has a sharp increase of 100+ inserts per second as measured by STS InsertsTotal.\%RSSD%\rs_config. \%RSSD%\rs_databases. the RepAgent processing time. the only timing mechanisms were the RSM heartbeat mechanism or the use of a manually created “ping”/”latency” table. only each individual duplicate row is rolled back. If you use bi-directional routes.. Additionally. So the sequence for bcp-ing in rs_config values is to do the following: 1. you will need to get very familiar with the counter schema.without the -b1 setting. – as it records a timestamp for every thread that touches the rs_ticket record – from the execution time in the primary database. you may need to bcp in each one. loading it can be a bit tricky. consider the following bcp command for rs_databases: bcp rep_analysis. remember to run update statistics for all the tables . bcp will not abort all processing due to the number of duplicates.likely the RRS for a connection.even when multiple RS’s are involved. Analyzing the Counters It should be noted that both Bradmark and Quest have GUI products that provide an off-the-shelf solution for monitoring RS performance. null 77 . bcp in the rs_config table from the RS of interest .bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 Once you have extracted all the counter data. by setting -m to an arbitrarily high value. After populating the tables. etc.6.0. By constraining the batch size to 1. an error in a batch will cause the batch to rollback. the RepAgent User counters are described in the section on the RepAgent User processing. However.1 bcp %RSSD%. This is important . RS specific tuning parameters and default values have an objid value of 0x0000000000000000 while connection specific parameters will have the connection id in hex in the first four bytes such as 0x0000007200000000 (0x72h is 114d . but use the -m and -b switches to effectively ignore the errors.0 documentation). If you use uni-directional routes.\%RSSD%\rs_statdetail.bcp -Usa -P -S %DSQUERY% -b1 -m200 -c -t"|" The difference is that now bcp will commit every row and ignore up to 200 errors before bcp aborts. null.latency check table definition create table rep_ping ( source_server varchar(32) source_db varchar(32) test_string varchar(255) source_datetime datetime dest_datetime datetime default default default default default "SLEDDOG" db_name() "hello world" getdate() getdate() not not not not not null. update statistics is important considering that a few hours of statistics gathered every minute could mean nearly 1 million rows in the rs_statdetail table.this can be done using a script such as: update update update update update update index index index index index index statistics statistics statistics statistics statistics statistics rs_statdetail using 1000 values rs_statrun using 1000 values rs_config using 1000 values rs_statcounters using 1000 values rs_databases using 1000 values rs_sites using 1000 values Using a higher step count and using update index statistics vs. but this time use -b1 and -m1000. Similarly. these tables are replicated between the RS’s. Pre. null. These are described in more detail in the appropriate section later in this document . a collection of stored procedures along with the repository schema has been uploaded to CodeXchange. RS_Ticket In Replication Server 12.rs_databases in .rs_statdetail out . if you don’t have these utilities.Final v2. However. Perhaps it will become one of the most useful timing mechanisms – replacing heartbeats. even if starting from truncated tables. then each server will have a full complement and you only will need to load one copy.for example. The important thing to remember is that dbid’s are unique within a replication domain . rs_config is a bit strange. the various threads with Replication Server – and finally the destination database.so this configuration value corresponds to dbid 114).. although the PRS may be of interest if analyzing a route bcp in the other rs_config tables using the trick above. The reason for the -m1000 is due to the number of default configuration values and server settings. An example of the latter is illustrated below: -. rs_sites and rs_databases may have duplicates when loading data from multiple RS’s simply due to the fact that when routes are created.across all the RS’s within that domain. a new feature called rs_ticket was added – unfortunately too late for the documentation (it is documented in the RS 15. 2. the individual counters and their relationships. The counter data tables (rs_statdetail and rs_statrun) should have no issues . null. If not using one of the vendor tools to facilitate your analysis.RS_Ticket Prior to RS_Ticket. For instance. source_db varchar(32). A sample is below. test_string varchar(255). source_datetime datetime ) primary key (source_server.6 rsinspri (rs_install_primary) script in $SYBASE/$SYBASE_REP/scripts. ticket_date datetime not null. Customize the rs_ticket_report procedure at the replicate database(s). such implementations were sorely lacking as it really didn’t help identify where the latency was occurring. this would cause the replicate field to be populated using the default value – which would be the current time at execution.rep_ping ( source_server varchar(32). any insert into the primary server would be propagated to the replicate(s). The full setup procedure is as follows: 1.1 ) go create unique clustered index rep_ping_idx on rep_ping (source_server. ticket_payload varchar(1024) null. While useful. Verify that rs_ticket is in the primary database – if not. constraint PK_RS_TICKET_HISTORY primary key (ticket_num) ) lock datarows go /*==============================================================*/ /* Index: ticket_date_idx */ /*==============================================================*/ create index ticket_date_idx on rs_ticket_history (ticket_date ASC) go if exists (select 1 from sysobjects 78 .Final v2. Enable rs_ticket at the replicates by 'alter connection to srv.source_datetime) searchable columns (source_server. extract from RS 12. Hence RS engineering decided to add a more explicit timing mechanism that would help identify exactly where the latency is. Rs_ticket setup Rs_ticket is implemented as a stored procedure at the primary as well as a corresponding procedure (and usually a table) at the replicate.smp_db with all tables named dbo.rdb1 set “dsi_rs_ticket_report” to “on”' 2. It should not be marked for replication as it uses the rs_marker routine that is marked for replication.source_db.source_datetime) go --latency check table repdef create replication definition rep_ping_rd with primary at SLEDDOG_WS.0) identity.0.source_db. Since the destination datetime column was excluded from the definition.source_db. You will also need to develop a parsing stored procedure (also below).source_datetime) send standby replication definition columns replicate minimal columns go By creating this set and subscribing to it for normal replication. 3. A sample rs_ticket_report procedure is as follows: if exists (select 1 from sysobjects where id = object_id('rs_ticket_history') and type = 'U') drop table rs_ticket_history go /*==============================================================*/ /* Table: rs_ticket_history */ /*==============================================================*/ create table rs_ticket_history ( ticket_num numeric(10. This could be significantly more accurate than using the rs_lastcommit table’s values which may reflect a long running transaction. .ddd ** ** Note: ** 1.we need to check first for a rollover situation to do this. 8) + ".ddd" select @c_time = getdate() select @n_param = @rs_ticket_param + ". ** ** Parameter ** rs_ticket_param: rs_ticket parameter in canonical form.@c_time)) .0. time output varchar(20). ** 4." + right("00" + convert(varchar(3). time. ** 2. @c_time.for rollovers. datetime.datepart(ms. we are going to cheat and add a date since time datatype is a physical clock time vs.Final v2.3) -. This is an example stored procedure that demonstrates how to ** add RDB timestamp to rs_ticket_param.@time_begin)>datepart(hh.@time_end)) begin 79 . add date and see if greater than getdate() -. datetime first get the hours.. One should customize this function for parsing and inserting ** timestamp to tables. DSI calls rs_ticket_report if DSI_RS_TICKET_REPORT in on. ** 3.<section> ** section ::= <tagxxx>=<value> ** tag ::= V | H | PDB | EXEC | B | DIST | DSI | RDB | .. ** DSI calls rs_ticket_report if DSI_RS_TICKET_REPORT in on. a duration (in otherwords 35 hours can not be stored in a time datatype) (datepart(hh. datetime -..1 where id = object_id('rs_ticket_report') and type = 'P') drop procedure rs_ticket_report go create procedure rs_ticket_report @rs_ticket_param varchar(255) as begin /* ** Name: rs_ticket_report ** Append PDB timestamp to rs_ticket_param.@n_param) end go if exists (select 1 from sysobjects where id = object_id('parse_rs_tickets') and type = 'P') drop procedure parse_rs_tickets go if exists (select 1 from sysobjects where id = object_id('sp_time_diff') and type = 'P') drop procedure sp_time_diff go create proc sp_time_diff @time_begin @time_end @time_diff as begin declare @time_char @begin_dt @end_dt ----if time. ** ** rs_ticket_param Canonical Form ** rs_ticket_param ::= <section> | <rs_ticket_param>. Don't mark rs_ticket_report for replication.@n_param = "@rs_ticket_param.print @n_param insert into rs_ticket_history (ticket_date.RDB(" + db_name() + ")=" + convert(varchar(8). ** Version value ::= integer ** Header value ::= string of varchar(10) ** DB value ::= database name ** Byte value ::= integer ** Time value ::= hh:mm:ss.RDB(name)=hh:mm:ss. */ set nocount on declare @n_param @c_time varchar(2000). ticket_payload) values (@c_time. 0). null. int.@time_char) return 0 end go create proc parse_rs_tickets @last_two_only as begin declare @pos @ticket_num @ticket_date @rs_ticket @head_1 @head_2 @head_3 @head_4 @pdb @pdb_ts @exec_spid @exec_ts @exec_bytes @dist_spid @dist_ts @dsi_spid @dsi_ts @rdb @rdb_ts @last_row @next_last @ra_latency @rs_latency @tot_latency bit=1 int.0) varchar(10) varchar(10) varchar(10) varchar(50) varchar(30) time int time int time int time int time time varchar(30) time time not null.108)) select @end_dt=convert(datetime.-1) from rs_ticket_history where ticket_num < @last_row 80 . time. numeric(10. null. time create table #tickets ( ticket_num head_1 head_2 head_3 head_4 pdb pdb_ts exec_spid exec_ts exec_bytes exec_delay dist_spid dist_ts dsi_spid dsi_ts rs_delay rdb rdb_ts tot_delay ) numeric(10. abs(datediff(mi.@begin_dt.@end_dt))).@time_end. null.1 select @begin_dt=convert(datetime. time.108)) end else begin select @begin_dt=convert(datetime. null.0). time. varchar(30). null select @last_row=isnull(max(ticket_num)."Jan 1 1900 " + convert(varchar(20).2)+":" select @time_char=@time_char + right("00"+convert(varchar(2). time. varchar(10).@end_dt))%60).@time_begin. null. varchar(10). not null. varchar(50). int."Jan 1 1900 " + convert(varchar(20).abs(datediff(hh.108)) select @end_dt=convert(datetime. int.2)+":" select @time_char=@time_char + right("00"+convert(varchar(2)."Jan 1 1900 " + convert(varchar(20).0). null. null. varchar(4096). null. time. datetime. null.0) from rs_ticket_history select @next_last=isnull(max(ticket_num). int. numeric(10. null."Jan 2 1900 " + convert(varchar(20).2) select @time_diff=convert(time.@begin_dt.@time_end. null.Final v2. time.@begin_dt. null. null. null. abs(datediff(ss. time. null.@time_begin.0. numeric(10. varchar(10). null.@end_dt))%60). varchar(30).108)) end select @time_char=right("00"+convert(varchar(2). 4096) -.1.@pos-1).".@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket.@rs_ticket)-1)).@pos+1.1.@rs_ticket)+7.@rs_ticket) select @head_3=substring(@rs_ticket.1.Final v2. else use null select @head_3=null.@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket.@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket.1.@pos+5.4096) select @exec_spid=convert(int. @pos=charindex("DIST".substring(@rs_ticket.parse out Heading 4 if it exists.charindex('.charindex('. @pos=charindex("H4".1 declare rs_tkt_cursor cursor for select ticket_num.4096) end -.substring(@rs_ticket.parse out Heading 2 if it exists.@rs_ticket)-1)).@pos+1. @pos=charindex("H3".substring(@rs_ticket.1.@rs_ticket)+1.@pos+3.4096) -.@rs_ticket) select @head_1=substring(@rs_ticket.'.".". else use null select @head_4=null.@pos+1.charindex(')'. @rs_ticket=substring(@rs_ticket. @pos=charindex("H2".0.@pos-1).@rs_ticket)+4. @rs_ticket=substring(@rs_ticket. @rs_ticket=substring(@rs_ticket. else use null select @head_2=null.@pos-1).@rs_ticket) select @head_2=substring(@rs_ticket.@pos-1).charindex("B(".4096) -.4096) end -.10)).@rs_ticket)+3.charindex('='. else use null select @dist_spid=null.4096) select @pos=charindex(".@rs_ticket)+5.@rs_ticket)-1)). @rs_ticket=substring(@rs_ticket.charindex('='.parse the first heading and then strip preceeding characters select @rs_ticket=substring(@rs_ticket.4096) select @exec_bytes=convert(int.parse the EXEC select @rs_ticket=substring(@rs_ticket.parse the EXEC bytes select @rs_ticket=substring(@rs_ticket. charindex('='.4096) select @pos=charindex(".@rs_ticket)+1.4096) -.@pos+3.substring(@rs_ticket.@rs_ticket) select @head_4=substring(@rs_ticket.charindex("EXEC". @rs_ticket)-1)).parse the PDB select @rs_ticket=substring(@rs_ticket.'.4096) select @pos=charindex(".". @rs_ticket while (@@sqlstatus=0) begin -.'.parse out DIST if it exists.12)). @dist_ts=convert(time.@pos+1.@pos+3. @rs_ticket=substring(@rs_ticket.charindex('.parse out Heading 3 if it exists.charindex(')'.4096) select @pdb=convert(varchar(30). @ticket_date.1.charindex('. @rs_ticket=substring(@rs_ticket.@rs_ticket)+1.charindex(')'.charindex("PDB".charindex("H1".4096) end 81 . @rs_ticket=substring(@rs_ticket. ticket_payload from rs_ticket_history where ((@last_two_only = 0) or ((@last_two_only=1) and ((ticket_num=@last_row) or (ticket_num=@next_last))) ) for read only open rs_tkt_cursor fetch rs_tkt_cursor into @ticket_num. @rs_ticket=substring(@rs_ticket.'.10)). @pdb_ts=convert(time.@rs_ticket)+1.@rs_ticket)+1. ticket_date.1.substring(@rs_ticket.substring(@rs_ticket.@rs_ticket)+1. @exec_ts=convert(time.@rs_ticket)+1.4096) select @dist_spid=convert(int.4096) end -.substring(@rs_ticket.'.charindex('.@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket. @dist_ts=null.4096) select @pos=charindex(".1. @rs_ticket)+1. null.8).@ra_latency. [email protected]) select @rdb=convert(varchar(30).@exec_ts.@rs_ticket)-1)). dist_time=convert(varchar(15).parse the DIST if present fetch rs_tkt_cursor into @ticket_num.charindex('.@exec_spid. @exec_ts. head_3.head_4.exec_bytes.exec_spid. @pdb_ts.<section> ** section ::= <tagxxx>=<value> ** tag ::= V | H | PDB | EXEC | B | DIST | DSI | RDB | . @rdb_ts.@dsi_spid.charindex('='.parse the DSI select @rs_ticket=substring(@rs_ticket.rdb_ts. dist_spid.dsi_ts.9).rs_delay.rdb_time=convert(varchar(15).exec_ts.exec_ts.@tot_latency) -.4096) -. exec_time=convert(varchar(15).8).dsi_ts. pdb_time=convert(varchar(15).10)). @rs_ticket=substring(@rs_ticket. head_1.9).substring(@rs_ticket. null. @dsi_ts=convert(time.charindex("RDB".@head_2.@head_4. @ra_latency output exec sp_time_diff @exec_ts.dist_ts.1.@rs_latency.substring(@rs_ticket.@dist_ts.. rs_delay=convert(varchar(15). @dist_spid.charindex("DSI".@rs_ticket)+4.substring(@rs_ticket.dsi_spid.substring(@rs_ticket. @rdb.1.pdb. exec_delay=convert(varchar(15).charindex(')'.@rs_ticket)-1)).9).@rs_ticket)+4.0. tot_delay=convert(varchar(15). ** Version value ::= integer ** Header value ::= string of varchar(10) ** DB value ::= database name ** Byte value ::= integer 82 .charindex(')'.tot_delay.exec_delay..@dsi_ts.12)) -. @rs_ticket end close rs_tkt_cursor deallocate cursor rs_tkt_cursor select ticket_num.rdb_ts.@rs_ticket)+1. head_2. @rs_latency output exec sp_time_diff @pdb_ts.pdb_ts.@head_3. null The full "ticket" when built and inserted into the replicate database may look like the following: ** rs_ticket parameter Canonical Form ** rs_ticket_param ::= <section> | <rs_ticket_param>.parse the RDB select @rs_ticket=substring(@rs_ticket. pdb_ts. @tot_latency output insert into #tickets (ticket_num.9). @ticket_date.1 -.head_1.head_2. rdb.@exec_bytes. dsi_time=convert(varchar(15).calculate horizontal latency exec sp_time_diff @pdb_ts.@rdb_ts.charindex('='.head_3.@head_1.tot_delay) values (@ticket_num. @rdb_ts=convert(time.4096) select @dsi_spid=convert(int.8) from #tickets order by ticket_num drop table #tickets return 0 end go Executing rs_ticket Executing the rs_ticket proc is easy – it takes four optional parameters that become the headers for the ticket records: create procedure rs_ticket @head1 varchar(10) @head2 varchar(10) @head3 varchar(10) @head4 varchar(50) as begin … = = = = "ticket".rs_delay.exec_delay. exec_bytes.Final v2.'. @dsi_ts.@rs_ticket)+1. head_4.9).dist_ts. they can be off by seconds by the end of the day.486. RS & RDS hosts should be within 1 sec of each other. DIST(24)=21:25:29. For example. and DSI (Data Server Interface).Final v2. Currently.1 ** Time value ::= hh:mm:ss. trying to use a header such as “Bob’s Test” will fail whereas “Bobs Test” is fine.PDB(pdb1)=21:25:28.211.H1=start. The reason for this is that within the RRS DIST thread. print_rs_ticket -.310.EXEC(41)=21:25:28.846 The description is as follows: Tag V H1 H2 H3 H4 PDB EXEC B DIST DSI RDB Description Rs_ticket version Header #1 Header #2 Header #3 Header #4 Primary Database RepAgent User Thread Bytes Distributor Thread DSI Thread Replicate Database (parenthesis) n/a n/a n/a n/a n/a DB name EXEC RS spid EXEC RS spid DIST RS spid DSI RS spid RDB name Value 1 (current version of format) First header value Second header value Third header value Fourth header value Timestamp of PDB rs_ticket execution Timestamp processed by EXEC Bytes process by EXEC Timestamp processed by DIST Timestamp processed by DSI-S Timestamp of insert at RDB The “Header” values are optional values supplied by the user to help distinguish which rows bracket the timing interval. DIST will not send rs_ticket to DSI unless there is at least one subscription from replicate site Do not use apostrophe/single or double quotation marks within the headers.327. Considering that the parsing routines look for semi-colons. A sample execution might look like: exec rs_ticket “start” (run replication benchmarks.examples: 83 . you should avoid using semi-colons within the headers to avoid parsing problems. whatever) exec rs_ticket “stop” rs_ticket tips There are a couple of pointers about rs_ticket that should be discussed: • Synchronize the clocks in the ASE & RS hosts!!!! The PDS.0. Tracing can be enabled in the three modules that update the rs_ticket: EXEC (Rep Agent User). If using routes. Rs_ticket processing occurs prior to then in the DIST processing sequence. • • • • • • rs_ticket Trace Flags The rs_ticket can be printed into the Replication Server error log when tracing is enabled. only the PRS DIST timestamps the ticket. the RDB timestamp is the time of the parallel DSI execution – which may be in advance of other statements that will need to be committed ahead of it. DSI time includes RSI & RRS DIST.DSI(39)=21:25:29.B(41)=324.RDB(rdb1)=21:25:30. due the uptime or due to high clock drift. If using parallel DSI’s. only the MD module is executed. This means that the RDB time may be a few seconds off. DML.ddd V=1. [“EXEC” | “DIST” | “DSI”]. DIST (Distributor). The syntax for the trace command is: trace [ “on” | “off” ]. This may have to be repeated often . The DSI timestamp is the time that the DSI read the rs_ticket – which could be a few seconds before execution if there is a large DSI SQT cache.while some systems automatically sync the clocks during boot. DIST(24)=21:33:43.H1=start.RDB(rdb1)=21:34:20.5 minutes of processing to the overall problem.EXEC(41)=21:32:03. This is termed “pipeline delay” as it shows the latency between threads within the pipeline.PDB(pdb1)=21:25:28.103 Note the two highlighted timestamps for each row.200.DSI(39)=21:34:08. Taking one step further. there is a difference of ~6. that what is printed in to the errorlog is the contents of the ticket at that point .EXEC(41)=21:32:03.846 -.DIST(24)=21:25:29.327. Vertical Vertical calculations show the time it takes for a single thread to process all of the activity between the two timestamps. vertical and diagonal. consider the following rs_ticket output (from two executions): -. Horizontal Horizontal calculations refer to the difference in time between two threads in the same rs_ticket row.DSI(39)=21:25:29. print_rs_ticket trace “on”. print_rs_ticket Note. In the bottom example. The commands between the two RS tickets included a large transaction – which likely could delay the DIST receiving the commands as the SQT has to wait to see the commit record before even starting to pass the commands to the DIST (likelihood: 60%) The outbound queue SQM is overburdened for the associated device speed. we notice that the time between when the command was executed and when the RS received it from the RepAgent was nearly immediate in the top example.EXEC(41)=21:25:28.327.H1=stop. “DIST”.DIST(24)=21:25:29.200.486.103 By comparing the PDB timestamps between the two.beginning V=1.end V=1.32 3. With 6. we notice that the DIST thread adds about 1.406. This could be due to either a bulk operation (i.H1=start. the EXEC trace will only include the PDB and EXEC timestamp information.H1=stop.for example.0.B(41)=20534.RDB(rdb1)=21:34:20.310.PDB(pdb1)=21:25:28. it gets a bit tricky. Using the same output as above.simply invoke RS ticket and wait for the DSI trace record to appear in the errorlog. thus slowing the delivery rate of the DIST to the outbound queue (likelihood: 35%) 2.466. -. This technique can be extremely useful when running benchmarks or trying to see when a table is quiesced .B(41)=324.406.DSI(39)=21:25:29. If we subtract the two in the “beginning” row.5 minutes of latency within the RepAgent processing.1 trace “on”.211 . the RepAgent was running approximately 6.5 minutes as we noted earlier from the horizontal calculation. a slow inbound queue write speed or just poor configuration.211 .PDB(pdb1)=21:25:39.beginning V=1.5 minutes – thus showing that by the end of the sample period. “DSI”. Reviewing RS monitor counter data will help to determine the actual cause.e.32 3. we can notice that the DIST vertical calculation is ~8 minutes.DIST(24)=21:33:43.310.DSI(39)=21:34:08. however. This may be an indication of one of three possibilities (in order of likelihood): 1.PDB(pdb1)=21:25:39.B(41)=20534. Each of these are described in the following sections. attempting to tune the RS components will not achieve a significant improvement. This is termed “module time” as it shows how long a particular module was active.466.Final v2. If we subtract the two. this is a latency figure and does not imply that the module was completely consuming all cpu during that time – the delay may be been caused by a pipeline delay.EXEC(41)=21:25:28.B(41)=324.000 rows) that actually resulted in the RepAgent being behind temporarily.486.5 minutes behind transaction execution. there are three types of calculations that can be performed: horizontal. we notice that the total test time was approximately 11 seconds of execution time at the primary.end V=1. Now then.RDB(rdb1)=21:25:30. a single update that impacted 100. we will see a delay of ~6. Analyzing RS_Ticket When comparing RS tickets. For example. “EXEC”. consider the various threads. If we further look at the EXEC vertical calculation. 84 . print_rs_ticket trace “on”. Note.846 -. Overall end-to-end latency can be observed by comparing the PDB & RDB (blue highlighted value)values in the “end” row – which shows 9 minute latency overall.RDB(rdb1)=21:25:30. 32 3. As we noted earlier.beginning V=1.DIST(24)=21:33:43.EXEC(41)=21:32:03.5 minutes prior to the DSI receiving the last of the rows from the DIST.B(41)=324.EXEC(41)=21:25:28. the RepAgent latency was about 6.H1=start. in the above. Diagonal In the last example. this is important.PDB(pdb1)=21:25:28.486. This means that the DIST saving the data to the outbound queue and the DSI reading the commands from the outbound queue only added about 30 seconds to the overall processing.103 For example.327. we can determine which of these are applicable.PDB(pdb1)=21:25:39.5 minutes for a total of 8 minutes. 85 .Final v2. Due to insufficient STS cache.406. -. we came close to performing a diagonal calculation.200.RDB(rdb1)=21:25:30. As you can see.DSI(39)=21:25:29.RDB(rdb1)=21:34:20.846 -. the most useful aspect of diagonal calculations will be in determining the impact of the modules which we don’t have timestamps for – namely the SQM module(s).1 3. the DIST starts sending data to the DSI ~8.DIST(24)=21:25:29.211 .end V=1.466.310. the DIST had to resort to fetching repdef & subscription metadata from the RSSD (likelihood 5%) By analyzing RS monitor counters.H1=stop. A diagonal calculation is termed “cross module time” and refers to the latency that can be the result of waiting access to the thread (messages cached in thread queues). In this case.DSI(39)=21:34:08.0.5 minutes while the DIST processing added 1.B(41)=20534. . 87 . Such commands will often refer to this RS thread module as EXEC.0. Column mapping needs to be performed for any renamed columns. consequently it is now included. However. Consequently. however. An important distinction – and one than many do not understand – is that the inbound threads used for a replication from a source to a primary belong to a different connection than the outbound group of threads.1 Inbound Processing What comes in… Earlier we took a look at the internal Replication Server threads in a drawing similar to the following: Figure 15 – Replication Server Internals: Inbound and Outbound Processing In the above copy of the diagram. In previous versions of this document. normalization includes: a. followed by the LTM User thread. the same set of inbound threads are used to deliver the data to all of the various sets of outbound threads for each connection. Replication Server will determine which type of thread each Executor is simply by the “connect source” command that is sent. and those excluded from the repdef need to be excluded from the queue. In the sections below. parses and normalizes the LTL and then packs it into binary format and then passes it to the SQM to be written to disk. RepAgent User (Executor) The RepAgent User thread has been named various things during the Replication Server’s lifetime. and lastly. Columns in the LTL stream need to be matched with those in the repdef. the RepAgent User thread. many of the trace flags and configuration commands are specified at the “Executor” thread generically and affect both threads. b. note that the threads have been divided into inbound and outbound processing along the dashed line from the upper-left to lower-right. The reason for this is that there actually are two different types of Executor threads – LTM-User for Replication Agents and RSI-User for Replication Server connections. as multiple destinations are added. some additional tuning parameters were added specifically for it. It simply receives LTL from the Replication Agent. we will simply be discussing the LTM-User or RepAgent User type of Executor thread. RepAgent User Thread Processing The executor thread’s processing is extremely simple. Parse LTL received from Rep Agent Normalize LTL – this involves comparing columns and datatypes in LTL to those in replication definition. 2. the RepAgent User thread was not discussed. For this module. with RS 12.Final v2. An extremely important part and fairly cpu intensive. we will be addressing the three main threads in the inbound processing within Replication Server. The full explanation of these steps can be viewed as follows: 1. It originally started as the Executor thread.1. Accordingly. The reason for this behavior is that the RepAgent User Thread often has very little work to do – as will be illustrated in the section on the monitor counters later. non-key columns eliminated from the stream.either the RepAgent User thread or the Distributor’s MD module. Once this limit has been reached. Periodically. For Rep Agent User threads. Consequently. Duplicate detection (OQID comparison) needs to be done to ensure that duplicate records are not written to the queue. you can control this by adjusting the parameter exec_cmds_per_timeslice (RS 12. this was set to a single block or 16K with a maximum of 60 blocks or 983040 bytes (2GB now in RS 12. f. this cache limit is controlled by the exec_sqm_write_request limit.0). d. While lowering it may have some impact.1 was that writers to the SQM could cache pending writes in the respective writer’s cache . update rs_oqids & rs_locater table in the RSSD with the latest OQID to ensure recovery. Packs commands in binary form and places on the SQM’s queue.1+). If the SQM’s pending writes are greater than exec_sqm_write_request_limit (RS 12. the Rep Agent User thread is put to sleep. updates need to be translated into separate delete followed by inserts. g. 4. By default.0. the EXEC thread needs to put multiple rows in the inbound queue. While it is true that Open Server messages are used to prevent it from being completely synchronous. but also (in effect) that it was written to disk as the RepAgent User thread does not acknowledge to the Replication Agent that the transfer was complete until then. e. If more than one replication definition is available. at the end of each transfer. the Replication Agent waits an acknowledgement not only that the LTL was received. further attempts to insert write requests on the SQM Write Request queue will be rejected and Rep Agent User Thread put to sleep. the simple fact is that each transfer from the Replication Agent must be written to disk. This may seem duplicative given the scan_batch_size and 88 .Final v2. c. Minimal column comparisons need to be performed and unchanged. 3.1 Multiple repdefs – if more than one repdef exists for the object. one command for each will be written to the queue. This is illustrated in the following diagram: Figure 16 – Rep Agent User Thread Processing A key feature added in RS 12. raising it frequently has little impact.1+) which controls how often the Rep Agent User thread will yield the cpu. If autocorrection is enabled for the particular repdef. Primary key columns need to be located as they are stored separately in the row to speed SQL generation at the DSI. the small buffer size (exec_sqm_write_request_limit by default is 16K) in the Rep Agent User thread essentially required a flush to disk. The parsing and normalization process can be fairly cpu intensive and is essentially synchronous in processing transactions from the Replication Agent all the way to the SQM.6 ESD 7 and RS 15. Recommendation: 20) RS 12. You can set exec_cmds_per_timeslice for all Replication Server Executor threads using configure replication server or for a particular connection using configure connection. the max has been increased to 2GB. Controls the amount of memory available to an LTI or RepAgent Executor thread for messages waiting in the inbound queue before the SQM writes them out. Recommendation for these versions is 2-4MB 12. the more work the Executor thread can perform before it must sleep until memory is released. With RS 12. RepAgent User Thread Monitor Counters The full list of RS 12. Max: 983040 (60 SQM blocks). You can set exec_sqm_write_request_limit for the Replication Server using configure replication server. On the other hand. the parsing and normalization process can be CPU intensive. several tuning configuration parameters were added: Parameter exec_cmds_per_timeslice (Default: 5.which should take less than a second to save to the inbound queue.040 bytes of exec_sqm_write_request_limit is likely less than 1.e. it may be robbing CPU time from the DIST or DSI threads. 89 .0 and prior. nor tuning configurations. ensuring that the setting is an even number of SQM blocks (i. ensuring that the LTL is written to disk for each transfer means a faster recovery RepAgent User Tuning Unlike the SQM.6. As a result.0. there were no specific commands to analyze the performance of the executor thread.6 counters are: Display Name CmdsTotal Explanation Total commands received by a Rep Agent thread.1 Explanation Specifies the number of LTL commands an LTI or RepAgent Executor thread can process before it must yield the CPU to other threads. The only downside to increasing the exec_sqm_write_request_limit is that if the RepAgent connection fails and the RepAgent tries to reconnect. it will not be able to until the full cache of write requests have been saved to the inbound queue. the exec_cmds_per_timeslice is a bit more difficult. The larger the value you assign to exec_sqm_write_request_limit. Max: 2147483648.1 secondary truncation point movement.1. In RS 12. Given that the average production system table is likely 1KB per row or more as formatted by the RS. you may want to raise exec_cmds_per_timeslice. The secondary truncation point and OQID synchronization take more work as the RSSD update is involved and a specific log page correlation is made. If the amount of memory allocated by the LTI or RepAgent Executor thread exceeds the configured pool value. a multiple of 16384) to ensure that memory is effectively utilized.0 ESD #1.1 Setting exec_sqm_write_request_limit is easy – set it to the maximum that memory will allow. Don’t use multiple repdefs for high volume tables unless absolutely necessary (doubles I/O) Do not leave autocorrection on any longer than necessary (doubles I/O for insert and update statements) RepAgent User Thread Counters In RS 12. exec_sqm_write_request_limit (Default/Min: 16384 (1 SQM block). if it should appear that data is backing up in the inbound queue and all applicable SQT tuning (below) has been performed. and frees memory in the pool. However. a full 983. it is not quite. On the other hand. since it may always have work to do in a high volume situation.000 replicated commands . Consider the following: • • • Create repdefs in the same column order as the table definition (speeds normalization). you may want to lower this number. Recommendation: 983040) Note in 12. SQT threads. there are a few implementation considerations that can also improve performance. the thread sleeps until the SQM writes some of its messages.Final v2. Min: 1. but in a sense.1. or if the DSI connections show a lot of “awaiting command” at the replicate (taking into account the dsi_serialization_method as discussed in the section on Parallel DSI). in all likelihood. Consequently. As mentioned earlier. 8 additional counters were added and some of the original counters were renamed for clarity. if the Replication Agent is getting behind (a much more normal problem). several counters specifically for the RepAgent User thread were added. in RS 12.6 ESD #7 and 15. Given that LTL could exceed the log page or due to text/image replication. In-coming connection packet size. For packet size. Total number of times a RepAgent Executor thread yielded it's time on the processor while handling LTL commands. Lang 'chunk' size is fixed at 255 bytes. Enable marker is sent by executing the rs_marker stored procedure at the active DB. or language 'chunks' when not in 'passthru' mode. Total request commands written into an inbound queue by a Rep Agent thread. The average amount of time the RepAgent spent yielding the processor while handling LTL commands each time the processor was yielded. Total CHECKPOINT records processed by a Rep Agent thread. see counter 'PacketSize'. These are 'forced' EOM's. SAVEXACT records processed by a Rep Agent thread). Total Repserver system commands written into an inbound queue by a Rep Agent thread. Total number of times a RepAgent Executor thread had to wait for the SQM Writer to drain the outstanding write requests below the threshold. Total updates to RSSD. Later releases will allow you to change the packet size.rs_locater where type = 'e' executed by a Rep Agent thread. Total number of protocol packets rcvd by a Rep Agent thread when in passthru mode. RepServer receives chunks of lang data at a time. Total bytes received by a Rep Agent thread. See counter 'PacketsReceived' for these numbers. See counter 'PacketsReceived' for these numbers. CHECKPOINT instructs Repserver to purge to a specific OQIQ value.1 Display Name CmdsApplied CmdsRequest CmdsSystem CmdsMiniAbort Explanation Total applied commands written into an inbound queue by a Rep Agent thread. Total enable replication markers written into an inbound queue by a Rep Agent thread. Total Repserver SQLDDL commands written into an inbound queue by a Rep Agent thread.0 or earlier versions used a hard coded 2K packet size. Buffers are broken into packets when in 'passthru' mode. Request Commands are applied as the executing request user. CmdsDumpLoadDB CmdsPurgeOpen CmdsRouteRCL CmdsEnRepMarker UpdsRslocater PacketsReceived BytesReceived PacketSize BuffersReceived EmptyPackets RAYields RAYieldTimeAve (intrusive) RAWriteWaits RAWriteWaitsTimeAve (intrusive) CmdsSQLDDL 90 . Route requests are issued by RS user. The average amount of time the RepAgent spent waiting for the SQM Writer thread to drain the number of outstanding write requests to get the number of outstanding bytes to be written under the threshold. RepAgent/ASE 12. SYNCLDDB records processed by a Rep Agent thread. and alter route requests written into an inbound queue by a Rep Agent thread.Final v2. Total number of command buffers received by a RepAgent thread. Applied Commands are applied as the maintenance user.. This size includes the TDS header size when in 'passthru' mode. When not in passthru mode. SYNCDPDB records and 'load database log' (in ASE. Mini-abort instructs Repserver to rollback commands to a specific OQIQ value. drop.0. Total 'mini-abort' commands (in ASE. Total 'dump database log' (in ASE. Total create. Total number of empty packets received in 'passthru' mode by a Rep Agent thread. Lang 'chunk' size is fixed at 255 bytes. Mini-abort instructs Repserver to rollback commands to a specific OQIQ value. Enable marker is sent by executing the rs_marker stored procedure at the active DB. For packet size. Create.1 Display Name RSTicket Explanation Total rs_ticket markers processed by a Rep Agent's executor thread. Number of command buffers received by a RepAgent thread. Updates to RSSD. or language 'chunks' when not in 'passthru' mode. Replication Server 15. and alter route requests written into an inbound queue by a Rep Agent thread. This size includes the TDS header size when in 'passthru' mode. The amount of time the RepAgent spent waiting for the SQM Writer thread to drain the number of outstanding write requests to get the number of outstanding bytes to be written under the threshold. These are 'forced' EOM's. When not in passthru mode. In-coming connection packet size. Enable replication markers written into an inbound queue by a Rep Agent thread. Route requests are issued by RS user. see counter 'PacketSize'. RepAgent/ASE 12. 'mini-abort' commands (in ASE.0 or earlier versions used a hard coded 2K packet size. SAVEXACT records) processed by a Rep Agent thread.0. For a typical source database. Buffers are broken into packets when in 'passthru' mode. SYNCDPDB records) and 'load database log' (in ASE. drop.. CHECKPOINT instructs Repserver to purge to a specific OQIQ value. RepServer receives chunks of lang data at a time. Number of protocol packets rcvd by a Rep Agent thread when in passthru mode. See counter 'PacketsReceived' for these numbers. Request commands written into an inbound queue by a Rep Agent thread. Request Commands are applied as the executing request user. SYNCLDDB records) processed by a Rep Agent thread. Applied commands written into an inbound queue by a Rep Agent thread. The amount of time the RepAgent spent yielding the processor while handling LTL commands each time the processor was yielded. Applied Commands are applied as the maintenance user.Final v2. Number of empty packets received in 'passthru' mode by a Rep Agent thread. the highlighted counters are the ones to watch. Later releases will allow you to change the packet size. Bytes received by a Rep Agent thread.0 had a few differences and added a few counters: Display Name CmdsRecv CmdsApplied CmdsRequest CmdsSystem CmdsMiniAbort Explanation Commands received by a Rep Agent thread. CmdsDumpLoadDB CmdsPurgeOpen CmdsRouteRCL CmdsEnRepMarker UpdsRslocater PacketsReceived BytesReceived PacketSize BuffersReceived EmptyPackets RAYieldTime RAWriteWaitsTime CmdsSQLDDL 91 . CHECKPOINT records processed by a Rep Agent thread. See counter 'PacketsReceived' for these numbers.rs_locater where type = 'e' executed by a Rep Agent thread. RepServer SQLDDL commands written into an inbound queue by a Rep Agent thread. Repserver system commands written into an inbound queue by a Rep Agent thread. 'dump database log' (in ASE. Note that as of MRA 12. Note also that counters RAYields and RAWriteWait appear to have been removed . Consider the following list (note that most are derived by combining more than one counter): CmdsPerSec = CmdsTotal/seconds CmdsPerPacket = CmdsTotal/PacketsReceived CmdsPerBuffer = CmdsTotal/BuffersReceived (Mirror Rep Agent & Heterogeneous Rep Agents) PacketsPerBuffer = PacketsReceived/BuffersReceived (Mirror Rep Agent & Hetero Rep Agents).6. then increasing the scan batch size to drive UpdsRslocaterPerMin towards 1 (likely impossible to get there) is the goal.0. “Avg” and other aggregate suffixes (and counters) have been removed as these are available from the counter_total. you will still see 10 or more updates per minute – which means recovery is only affected by a few seconds. setting scan_batch_size to really high values can be detrimental on low volume systems.which artificially lowers the CmdsPerPacket ratio to considerably less than 1. This is a factor of how much cache is available (exec_sqm_write_request_limit) as well as the values for init_sqm_write_delay/init_sqm_write_max_delay. CmdsPerPacket is an interesting statistic.using tens of packets per buffer .000. However. in 100ths of a second. Minimally. However.for a ratio of 1. counter_last and counter_avg=counter_total/counter_obs columns in the rs_statdetail table for RS 15. the MRA appears to be a bit “chatty” . this does relate to recovery speed of ASE – but think about it. For the MRA and heterogeneous replication agents. you will find out that even if you set scan batch size to 20. Yes. Increasing the Rep Agent packet size by changing the ASE rep agent ‘send buffer size’ configuration parameter helps this out tremendously. First. The amount of time. the number of yields per second gives a good indication of how much or how little cpu time the RA User thread is getting. However. raising the MRA rs_packet_size to 8192 or 16384 is suggested. If during peak processing. etc. you would expect the PacketsPerBuffer to be the same (and they are) . For ASE Rep Agent Thread. Obviously. you may look at these two counters and determine if tuning them is appropriate. For example. UpdsRslocaterPerMin = UpdsRslocater/minutes ScanBatchSize = CmdsTotal/UpdsRslocater RAYieldsPerSec = RAYields/seconds RA_ECTS = CmdsTotal/RAYields RAWriteWaits The first one (CmdsPerSec) should be fairly obvious – we are getting a normalized rate that we can use to track the throughput into RS.which may be surprising considering the relative importance of them. the goal would be to increase the number of commands processed during a given period – assuming the commands are equal and transaction rate the same. when compared with the number of commands received (via RA_ECTS). Secondly. waiting on writes. UpdsRslocaterPerMin and ScanBatchSize work together to identify when the Rep Agent scan batch size configuration should be adjusted. on really busy systems.1 Display Name RSTicket RepAgentRecvPcktTime Explanation rs_ticket markers processed by a Rep Agent's executor thread. both counters can be obtained as the number of observations for RAYieldTime and RAWriteWaitTime (counter_obs). One would suspect this to be fairly high. the MRA has a default ltl_batch_size of 40.000 bytes and a default rs_packet_size of 2048. 92 . This can be interesting to use to determine how busy the RepAgent is on network processing time vs. we can see how the configuration parameter exec_cmds_per_timeslice (aka ECTS) is helping or hurting us. counter_max. spent receiving network packets.0. you likely have scan_batch_size set too high. you don’t see any updates to the rs_locater within 2-3 minutes. Note that heterogeneous replication agents and the Mirror Replication Agent (MRA) all use the concept of an LTL buffer that is different in size than the packet size. Is the difference of 1 minute really a big problem?? If not. most production sites find themselves only processing 2-3 commands per packet – and since this includes begin/commit commands. A good goal to have is to get 8-10 commands per packet – but what good is that goal if the default exec_cmds_per_timeslice is still at 5 – which means that part way through processing the packet. RAYields is the number of times the RS RA User thread yielded the cpu to another module – and is very interesting. There is one new counter added . since the packet and buffer size are the same. RA thread yields the cpu?? However. the one that is most interesting is RAWriteWaits – it signals how often the RA thread had to wait when writing to the inbound queue. The RA thread has a number of counters that are of special interest to us and can help us try to improve this rate.Final v2. Note that the “Total”. but most often with the default 2K packet size and fairly large table sizes (when column names are included). really identifies the first bottleneck.the last one in the list: RepAgentRecvPcktTime. Final v2.0.1 RepAgent User Thread Counter Usage Perhaps the best way to use the counters is to look at them in terms of progression of the data from the source DB to the next thread (SQM). Consider the following sequence for RS 12.6: 1. RepAgent User Thread receives a batch of LTL from the RS. Each LTL batch is a single LTL buffer that is sent using one or more packets to the RS. This causes the “network” counters BuffersReceived, PacketsReceived, BytesReceived, EmptyPackets to be incremented The RepAgent User Thread then parses the commands out of the buffer and the commands are evaluated for type (i.e. is it a DML command the RepAgent has to pass to SQM or is a locater request). This updates the various “Cmd” counters such as CmdsTotal, CmdsApplied, CmdsRequest, CmdsSystem, CmdsMiniAbort, CmdsDumpLoadDB, CmdsPurgeOpen, CmdsRouteRCL, CmdsEnRepMarker, CmdsSQLDDL to be incremented accordingly. Depending on the command, what happens next: a. In normal operations, it is likely that the command was a DML, DDL or system statement (miniAbort, Dump/Load, PurgeOpen, Route RCL, Enable Replication marker (rs_marker)). If so, a write request is issued to the SQM (assuming num_messages or exec_sqm_write_request_limit hasn’t been reached) and processing continues. b. If the command was a request for a new locator, the RepAgent determines which record was the last written to disk and updates the RSSD locater appropriately. This also increments the UpdsRslocater counter. c. The command could be one of several different commands that the RepAgent User Thread needs to pass to other threads. For example, if a checkpoint record was received, in addition to the incrementing of the CmdsPurgeOpen, the RA User Thread coordinates with the inbound SQM to purge all the open transactions to that point (this happens during ASE database recovery). Similar behaviors for MiniAborts, Dump/Loads, etc. d. If the command was an Enable Replication Marker (rs_marker), then the Rep Agent coordinates setting the replication definition to the marker state (i.e. valid). e. If the command was an rs_ticket (a form of rs_marker), the RepAgent User Thread appends it’s timestamp info along with byte counts and process id unto the rs_ticket record and sends it through to the SQM. This also updates the RSTicket counter. Periodically, of course, the RepAgent User Thread will need to yield the CPU. This can happen for several reasons, but in each case, if intrusive counters are enabled, the counters RAYields and RAYieldTimeAve are incremented. The types of yields include: a. The number of cmds processed has exceeded the exec_cmds_per_timeslice. b. As mentioned in 3(a), the exec_sqm_write_request_limit has been reached – at which point the SQM won’t accept anymore write requests, the counters RAWriteWaits and RAWriteWaitsTimeAve are incremented. c. RS scheduler driven yield – which is why setting exec_cmds_per_timeslice high may be of no effect as the RS may still slice out the RA User Thread to provide time for the other threads to run. 2. 3. 4. From this point processing is handed off to the SQM. Let’s take a look at some sample data. Note: in each section, the first set of data will be from real customer data and the second set will be from a wide row (30+ columns) insert speed test. For the first consideration, let’s look at the efficiency of the network processing between the RepAgent and the RepAgent User Thread for the customer data set: 93 Final v2.0.1 0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54 79,356 93,852 71,669 63,173 63,086 56,570 108,667 101,507 92,022 81,852 78,507 267,882 364,632 253,283 266,288 253,531 164,249 375,512 450,749 326,619 325,148 317,559 3.3 3.8 3.5 4.2 4.0 2.9 3.4 4.4 3.5 3.9 4.0 889 1,207 841 881 839 545 1,243 1,492 1,085 1,076 1,055 268 365 254 266 253 164 375 451 327 326 317 999.5 998.9 997.1 1,001.0 1,002.0 1,001.5 1,001.3 999.4 998.8 997.3 1,001.7 UpdsRslocator/Min (derived) 53 72 50 52 50 32 74 89 65 64 63 Cmds/Sec (derived) Packets Received As you can see from the derived columns in red above, sometimes the most useful information from the monitor counters is when you compare two of them. Let’s explore some of these: Cmds/Pckt – derived from dividing CmdsTotal by PacketsReceived. In this case we are seeing that we are hitting about 3 commands per packet. You have to admit, processing 3 commands per packet does not represent a lot of work nor very efficient. This system would likely benefit from raising the RepAgent configuration ltl_buffer_size, which controls the packet size sent to Replication Server. Cmds/Sec – derived from dividing CmdsTotal by the number of seconds between samples (rs_statrun). Note that this is an average – in other words, during the ~5 minute intervals, there may have been higher spikes and lulls in activity. However, it does show that the Replication Agent is feeding roughly 1,000 commands per second to the Replication Server. To sustain this without latency, we will need to ensure that each part of Replication Server can also sustain this rate. Scan_batch_size – derived by dividing CmdsTotal by UpdsRslocater to get a representative number of commands sent to RS before the Replication Agent asks for the new truncation point. While this is an average, it does provide insight into the probable setting for the Replication Agent scan_batch_size – which in this case is likely set to 1,000. To see the effect of this, consider the next metric UpdsRslocater/Min – derived by dividing UpdsRslocater by the number of minutes between samples. This metric represents the SQL activity RS inflicts on the RSSD just to keep up with the truncation point. As you can see, it is updating the RSSD practically once per second. Again, this corresponds to the Replication Agent scan_batch_size configuration parameter. Some DBA’s are reluctant to raise this for fear of the extra log space that may impact recovery times, etc. But if you think about it, in its current state, I am moving the secondary truncation point every second – a bit of overkill. Increasing this to 10,000 would reduce the RSSD overhead considerably while reducing the secondary truncation point to every 10 seconds or so – certainly not a huge impact on the transaction log. Now, let’s look at a test system in which a small desktop system was stressed by doing a high rate of inserts on wide rows (32 columns). Ideally, we would like to compare to the same system after Replication Agent configuration values have been changed, however, this was not possible to obtain from the customer. So while not a true apples-apples comparison, it will be useful to compare the counter behavior. The Replication Agent configuration differences are: ltl_buffer_size=8192; scan_batch_size=20,000. Using the same metrics from above, we see: 94 Scan_batch_size (derived) Upds Rslocater Sample Time CmdsTotal Cmds/Pckt (derived) Final v2.0.1 Packets Received Scan_batch_size (derived) 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 149 1,096 637 2,865 78 1,027 7,781 4,512 20,322 553 6.8 7 7 7 7 93 778 410 2,032 50 0 0 0 1 1 0 0 0 20,322 553 UpdsRslocator/ Min (derived) 0 0 0 6 5 WriteWait% (derived) 11.9 9.7 8.0 5.6 8.1 13.7 10.2 7.3 8.6 6.8 4.7 To see how these differences impact the system, let’s take a look at the CPU and write wait metrics from the RepAgent User Thread perspective – again looking at the customer system first: WriteRequessts (SQM) RAWrite Waits 32,040 35,479 20,243 14,859 20,673 22,528 38,279 32,790 28,127 22,201 14,817 Sample Time CmdsTotal 0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54 79,356 93,852 71,669 63,173 63,086 56,570 108,667 101,507 92,022 81,852 78,507 267,882 364,632 253,283 266,288 253,531 164,249 375,512 450,749 326,619 325,148 317,559 42,984 58,811 36,820 39,084 39,804 25,347 59,447 72,149 45,778 47,273 39,971 RA ECTS (derived) RAYields Packets Received 6 6 6 6 6 6 6 6 7 6 7 268,187 364,705 253,283 266,334 253,684 164,566 376,184 450,809 326,750 325,340 317,674 Note that some of the columns are repeated for clarity - again we have some derived statistics. RA ECTS – derived from dividing CmdsTotal by RAYields. This compares to the exec_cmds_per_timeslice configuration parameter, which has a default of 5. Note that in this case, using the default exec_cmds_per_timeslice, we are getting about 6 commands processed before the RA User thread slices. It may be that the exec_cmds_per_timeslice may be affecting the system since we are so close to the default or it may be just the thread scheduling. WriteWait% - derived by dividing the SQM counter WriteRequests by the RAWriteWaits. This is partially due to the fact we have a default exec_sqm_write_request_limit of 16384 (1 block). Some of these waits are undoubtedly influencing the RA User Thread time slices Now, let’s look at the insert stress test. For this system, exec_cmds_per_timeslice is set to 20, exec_sqm_write_request_limit is set to 983040 (the max) – other than the Rep Agent configurations mentioned earlier, no other tuning was done to the Rep Agent User configurations Upds Rslocater Sample Time CmdsTotal Cmds/Pckt (derived) Cmds/Sec (derived) 95 Final v2.0.1 WriteRequess ts (SQM) Sample Time 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 149 1,096 637 2,865 78 1,027 7,781 4,512 20,322 553 34 264 156 748 22 30 29 28 27 25 1,027 7,788 4,512 20,336 553 0 0 0 0 0 As you can see, the SQM WriteRequests are much lower, so that may be why there are no RAWriteWaits – however, maxing the sqm_write_request_limit may have helped as well. The interesting thing is that the average RA ECTS (derived by dividing the CmdsTotal by RAYields again) shows considerably higher than the configuration value suggesting that the raising the exec_cmds_per_timeslice may be a limit when less than the default, but when cpu time is available, the Rep Agent User can exceed the default cap. This suggests from the customer viewpoint above, raising the exec_cmds_per_timeslice – while a suggestion – may not help. However, some customers have reported benefits when exec_cmds_per_timeslice is set as high as 100 – unknown if these were non-SMP systems, which could influence the behavior. Either the write waits or other cpu demands are causing the RA User thread to timeslice. RepAgent User/EXEC Traces There are a number of trace flags that can be used to diagnose RepAgent and or inbound SQM related performance issues. Module EXEC EXEC EXEC EXEC EXEC Trace Flag EXEC_CONNECTIONS EXEC_TRACE_COMMANDS EXEC_IGNORE_PAK_LTL EXEC_IGNORE_NRM_LTL EXEC_IGNORE_PRS_LTL Diag Description Traces LTM/Rep Agent connections Traces LTL commands received by EXEC RS behaves as data sink Ignores Normalization in the LTL Ignores Parsing of LTL commands Note that each of the above requires use of the diag binary for Replication Server. As a result, it should only be used in a debugging environment as the extra diagnostic code will have an impact on performance and log output (which can slow down the system). Some of the more useful traces are described below. For best understanding, refer back to the earlier illustration (pg 76) at the modules the EXEC thread performs. EXEC_CONNECTIONS If the RepAgent is having problems connecting to the RS, this trace can be useful to determine if the correct password is being used, etc. The output in the errorlog is the RepAgent user login followed by the password – which can be compared to the RSSD values. Care should be taken as the password will be output into the errorlog in clear text – you will probably want to change the errorlog location for any diagnostic binary boot just due to the volume of output. If so, you will want to delete it if you use this trace to avoid having passwords exposed. EXEC_IGNORE_PAK_LTL (WARNING: Results in data loss). At first glance, this seems misnamed, however, realizing that the step immediately prior to the RepAgent user thread passing the LTL to the SQM is packing it into packed binary format. Consequently, by enabling this traceflag, the LTL output will not be written to the inbound queue – however, the RepAgent user thread will still parse and normalize the LTL stream. This can be useful for eliminating SQM performance issues when debugging RepAgent performance problems (especially when the waits on CT-Lib are high). 96 WriteWait% (derived) 0.00 0.00 0.00 0.00 0.00 CmdsTotal RA ECTS (derived) RAYields RAWrite Waits Packets Received Final v2.0.1 EXEC_IGNORE_NRM_LTL (WARNING: Results in data loss). This trace flag disable the normalization step with-in the RepAgent user thread. If you are positive that the replication definitions precisely match the table’s ordinal column definition, disabling this can be done without exec_ignore_pak_ltl. However, it is most useful in continuing to “step backward” to isolate RepAgent performance problems. By first disabling writes to the queue via exec_ignore_pak_ltl and then disabling normalization, you have eliminated the SQM and any normalization overhead (such as checking replication definitions from RSSD) from the RepAgent LTL transmit sequence. EXEC_IGNORE_PRS_LTL (WARNING: Results in data loss). This traceflag disables parsing the LTL commands received by the RepAgent user thread. When used with exec_ignore_pak_ltl and exec_ignore_nrm_ltl, the RepAgent user effectively is throwing the data away without even looking at it. Any RepAgent performance issues that are network oriented that remain at this point are likely caused by network contention within ASE, the host machine(s), or the OCS protocol stack within the RS binary. SQM Processing The Stable Queue Manager (SQM) is the only module that interacts with the stable queue. As a result, it performs all logical I/O to the stable queue and as one would suspect is then one of the focus points for performance discussions. However, SQM code is present in both the SQM and SQT on the inbound side of the connection, and in the SQM and DSI for the outbound (and Warm Standby) side of a connection. It is best to get a better understanding of the SQM module to better see that in itself, the SQM thread may not be contributing to slow downs in inbound queue processing. The SQM is responsible for the following: Queue I/O - All reads, writes, deletes and queue dumps from the stable queue. Reads are typically done by a SQM Reader (SQT or DSI) using SQM module code - while the SQM is responsible for all write activity. Duplicate Detection - Compares OQID’s from LTL to determine if LTL log row is a duplicate of one already received. Features of the SQM thread include support for: Multiple Writers - While not as apparent in inbound processing, if the SQM is handling outbound processing, multiple sources could be replicating to the same destination (i.e. a corporate rollup). Multiple Readers - More a function of inbound processing, a SQM can support multiple threads reading from the inbound queue. This includes user connections, Warm Standby DSI threads along with normal data distribution. For the purpose of this discussion, we will be focusing strictly on the SQM thread which does the writing to the queue. The SQM write processing logic is similar to the following: 1. 2. Waits for a message to be placed on the write queue Flushes the current block to disk if a. Message on queue is a flush request b. Message on queue is a timer pop AND there is a queue reader present c. Message on queue is a timer pop AND the current wait time exceeds “init_sqm_write_max_delay” d. The current block is full Adds message to current block 3. The flushing logic (where the physical I/O actually occurs) is performed in the following steps: 1. 2. 3. 4. 5. Attempts platform-specific async write If retry indicated, yields then tries again Once the write request is successfully posted, places write result control block on AIO Result daemon message queue and sleeps Expects to be awakened by AIO Result daemon when that thread processes this one’s async write result Awakens any SQM Read client threads waiting for a block to be written It is important to note the distinction – the SQM actually writes the block to disk and then simply tells the dAIO thread to monitor for that I/O completion. The dAIO detects the completion by using standard asynchronous I/O polling 97 Final v2.0.1 techniques and when the I/O has completed, wakes up the SQM, which, can then update the RSSD with the last OQID in the block that was written. This ensures system recoverability as it is this OQID that is returned to the RepAgent when a new truncation point is requested (as described earlier). This is illustrated as follows: Figure 17 – SQM Thread Processing SQM Performance Analysis One of the best and most frequent commands for SQM analysis is the admin who, SQM command (sample output below extracted from Replication Server Reference Guide). admin who, sqm Spid State -------14 Awaiting 15 Awaiting 52 Awaiting 68 Awaiting Duplicates ---------0 0 0 0 Info ---101:0 TOKYO_DS.TOKO_RSSD 101:1 TOKYO_DS.TOKYO_RSSD 16777318:0 SYDNEY_RS 103:0 LDS.pubs2 Reads ----0 8867 2037 0 Bytes ----0 9058 2037 0 Message Message Message Message Writes ------ 0.1 0.1.0 B Writes -------0 0 0 0 B Filled ------0 34 3 0 B Reads ------44 54 23 B Cache ------0 2132 268 0 Save_Int:Seg -----------0:0 0:33 0:4 strict:O Next Read --------0.1.0 33.11.0 4.13.0 0.1.0 First Seg.Block --------------0.1 33.10 4.12 0.1 Readers ------1 1 1 1 Truncs -----1 1 1 1 Last Seg.Block -------------0.0 33.10 4.12 0.0 98 The most useful uses of this column are to track bytes/min throughput and to explain why the queue usage may be different than estimated (i. Obviously if the blocks were always full. This metric is from the viewpoint of the SQM thread and not the endpoint (DIST or DSI) that we think it is. block and row to be read.0. there is a substantial amount of 99 . but if continues to increase. sqt. Prior to the true endpoint. Note the word “rough” is underlined in the high-lighted sentence regarding calculating latency by subtracting Last Seg and Next Read. if state shows “Active” or “Awaiting I/O”. May surge high at startup due to finding the next row. the result would be close to 16K. A frequent command for inbound queue determination is admin who. in normal processing.Final v2. The efficiency of the block usage can be calculated by dividing “Bytes” by “B Writes”. a rough idea of the latency can be determined from the amount of queue to be applied ~ Last Seg – Next Read (answer in MB) Number of readers Number of truncation points Info Duplicates Writes Reads Bytes B Writes B Filled B Reads B Cache Save Int:Seg First Seg. However. these are indications – further commands will be necessary to determine exactly what the problem is. messages are being reread from the queue due to large transactions or SQT cache too small. after startup. First undeleted segment and block in the queue. it is caught up and not necessarily part of the problem.Block. while for outbound queues.e. If Replication Server is behind. If continually behind. the size of the queue can be quickly calculated via Last Seg – First Seg (answer in MB) The next segment. Although a more detailed discussion is in the Reference Guide. low block density). tuning exec_cmds_per_timeslice may help Number of messages read from queue. As such. Last segment and block written to the queue. If consistently higher than Reads. If it points to the next block after Last Seg. Queue id and database connection for queue Number of LTL records judged as already received – can increase at Rep Agent startup. a quick summary of the output is listed here for easy reference. The reason for the highlighting is that this method is not exactly accurate. As a result.1 Now that we understand how Replication Server allocates space (1MB allocations) and performs I/O (16K blocks – 64 blocks per 1MB). However. the SQM is busy writing data to/from disk. Column Spid State Meaning RS internal thread process id – equivalent to ASE’s spid Current state of SQM – Awaiting message. If the inbound queue and not a warm standby. Number of actual bytes written to queue. Number of messages (LTL rows) written to the queue.Block Last Seg. it most likely will be a look at the replicate database.Block Next Read Readers Trunc In the above table. However. then the queue is quiesced (caught up). performance indicators were highlighted. Number of 16K blocks written to queue Number of 16K blocks written to queue that were full Number of 16K blocks read from queue Number of 16K blocks read from queue that are cached Save interval in minutes (left of colon) and oldest segment (1MB allocation) for which save interval has not yet expired. you will most likely be seeing a backlog develop. this is often not the case as transactions tend to be more sporadic in nature. if this number starts outpacing writes by any significant number. then reading is not keeping up with writes. it is a sign of someone recovering the primary database without adjusting the generation id. the above starts to make a bit more sense. The range is 1 to 100.x sqm_recover_segs (Default: 1. Max: 100) sqm_warning_thr_ind (Default: 70. Given that IO operations today are in the low ms range. Given that IO operations today are in the low ms range. Recommendation: 10) sqm_warning_thr1 (Default: 75. Of course.Min: 1.or if the queue is being read.Final v2. To realize what this means. Max: 100) sqm_warning_thr2 (Default: 90.x 11. the SQM writer will check if there are actually readers waiting for this block. DSI or RSI threads. this default value probably should be lowered – see next configuration for rationale. but lengthening the recovery time due to more segments needing to be analyzed during recovery.x 12.x Meaning Write delay for the Stable Queue Manager if queue is being read.x 11. if after successive queries the Next Read/Last Seg shows no latency. we will excessively delay Replication Agent processing and have a bigger impact on the system overall. As we discuss the SQT thread and DSI SQT module. Percent of partition segments used to generate a second warning. However. When the delay time has expired. The range is 1 to 100. The likely cause of waiting for the queue to be read would be rescanning for large transactions. To get the smallest possible latency you’ll have to set init_sqm_write_delay to 100 or 200 milliseconds and batch_ltl to false (sp_config_rep_agent). Specifies whether or not writes to memory buffers are flushed to the disk before the write operation completes. Decreasing init_sqm_write_delay will cause more I/O to 100 .1 cache likely in the SQT or DSI (dsi_sqt_max_cache_size) that can be masking the latency.0. init_sqm_max_write_delay (Default: 10000. However. improving throughput. then SQM will adjust this time and make it longer for the next wait time. You may want to change this parameter if you have special latency requirements and the updates to the primary database are done in bursts. we will explain in more detail the times and conditions when this could be inaccurate. Recommendation: 100) 11. SQM Tuning To control the behavior of the SQM. and the SQM needs to close the block so that the reader can access it immediately. The stable queue manager waits for at least init_sqm_write_delay milliseconds for a block to fill before it writes the block to the correct queue on the stable device . then it will be frequently attempting to read from the write block. this is the initial wait time.which again causes the SQM to double the time and wait before it again tries to write. The range is 51 to 100. it will delay writing by this initial delay. If the reader is caught up. then they will not be waiting for this block and consequently. there are a couple of configuration parameters available: Parameter init_sqm_write_delay (Default: 1000. if the reader is behind and is still processing previous blocks. then it is in fact waiting for the disk block. Init_sqm_write_delay should be less than init_sqm_write_max_delay. delaying rows from being appended to it. Recommendation: 50) RS 11. If there are no readers waiting for the block. you have to remember that the reader for the block typically will be the SQT. the SQM will write less frequently. the SQM can wait a bit longer to see if the block can be filled before flushing it to disk. The other option is that the queue is still being read . Values are "on" and "off. Percent of partition segments (stable queue space) to generate a first warning. this should be lowered. Max: 100) sqm_write_flush (Default: “on”.1 The first two take a bit of explaining. Percent of total partition space that a single stable queue uses to generate a warning. Recommendation: “off”) 12. If we allow up to a 10 sec delay due to rescanning a large transaction. By increasing. Controls how often the SQM updates rs_oqid’s.Min: 51. The maximum write delay for the Stable Queue Manager if the queue is not being read. The downside is that if the SQT is completely caught up." Essentially allows file system devices to be used safely (ala ASE’s dsync option).1 11.Min: 1. then it likely is that true that no latency exists (exception is Warm-Standby). and the block is not full. Note that this delay does not mean the SQM is “sleeping” . but are rather bursty may degrade performance as the Rep Agent effectively has a synch point with the SQM – basically another block can not be forwarded until the first one is on disk. it will append them to the block. from cache. These seem confusing. it will have to be flushed. it waits init_sqm_write_delay before writing the current block to disk. However. Assume the default settings and that it the rows are 1KB each – so it will take 16 rows to fill a block. a little bit different. 1. 3. forcing the SQT to read it from disk vs. Rep Agent sends an ltl_buffer_size block of LTL. Finally. if the SQM reader is not up or is lagging.delaying RepAgent User throughput. If the RS is fully caught up. The key here is that this is how long the SQM will wait before writing to the queue if the DSI. it is able to retrieve the block from the SQM cache as discussed above if the next block to read is the current block.e. This is the final condition if a block wasn’t written yet because of a full condition or the init_sqm_write_delay. suspend the DSI or suspend distribution (the DIST thread starts/stops the SQT thread)). the block is simply read from cache vs. SQM begins receiving LTL rows and begins to build a 16K block. If so. Each time RS hits init_sqm_write_delay. init_sqm_write_max_delay is how long a block will be held due to the fact that the DSI. The question some may ask is what happens if other replicated rows arrive from the Replication Agent. consequently block is not flushed to disk unless it is full. 101 . but consider the following scenario: 1. However. 4. Increasing this value in situations in which the transactions do not quite fill up a full block. If the reader does not come back up within init_sqm_max_write_delay. the SQM readers (when up) may be requesting to read the same disk block as was just written. RSI or SQT threads are suspended and not reading from the queue or the reader was not waiting for the block so the SQM delayed past init_sqm_write_delay. Once the block is full and the wait has expired. it is likely that the real delay in writing to the queue when the queue is being read is init_sqm_write_max_delay and not init_sqm_write_delay.0. so block is written to disk. Due to normalization. but is quiescent on weekends and evenings. the block is still cached in memory of the SQM. This parameter has to do more with when the block will be flushed from memory. Assuming the DSI. On the other hand. As a consequence in many systems it is a good idea to reduce init_sqm_write_max_delay. If the block was not full and the readers were not waiting for it. Let’s assume we have a system that is being updated 10 times per second during normal working hours. DSI. 3.1 occur as a small init_sqm_write_delay will write blocks that are not filled completely. RSI or SQT threads are active to ensure full blocks. the block is flushed to disk regardless of full status. If RS is fully caught up. the next block will wait longer (to a maximum of init_sqm_write_max_delay). However. A better solution than to increase this parameter is to simply ensure that batch_ltl is on at the Rep Agent (if on. the SQM will flush it to disk. RSI or SQT are up and the SQT is actively reading the queue. this may be less space in the queue. the rapid polling read cycle against the SQM write block will cause the SQM to delay appending new rows to the block . 2. the SQM at the end of the “wait” cycle will check to see if there are more write requests. SQM begins receiving LTL rows and begins to build a 16K block. this parameter controls how long the SQM will keep the block in cache waiting for the reader to resume or catch up. RSI. but under normal circumstances it will be sufficient). let’s consider what likely happens in real life. The reader will have to do a physical I/O to retrieve the disk block. readers are not up. If the SQT is completely caught up. the block it is requesting is the one just written.if the block is not full. for increased throughput. Let’s kill the SQM reader (i. A flush to the queue is guaranteed to happen after waiting for init_sqm_write_max_delay. This is important – it means that the SQM will delay writing partially full blocks when the SQT is busy reading – consequently: • A large transaction that is removed from the SQT cache and is being re-read (and keeping the SQT busy reading) may reduce throughput as it is likely that once the block is full. • The other important aspect is that the configuration value is the initial wait time. Now. Init_sqm_write_delay expires. you may wish to increase this parameter in bursty environments with low transaction rates to ensure more full blocks are written and consequently less i/o required to read/write to queue. The SQM cheats and simply reads the block from cache. As a result. Init_sqm_write_delay expires. after RS has been in operation for any length of time. however. 2. the copy flushed to disk. If the reader comes back up within init_sqm_max_write_delay. it will double the time up to init_sqm_max_delay.Final v2. This will fill up the stable queue faster with less dense blocks. or SQT reads the next block. To avoid unnecessary disk I/O. 000ms – a very rare situation in which queue space may be at a premium and write activity is very low in the source system. we would be flushing a block containing 10 rows of data. the intent here is to reduce the impact of updating the RSSD – not that the RSSD can’t handle the load. 5. From an input standpoint.Final v2. 2. at some point. or any other event that suspends SQM processing. be aware that increasing this parameter may also increase recovery time after a shutdown. the SQM does the following 1. the most common cause of SQM contributing to performance issues is simply if the SQM can’t write to disk fast enough. the block will in all likelihood fill and get flushed to disk. you could try decreasing these values as well as looking at the cumulative writes (in MB) for all the queues on the same disk partition or look at the sqm_recover_segs to see if you can speed up the SQM processing. every 10 seconds. Normal SQM processing is fairly fast – however. So. However. the SQM will update the rs_oqid less frequently.5 seconds. Update the rs_oqid with the last oqid processed for the segment Check if there is space on the current partition being used Check to see if the current partition has been marked to be dropped Check if a new disk_affinity setting specifies a different location Update the disk partition map and allocate the new segment If a large number of connections exist or in a high volume system. RSI or SQT) is down for any length of time. 5. if we use the suggested value of 1 second (from the table above). it is more probable that the queue will begin to back up if the SQM reader is down. if there are RAWriteWaits and you have already maximized exec_sqm_write_request_limit. if we waited for a full block we’d wait for ~16 seconds before the block flushed. keep this in mind. Increasing the init_sqm_max_write_delay beyond 10 seconds is probably not useful.6 . init_sqm_write_delay is doubled from it’s initial 1 second delay until init_sqm_max_write_delay (10 seconds) As activity starts. necessitating a physical I/O. reducing this value if there are no RAWriteWaits is likely not a going to help. This will show up as a RAWriteWait event (in RS 12. 4. But since we have a timer. Even though the timer has not expired. Consequently. While this sounds easy. The only time increasing this may make sense is if increasing the init_sqm_write_delay to greater than 10.1 1. so adjusting it from 1 to 2 may not show any appreciable impact. the SQM will need to allocate a new segment. Process repeats with the SQM block being written at a rate of 1 every ~1. Someone looking at the replicate database might notice the 10 second delay and make some wrong assumptions about why the delay and try tuning different areas of RS – especially if they have a desire to see RS latency in the 1-2 second range.0. updates to the RSSD have the worse effect of degrading RS throughput at that point in time. if you see that the transaction log’s rate exceeds the SQM rate – it may be an indication that the Rep Agent is not able to keep up. If the SQM ‘waits’ too long. A new block is allocated and the timer reset to 0. the SQM actually has to do a bit of checking. the counter_obs for RAWriteWaitTime will be incremented). 3. Other than the “lucky” instances where you might see the state column in the admin who. Much like changing the Replication Agent scan_batch_size to reduce the updates to the rs_locater. hibernation. the end of the current 1MB segment will be reached. Also. the other steps will still have to be performed (but the time to do so will likely nearly be cut in half). If the SQM reader (DSI.in RS 15. By increasing this value. but since this is done inline with RS processing.0. remember that this reduces the updates to rs_oqid only – during a segment allocation. As a result. Additionally. reducing both the init_sqm_write_delay and init_sqm_max_write_delay can help. Since there is no activity. 3. Generally speaking. the block will be flushed to disk. after a short time. And that is why it probably is useful to reduce init_sqm_write_max_delay for low throughput systems – while the blocks will be flushed nearly empty. the SQM write 102 .5 seconds. the latency will be reduced. What happens if the transaction rate slows to 1 per second? At 1KB rows and 16KB blocks. However. the cache of write requests (exec_sqm_write_request_limit) will be filled and the RepAgent User will be forced to wait. From a performance perspective. setting this value to 10 can help as SQM flushes to the RSSD are reduced yet for recovery the most that will have to be scanned is 10 blocks (~160KB). sqm command stating “Awaiting I/O” this may be difficult to detect as the bytes written to the queue may be more than what was written to the transaction log. 2. At slightly more than 1. At that point. RS is booted/re-booted on a weekend. For example. enough rows have arrived that the block is full. Note that the SQM does not currently update the RSSD with every block anyhow. each block would only contain 1 row of data at 1 transaction per second activity rates. you may wish to adjust sqm_recover_segs. However. However. the first rows arrive – since the block is not full. 4. Consequently. the SQM delays writing the block (the timer will expire in 10 seconds). the Rep Agent or DIST will still be supplying data to the SQM. the block will be flushed at init_sqm_write_max_delay regardless of whether or not it is full. Whenever a segment is full and new one is allocated. Obsolete. Total segments allocated to a queue during the current statistical period. Total srv_sleep() calls by an SQM Writer client while waiting to write an enable rs_marker into the inbound queue. Total active segments of an SQM queue: the number of rows in rs_segments for the given queue where used_flag = 1.1 and 12. Total messages that have been rejected and ignored as duplicates by an SQM thread. remember that a stable device may be used by more than one connection. Total number of full blocks written by an SQM thread. this was supplemented by adding counters from the SQM Reader and some of the SQM module counters were shifted to the SQM Reader module counters (listed as deprecated/obsolete in the counter description as you will see below). Total srv_sleep() calls by an SQM Writer client while waiting to write a drop repdef rs_marker into inbound queue. Total srv_sleep() calls by an SQM Writer client due to waiting for the SQM thread to get a free segment. From a write speed aspect. Obsolete. In 12. Total bytes written to a stable queue by an SQM thread. Obsolete. Obsolete. Consequently if experiencing a high rate on one or more connections. Average command size written to a stable queue. Total segments deallocated from a queue during the current statistical period.6 are: Counter Name AffinityHintUsed BlocksFullWrite Explanation Total segments allocated by an SQM thread using user-supplied partition allocation hints. This SQM module thread counters for RS 12. See CNT_SQMR_SLEEP_Q_WRITE. BlocksRead BlocksReadCached BlocksWritten BPSaverage BPScurrent BPSmax BytesWritten CmdSizeAverage CmdsRead CmdsWritten Duplicates SegsActive SegsAllocated SegsDeallocated SleepsStartQW SleepsWaitSeg SleepsWriteDRmarker SleepsWriteEnMarker SleepsWriteQ 103 .1 is likely the largest cause of Replication Agent latency – however. so concentrating on this is likely not going to help reduce overall latency much. Current byte deliver rate to a stable queue. See CNT_SQMR_BLOCKS_READ. See CNT_SQMR_BLOCKS_READ_CACHED. SQM Monitor Counters SQM Thread Monitor Counters In RS 12. See CNT_SQMR_COMMANDS_READ. it is likely advisable to use disk_affinity to spread the writes across different devices for different connections. Total number of 16K blocks written to a stable queue by an SQM thread Average byte deliver rate to a stable queue.5 there was only a single group of counters that applied to the SQM thread. This includes separating inbound and outbound connections as well.0. the latter use the SQMR module. Individual blocks can be written due either to block full state or to sysadmin command 'show_queue' (only one message per block). While the former still use the module name of SQM. Total commands written into a stable queue by an SQM thread.6.Final v2. Maximum byte deliver rate to a stable queue. Total srv_sleep() calls by an SQM Writer client due to waiting for SQM thread to start. the biggest probably cause of latency is likely at the DSI. Elapsed time. Timer stops when the next segment is allocated. Timer starts when a segment is allocated. However. Timer stops when the segment is deleted. TimeMaxNewSeg (intrusive) TimeMaxSeg (intrusive) UpdsRsoqid WriteRequests WritesFailedLoss WritesForceFlush WritesTimerPop XNLAverage XNLInterrupted XNLMaxSize XNLPartials XNLReads XNLSkips XNLWrites Replication Server 15. Total updates to the RSSD. Timer starts when a segment is allocated or RepServer starts. See CNT_SQMR_XNL_INTR. SQM writer thread initiated a write request due to timer expiration. Total message writes requested by an SQM client.rs_oqid table by an SQM thread. so care should be taken when attempting to time RS speed using this counter. This does not count skipped large message in mixed version situation. so care should be taken when attempting to time RS speed using this counter. See CNT_SQMR_XNL_READ.. in 100ths of a second. Timer starts when a segment is allocated or RepServer starts. The maximum elapsed time. The maximum size of large messages written so far. to allocate a new segment. Total large messages skipped so far. Each new segment allocation may result in an update of oqid value stored in rs_oqid for recovery purposes. to process a segment. Obsolete. Timer stops when the segment is deleted. Timer stops when the next segment is allocated. to process a segment. SQM_WRITE_LOSS_I. in 100ths of a second. in 100ths of a second. See CNT_SQMR_XNL_PARTIAL. Timer starts when a segment is allocated or RepServer starts.5. in 100ths of a second. Timer stops when the next segment is allocated. Average elapsed time.1 Counter Name SleepsWriteRScmd TimeAveNewSeg (intrusive) TimeAveSeg (intrusive) TimeLastNewSeg (intrusive) TimeLastSeg (intrusive) Explanation Total srv_sleep() calls by an SQM Writer client while waiting to write a special message. Includes time spent due to save interval. which is typically associated with a rebuild queues operation.0. Obsolete. Total large messages written successfully so far.0 has slightly different SQM counters: Counter Name CmdsWritten Explanation Commands written into a stable queue by an SQM thread. Obsolete. to allocate a new segment. to allocate a new segment. The maximum elapsed time. Timer starts when a segment is allocated. in 100ths of a second. This only happens when site version is lower than 12. The elapsed time. SQM writer thread has forced the current block to disk when no real write request was present.Final v2. Timer stops when the segment is deleted. Average elapsed time. such as synthetic rs_marker. Total writes failed by an SQM thread due to loss detection. typically by quiesce force RSI or explicit shutdown request. 104 . Includes time spent due to save interval. in 100ths of a second. to process a segment. there is data to write and we were asked to do a flush. Average size of large messages written to a stable queue. Timer starts when a segment is allocated. Command size written to a stable queue. AffinityHintUsed UpdsRsoqid WritesFailedLoss WritesTimerPop WritesForceFlush WriteRequests BlocksFullWrite CmdSize XNLWrites XNLSkips XNLSize SQMWriteTime 105 .. The amount of time taken for SQM to write a block. to process a segment. SQM_WRITE_LOSS_I. Large messages written successfully so far. which is typically associated with a rebuild queues operation. Timer stops when the segment is deleted. Large messages skipped so far. Elapsed time.5. srv_sleep() calls by an SQM Writer client due to waiting for SQM thread to start. Each new segment allocation may result in an update of oqid value stored in rs_oqid for recovery purposes. Number of full blocks written by an SQM thread. Writes failed by an SQM thread due to loss detection. in 100ths of a second. Segments deallocated from a queue during the current statistical period. srv_sleep() calls by an SQM Writer client while waiting to write an enable rs_marker into the inbound queue. Segments allocated by an SQM thread using user-supplied partition allocation hints. This does not count skipped large message in mixed version situation. Message writes requested by an SQM client. to allocate a new segment. Messages that have been rejected and ignored as duplicates by an SQM thread.1 Counter Name BlocksWritten BytesWritten Duplicates SleepsStartQW SleepsWaitSeg SleepsWriteRScmd SleepsWriteDRmarker SleepsWriteEnMarker SegsActive SegsAllocated SegsDeallocated TimeNewSeg TimeSeg Explanation Number of 16K blocks written to a stable queue by an SQM thread Bytes written to a stable queue by an SQM thread. The elapsed time. srv_sleep() calls by an SQM Writer client due to waiting for the SQM thread to get a free segment. Segments allocated to a queue during the current statistical period. SQM writer thread has forced the current block to disk when no real write request was present. such as synthetic rs_marker.0. Timer starts when a segment is allocated or RepServer starts. typically by quiesce force RSI or explicit shutdown request. Active segments of an SQM queue: the number of rows in rs_segments for the given queue where used_flag = 1. The size of large messages written so far. Timer stops when the next segment is allocated. This only happens when site version is lower than 12.Final v2. in 100ths of a second. srv_sleep() calls by an SQM Writer client while waiting to write a special message. Updates to the RSSD.rs_oqid table by an SQM thread. srv_sleep() calls by an SQM Writer client while waiting to write a drop repdef rs_marker into inbound queue. there is data to write and we were asked to do a flush. SQM writer thread initiated a write request due to timer expiration. Individual blocks can be written due either to block full state or to sysadmin command 'show_queue' (only one message per block). Timer starts when a segment is allocated. However. you can see where the updates to the RSSD are a lot higher than we would like. counters – are more for just tracking the space utilization – although ideally. there are a number of counters when compared with their counter-parts can provide insight into what the possible causes of performance issues might be. the RepAgent User thread will be forced to go to sleep once its outstanding write requests have exceeded what the SQM Writer can pack into one block. we are updating the OQID in the RSSD as we track our progress. Numbers less than 100% indicates that not a lot of commands are coming into RS on a throughput basis. In the above list. The next two sets (BlocksFullPct and SegsActive. SegsAllocated. the replication to other destinations will take at least twice as long as the original bcp. you need to look at this realistically. However. Fortunately for most people. While a byte rate is possibly useful. consequently. When RS is restarted or a connection resumed. this counter may help as it shows how long each 16K I/O takes for a full block. for an outbound queue. the goal is to see the SegsAllocated and SegsDeallocated matching. 106 . consequently you may not be able to directly compare the CmdsWritten to DIST counter values. in this case we are concerned about the speed of recovery for RS. First. However. RS uses the OQID from the RSSD to locate the current segment and block.1 Note again. while this is a way of tracking disk space utilization. this can literally mean that for every 4-8KB of log data. Where a single connection is involved. The others – all the SegsActive. As you will notice. CmdSizeAverage) tells us how many commands actually were written into the queue and should compare with CmdsTotal from the RA – although it may not be exactly equal as purge commands during a recovery. have been removed. since they have not adjusted exec_sqm_write_request_limit from the default of 16384. consider the following: RAWriteWaitPct = RAWriteWaits/WriteRequests CmdsWritten. The next sequence of counters (CmdsWritten. they are updating the OQID in the RSSD 2-3 times per second…and this is just the inbound queue. The next two (UpdsRsoqidSec and RecoverSeg) are related and likely a big factor in performance of the SQM.Final v2. etc. Even a low value such as 5-10% could be indicative of a problem once you realize that the default init_sqm_write_delay is 1 second – which causes the ASE RepAgent to have to wait. For instance. once again. etc. CmdSizeAverage is the first place that we get a look at how big each command is from the source when packed into SQM format. SegsDeallocated) are ones to watch.0.040 (60 16K blocks). it can be useful. one new counter of interest is SQMWriteTime. CmdSizeAverage BlocksFullPct=BlocksFullWrite/BlocksWritten SegsActive. this could be different as the same outbound queue may be receiving transactions from more than one source (corporate rollup implementation). the SQM counter values can be viewed in at least two different comparisons. CmdsWritten when compared with itself could demonstrate a rate (when normalized) of 100 commands/second. are not written to the queue. but you really can’t do much about. The key to all this is realizing that the SQM writes/reads 16K blocks (not configurable). the RepAgent User is forced to wait – which in turn forces the RepAgent to stop scanning. however. provides a lot more cushioning for the RepAgent User to keep processing write requests before it is forced to sleep by the SQM. The second way of comparing the counters is to compare multiple counters within the same sample interval. A sub-second recovery interval is likely overkill – and yet most DBAs are surprised to find out that during busy periods. just like with the Rep Agent scan batch size. the normal is to compare the current sampling’s values with the previous interval’s. So. the shorter RS has to scan from the point the RSSD was last updated to the current working location. However. For example. When you add in the outbound queue and multiply across the number of connections. the timer pop. by default. BlocksFullPct will likely be 100% as every block is written when full vs. etc. The more frequently this is updated. it shouldn’t be used as an indication of latency (it could be – but it also could be just due to something else). increasing it to the maximum of 983. the obvious implication is that the RepAgent can only read that particular table’s rows out at half the speed of bcp. SegsDeallocated UpdsRsoqidSec = UpdsRsoqid / Sec RecoverSeg = SegsAllocated/UpdsRsoqid The first counter (RAWriteWaitPct) is a derived value from taking the RAWriteWaits from earlier and dividing it by the number of SQM WriteRequests. Regardless. However. SegsAllocated. Adjusting sqm_recover_seg from its default of 1 to 10 or another value and watching both UpdsRsoqidSec and RecoverSeg to fine tune it is likely a good course of action. In most busy systems. Given that the inbound queue often has a 2-4x space explosion. This tells us a rough percentage of the time that the RA had to wait in order to write. If the primary activity was a bcp of 200 rows/second. Again. that many of the averages. This establishes an idea of the rate of a single activity. From the earlier table. or timeout interruptions.6: Counter CmdsRead BlocksRead BlocksReadCached SleepsWriteQ XNLReads XNLPartials XNLInterrupted Explanation Total commands read from a stable queue by an SQM Reader thread. or timeout interruptions. Number of interruptions so far when reading large messages with partial read. Partial large messages read so far. As a result. First let’s look at the counters from RS 12. This does not count partial message. in queue processing.6 and 15.0. these would have the instance_val column value of 11 for the SQT SQMR and 21 for the WS-DSI SQMR. Total srv_sleep() calls by an SQM Reader client due to waiting for SQM thread to start. srv_sleep() calls by an SQM read client due to waiting for the SQM thread to write. For the inbound queue. For instance. These can be distinguished via the counter structures. srv_sleep() calls by an SQM Reader client due to waiting for SQM thread to start. it will either be a DSI or an RSI thread. Total number of 16K blocks from cache read by an SQM Reader thread. the readers are the SQT and/or the WS DSI threads. The number of segments yet to be read. a Warm Standby that doesn’t have distribution disabled or is replicating to a third site will have both a DSI set of SQMR’s (for the Warm Standby DSI which reads from the inbound queue) and a SQT set of SQMR’s. unexpected wakeup. The amount of time taken for SQMR to read a block. we are often comparing the read rate to the write rate. the counters below are actually from the respective reader thread in RS 12. Counter CmdsRead BlocksRead BlocksReadCached SleepsWriteQ XNLReads XNLPartials XNLInterrupted Explanation Commands read from a stable queue by an SQM Reader thread. Number of interruptions so far when reading large messages with partial read. SleepsStartQR SQMRReadTime SQMRBacklogSeg SQMRBacklogBlock 107 . you might think they are in the wrong location. or nonblock read request which is marked as READ_POSTED. we will discuss them here. However.1 SQMR Counters After describing where these counters are located. For the outbound queue. or nonblock read request which is marked as READ_POSTED. Such interruptions happen due to time out. Total srv_sleep() calls by an SQM read client due to waiting for the SQM thread to write.0 and not actually part of the SQM thread. The number of blocks within a partially read segment that are yet to be read. RS 15. and given the name. Number of 16K blocks read from a stable queue by an SQM Reader thread. we saw that in rs_statdetail. Total large messages read successfully so far.Final v2. SleepsStartQR Similar to the SQM counters. The SQMR actually refers to the SQM code executed by the reader. unexpected wakeup. Such interruptions happen due to time out. Total number of 16K blocks read from a stable queue by an SQM Reader thread. Total partial large messages read so far. Number of 16K blocks from cache read by an SQM Reader thread.0 has a few modifications for SQM Readers as well. Large messages read successfully so far. This does not count partial message. the defaults) could cause both the writer and reader to spend a lot of time sleeping vs. helps to look at the counters in terms of the progression of data through the replication server. However because of rescanning. While the SQM counters SegsAllocated. it is most likely that BlocksReadCachedPct will start high and rapidly drop to zero as the backlog in the DSIEXEC causes the DSI to lag far behind in reading the queue vs.1 The last three (which are new in RS 15. and no RepAgent latency. this ratio might drop. while reading from cache is ‘good’ for the reader. you should be concerned that the writer is not flushing blocks fast enough so that the reader is constantly have to wait for the next write – see counter SleepsWriteQ. For the inbound queue. and SegsActive would appear to give that information. The final SQMR counter takes a bit of explanation. What could be happening is that the SQM writes a block. a segment could have been read a long time before it is deallocated. you may want to reduce sqm_init_write_delay (and the max) to reduce the sleep time. If between the time that the writer requests the block to be written and it starts to re-use the memory to build the next block. The problem with SQMR for 12. 2.particularly the Backlog counters .0) are interesting. but it doesn’t look like reading is fast enough (usually indicated by the fact the SQT cache is not full and BlocksReadCachedPct < 30%). Ideally we would like to see this higher than 75%. the best counters to consider for the SQMR include: CmdsRead BlocksReadCachedPct = BlocksReadCached/BlocksRead SleepPct = SleepsWriteQ/BlocksRead Ideally. doing work – resulting in RepAgent latency (as high RAWriteWaits as eventually exec_sqm_write_request_limit fills). 108 . BPScurrent. the SQMRReadTime could be used as means of determining the length of time it will take to read it at the current rate (although this is likely an idealistic number). One aspect to remember. sqm next. SegsDeallocated. SleepsWriteQ itself refers to the number of times the reader was put to sleep while waiting for the SQM to write. a possible cause is that the writer is constantly waiting on read activity – and when it does. 1. you may frequently see a much higher value – especially when rescanning large transactions that were removed from the SQT cache. a cause may be the configuration values sqt_init_read_delay and sqt_max_read_delay. this number (SleepPct) should be in the 300%-700% range – as long as the BlocksRead are nearly identical to BlocksWritten (or a decent BlocksReadCachedPct). the SQT reads it…forcing the SQM to sleep sqm_init_write_delay seconds before it can write the next one. it could delay the writer. Keeping this in mind. once again we will take a look at the customer data used earlier in the RepAgent User Thread discussion. Since this has a lower priority.0. the issue was that a segment is active until it is deallocated. You can see quickly how that large settings (i. The cache referred to for queue reads is an unconfigurable 16k of memory that the writer uses to build the next block to be written. at the same time if RepAgent latency exist (in ASE). Again.e. once the number of segments in the backlog is obtained.6. these were defaulted to 2000ms and 10000 ms respectively which meant that if the reader went to read and it was caught up. then it is able to “read from cache” rather than from disk. So if BlocksReadCached is high (i. but the SQT tries to read the next one during that time and is put to sleep sqt_init_read_delay seconds. On the other hand. While you would like to see high BlocksReadCachedPct numbers.read and last. this is best looked at in conjunction with (SQM) BlocksWritten (earlier) – but expressed as a ratio of how often it had to sleep for each block read. This caused so many problems with upgrades to RS 12. the SQMR may have to re-read significant numbers of blocks to re-create it later.Final v2. This increments the WriteRequests counter. If the SQT starts to lag and reading then gets behind. although anything higher than 30% is fine. These new counters .seg columns to determine a latency. The next counter (BlocksReadCachedPct) is the most important for the inbound queue reading. The counters BPSaverage.could be used much like the admin who. and BPSmax effectively measure the bytes per second rate of delivery of the write requests to the SQM while CmdSizeAverage records the average size of the commands in the write requests to the SQM. it sleeps sqm_init_write_delay to sqm_init_write_max_delay. To see how this works. of course.0. In RS 12. is that if a transaction is removed from SQT cache due to size.e. 100%) and there is RepAgent latency. Even better. This indicates that the SQMR is caught up. one aspect to watch is if the writing seems to be going fine.6 is that it could not be used to derive a relative latency. the defaults for these values was set at 1ms each – which is likely overkill in the other direction and could be causing DIST servicing problems from the SQT. The first thing that happens is that the SQM Writer client puts a write request message on the internal message queue (as discussed in the earlier section detailing the OpenServer structures). a reader requests a message from that block. constantly >700%) then it is likely that the sqm_init_write_delay is too high.6. though. SQM Thread Counter Usage Again. For the outbound queue. it would most likely sleep for 2 or more seconds – now causing it to be behind. This wait is likely caused by the SQMR (SQT or DSI) being caught up and therefore is waiting on more data to be written. Consequently. we would like to see CmdsRead equal to the SQM counter CmdsWritten. Alternatively. that in RS 15.e. The SQM checks each incoming message to see if it is a duplicate or if a loss was detected. if SleepPct is too high (i. So. the SQM writing. it can not be read by a SQM Reader client (SQT or WS-DSI). the counter AffinityHintUsed is incremented. Eventually. ii. If loss was detected. High values here may indicate that the maintenance activity is affecting throughput. This is useful to detect a bad configuration when the replicate is getting out of sync on tables using XNL Datatypes. 4. the counter BlocksFullWrite is incremented. corresponds to XNL Datatypes). populated. the counter WritesTimerPop is incremented. b. these counters are likely not as effective as the write request rate may not be driving new segments to be allocated fast enough. 3. If the block was written to disk because the init_sqm_write_delay or the init_sqm_max_write_delay write timer expired. If this happens. If the RS site version is 12. the time is measured from the last new segment allocated and the counters TimeAveNewSeg. the SleepsWriteQ counter is incremented. If this happens. or the inbound stream of data is not that high of a volume. hibernation or other maintenance activity that suspends or shuts down the SQM thread. d. This is an indication that either the SQM is not getting data from the RepAgent User Thread fast enough (i. Use of these counters are interesting in that they show the time it takes for each 1MB segment to be allocated. If the message is considered to be large (i. If the block was written to disk because it was full (essentially the next message would not fit in the space that was left). the counter UpdsRsoqid is incremented. b. This is an a. SleepsWriteEnMarker. If the RS site version configuration value is less than 12. this demonstrates the disk throughput in MB/milliseconds. The SQM is continuously performing space management activities. If the new segment is allocated according to the disk affinity setting. the SQM has to process these records. so it sleeps while the enablement or disablement occurs. This increments the SleepsWriteDRmarker. While the block is being filled. This can be overridden through a ‘rebuild queues’ command. the counter SegsActive is incremented.1 If it is a duplicate. the XNLWrites.5. As new requests come in. it will attempt to read the next block from the queue or SQM cache. and SleepsWriteRScmd accordingly. it may have to allocate additional segments. you may want to adjust the sqm_recover_segs configuration to reduce this. the block will get flushed to disk (reasons and counters below). If the SQM has to wait for the segment allocation.0. Since a segment is allocated only when needed.5 or greater. 109 . this will cause the counters BlocksWritten. including CmdsWritten and in some situations others as discussed below. a. If this value is fairly high and SQM write speed is blocking the EXEC or DIST rate. it receives the command records and begins filling out a 16K block in memory. BytesWritten to be incremented.e. If the command was a replication definition or subscription marker (rs_marker). If intrusive counters are enabled. Depending on the configuration values for sqm_recover_segs. the message is skipped and the XNLSkips counter is incremented. incrementing the SegsAllocated counter. indicating the number of segments that contain undelivered commands. Now that the SQM has space it can use to write to. Writes issued by such maintenance activities will cause the WritesFailedLoss counter to be incremented. and TimeMaxNewSeg are updated accordingly. c. typically the processing suspends. Regardless of the reasons. etc) e.Final v2. or a synthetic rs_marker. a. the counter SleepsWaitSeg is incremented. When an SQM Reader finishes processing its previous command(s). b. In low volume systems. it is discarded. RA User is starved for cpu time). While there is no counter that tracks how long it waits. 5. the time is built in to the above counters (TimeAveNewSeg. 6. i. the XNL related counters are affected. c.e. the new segment allocation may have to update the OQID in the RSSD. If the block was written to disk due to a RS shutdown. in a steady state high volume system. b. and XNLAverage counters are incremented. TimeLastNewSeg. the counter WritesForceFlush is incremented. the Duplicates counter is incremented and the SQM starts processing the next write request. written to disk – in other words. a. XNLMaxSize. This causes several counters to be affected. When the block is read. Let’s take a look at some sample data. disabling the DIST will provide more accurate values for these counters. in nonWS environments. If intrusive counters have been enabled. the counter XNLReads is incremented. c. admin sqm_readers command or the SegsActive can be an indication of how far the WS-DSI may be lagging behind. Obviously. and TimeMaxSeg counters are adjusted. it is not lagging behind. you may need to check the replicate_if_changed state of text/image columns or whether their replication is necessary. 7. If you see large values for XNLInterrupted. the segment is deallocated from the particular queue. TimeLastSeg. since the SQM is a single thread. the ratio of BlocksReadCached:BlocksRead is similar to the cache hit ratio in ASE and can indicate when the exec_sqm_write_request_limit/md_sqm_write_request_limit are too small – or that a SQM reader is lagging behind. the timers started when the segment was allocated (3(b) above) are sampled and the TimeAveSeg.Final v2. If the message contains XNL data. If the XNL data record spans more than one 16K block. 9. 8. However. the reader parses out the commands.000 character comment fields to a reporting system may not be necessary. one may be caught up (and incrementing BlocksReadCached) while the other is lagging. additional command records may need to be read as follows: a. the XNLPartials counter is incremented. we will use the customer data as well as in the insert stress test – starting with the customer data below. the next block will try to be fetched and processed. In cases where there are multiple readers. the SleepsWriteQ and BlocksReadCached may be the effect of the SQT processing the messages if distribution has not been disabled for the connection. the reading which are the SQMR counters). Once again.0. Once the block is read successfully. the counters BlocksRead and BlocksReadCached are incremented accordingly.e. a. This causes the counter CmdsRead to be incremented. b. This increments the SegsDeallocated timer. However. the write timers may have popped necessitating and write operation. When all SQM readers signal that they are finished with all the blocks on a particular segment. In such cases. In strict Warm Standby’s with no other replicate involved. derived statistics are in red. it may be an indication that the large message reading is blocking the SQM writes – which in turn may be slowing down the RepAgent processing. the segment is marked inactive and the SegsActive counter is decremented.1 indication that the SQM Reader is reading the blocks at the same rate that they are being written – i. It this occurs frequently. while the other may be lagging (see next point below). the SQM reader tells the SQM that they are finished with that block. Again. The same could be true for large comment columns – while these may be necessary for WS systems. Once the last row is read for the large message. Once the segment has been marked inactive and any save interval has expired. One of them (typically the SQT) may be reading fast enough to read the blocks from cache and may be tripping this counter. First. This continues for all 64 blocks in the segment. we will look at the writing side by looking at the SQM counters (vs. the reading of large messages is interrupted and the XNLInterrupted counter is incremented. remember that you may have multiple readers for an inbound queue. 110 . 10. For each partial XNL data record read. When this happens. replicating 16. Otherwise. Once all the commands have been read from a block and successfully processed. 322 22.99 511 569 370 287 358 387 623 536 497 393 312 0 0 0 0 0 0 0 0 0 0 0 511 569 370 287 358 387 622 535 497 393 312 1. SegsAllocated & SleepsWaitSeg – taken together.809 326.783 25.6 1. However.000 bytes per command. While not a high volume.98 99.355 1. these two can illustrate when the segment allocation process is hindering replication performance.566 376. This metric is useful when trying to determine how wide the rows are being replicated (for space projections) and especially compared to the RepAgent counter PacketsReceived.. this shows the interruption in RS processing to record recovery information.865 34. again. however.187 364.200 1.2 0.99 99.750 325. Sqm_recover_seg – this metric is derived by dividing the SegsAllocated by the UpdsRsoqid. we are averaging about 2 updates/second.153 19.e.674 1. For inserts. let’s compare this to the insert stress test: Sqm_recover_seg (derived) 1 1 1 1 1 1 1 1 1 1 1 CmdSizeAverage WritesTimerPop BlocksFullWrite UpdsRsoqid/sec (derived) SleepsWaitSeg BlocksWritten SegsAllocated CmdsWritten Sample Time BlocksFull% (derived) UpdsRsoqid 111 .975 2 3 1 2 2 0 1 1 0 0 1 32. CmdSizeAverage – This metric records the number of bytes necessary to store each command.392 23. Earlier.684 164. this is a good indication of the actual RS configuration parameter sqm_recover_seg.395 23. At ~1.032 1.664 18. This metric should be fairly close to the RepAgent counter CmdsTotal – although it may not be exact as some RepAgent User thread commands are system commands not written to the queue (such as truncation point fetches).759 39.691 36.0.99 99. this is the after row image. both of these are a good indication of how busy the input stream is to Replication Server.380 1. Much like the RA ECTS value. Specifically.Final v2.723 1.153 19.693 36. If we couple this metric with the RepAgent counter UpdsRslocater from above.340 317. while for updates. etc.974 99. both the after row image and the before row image – less identical values when minimal columns is enabled.283 266. this value is actually fairly important when looking at read activity and SQMR counters. 2.862 34. While this may not appear to be as useful given that CmdsTotal is broken down by CmdsApplied. that is all that will fit in the default packet size. it is just as likely to be caused by RSSD performance issues. we noted that we were getting about 3 RepAgent commands per packet (which includes begin/commit transaction commands) and this metric demonstrates why.097 1.759 39.334 253. Before we look at the SQM read (SQMR) counters.2 2 1.184 450.907 24.6 1.9 1. This system is consistently busy with very marginal timer driven flushes. UpdsRsoqid/sec – this metric is derived by dividing UpdsRsoqid by the number of seconds between sample intervals.011 825 32.903 24. CmdsSystem.99 99. WritesTimerPop & BlocksFull% – the second metric is derived by dividing BlocksWritten by BlocksFullWrite.1 0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54 268.7 1.246 31. again it shows the impact on the RSSD.8 1.99 100 100 99. If the CmdSizeAverage is large – i.3 1 Let’s take a look at some of these: CmdsWritten – This corresponds to the number of commands actually written to the queue.99 99.1 1.98 100 99.000 bytes – this could result in a single command per packet being sent using the default packet size. Adjusting this slightly could improve RS throughput.320 22.655 1. CmdsRequest.248 31.663 18. Any writes caused by a timer pop indicate that the SQM block wasn’t full indicating a lull in activity from the Replication Agent User thread.705 253.783 25. A non-busy system would likely have a lot more and correspondingly a lower full %.190 893 1. The actual cause of the delay could be I/O related. 67 20.07 18.Final v2.134 19.458 105 817 471 2.705 253.615 5. let’s view the customer data metrics (note that segment allocation metrics are SQM and not SQMR counters): BlocksReadCached Cached Read % 0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54 268.120 56 99. anytime the SQMR. Consequently.11 253.781 33.054 194.340 317. when the commit was finally seen.96 144.26 5.03 54.828 54. Now let’s take a look at some read statistics by looking at the SQMR counters.70 33.398 40.438 10. First.404 73.CmdsWritten.860 947.04 303 38 2 2 25 2 41 57 29 3 2 511 569 370 287 358 387 623 536 497 393 312 621 835 403 287 335 412 583 522 523 422 312 Let’s take a look at some of these: CmdsWritten (SQM) vs. What has happened is that the SQT cache was filled causing large transactions to get removed from cache. the SQT had to re-read the entire transaction from disk – and consequently had to re-request the commands from the SQM.975 17.29 398.364 19. Consequently.0.61 40.065 352.283 266. Also note that sqm_recover_seg is 10 and the derived value is showing the fluctuation induced by averaging across time periods – for example.909 24.184 450.027 7.69 16.334 253.210 6.07 725.199 8.336 553 1.035 84.06 49.462 8.165 44.35 19.809 326.1 0 0.51 38.1 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 1. you should look to the SQT metrics as the SQM is re-scanning the disks. 112 SegsDeallocated Wrire Wait % SegsAllocated SleepsWriteQ CmdsWritten (SQM) Sample Time BlocksRead CmdsRead (SQMR) SegsActive Sqm_recover_seg (derived) 0 12 7 11 0 CmdSizeAverage WritesTimerPop BlocksFullWrite UpdsRsoqid/sec (derived) SleepsWaitSeg BlocksWritten SegsAllocated CmdsWritten Sample Time BlocksFull% (derived) UpdsRsoqid .273 19.386 365.12 44.344 2. the 11:38:08 sample likely had an update to rs_oqid at 8 (2 from previous sample period + 8 = 10) and then the next four were combined with six of the seven in sample 11:38:19 and so forth.998 28.788 4.932 144.657 11.55 21.00 100.85 166.728 7.00 98.69 95. As you will see in some of the later metrics.3 0 Again.566 376. This is partially true.73 230.025 32.750 325.808 318.611 282.153 36.187 364.491 1.120 57 1 0 0 0 1 104 817 471 2.996 19. CmdsRead (SQMR) – it looks like the SQM is reading a lot more than writing.684 164.683 73.786 40.419 73.309 19.165 7.491 1.017 39.CmdsRead counter is appreciably higher than SQM.887 99.674 587. this has an impact on system performance.04 100. we see mostly full blocks with exception of the beginning and end of the test run – which illustrates how WritesTimerPop can be used to indicate a lull in Replication Agent user thread activity.37 20.369 79.00 100.231 43.465 1.844 400.24 2 12 7 33 1 0 0 0 0 0 0 1 1 3 0 0 0.435 522.512 20.34 36.396 42.491 1.958 277.656 317. For instance. The SegsActive is climbing as the DSI is falling behind due to the replicate ASE not being able to receive the commands fast enough (most often this is the biggest source of latency).this metric is derived by dividing the BlocksRead by the SleepWriteQ.5 810. the SQM had to flush the blocks it had to disk – resulting in physical reads most of the time. However. the large transaction began 300+ segments back – and when it had be successfully read out and distributed.336 553 1.this metric is derived by dividing the BlocksReadCached by BlocksRead. we notice that the Cache Read % is in the high 90’ initially (it drops off later due to the fact the DSI is not keeping up – so the Cache Read % is artificially high at the beginning as the DSI SQT cache is filled). Nearly all customers who call into Sybase Tech Support complaining about latency in RS and think RS is the problem due to the “backup being in the inbound queue” forget that as a Warm Standby. it is when the SQM read client sleeps while waiting for the SQM write client to write.947. in this case a SQM read client may have to wait more than once for the current block to be written. Ideally. so these counters are from the SQM read client for the WS-DSI that was reading from the inbound queue. let’s take a look a the same counters from the insert stress test.027 7.798 3. numbers below 300% seem to indicate a latency. Consequently.Final v2. low numbers would be desirable here. However. Ideally. This is actually an interesting metric. However. This last point is interesting.084 575 105 817 471 2. they only have an inbound queue – which also functions as the outbound queue. SegsDeallocated 0 3 4 4 4 Wrire Wait % SegsAllocated SleepsWriteQ CmdsWritten (SQM) Sample Time BlocksRead CmdsRead (SQMR) SegsActive 113 .0.14 22.018 7. but anything in the 90’s acceptable. Even when it appears to “catch up” (around samples 3.509 106.1 Cached Read % . sqm – the amount of active segments indicates latency. the SQM could then drop those segments (a better description of the process is contained in the SQT processing section regarding the Truncate list). In this case we see rather dismal numbers – largely the fault of all the rescanning. the higher above 100% this value.03 112 7. Similar (and if fact the same metrics) to admin who. Likely.816 6. The only caveat is that the insert stress test was a Warm Standby implementation. between the first two sample periods.795 4.90 98.5 3 12 16 45 43 2 12 7 33 1 Comparing this to the above.2 322.514 20. Write Wait % . 4 & 5). While normally 100% is considered “complete”. The reason is simple is that when the SQMR had to re-read. The following diagram depicts the flow of data through the SQT starting with the inbound queue SQM and the Distributor to the outbound queue. SegsActive – this metric shows how much space is being consumed in the stable queue. the cache hit rate is low.742 3.7 954. the latency may not be as large as the actual number of segments active. It is desirable that SleepWriteQ is high – by definition. note that the Write Wait % is very high – which is desirable. BlocksReadCached Cached Read % 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 1.512 20.1 5. we would like this to be in the high percentages with 100% being perfect.93 30. Now. SQT Processing The Stable Queue Transaction (SQT) is responsible for sorting the transactions into commit order and then sending them to the Distributor to determine the distribution rules. the number of active segments drop from 303 to 38. the stronger the indication that the SQM read client is caught up to the SQM writer.04 92.788 4.093 59 104 759 466 631 13 99. This will be evident more when looking at the insert stress test metrics. this queue is a list of transactions for which the commit record has not been processed or seen by the SQT thread yet. 114 . sqt). large amounts of SQT cache will not even be utilized and will result in over-allocating DSI SQT cache if dsi_sqt_max_cache_size is still at the default. the SQT thread was a common cause of problems because the default SQT cache was only 128KB and DBA’s would forget to tune it. Only after all of the transactions on a block have had the commit statements read and been processed by the DIST and placed on the read queue can the SQT request the SQM to delete the block. the transaction is moved to the “Read” queue. the transaction structure record is placed on the Truncate queue. Unfortunately. These lists are: Open – The first linked list that transactions are placed on. this became almost a “silver bullet” that became relied on by DBA’s to simply keep raising the SQT cache any time there was latency – and then complaining when it no longer helped. For this section. In early releases of Replication Server. the SQT thread is responsible for sorting the transactions into commit order. the problem isn’t here – and adding more cache will likely just contribute to the problem at the DSI. it is best to understand how the SQT works. Likely. In any case. this problem is very easy to address by adding cache. Truncate – Along with the Open queue. Key Concept #11: SQT cache is dynamically allocated – for small transactions. Even today’s default (1MB) may not be sufficient.Final v2.and that performance of this ‘pipeline’ of data flowing between the queues depends on the performance of each thread along the path. often referred to (confusingly enough) as “queues”. although this is internal to the DSI as will be discussed later in the outbound section). the transaction is moved from the “Open” list to the closed list and a standard OpenServer callback is issued to the Distributor thread (or DSI. As mentioned earlier.1 Figure 18 – Data Flow Through Inbound Queue and SQT to DIST and Outbound Queue It is good to think of the SQT as just one step in the process between the two queues . Read – Once the DIST or DSI threads have read the transaction from the SQT. if the SQT cache is already above 4-8MB. Closed – Once the commit record has been seen. DBA’s should resist raising it further without first seeing if the cache is being exceeded. when a transaction is first read in to the system. thankfully.0. Today. In order to better understand the performance implications of this (and the output of admin who. SQT Sorting Process The SQT sorts transactions by using 4 linked lists. we will focus strictly on the SQT thread on the left side of the above diagram. Final v2. consider the following example of three transactions committed in the following order at the primary database: CT1 D19 I 18 I 17 I 16 D15 U14 I13 I12 I11 BT1 CT2 U27 I 26 I25 I24 I23 I 22 I21 BT2 CT3 D35 D34 I 33 U32 U31 BT3 T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 BT3 / CT3 U31 Begin/Commit Transaction Pair (with tran id) Statement ID DML Operation (Update. In fact. the transactions were committed in the order 2-3-1. the transactions may be stored in the inbound queue in blocks similar to the following (assuming blocks were written due to timing and not due being full): 115 . Due to the commit order. Insert.0. Delete) Transaction ID Figure 19 – Example Transaction Execution Timeline In this example. the transaction log is not that neat.1 To get a better idea how this works. the transactions might as well have been applied similar to: CT1 D19 I18 I17 I16 D15 U14 I13 I12 I11 BT1 CT2 U27 I26 I25 I24 I23 I22 I21 BT2 CT3 D35 D34 I33 U32 U31 BT3 T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 20 – Example Transactions Ordered by Commit Time However. however. it would probably look more like the following: CT1 D19 I18 I17CT3 D35 D34 CT2 I33 U27 U32 I26 U31 I25 BT3 I24 I16 I23 D15 I22 U14 I21 I13 BT2 I12 I11 BT1 End of Log Beginning of Log Figure 21 – Transaction Log Sequence for Example Transaction Execution After the Rep Agent has read the log into the RS. 3.4 Beginning of Queue Figure 22 – Inbound Queue from Example Transactions with Sample Row Id’s The following diagrams illustrate the transactions being read from the SQM by the SQT. So.3 U3 I2 U3 I2 BT3 2 6 1 5 0.1 2 1 0. Note that immediately after reading the transaction from the SQM. we encounter the second transaction.6 CT1 D19 I18 I1 0.0). we begin a second linked list for its statements as well as continuing to build the first transactions list with statements belonging to it read from the second block.4 D3 CT2 4 0. Continuing on and reading the next block from the SQM yields: Open TX1 TX2 BT1 BT2 I11 I12 I13 U14 I21 I22 Closed Read Truncate TX1 TX2 Figure 24 – SQT Queues After Reading Inbound Queue Block 0.0 Note that the transaction is given a transaction structure record (TX1 in above) and statements read thus far along with the begin transaction record have been linked in a linked list to the Open queue. Continuing on and reading the next block from the SQM yields: 116 .0 D35 I33 U27 I1 I2 D1 I2 U1 I2 I1 BT2 I1 I1 BT1 End of Queue Row 0.1 Having read the second block from the SQM. we add that transaction to the Truncate queue.5 7CT3 0.0 Row 0.3.3.3 Row 0.3. sorted via the Open.1 0. Read and Truncate queues within the SQT.1 Row 0. After reading the first block (0. Additionally.0. Closed.2 Row 0. the transaction id is recorded in the linked list for the Truncate queue.2 I24 6 3 5 2 4 1 3 0.Final v2.3. these four queues will look like the below: Open TX1 BT1 I11 I12 Closed Read Truncate TX1 Figure 23 – SQT Queues After Reading Inbound Queue Block 0. which moves it to the “Closed” queue. Since we now have a commit. we have all three transactions in progress.2 No new transactions were formed.Final v2.3 At this point. This yields an SQT organization similar to: Open TX1 BT1 I 11 I 12 I1 3 4 Closed TX3 BT3 U31 U32 TX2 BT2 I 21 I 22 I 23 I 24 I 25 I 26 U27 CT2 Read Truncate TX1 TX2 TX3 U1 D15 I 16 Figure 27 – SQT Queues After Reading Inbound Queue Block 0. the transaction’s linked list of statements is simply moved to the “Closed” queue and the DIST thread notified of the completed transaction.0. This yields an SQT organization similar to: 117 . Continuing with the next block read from the SQM yields the first commit transaction (for TX2).4 Continuing with the next read from the SQM.1 Open TX1 TX2 BT1 BT2 I 11 I 12 I 13 U1 4 5 Closed Read Truncate TX1 TX2 I 21 I 22 I 23 I 24 D1 I 16 Figure 25 – SQT Queues After Reading Inbound Queue Block 0. the DIST is able to read TX2 which causes it to get moved to the “Read” queue and the commit record for TX3 is read. Continuing on yields the following SQT organization: Open TX1 TX2 TX3 BT1 BT2 BT3 I11 I12 I1 3 Closed Read Truncate TX1 TX2 TX3 I21 I22 I2 3 U31 U32 U14 I24 D15 I25 I16 I26 Figure 26 – SQT Queues After Reading Inbound Queue Block 0. so we are simply adding statements to the existing transaction linked lists. yields the following: Open Closed TX1 BT1 I11 I12 I13 U14 D1 I1 5 6 Read TX2 BT2 I 21 I 22 I 23 I 24 I2 I2 5 6 Truncate TX1 TX2 TX3 TX3 BT3 U31 U32 I 33 D 34 D 35 CT3 I17 I18 D19 CT1 U27 CT2 Figure 29 – SQT Queues After Reading Inbound Queue Block 0. all transactions have been closed.6 At this stage. we have to wait until it has been read by the DIST. Since the statements that make up TX2 are scatter among the blocks and statements for transactions for which the commit has not been seen yet. the space must be freed contiguously from the front of the queue (block 0. however. Once that is done. then the deletes could still be done for blocks 0. you might think that we could remove TX2 from the inbound queue. Remember. we would lose TX1. the SQT doesn’t attempt to resort and resend transactions already processed. Consequently. sqt command (example copied from Replication Server Reference Manual).Final v2. we still cannot remove them from the inbound queue. in order to free the space. Consequently on restart after recovery. if we removed them from the inbound queue now and a system failure occurred. If however.0 in this case).1 Open TX1 BT1 I 11 I 12 I 13 U1 4 5 Closed TX3 BT3 U 31 U 32 I 33 D 34 D3 5 Read TX2 BT2 I 21 I 22 I 23 I 24 I 25 I 26 U27 CT2 Truncate TX1 TX2 TX3 D1 I1 I1 I 16 7 8 CT3 Figure 28 – SQT Queues After Reading Inbound Queue Block 0.0 through 0. the deletion of TX2 must wait. In addition.5 At this juncture. 118 . Continuing on with the last block to be read. block 0. if you remember. this is strictly memory sorting (SQT cache). SQT Performance Analysis Now that we see how the SQT works. this should help explain the output of the admin who. However. it simply starts with the first active segment/row and begins sorting from that point.5.6 also contained a begin statement for TX4. consequently. all I/O is done at the block level. How? The answer is that the SQM flags each row in the queue with a status flag that denotes whether it has been processed. Instead.0. all three transactions would be in the “Read” queue and consequently a contiguous block of transactions could be removed since all of the transactions on the blocks have been read. e. spids 10 & 0 above represent DSI threads performing SQT library calls. we are going to concentrate on the SQT thread aspect – however. Number of transactions removed from cache. Differences will be discussed in the section on the DSI later.1 admin who. Trunc is the sum of the Closed. you may wish to monitor the "removed" counter to detect if transactions are getting removed due to cache being full. Since this is a transient counter. remember that it applies to the DSI SQT module as well.TOKYO_RSSD 103:1 DIST LDS. but the ones for outbound queues (designated qid:0) are missing. Transactions are removed if the cache becomes full or the transaction is a large transaction (discussion later) Denotes if the SQT cache is currently full.TOKYO_RSSD 106 SYDNEY_DSpubs2sb Trunc ----0 0 0 0 First Trans ----------0 0 0 0 Parsed ------0 0 0 0 SQM Blocked ----------1 1 0 0 Change Oqids -----------0 0 0 0 Detect Orphans -------------0 0 1 1 The observant will say that not all the SQT threads are listed as the ones for the inbound queues (designated with qid:1) are present. Consequently. Read Open Trunc Removed Full 119 .0.Final v2. sqt Spid State -------17 Awaiting 98 Awaiting 10 Awaiting 0 Awaiting Closed -----0 0 0 0 Removed ------0 0 0 0 SQM Reader ---------0 0 0 0 Read ---0 0 0 0 Full ---0 0 0 0 Wakeup Wakeup Wakeup Wakeup Open ---0 0 0 0 Info ---101:1 TOKYO_DS. then the next thread (DIST or DSI-Exec) is the bottleneck as the SQT is simply waiting for the reader to read the transactions. Well. and Open columns (due to reasons discussed above). For this section. Read. Number of transactions in the “Open” queue for which commit has not been seen by SQT yet (although SQM may have written it to disk already) Number of transactions in the “Truncate” queue – essentially an ordered list of transactions to delete once processed in disk contiguous order. A high number in this block may point to a long transaction that is still “Open” at the very front of the queue (i. This essentially explains the number of transactions process not yet deleted from the queue. the reality is that there is not a SQT thread for outbound queues. Instead. user went to lunch) as deleting queue space is fairly quick.pubs2 101 TOKYO_DS. Number of transactions in the “Read” queue. the DSI (Scheduler) calls SQT routines. The output for the columns are described in the below table: Column Spid State Info Closed Meaning Process Id for each SQT thread State of the processing for each SQT thread Queue being processed Number of transactions in the “Closed” queue waiting to be read by DIST or DSI. If a large number of transactions are “Closed”. sqt is one of the key commands to determining problems on the inbound queue performance.e. As we will see later. when a large transaction is encountered.5. Indicates that it is doing orphan detection. R (read).23 – which basically tells you that at this stage. but the space has not been truncated from the queue yet by the SQM. Indicates that the origin queue ID has changed. this field can give you an idea of the transaction volume over time. This column contains information about the first transaction in the queue and can be used to determine if it is an unterminated transaction. the first transaction in the queue is still “Open” (no commit read by SQT) and so far it has 3.1 Column SQM Blocked First Trans Meaning 1 if the SQT is waiting on SQM to read a message. Closed. if someone does not close their transaction when the system crashes. This state should be transitory unless there are no closed transactions. The index of the SQM reader handle.5. it is extremely useful for determining when you have encountered a large transaction – or. Along with statistics. This is especially true if the sqt_init_read_delay and sqt_max_read_delay to are not set to 1000 milliseconds (1 second). If multiple readers of an SQM. Consider the following tips for this column: ST: O O O C C R Cmds increasing same changes changes same same Qid same same increasing slow same same Possible Cause Large transaction Rep Agent down or uncommitted transaction at primary SQT processing normally SQT reader not keeping up (DIST or DSI) DIST down. Above we gave an example of one view of a large transaction (ST:O Cmds: 3245 qid: 103. The column that assists in this is the “First Trans” column. In addition to helping you identify progress of transactions through the Open.0. outbound queue full Transaction on same block/queue still active It is important to recognize that this is the first transaction in the queue – which especially for the outbound queue could have been delivered already. Typically this only happens in Warm Standby after a switch active. or D (deleted) ·Cmds: Followed by the number of commands in the first transaction ·qid: Followed by the segment. Parsed SQM Reader Change Oqids Detect Orphans Admin who. Consequently. This is largely only noticed on RSI queues. and 2) slow SQT readers (i. this is a very useful piece of information. For normal database queues. The inbound queue is even more confusing – it may have already been processed. The number of transactions that have been parsed.Final v2. this designates which reader it is. Common Performance Problems The most common problems with the SQT are associated with 1) large transactions. This is the total of transactions including those already deleted from the queue. the Rep Agent will see the recovery checkpoint and instruct the SQM to purge all the open transactions to that point. Read and Truncate queues. C (closed). block.245 commands in the transaction (probably a large one) and begins in the queue at segment 103 block 5 row 23. If the SQT attempted to cache all of the statements for such a delete in its memory. DIST or DSI). and row of the first transaction An example would be ST:O Cmds: 3245 qid: 103. The column has three pieces of information: ·ST: Followed by O (open). on recovery. one that is being held open for very long time. it would quickly be exhausted. The first deals with the classic 10.000 row delete.23). 120 . The first is rather obvious – increase the amount of memory that the SQT has by changing the value for sqt_max_cache_size. By adding this parameter. If there are no transactions in the Closed or Read queue.5+ There are two main ways of improving SQT performance. The second problem is common as well. this means that in order for the transaction to be passed to the SQT reader.400. Max: 86. The length of time in milliseconds that an SQT thread sleeps while waiting for an SQM read before checking to see if it has been given new instructions in its command queue. the memory used by DSI is the same as the sqt_max_cache_size setting. all connections used sqt_max_cache_size for the inbound queue processing by the SQT regardless of requirement. In cases where the DIST. See discussion below. the SQT finds the largest transaction in the Open queue. dist_sqt_max_cache_size (Default: 0??. Recommendation: 4MB) RS 11. Min: 1000. SQT Performance Tuning To control the behavior of the SQT. If zero. In addition to the slow down of simply doing the physical i/o.000 (24 hrs).400. Once this begins to happen. Should this continue for very long. the amount of memory used by the DSI thread for SQT cache. there are a couple of configuration parameters available: Parameter sqt_max_cache_size (Default: 1MB. Note that this is a maximum – RS dynamically allocates this as needed by the connection and then deallocates when no longer needed. In the past.000 (24 hrs).0+ dsi_sqt_max_cache_size (Default: 0. SQT doubles its sleep time up to the value set for sqt_max_read_delay. Recommendation: 4MB) 12. discards the statements (keeping the transaction structure – similar to a large transaction) and then keeps processing. By default. With each expiration. it effectively pauses the scanning where the SQT had gotten to until that transaction is fully read back off disk and sent to the DIST. the SQT has a decision to make. Recommendation: 1MB) 11.5+ 12. the Closed queue continues to grow until all available DSI SQT cache is used. the SQT simply halts reading the SQM until the transaction is complete and queue can be truncated. The maximum length of time an SQT thread sleeps while waiting for an SQM read before checking to see if it has been given new instructions in its command queue. This new parameter was added in RS 12. However.6 ESD #7 as well as RS 15.0. The default of 0 is clearly inappropriate if you start adjusting sqt_max_cache_size. It also impacts Replication Agent performance as this likely will involve a large number of read requests to refetch all of the same blocks – adding to the workload of the SQM that is busy trying to write. this is frequently oversized and customers often don’t understand why continuing to increase it has no effect. See discussion below. a large number of transactions in the SQT cache may have to be rescanned – further slowing down the overall process. if the command queue is empty. Most medium production systems need 2MB SQT caches with high volume OLTP systems using any where from 121 .1 the SQT simply discards all of its statements and merely keeps the transaction structure in the appropriate list. The reason is that over sizing this can drive the DSI to be filling cache more than issue SQL due to the default value of dsi_sqt_max_cache_size.x Meaning Memory allocated per connection for SQT cache. for a total of 2 source and 5 destination databases we would have 14 (2 source inbound/outbound and 5 destination inbound/outbound) 1MB memory segments for SQT cache.Final v2. 1MB is typically too little. Recommendation: 50) 12. However. Consequently. If other than zero.x sqt_init_read_delay (Default: 2000. individual inbound queue SQT cache sizes can be tuned similar to DSI SQT cache sizes. etc.0 ESD #1. or DSI threads cannot keep up. Values above 4MB need to be considered very cautiously and only when transactions are being removed and cache has been exceed. So.6+ 15. Min: 1000. Max: 86. Recommendation: 10) sqt_max_read_delay (Default: 10000. the SQT has 1MB for each inbound and outbound queue. the SQT must go back to the beginning of the transaction and physically rescan the disk. If there are transactions in the Closed or Read queue. the more this impacts overall Replication Server memory settings.Final v2.6. Starting with 12.0. to avoid problems caused by rescanning. However. The rationale for the above statement was that in implementing the SMP logic. the DSI may need SQT cache to keep up with the multiple DSI’s parsing requirements. If you think about it. With a 4MB sqt_max_cache_size setting. providing cached transactions to clients such as the DIST and DSI threads. if using Parallel DSI. the SQT cache may not be fully needed for the DSI for transaction sorting – and it can be adjusted down on a connection-by-connection basis via the dsi_sqt_max_cache_size. notice that for most OLTP systems. Arguably."SQTCacheLowBnd".6 as SQT cache sizing was frequently oversized on many systems. we are referring to DML based transactions (unfortunately sp_sysmon reports all). As a result. whenever SQT cache contains no closed or read transactions (that is. SQT will remove the statements of undistributed transactions from cache in order to make room for more transactions until it is able to cache one that can be distributed or until some distributed transactions can be deleted. the more connections you have. However. if each source database is replicating to individual destination systems (1 to 2 and the other to 3). In these cases. The rationale is that normal OLTP transactions will cycle through the SQT cache so quickly that the SQT cache will likely not use very much memory. Typically. the outbound queue will contain “sorted” transactions provided that no other DIST thread is replicating into the destination. below which transactions would have been removed. prompting the following statement: Prior to RepServer 12. The later (Parallel DSI) is best dealt with by adjusting the dsi_sqt_max_cache_size separately from sqt_max_cache_size. the logic for the SQT processing was altered to favor filling the cache vs. if this counter does not report more than 1 removed transaction in any 5-minute period. The tendency to oversize SQT cache has lent to some concern from within Sybase Replication Server engineering. However. but not much larger than that. This counter captures the minimum SQT cache size at any given moment. Earlier we had the following table: Configuration sqt_max_cache_size dsi_sqt_max_cache_size memory_limit Normal 1-2MB 512KB 32MB Mid Range 1-2MB 512KB 64MB OLTP 2-4MB 1MB 128MB High OLTP 8-16MB 2MB 256MB In which these were defined by: Normal – thousands to tens of thousands of transactions per day Mid Range – tens to hundreds of thousands of transactions per day OLTP – hundreds of thousands to millions of transactions per day High OLTP – millions to tens of millions of transactions per day By transactions. Due to SQT behavior modifications associated with the SMP feature. Monitor this value frequently during a period of typical transaction flow. To help determine the proper setting of sqt_max_cache_size and dsi_sqt_max_cache_size. the earlier example of 2 source/5 destinations would require 52MB strictly for SQT cache – providing that all SQT caches are completely full. only a 2-4MB sqt_max_cache_size is truly all that is necessary. this was true even prior to RS 12.1 4MB+ of cache. Higher than this is really only necessary in very high volume systems that have periodic/regular large transactions. and configure SQT cache to be no more than about 20% greater than the largest value observed. transaction removal rate may be considered acceptable. Transactions are removed from SQT cache forcing them to be re-read from the queue when needed. Obviously. no transactions to be distributed or to be deleted after having been distributed) and cache is full. sizing the SQT cache to contain the periodic large transactions will allow the SQT to avoid the hit. refer to counter 24019 . As a result. typical tuning advice was to increase sqt_max_cache_size so that there are plenty of closed transactions ready to be distributed or sent to the replicate database when RepServer resources handling those efforts became available. You can monitor the removed transaction count by watching counter 24009 . 122 . latency sometimes was introduced simply by the SQT thread waiting to fill huge caches allocated by the DBA vs."TransRemoved". Even 2-4MB SQT cache may be a bit excessive. the best advice for correctly sizing SQT (for either the sqt_max_cache_size or the dsi_sqt_max_cache_size configuration) is to set it large enough so that transactions removed from SQT cache never occur or only infrequently.6 that advice no longer applies. Counter CacheExceeded (a useless counter) CacheMemUsed Explanation Total number of times that the sqt_max_cache_size configuration parameter has been exceeded. Maximum memory consumed by one transaction. It became crucial. then it is a possible indication that SQT cache is undersized – particularly from the inbound processing side. Total transactions removed from the Read queue. Total commands in the last transaction completely scanned by an SQT thread. XREC_CHECKPT. SQT thread memory use.0 to the following set: 123 . if no transactions are active in SQT. sqt command. If the “Removed” column is growing and the transactions are not large. Total commands read from SQM. Total empty transactions removed from queues. then it is probable that the cache was filled to capacity several times and multiple transactions normally not considered large were removed to make room. Average number of commands in a transaction scanned by an SQT thread. Total transactions whose constituent messages have been removed from memory. For this reason. Maximum number of commands in a transaction scanned by an SQT thread. SQT Counters SQT Thread Monitor Counters The following counters are available in RS 12. Total transactions added to the Read queue. oversizing it. However.1 passing the transactions on. As taught/mentioned in the manuals. Total transactions removed from the Closed queue.6 to monitor the SQT thread.0. However. One way to detect either of these two situations is to watch the system during periods of peak activity via the admin who.Final v2. XREC_COMMIT. the absolute best way (and most accurate) to determine cache sizing is to use the monitor counters. the best indication from the admin who. Total transactions removed from the Open queue. Total memory consumed by the last completely scanned transaction by an SQT thread. SQT cache usage is zero. then. Total transactions removed from the Truncation queue. Total transactions added to the Closed queue. if the “Full” column is set to a 1.0 . ClosedTransRmTotal ClosedTransTotal CmdsAveTran CmdsLastTran CmdsMaxTran CmdsTotal EmptyTransRmTotal MemUsedAveTran MemUsedLastTran MemUsedMaxTran OpenTransRmTotal OpenTransTotal ReadTransRmTotal ReadTransTotal TransRemoved TruncTransRmTotal TruncTransTotal These changed in RS 15.6 and RS 15. Each command structure allocated by an SQT thread is freed when its transaction context is removed. Total transactions added to the Open queue.to “right-size’ the SQT cache vs. Removal of transactions is most commonly caused by a single transaction exceeding the available cache. Total transactions added to the Truncation queue.sqt command is the “Removed” column. in RS 12. Commands include XREC_BEGIN. Average memory consumed by one transaction. Removal of transactions is most commonly caused by a single transaction exceeding the available cache. Current open transaction count. Current read transaction count.Final v2. MemUsedAveTran CachedTrans = CacheMemUsed/MemUsedAveTran 124 . Transactions removed from the Closed queue. The time taken by an SQT thread (or the thread running the SQT library functions) to delete messages from SQT cache. The time taken by an SQT thread (or the thread running the SQT library functions) to read messages from SQM. The smallest size to which SQT cache could be configured before transactions start being removed from cache.1 Counter CmdsRead OpenTransAdd CmdsTran CacheMemUsed Explanation Commands read from SQM. Commands include XREC_BEGIN. ReadTransTotal CmdsAveTran. Transactions added to the Read queue. the average. Transactions added to the Truncation queue. and SQTDelCacheTime) could be interesting if there is a latency within the SQT.0 with individual columns in rs_statdetail. MemUsedTran TransRemoved TruncTransAdd ClosedTransAdd ReadTransAdd OpenTransRm TruncTransRm ClosedTransRm ReadTransRm EmptyTransRm SQTCacheLowBnd SQTWakeupRead SQTReadSQMTime SQTAddCacheTime SQTDelCacheTime SQTOpenTrans SQTClosedTrans SQTReadTrans As mentioned earlier. Transactions removed from the Truncation queue. if no transactions are active in SQT. XREC_CHECKPT. SQT cache usage is zero. An SQT client awakens the SQT thread who is waiting for a queue read to complete. ClosedTransTotal. For this reason. Transactions removed from the Read queue. Transactions removed from the Open queue. However. CmdsMaxTran CacheMemUsed. Commands in transactions completely scanned by an SQT thread. Transactions added to the Closed queue. Transactions whose constituent messages have been removed from memory. XREC_COMMIT.0. SQT thread memory use. the three new time tracking counters above (SQTReadSQMTime. SQTAddCacheTime. Each command structure allocated by an SQT thread is freed when its transaction context is removed. Transactions added to the Open queue. Current closed transaction count. total and max counters are replaced in RS 15. Memory consumed by completely scanned transactions by an SQT thread. The most important counters SQT counters are: CmdsPerSec = CmdsTotal/seconds OpenTransTotal. The time taken by an SQT thread (or the thread running the SQT library functions) to add messages to SQT cache. Empty transactions removed from queues. as transactions are read and the truncated from the SQT cache. If TransRemoved is 0.e. you will find that sites have left dsi_sqt_max_cache_size at 0 which means it inherits the value for sqt_max_cache_size – which is likely oversized and now the DSI/SQT module is spending time stuffing the cache vs.000 – you likely don’t have enough memory to cache it anyhow. Closed. the TransRemoved counter is likely the most critical (and why it is high-lighted). sqt_max_cache_size is a server setting that applies to the all connections – so before decreasing. MemUsedAveTran gives us a good starting point to use as a multiple to increase by (i. 60Hz.5. If CmdsMaxTran is high. A third alternative is that the SQT cache is too big and since the SQT prioritizes reading over servicing the DIST (and freeing space from the SQM dead last). However. we don’t need a lot of detail here as the Open. let’s skip to looking at the customer data: 125 . we will need to increase dsi_sqt_max_cache_size). For example. Read and Trunc prefixes make the counters fairly intuitive. the real goal is to see that all three are nearly identical. Instead.Final v2. From this. it is either because everything is being done in isolation level 3 or chained mode. Most often. Notice that we focused on TransRemoved. The latter can be especially unplanned with java applications as the default behavior is to execute all procedures in chained mode. if TransRemoved is occasionally > 0 but the CmdsMaxTrans is 1. CacheExceeded is kind of like the admin who. we likely will only care about the CacheMemUsed – monitoring to see how much memory we actually are using and if we need to increase this (if TransRemoved > 0). either the DIST is not able to keep up (due to bad STS cache settings or slow outbound queue) or a large number of large transactions were committed and in order to pass them to the DIST (which is when it moves from Closed to Read). However. Closed. the empty transaction is flushed to the transaction log (think of the performance implications there – and the log semaphore) and then replicated. this value rapidly change as the new space available is filled quickly by the SQT. the whole transaction has to be rescanned from disk. Of all of these.0.2 was system commands – such as reorgs – which use a plethora of small empty transactions to track progress. The key here is that just about any non-zero value occurring often is a problem – so thinking that just because it is low (like a steady value < 10) means it is not a problem is just plain wrong. However. increasing sqt_init_read_delay slightly may help (as the SQT will be forced to find something else to do). This is so useless a metric. If ClosedTransTotal starts to lag behind OpenTransTotal. The way to find out the cause is to look at the next set of counters. SQT Thread Counter Usage After the fairly lengthy discussion of how the SQT works. we will need enough cache for at least 100 transactions – and likely double that number (so if CachedTrans is <100.3+ or hunt your developers down to see if they are running everything in chained mode or isolation level 3 for some reason. the most likely culprit is a series of large transactions. ReadTransTotal) all refer to the Open. giving the DSIEXEC’s transactions to work on. ClosedTransTotal. this is not as common as when ReadTransTotal is lagging Closed. The second set (OpenTransTotal. These report the average number of commands per transaction as well as the max. CmdsMaxTran) is also very interesting – especially when combined with the next one ‘CachedTrans’. Read and Truncate transaction lists used by the SQT for sorting. However. these counters are the most useful for DSI tuning. If using admin who. However. you may want to check all your connections and do not decrease if any show TransRemoved > 0 that are not attributable to the once nightly batch job or other obscenely large transaction. This can be really useful to spot those bcp’s that someone is not using –b on as well as to get a picture of the transaction profile from the origin from a sizing perspective (as will be useful for DSI tuning). you may want to add more SQT cache by increasing sqt_max_cache_size. In the latter case. Another common occurrence of this prior to ASE 12.1 TransRemoved (vs. it is time to either upgrade to ASE 12. sqt cache full column – it merely is an indication that the cache was full at some point (which the SQT is busy trying to do). since a commit was registered. than it is likely a transaction was removed from cache and that may be the cause of ReadTransTotal lagging (more on this later). Additionally. adding SQT cache is a useless exercise and may actually contribute to the problem. we can not group transactions if they are not in cache – so if we are using 5 parallel DSI’s and have dsi_max_xacts_in_group at 20. If this happens. The last counter (EmptyTransRmTotal) is good as a bad-application design counter. Even if no rows were modified in the proc (selects only). Additionally. if you frequently see TransRemoved >0. CacheExceeded) EmptyTransRmTotal Again.5.000. The third set of counters (CmdsAveTran.0. From the inbound queue’s perspective. If you see a lot of empty transactions. If we need to increase it. to add cache for another 100 transactions – simply multiply MemUsedAveTran by 100). So if the RA and/or SQMR is lagging and you have a high number of EmptyTransRmTotal. that this counter was removed in RS 15. the full column likely blinked between 0 & 1 so fast it is like a light-bulb in your house – blinks so fast you think it is constantly on vs. sqt. the first one (CmdsPerSec)is establishing a rate – hopefully it should compare to the rate from the RA thread. too much SQT cache could be a problem as well. we can see how much SQT cache was actually used by this SQT and the average memory per transaction. Final v2.313 3.790 65.880 2. reading the commands almost as soon as they arrive (when the SQM writes them).240 324. While it might be tempting to use CmdsAveTran.871 374.382 13.776 1. In this case.0. that does not mean that it is tuned properly.448 3.187 364.892 10.213 432.498. At this point. CmdsMaxTran – This is a very interesting statistic as it indicates the largest transaction processed during that sample period.597 73.462 3.334 253.441 733 4.097.430.528 280.767 65.465 84. and it is passed to the DIST thread quick enough that no discernable lag is evident. This isn’t obvious in the above statistics as the stats were from RS 12. it is not untypical for the ReadTransTotal to lag behind ClosedTransTotal until sqt_max_cache_size is reached.213 76.674 268.091.443 76.014 38.1 vs.750.608 1.4MB and then drops back down to 600K before growing successively until the max is reached.1 ClosedTransTotal OpenTransTotal ReadTransTotal SQT CmdsTotal CacheMemUsed CacheExceeded CmdsMaxTran 0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54 268.806 73.379. 126 TransRemoved 6 2 7 3 7 3 10 5 15 17 0 CmdsWritten (SQM) Sample Time .994 327. SQM CmdsWritten – This represents the lag that the SQT in reading from the inbound queue as commands occur. SQT CmdsTotal vs. Also. and as a result. the commit is found.600 1.040 1.442 76. but then caught back up quickly.818 83.535 347. This grows to 1.661 65.184 450. Note especially the extremely large transaction at the beginning.933 81. we have 2MB configured – but at the beginning we are only using about 300K. Not only does it help in sizing sqt_max_cache_size by showing the high-water mark during each sample interval.471 84.502 336.967.222 21.029 24.104 1.442 1.159 27.014 38.943 81.111 215.597 73.196 9.152 bytes (2MB) and dsi_sqt_max_cache_size at 0.462 9.750 325.016 632.246 165. ReadTransTotal – It should be fairly obvious what these refer to – the “Open”.809 326.766 65. 12.469 84.031 29.6 when the change in SQT processing was influenced by the SMP implementation.941 65.678 83. Any latency in the system is not due to the SQT.684 164. however.566 376.644 5.103.442 1. The reason why is that the SQT can’t put any more transactions into the cache until it removes one – as a result.193 59.035 38.320 857.283 266. However. The goal is that these should be fairly identical during the sample period – meaning that transactions are added to the SQT cache.528 Now then. the processing (once the cache is full) is that a new transaction can’t be read from the inbound queue until one is read by the DIST.525 21. this represents a ‘steady state” of the server. the ReadTransTotal will start mimicking the ClosedTransTotal. the problem is that a lot of small transactions could skew when a large transaction hit.944 81.038 318. CacheMemUsed – This is a very interesting counter. let’s take a look at these metrics.787 65.705 253. The problem is that the SQT gives priority to filling the cache over servicing the DIST.586 266.723 72. monitoring had been ongoing for >10 hours when this slice of the sampling was taken – and the system was busy the entire time.257 59.031 29.832 2.438 10.340 317. The most useful aspect to this metric is used in conjunction with TransRemoved to determine if raising the sqt_max_cache_size would be of benefit.528 253.723 93 21.817 83. ClosedTransTotal. OpenTransTotal.705 15. With this in mind. it got behind in the 1:00am time frame when the cache filled.192 59. that often the best starting point is to compare the “Cmds” in each counter module through the RS. the SQT is keeping up. it also shows the dynamic allocation and deallocation of memory within each SQT cache.131 29. We said earlier.944. In this case.196 215.840 1. “Closed” and “Read” transaction lists. the fairly consistently large transactions throughout and the small transaction at the end.525 21.297 10. this customer had sqt_max_cache_size set at 2. As a result. 000 command transactions assuming 1. increasing it to 32MB is likely not to have any benefit over 16MB.000 row range (suggesting a 4-10MB cache).0.000 rows. you don’t tune sqt_max_cache_size to fully cache extraordinarily large transactions that occur periodically. we allocate 16MB of cache for the DSI thread – which really doesn’t need it. However. this may not be true. However.777. the SRE performs the following functions: • • Unpacks each operation in the transaction. this system would benefit from increase sqt_max_cache_size to 16MB (16. It has been shown that oversizing the SQT cache can lead to performance degradation as a result. a 200. In order to determine the routing of the messages. Additionally. As a result. The reason for this is that the DIST thread is the “brains” behind the Rep Server – determining where the replicated data needs to go. Distributor (DIST) Processing Earlier we showed the inbound process flow from the inbound queue to the outbound queue using the following diagram: Figure 30 – Data Flow Through Inbound Queue and SQT to DIST and Outbound Queue This time. Of all the processes in the Replication Server. While an 8MB SQT cache may be usable. you would need 200MB of SQT cache to contain it. the DIST thread will call three library routines . yielding its time to the DSI EXEC threads to process the SQL statements. the DIST thread is probably the most CPU intensive.000 row transaction averaging 1K command size (SQM counter CmdSizeAverage). Subscription Resolution Engine (SRE) The Subscription Resolution Engine (SRE) is responsible for determining whether there any subscribers to each operation. However. if we do raise this. This is impractical as the next large transaction (likely a bcp as it was in this case) may have 500. clearly indicating the SQT cache is undersized. Consequently. we note that nearly every sample interval has transactions removed. TD and MD as depicted above.Final v2. we will be focusing on the Distributor (DIST) thread. Overall. the DSI Scheduler will spend it’s time filling the DSI SQT cache vs. if transactions were only removed during the first several sample intervals. These library routines are discussed below. Without doing this.the SRE. This value is actually high but is based on providing padding over the largest transaction that is expected that we really want to cache (the 9.216). Looking at the above. Consequently. the cache is completely full twice around 1:00am when the number of transactions peak at ~80. If you think about it.000 transactions.500 byte command size). Checks for subscriptions to each operation 127 . in the above case we see that we have a fairly constant transaction sizes in the 3. we should make sure that dsi_sqt_max_cache_size is explicitly set to 1-2MB.000-9.1 TransRemoved – this is one of the more important counters. 1 • • Checks for subscriptions to publications containing articles based on the repdef for each operation Performs subscription migration where necessary. However. • It should also be pointed out that the SRE does not check to see if a site subscribes more than once. a given replication definition could specify that last name. this trick is borrowed from the SQL optimization trick of switching column!=null to column>char(0) – the ANSI C equivalent for NUL. <>) comparators. they are reluctant to add more. This causes a problem described later in this section. the SRE simply has to do a row-by-row comparison for each row in the transaction. each searchable column can only participate in a single conditional (a range condition constructed by two where clauses is considered a single conditional). CA or have a last name of ‘Smith’ care needs to be taken to avoid a duplicate row situation. Boolean OR conditionals are easily accomplished via simply creating two subscriptions – one for each side of the OR clause. A point to consider is that the begin/commit pairs in the transaction were effectively removed by the SQT thread and the transaction information (transaction name. commit time. subscribing where site_id & 64 ) extends this near infinitely. It should be noted that the next discussion – while focusing on rs_address columns – has a secondary purpose in illustrating how subscription rules can impact implementation choices. Of course. as their business grows. For example (col_name != “New York”) becomes (col_name < “New York” OR col_name > “New York”) which is handled simply by using two subscriptions. city. the SRE simply has to check for subscriptions on the individual operations.Final v2. The bit-wise AND operation for the subscription behaves as col_name & value > 0 vs. While one rs_address column is easily managed. Not equals (!=. Many Replication System Administrators complain that the rs_address column isn’t as useful as it could be for several reasons: • • • It only supports 32 bits – restricting them to 32 sites in the organization. NOT.0. This is important as the TD module will make use of this information. however. they have to add more rs_address columns causing considerable logic to be programmed in to the application or database triggers to support replication. user. A special type of equality is permitted using rs_address columns is bit-wise comparisons with the logical AND (&) function. An alternative solution (but one that doesn’t work as we will see why) might be to think of the bits in the rs_address columns as components similar to class B & class C Internet addresses. this is easily bypassed by treating the situation like a noninclusive range. For the most part. but for now. As a result. using the rs_address column as an integer and subscribing with a normal equality (for example. Subscription Conditions To maintain performance. A valid complaint if you think of the bits one dimensionally with sites. if the data modification is projected for multiple sites. the SRE is a very lean/efficient set of library calls that only supports the following types of conditionals: • • • • • • Equality – for example col_name = constant. formulas or operators (other than & with rs_address columns) are not supported Boolean OR. and state are subscribable columns. The reason the SRE looks at the individual operations is that not all tables may be subscribed to by all the sites – consequently a transaction that affects multiple tables would still need to have the respective operations forwarded accordingly. A good example of how this impacts replication can be seen in the treatment of rs_address columns. For “not null” comparisons. For example. Columns contained in the primary key can not have rs_address datatypes. XOR conditionals. If a destination wants to subscribe to all authors in Dublin. col_name & value = value. Warm Standby feature. Incidentally. If the only column changed. then it is not replicated – problematic for standby databases using repdef & subscriptions vs. Range (unbounded and bounded) – for example col_name < constant or col_name > low_value and col_name < high_value Boolean AND conditionals Note that several (sometimes disturbing to those new to Replication Server) forms of conditionals are not supported: Functions.) are all part of the transaction control block in the SQT cache. a single subscription based on col_name > ‘’ (note the empty string and use of single quotation marks) is sufficient. Simply creating two subscriptions: one specifying last_name=’Smith’ and the other specifying city=’Dublin’ and state=’CA’ will result in an overlapping subscription – and cause the destination to receive duplicate rows. The biggest restriction is that for any subscription. High order bytes could be associated with 128 . this could require multiple updates to the same rows and subscription migration issues. etc. subscribing where site_id = 123 vs. Additionally. and 0).Good subscription create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.0.my_rsaddr_col is an rs_address column create subscription titles_sub for titles_rep with replicate at SYDNEY_DS. This is simply due to the fact that the original intent of the rs_address column was a single dimension of bits. then the number of sites addressable with each scheme extends another order of magnitude. State 0: Server 'SYBASE_RS': Duplicate column named 'my_rsaddr_col' was detected. etc. Level 12. it logically fits with distribution rules the application may be trying to implement and therefore mentally easier to implement. rs_address columns may only appear once in the where clause of a subscription. the way rs_address columns are treated. the reason should be obvious – a separate condition would be necessary for each column. Otherwise. However. the same subscription migration issue would occur that plagues a single integer based scheme – updates setting the value to first one value and then another in an attempt to send to more than one location migrates the data from one location to the other instead of sending it to both. although other columns can appear more than once in a where clause. in the above. This includes rs_address values of 3. Consequently. It results in: Msg 32027. However. Since “71” is 64+4+2+1 (bits 6. using multiple rs_address columns or “dimensioning” the rs_address column will result in more conditionals for the SRE to process. Combining several bits as in (column & 71) could have some unexpected results. we can’t do that!!! Unlike other columns (in a sense). the union of the conditions must produce a single valid range (single pair of low & high values). For multiple columns. when a condition such as (column & 64) returns a non-zero number. Consider the following examples of bit-wise addressing: Bit Addressing 4 – 28 8 – 24 16 – 16 4 – 4 – 24 4 – 8 – 16 4 – 4 – 4 – 20 4 – 4 – 8 – 16 4–4–8–8–8 Total Sites 112 192 256 384 512 1280 2048 8192 Comments Could be 4 World Regions – each with 28 locations World Region – Location Country/Region – Location World Region – Country – Location Hemisphere – Country/Region – Location Hemisphere – Country – Region – Location Hemisphere – Country – Region – Location Hemisphere – Country – Region – District – Office While this does expand the number of conditions that must be checked.pubs2 where int_col > 32 and int_col < 64 -. 2. However. If the last bit address represented a region or “cell”. you might think that this would achieve the goal. we treated each as separate individual locations.Final v2. the row is replicated. any column which has bits 6. a single column can only participate in a single rule (rs_rules table has a unique index on subscription and column number).pubs2 where my_rsaddr_col & 64 and my_rsaddr_col & 4 and my_rsaddr_col & 2 and my_rsaddr_col & 1 BUT.2. The reason is that for any single subscription. 129. Since we are allowed to AND conditions together. As mentioned earlier. you might think the way to ensure that exactly the desired value is met is to use multiple conditions as in: -. Consequently. the same is true for rs_address columns that have been dimensioned – a separate condition would be necessary for each “dimension” at a minimum.Good subscription (effectively !=32) 129 . 1 or 0 on would get replicated to that site – effectively a bitwise “OR”.1. it should be noted that this scheme (if it worked) would only work in cases where data is intended solely to be distributed to a single Region or District (next to last division) or a single location.1 countries or regions while the low order bits with specific sites within those regions. For example: -. The 130 . and ‘AND’ clause behaves the same as a normal subscription. Consider the following: create publication rollup_pub with primary at HQ.legal article definition create article titles_art for rollup_pub with primary at HQ.pubs2 where int_col = 30 and int_col = 31 Among other things. While a theoretical 1. if a WS system is involved. Introduced in RS 11. the most common method for updating rs_address columns to set the desired bits is in a trigger. a single replication would require 2 updates to the same row – the first being the regular update and the second setting the appropriate bits for distribution.Bad range subscription – should be an OR (two subscriptions) create subscription titles_sub for titles_rep with replicate at SYDNEY_DS. subscriptions and where clauses.db go -. the original row modification plus the modification in which the bits are set are both processed by replication server. SRE Performance Performance of the SRE depends on a number of issues that should be fairly obvious: • • • Number of replication definitions per table. Even if attempting to use the second rs_address column as the Region/District dimension as depicted above in the 2 dimensional break-out.illegal article definition create article titles_art for rollup_pub with primary at HQ. However.0. Use articles/publications overlaying replication definitions/subscriptions.db with replication definition titles_rep where my_rsaddr_col & 64 and my_rsaddr_col & 8 go -.pubs2 where int_col < 32 and int_col > 63 -. of course. Additionally. remember.Bad range subscription – should be an OR (two subscriptions) create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.024 sites could be addressed if each dimension supported an even 32 locations in each. This leads to n+1 DML operations at the primary for every intended operation – not a good choice then. the SRE makes use of the System Table Services (STS) cache. you could incur problems. the references to the same column must use an OR clause as within the RSSD.Final v2. Additional destinations would require additional updates. Number of subscriptions per replication definition Number of conditions per subscription In order to reduce the number of physical RSSD lookups to retrieve replication definitions. if performance is of consideration. Configurable through the replication server configuration sts_cachesize. while an OR clause constructs multiple where clauses conditions in the RSSD. Additionally. only a single Region/District or location could be the intended target.5.db with replication definition titles_rep where my_rsaddr_col & 64 or where my_rsaddr_col & 8 go It seems frustrating that there seems to be no way to bypass the 32 site limit with a single rs_address column. Consequently.1 create subscription titles_sub for titles_rep with replicate at SYDNEY_DS. you can see that this condition restricts Replication Server from supporting Boolean “OR” conditionals and forces designers to implement multiple rs_address columns. As a result. the STS caches rows from each RSSD table/key combination in a separate hash table.pubs2 where int_col < 32 and int_col > 32 -. it ignores updates in which the only changes were to rs_address columns – consequently – after a failover – you may not have an accurate reflection of the last site updates were distributed to in the processing. articles allow Boolean OR’s as well as referring to the same column multiple times in the same where clause. if you think about it for a second. There is a work-around for the ‘OR’ problem. ”54783L”.discount. the TD is responsible for “packaging” these modifications into a transaction and requesting the writes to the outbound queue. Transaction Delivery The Transaction Delivery (TD) library is used to determine how the transactions will be delivered to the destinations.”732189H”.00) insert into order_items (order_num.$444.”NY”.1.total) values (123456789.total) values (123456789. consider the following transaction: begin transaction web_order insert into orders (customer. ship_zip) values (1122334.$250.60. For most systems. “Chamois Shirt”.$25.item_num. ship_state. you can simply use the provided monitor counters.1.60 commit transaction Now.”987652W”.desc.discount.”54783L”. 123456789.total) values (123456789.price.item_num.21100) insert into order_items (order_num.item_num. 123456789.1.1. “Welcome Mat”.0. order_num.price.1.qty.item_num.qty.item_num.1.price.qty.total) values (123456789.desc.$494. For example.price.total) values (123456789.discount.”889213T”. A better starting point might be to set sts_cachesize to the max of the number of columns in repdefs managed by the current Rep Server or the number of subscriptions on the repdefs managed by the current Rep Server. “Volley Ball Set”.item_num. a sqt_max_cache_size of 4MB was resulting in considerable latency in processing large transactions being distributed to two different reporting system destinations.1.$129. STS.$12.”30345S”.$12. while RDB2 with transactions for household goods.qty.00) insert into order_items (order_num.00.$250.desc. This would result in the following replicate database transactions: -.00) insert into order_items (order_num. The best way to think of this is that while the SRE decides who gets which individual modifications.”31245Q”.00. the default sts_cachesize configuration of 100 is far too low. the sqt_max_cache_size setting is crucial to overall inbound processing.item_num.21100) insert into order_items (order_num.price.price.”Anytown”.00) insert into order_items (order_num.0.item_num.00. ship_addr.total) values (123456789.00. “Bed Spread Set”.$50.price.1. and RDB3 focusing on sporting items. This would restrict the system to only retaining the most current 100 rows of subscription where clauses.$79. if greater. trace “on”. order_num.21100) insert into order_items (order_num. ship_state. is to turn on the cache statistics trace flag.00) insert into order_items (order_num.qty. 123456789. “Leather Jacket”.item_num.desc.$79. order_shipcost=$20.desc. ship_city.Collects STS Statistics Which works prior to RS 12.0. For example. “Leather Jacket”.discount.0.discount.00.total) values (123456789. With RS 12.40.desc. at one customer.1 sts_cachesize parameter refers to the number of rows for each RSSD table.”987652W”. “Welcome Mat”.desc. “123 Main St”.0.discount. order_num. “123 Main St”. Key Concept #12: The single largest tuning parameter to improve Distributor thread performance is increasing the sts_cachesize parameter in order to reduce physical RSSD lookups. As you can imagine.item_num.2. Replicate Database 1 (RDB1) might be concerned with clothing transactions (shipping warehouse for clothing). order_total=$984.00) update orders set order_subtotal=$964.60 commit transaction -.$129.0.price.1.00.qty. ship_addr.replicate database 2 (household goods) begin transaction web_order insert into orders (customer.$129. Consequently.”Anytown”.discount.1. ship_state.0.qty.”NY”. ship_addr.qty. order_total=$984.0.total) values (123456789.”31245Q”.$25.$12. order_shipcost=$20.total) values (123456789.00.Final v2. “Bed Spread Set”.$129. picture what happens in a normal replication environment if the source system was replicating to three destinations – each concerned with its own set of rules.00) 131 .$50.replicate database 1 (clothing items) begin transaction web_order insert into orders (customer.total) values (123456789.”NY”.discount.desc. “Chamois Shirt”.”732189H”.0. ship_city.desc.qty. One way to determine how effective the STS cache is.60. “6 Man Tent”.00.discount. ship_city.$250. For example. Setting the sqt_max_cache_size to 16MB resulted in the inbound queue draining at over 100MB/min. etc.discount.price.0.”Anytown”. The biggest bottleneck of the SRE will actually be getting the transactions from the SQT fast enough. This speed is even more notable when considering that the DIST thread had to write each transaction from the inbound queue to two different outbound queues.qty.00.00) insert into order_items (order_num. ship_zip) values (1122334. the largest impact that you can have is by increasing sts_cachesize to reduce the physical lookups.desc. STS_CACHESTATS .60) update orders set order_subtotal=$964.00.00) insert into order_items (order_num.$49. ship_zip) values (1122334. “123 Main St”.2.price.$250.$12. your begin tran (and other rows would have lower OQID’s and would really mess things up).discount. Packs the command into packed ASCII format and writes the command to each of the destination queues (via the MD module) Writes a commit record to each of the queues once the entire list of operations has been processed. TD library appends a 2 byte counter to COMMIT record of the transaction for all the commands which are distributed by TD.$79.qty.qty. as each transaction is processed.$494.21100) insert into order_items (order_num.60.price.”889213T”. order_shipcost=$20.desc.$444. order_shipcost=$20. It accomplishes this through the following steps: • • • • • Looks up the correct queue for each of the destination databases – it is passed a bitmap of the destination databases from the DIST thread (based on SRE).0. The SQM thread relies on the increasing OQID’s to perform its duplicate detection. ship_addr.60) update orders set order_subtotal=$964. Consider the following scenario in which transaction T1 begins prior to transaction T2. The job of the SQT thread is to pass transactions to DIST thread in the COMMIT order.total) values (123456789.”30345S”.”Anytown”. Only DIST thread calls TD.$49.40.item_num.00. “6 Man Tent”.”NY”.60 commit transaction -. ship_zip) values (1122334.discount. ship_city. it would result in a de-sorting all the work done by the SQT thread if they were just sent through as normal. Earlier. we discussed the fact that the TD module added two bytes for uniqueness. 123456789. Writes a begin record for each transaction to the destination queue (using the commit OQID) For each operation received. order_total=$984. The answer is in the simple fact that transactions could overlap begin/commit times and since the original OQID’s are generated in order.1.1. adds two bytes to the commit OQID and replaces the operations OQID with the new OQID based off of the commit record.replicate database 3 (sporting goods) begin transaction web_order insert into orders (customer. In order to prevent the outbound SQM rejecting the commands.Final v2.1 update orders set order_subtotal=$964.item_num. however. yet commits after: OQID 0x04010000 0x04020000 0x04030000 0x04040000 0x04050000 0x04060000 0x04070000 0x04080000 0x04090000 0x040A0000 0x040B0000 Operation begin t1 insert t1 begin t2 delete t2 insert t1 update t2 insert t2 insert t1 commit t2 insert t1 commit t1 The TD would receive T2 first and then T1 and would renumber the OQID’s as follows: 132 . in the makeup for the OQID. order_total=$984.desc. the TD uses the commit record’s OQID and simply adds a sequential number in the last two bytes. A frequent question is “Why?”. Consider the following points: • • • • When the Rep Agent forwards commands to the Replication Server it generates unique 32 byte monotonically increasing OQID’s. o Why the commit record??? Because if your transaction began before someone else’s who committed before you. therefore the commands the DIST forwards to the TD module may not have increasing OQID’s. “123 Main St”. o So we use the CT oqid and add 0001-ffff to each row in the tran The counter is reset when a NEW begin record is passed to TD • Consequently.total) values (123456789.00) insert into order_items (order_num. ship_state. order_num.60. it is the TD that “remembers” that each is within the scope of the outer transaction “web_order” and requests the rows to be written to each of the outbound queues.0.price.60 commit transaction The SRE physically determines what DML rows go to which of the replicates.00.$79. “Volley Ball Set”. You need to first find the commit record for that transaction in the inbound queue – a feat that is not made simple in that it is not always identified which transaction the commit record belongs to. the message is written to the outbound queue via the SQM for the RSI connection to the Replicate Replication Server. New Delhi. NY would only have to send a single message to cover Chicago. Take. Sydney Australia. Taiwan. the following scheme.e.0. Further. Mexico City. if a transaction needs to be replicated to all of the European sites.Final v2. The DIST thread passes the transaction row and the destination ID to the MD module. Message Delivery The Message Delivery (MD) module is called by the DIST thread to optimize routing of transactions to data servers or other Replication Servers. the NY system only needs to send a single message with all of the European destinations in the header to the London system. this performance advantage gained by distributing the outbound workload may make it feasible to implement replication routing even to Replication Servers that may reside on the same host. the primary key values). If so. This should also explain why some people have a difficult time identifying the same transaction in the outbound queue as one in the inbound queue when attempting to ensure that it is indeed there.1 OQID 0x04090001 0x04090002 0x04090003 0x04090004 0x04090005 0x040B0001 0x040B0002 0x040B0003 0x040B0004 0x040B0005 0x040B0006 Operation begin t2 delete t2 update t2 insert t2 commit t2 begin t1 insert t1 insert t1 insert t1 insert t1 commit t1 As a result. now the destination queues have transactions in commit order with increasing OQID’s to facilitate recovery. Hong Kong. from a technical perspective. In addition. this has often been touted as a means to save expensive trans-oceanic bandwidth. due to the multi-tiered aspects of the Pacific arena above. the new destination is simply appended to the existing message. In the past. Peking. the module determines where to send the transaction: • • If the current Replication Server manages the destination connection. it almost always easier to search by values in each record (i. San Francisco. Tokyo. If the destination is managed by another Replication Server (via an entry in rs_repdbs). As a result. the MD module checks to see if it is already sending the exact same message to another database via the same route. for example. 133 . If not. While this may be true. Dallas. MD & Routing This last point is crucial to understanding a major performance benefit to routing data – consider the following architecture Figure 31 – Example World Wide Topology In the above diagram. Using this information and routing information in the RSSD. the biggest savings is in the workload required of any one node – allowing unparalleled scalability. the message is written to the outbound queue via the SQM for the outbound connection. in this case. they offer a tremendous performance benefit to Replication Server by reducing the workload on the primary Replication Server. This can be used to effectively create a MP Replication scenario for load balancing in local topologies. that RS can concentrate strictly on inbound queue processing and subscription resolution. along with a single ASE implementation for the RSSD databases could start making more effective use of larger server systems that they may be installed on. Key Concept #13: While replication routes offer network bandwidth efficiency. While none of the systems are very remote from the POS system.0. if the Replication System begins to lag.1 Pay Roll Accounting Marketing CRM POS Billing Supply Purchasing DW Shipping Figure 32 – Example Retail Sales Data Distribution In this scenario. the RS that manages the POS connection does not then manage any other database connections. the POS system may be impacted due to the affect the Replication Server could have on the primary transaction log if the Replication System’s stable devices are full. The other three can concentrate strictly on applying the transactions at the replicates. Consequently. it may make sense to implement a MP Rep Server implementation by using multiple Replication Servers to balance the load.Final v2. Pay Roll Accounting Marketing CRM POS Billing Supply Purchasing DW Shipping Figure 33 – Retail Sales Data Distribution using Multiple Replication Servers Note that in the above example solution. 134 . An additional performance advantage in inconsistent network connectivity environments is that network problems that occur during Replication Server applying the transactions at the replicate can degrade performance due to frequent rollbacks/retry due to loss of connection. all four Replication Servers. With a 6-way SMP box. 1).1+. Total number of destinations. Note that similar to the exec_sqm_write_request_limit.1 MD Tuning Other than the sts_cachesize and replication routing. Number of messages received and processed. Frequently. Fortunately. but rather the source connection. when a replicate database has more than one source database (i. md command as illustrated below: admin statistics. only 12 blocks of cache will be available for each destination’s SQM (assuming each are experiencing same performance traits). Consequently.x & 12. The reason for this is that we are still discussing the Distributor thread. With previous versions of RS (i. By adjusting the md_sqm_write_request_limit/md_source_memory_pool. Indicates whether the current Replication Server can send messages: 0 .This Replication Server cannot send messages 1 . customers have noted that when the inbound queue experiences a backlog. 11. The problem is that it is a single pool and the blocks (if you will) are for single connection each. once the SQT cache is resized. which is part of the inbound side of replication server internal processing. The number of messages sent to the SQM without acknowledgment.0 ESD #1. it is often misunderstood as it does not change destination connections.6 ESD #7 and RS 15. the inbound queue 135 .1. md Source -----SYDNEY_DS TOKYO_DS TOKYO_DS Pending_Messages ---------------0 0 0 SQM_Writes ---------34 551 1452 Is_RSI_Source? -------------0 0 0 Memory_Currently_Used --------------------0 0 0 Destinations_Delivered_To ------------------------34 551 1452 Messages_Delivered -----------------34 551 1452 Max_Memory_Hit -------------0 0 0 Each of these values are described below: Column Source Pending_Messages Meaning The Replication Server or data server where the message originated. Prior to RS 12.e. the md_sqm_write_request_limit can be set through the standard alter connection command. this parameter was frequently missed as the only way to set it was through using the rs_configure stored procedure in the RSSD database.0). this occurs because Replication Server is processing the messages before writing them to disk. Number of messages delivered. in RS 12. the limit for md_sqm_write_request_limit was raised from 983040 (60 blocks) to 2GB (recommendation is 24MB). you allow the source connection’s distributor thread to cache its writes when the outbound SQM is busy and to enable more efficient outbound queue space utilization. corporate rollup). the other performance tuning parameter that directly affects the distributor thread is md_sqm_write_request_limit (formerly known as md_source_memory_pool prior to RS 12. if replicating to 5 different destinations.0. with RS 12. This is especially useful when a source system is replicating to multiple destinations without routing. This is a memory pool specifically for the MD to cache the writes to the SQM for the outbound queues. While md_sqm_write_request_limit is a connection scope tuning parameter. not much tuning is needed. or for the remote replication server when multiple destinations exist for the same source system. Not yet implemented.Final v2. Memory used by pending messages. even with 60 blocks available for caching. the only visibility into this memory was via the admin statistics.e. Usually.This Replication Server can send messages Memory_Currently_Used Messages_Delivered SQM_Writes Destinations_Delivered_To Max_Memory_Hit Is_RSI_Source? Beyond tuning the md_sqm_write_request_limit and sts_cache_size. If the number of pending commands is high. to monitor the performance or throughput of the Distributor thread. This is the opposite of the above (PendingCmds). then the DIST could be a bottleneck as it is not reading commands from the SQT in a timely manner.SYDNEY_RSSD 106 SYDNEY_DS.Final v2. Column PrimarySite Type Status PendingCmds Meaning The ID of the primary database for the SQT thread. This essentially certifies that the DIST is not a cause for performance problems. dist command admin who. The number of commands that have been processed by the thread. The likely culprit is either the STS cache is not large enough and repeated accesses to the RSSD is slowing processing – or the outbound queue is slow. delaying message writes. The number of commands that are pending for the thread." You should only see “ignoring” during initial startup of the Replication Server. we covered tuning issues specific to that module. Whether or not the thread is waiting for the SQT.0. The number of commands belonging to the maintenance user. This is a testament to the performance and efficiency of the DIST thread. dist Spid ----21 22 PrimarySite ----------102 106 Duplicates ---------0 290 NoRepdefCmds -----------0 0 State ---------------Active Active Type ---P P Status -----Normal Normal Info --------------------102 SYDNEY_DS. The thread has a status of "normal" or "ignoring. SqtBlocked Duplicates TransProcessed CmdsProcessed MaintUserCmds 136 . The number of duplicate commands the thread has seen and dropped. DIST Performance and Tuning Within each of the Distributor module discussions above. Overall. This should stop climbing once the Replication Server has fully recovered and the Status (above) changed from “ignoring” to “normal”.pubs2 PendingCmds ----------0 0 CmdsProcessed ------------1430 293 CmdMarkers ---------0 1 SqtBlocked ---------1 1 MaintUserCmds ------------0 0 TransProcessed -------------715 1 CmdsIgnored ----------0 0 The meaning for each of the columns is described below. The thread is a physical or logical connection. you can use the admin who. The number of transactions that have been processed by the thread. This should be 0 unless the Rep Agent was started with the “send_maint_xacts_to_replicate” option.1 drains quite dramatically – at a rate exceeding 8GB/hr. Total commands read from an inbound queue by a DIST thread. This is also when request functions are identified.6 and higher. Total commands ignored by a DIST thread.Final v2. SQM disk i/o. this can be a key insight into why there may be database inconsistencies between a primary and replicate system. rs_markers are enable replication. 137 . etc.1 Column NoRepdefCmds Meaning The number of commands dropped because no corresponding replication definitions were defined – or in RS 12. request functions have a replication definition specifying the real primary database which would not be the current connection processing the logged procedure execution. and dump markers. Total rs_markers placed in an inbound queue. This counter is incremented each time a new SUB is dropped. activate. however." The number of special markers (rs_marker) that have been processed. In either case. The way this is detected is described in more detail later. This counter is incremented for each new SUB. a large number of occurrences of NoRepdefCmds can mean one of several things: • • • Database replication definition was created (for MSA implementation possibly) for a specific source system. If this is the case. Total rs_ticket markers processed by a DIST thread. Total commands rejected as duplicates by a DIST thread. this means that the databases are probably suspect as they are definitely out of synch. DIST Thread Monitor Counters The Distributor thread counters added in RS 12. it could include commands replicated using database repdefs (MSA) for which no table level repdef exists.1 are listed below: Counter CmdsDump CmdsIgnored CmdsMaintUser CmdsMarker CmdsNoRepdef CmdsTotal Duplicates RSTicket SREcreate SREdestroy Explanation Total dump database commands read from an inbound queue by a DIST thread. then a good. In either case.0. CmdsIgnored CmdMarkers As noted from the above command output. it discards the log row at this stage. Total commands executed by the maintenance user encountered by a DIST thread. Total SRE creation requests performed by a DIST thread. Total SRE destroy requests performed by a DIST thread. If a procedure. validate. If the replication definition does not exist. cheap performance improvement is to simply unmark the tables or procedure for replication. Normally only noticed during replication system implementation such as adding a subscription or a new database. The number of commands dropped before the status became "normal. Or… Tables or procedures were needlessly marked for replication. In any case. if you remember from classes you have taken (or reading the manual). SQT and DIST CPU time. the DIST thread is responsible for matching LTL log rows against existing replication definitions to determine which columns should be ignored. This will reduce Rep Agent processing. Total commands encountered by a DIST thread for which no replication definition exists. this is an indication that a table/procedure is marked for replication but lacks a replication definition (as table level repdefs should be created even for MSA implementations). but individual table-level replication definitions were not created (a performance issue) A replication definition was mistakenly dropped or never created. activate. Update commands encountered by a DIST thread and resolved by SRE. SRE requests performed by a DIST thread to fetch a SRE object. This counter is incremented each time a new SUB is dropped. Total update commands encountered by a DIST thread and resolved by SRE. This counter is incremented each time a DIST thread fetches an SRE object from SRE cache. Total deletes commands encountered by a DIST thread and resolved by SRE. Insert commands encountered by a DIST thread and resolved by SRE.Final v2. Updates to RSSD. This counter is incremented for each new SUB.rs_locater table by a DIST thread. Total Commit or Rollback commands processed by a DIST thread. Total DIST commands with no subscription resolution that are discarded by a DIST thread. Commands executed by the maintenance user encountered by a DIST thread. SRErebuild SREstmtsDelete SREstmtsDiscard SREstmtsInsert SREstmtsUpdate TDbegin TDclose TransProcessed UpdsRslocater The counters in RS 15. This counter is incremented each time a DIST thread fetches an rs_subscriptions row from RSSD. Commands ignored by a DIST thread while it awaits an enable marker. SRErebuild SREstmtsInsert SREstmtsUpdate 138 . SRE rebuild requests performed by a DIST thread. Total SRE rebuild requests performed by a DIST thread.0.rs_locater table by a DIST thread. Commands rejected as duplicates by a DIST thread. and dump markers.. Total updates to RSSD. A DIST thread performs an explicit synchronization each time a SUB RCL command is executed. Commands encountered by a DIST thread for which no replication definition exists. Transactions read from an inbound queue by a DIST thread. SRE destroy requests performed by a DIST thread. Dump database commands read from an inbound queue by a DIST thread.1 Counter SREget Explanation Total SRE requests performed by a DIST thread to fetch an SRE row. Total transactions read from an inbound queue by a DIST thread.. This implies either there is no subscription or the 'where' clause associated with the subscription does not result in row qualification. validate. Total insert commands encountered by a DIST thread and resolved by SRE. rs_markers are enable replication. rs_markers placed in an inbound queue. Total Begin transaction commands propagated by a DIST thread.0 are: Counter CmdsRead TransProcessed Duplicates CmdsIgnored CmdsMaintUser CmdsDump CmdsMarker CmdsNoRepdef UpdsRslocater SREcreate SREdestroy SREget Explanation Commands read from an inbound queue by a DIST thread. SRE creation requests performed by a DIST thread. A DIST thread performs an explicit synchronization each time a SUB RCL command is executed. Commit or Rollback commands processed by a DIST thread. This also is a good place to again find application driven problems.1 Counter SREstmtsDelete SREstmtsDiscard Explanation Deletes commands encountered by a DIST thread and resolved by SRE. The second set is useful as now we can get a glimpse as to how many transactions vs. For instance. the last two counters are new and can be helpful in determining why a latency might occur between the DIST and the SQT . CmdsNoRepdef is a bit interesting. a high value here is to be expected. this shouldn’t afflict much damage – besides. This implies either there is no subscription or the 'where' clause associated with the subscription does not result in row qualification.0.0 ESD #1 servers) as well. SREstmtsUpdate.0 only) Again. it does have a similar cache configuration – called md_sqm_write_request_limit (replaces the deprecated md_memory_source_pool) – which should be increased to the current maximum of 983. these counters are only incremented if using standard table repdefs – a database repdef without table repdefs will cause these to be ignored. but if DISTReadTime is high. CmdsPerSec = CmdsTotal/seconds TransProcessed. just commands are flowing through – which can then be compared to the DSI transaction rate later. 139 . DISTReadTime DISTParseTime The amount of time taken by a Distributor to read a command from SQT cache. there is no real way to control UpdsRslocater – but by reducing everything else. If using RS 12. The next three are useful if trying to learn how many inserts/updates/deletes are flowing through the system. Key counters include: CmdsTotal. However.0 to help track how much time the DIST spends on these activities.Final v2. SREstmtsDelete DISTReadTime. rs_ticket markers processed by a DIST thread. the first one helps us identify the rate and compare this back to the SQT and RA modules to see if we are running up to speed. Either way. the average. TDbegin TDclose RSTicket dist_stop_unsupported_cmd dist_stop_unsupported_cmd config parameter. The DIST thread will generally have two sources of problems. The second source (and most common) is that the outbound queue is not keeping up (or we are writing to too many outbound queues in a fan-out – time to add routes and spread the load a smidgen). if you see that the number of inserts and deletes are nearly identical.6 ESD #7 and pre 15. However. TranPerSec = TransProcessed/seconds CmdsNoRepdef UpdsRslocater (again!!!) SREstmtsInsert. it is possible that either autocorrection is turned on – or the application developers used a delete followed by insert instead of an update. First. DIST commands with no subscription resolution that are discarded by a DIST thread. Typically. this is lower than the updates to the OQID – typically less than 1 per second in any case. This time. The amount of time taken by a Distributor to parse commands read from SQT. The last two are new counters added in RS 15.040 (for pre 12. As with the other modules. In all other cases. of course we have the SQM for the outbound queue(s) which have the same counters as the inbound queue – the only difference is that the DIST does not have a WriteWaits style counter like the RA thread. However.6 and a database replication definition (MSA) with no table level repdefs. either not enough STS cache was provided or sts_full_cache_ is not enabled for rs_objects and rs_columns. total and max counters have been combined into a single counter with the different columns in rs_statdetail. Begin transaction commands propagated by a DIST thread.other than the obvious problem of the SQM outbound slowing things down. it points to a table marked for replication for which there is no repdef. DISTParseTime (RS 15. it may point to a problem with the SQT. However. this in itself should also point out that it is ALWAYS a good idea to use repdefs from a performance perspective – even when not necessary (MSA or WS). this should be minimal. After the DIST thread. the DIST counters also are fairly handy for finding application problems as well. 409 250. SREstmtsInsert/Update/Delete – This is the first location within the monitor counters where you begin to get a picture of what the source transaction profile looked like – especially if combined with DIST. let’s take a look at some of these counters in action using the customer data we’ve been discussing: CmdsWritten (SQM) Sample Time DIST CmdsTotal SQMR CmdsRead Cmds/Sec SREstmts Insert SREstmts Update 0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54 268. table level replication definitions ought to be used if database consistency (think float datatype problems) and DSI performance is of any consideration. Note that in this case.847 24. The second possible cause is that the table is marked for replication – or the database is marked for standby replication – but the table(s) involved at this point don’t have corresponding subscriptions.577 95.674 587. a very curious phenomenon was observed that lead to the second problem 140 CmdsNo RepDef .280 459. Note that not exactly all commands will be distributed.261 299 2 9 87 4 3 14 4 469 7 44 SREstmts Delete 0 3.726 19. CmdsNoRepDef – Here is where we begin to see the first problem – we have significantly large values for this counter where logically we should expect none.753 26.076 83.340 317.050 84.683 286.077 373. in this case.435 522.CmdsWritten counter. so a precise match is likely not possible. you could not compare it to the previous stage.860 947. This value is useful in observing the impact of tuning on the overall processing by the DIST – particularly if adjustments are made to the STS cache (in addition to observing the STS counters as well). despite all the rescanning for large transactions.334 253. a database replication definition being used for a standby database implementation via the Multiple Standby Architecture (MSA) method is similar to a Warm Standby implementation in that table level replication definitions are not required. Without subscriptions and lacking a database repdef/subscription – the DIST has not choice but to discard these statements.965 29.432 33.586 317.065 352. However. it does indicate that overall system performance could be improved by not replicating this data in the first place – either by unmarking the tables for replication.TransProcessed. While not required.714 325.611 282.684 164. However.554 21.386 365.750 325.168 430.687 24.CmdsTotal counter to the SQM.768 125.844 400.481 393. This does not mean that the SQT cache does not need to be resized – it suggests that if any latency is observed.540 20.949 203.0. However. if instead you tried to compare with SQMR CmdsRead. DIST CmdsTotal – The best way to identify latency in the SQT DIST pipeline is to compare the DIST.808 318. There are two possible causes for this.662 25.958 277.187 364. this value is derived by dividing the CmdsTotal by the number of seconds between sample intervals.054 194.705 253.707 22. or other technique of ensuring that the Replication Agent doesn’t process the rows.139 1.238 15.926 29.152 165. This will become apparent as we look at these counters SQM CmdsWritten vs.710 22.520 932 882 828 549 1.261 This one sample period actually was useful as it illustrated two different problems at this customer site.470 951 1. Cmds/Sec – Much like other derived rate fields.184 450.757 26.432 35.054 243. using the ‘set replication off’ command prior to the batch submission.607 57. In this case.698 25.934 157. the DIST thread is keeping pace with the SQM Writer.078 1.250 15. you would have a negative influence based on the re-scanning of removed transactions (as illustrated above) – plus if there was any latency.247 19.1 DIST Thread Counter Usage Again.677 266.375 344.Final v2.915 136.283 266. First. it would significantly reduce the workload of the SQM (inbound) and the SQT.566 376.013 110.424 1.809 326.313 280. increasing the SQT cache size is not likely to have a significant impact on throughput or reduce the latency as not much exists at this stage.408 0 3.656 317.241 1. not null. this workload is not as apparent thanks to user concurrency. appending the clause “replicate minimal columns” to replication definitions is often forgotten. not null. The second choice is that the application itself is doing delete/insert pairs vs. update. A common misconception is that minimal column replication chiefly benefits the RS throughput by reducing the amount of space consumed in the inbound (and outbound) queues. the inserts and deletes are nearly identical while the number of updates are noise level. note that the salesdetail table has a trigger that updates the title. Replication Server constructs a default function string containing an update for all 10 columns of the table. a replicated update would be submitted as a delete followed by an insert. The most likely choice is that the ‘autocorrection’ setting has been accidentally left enabled for a replication definition. earlier versions of some GUI application development tools such as PowerBuilder used to do this by default. As a result. If you notice. create table titles (title_id tid title varchar(80) type char(12) pub_id char(4) price money advance money total_sales int notes varchar(200) pubdate datetime contract bit go not null. While this sounds illogical. the STS counters and SQM (outbound) counters may also need to be looked at to determine what may be driving DIST thread performance. from the second sample interval on. For example.Final v2. but it also causes slower performance at the DSI as rows are removed not only from the table – but also the indices – and then readded. This could be legitimate – for example. This workload reduction specifically is the probable reduction in unnecessary index maintenance at the replicate as well as a reduction in contention caused by index maintenance when parallel DSI’s are used and the dsi_serialization_method is set to isolation_level_3. not null.1 identification. you first have to understand what happens normally. performing an update. not null ) create unique clustered index titleidind on titles (title_id) go create nonclustered index titleind on titles (title) go For further fun. This leaves two other possible choices. With Replication Server by default using a single DSI. To understand the impact of this. this workload delays replication as a whole. In that mode. the DIST thread does not perform any checking of what columns have been changed for an update statement. it can dramatically reduce the workload of the DSI thread as it can tremendously reduce the work at the replicate dataserver. Normal Replication Behavior Under normal (non-minimal column) replication.0. While not reducing the workload of the DIST thread so much. null. delete as /* Save processing: return if there are no rows affected */ if @@rowcount = 0 begin return 141 . At the primary. Minimal Column Replication Unfortunately. this is unlikely. It turned out that this indeed was the application logic – and while not a simple fix – rewriting the application to use updates instead would immediately have the replication latency. However. The issue is that this not only doubles the workload in Replication Server in having to process twice the number of commands.total_sales column: create trigger totalsales_trig on salesdetail for insert. null. consider the following table (from pubs2 sample database shipped with Sybase ASE) and associated indexes. when working off of a job queue – new jobs could be added as old jobs are removed. setting the column values equal to the new values with a where clause of the primary key old values. While it does reduce the space – and tighter row densities allow more rows to be processed by the SQM/SQT per I/O and this can improve performance. the biggest benefit of minimal column replication is the performance gain through reducing the workload involved at the replicate DBMS – aiding in DSI performance (where typically the problem is). null. null. null. if an update of only 2 columns of a 10 column table occurs. In addition to the DIST counters. 0. contract = ?contract!new? where title_id = ?title_id!old? ' The result is rather drastic. "price" money. price = ?price!new?. is of course. Consider: Aggregate columns – Such as the titles example.title_id = inserted. for an update statement. In this example. 142 .title_id) where title_id in (select title_id from deleted) go By now some of you may be already seeing the problem. "type" char(12). Auditing columns – this includes such columns as last_update_user. any index values that are updated automatically cause the index to be treated as “unsafe” and therefore also needing updated. "pub_id" char(4).rs_update for rs_sqlserver_function_class output language ' update titles set title_id = ?title_id!new?. last_updated_date. title = ?title!new?. every time a new order is inserted into the salesdetail table. But this is minor compared to what really impacts DSI delivery speed. this is not desirable behavior.pubs2 with all tables named 'titles' ( "title_id" varchar(6).title_id) where title_id in (select title_id from inserted) /* remove all values being deleted or updated */ update titles set total_sales = isnull(total_sales. As mentioned previously. it occurs much more often than you would think. Clearly. – similar to the trigger issue mentioned previously. The first problem.assuming the notes column was filled out.(select sum(qty) from deleted where titles.title_id = deleted. Unfortunately. if ANSI constraints were used.1 end /* add all the new values */ /* use isnull: a null value in the titles table means ** "no sales yet" not "sales unknown" */ update titles set total_sales = isnull(total_sales. Consider a mythical replication definition like: create replication definition CHINOOK_titles_rd with primary at CHINOOK. Worse yet.Final v2. the related foreign key tables would have holdlocks placed on the related rows. total_sales = ?total_sales!new?. "notes" varchar(200). etc. Status columns – shipping/order status information for order entry or any workflow system. "advance" money. "contract" bit ) -. advance = ?advance!new?. "title" varchar(80). the corresponding update at the replicate not only updates the entire row .it also performs index maintenance.Primary key determination based on: Primary Key Definition primary key ("title_id") searchable columns ("title_id") This means the function string (if you were to mimic it by altering the function string) would resemble: alter function string CHINOOK_titles_rd. increasing the probability of contention. any time a row is updated. "pubdate" datetime. "total_sales" int. type = ?type!new?. notes = ?notes!new?. For those of you familiar with database server performance issues. 0) . that the outbound queue will contain significantly more data than actually was updated . 0) + (select sum(qty) from inserted where titles. RS will generate a full update of every column. pubdate = ?pubdate!new?. pub_id = ?pub_id!new?. Under normal replication rules. the rs_update function is processed and sent to the RS. One option would be for RS to ignore any update for which only columns not being replicated were updated.rs_update for rs_sqlserver_function_class output language ' update titles set title_id = ?title_id!new?. title = ?title!new?. "price" money.rs_insert for rs_sqlserver_function_class output language ' update titles set total_sales = ?total_sales!new? where title_id = ?title_id!old? ' Which more than likely will execute much quicker in high volume environments. Consider a regional chain store that wants to replicate price changes to 60+ stores for 100’s of products. "advance" money. the behavior is much different. The RepAgent User thread simply strips out any columns not being replicated as part of the normalization process and the resulting functions are generated as appropriate. However. under minimal columns. Now add in the overhead of changing every column and index maintenance – and the associated impact that could have on store operations. pubdate = ?pubdate!new? where title_id = ?title_id!old? ' Now. the full update function string would now be: alter function string CHINOOK_titles_rd. "pubdate" datetime ) -. notes = ?notes!new?. "pub_id" char(4). most of the updates to the titles table would be executing a function string similar to: alter function string CHINOOK_titles_rd. only the columns with different before and after images – as well as primary key values – are written to the inbound & consequently outbound queue.Primary key determination based on: Primary Key Definition primary key ("title_id") searchable columns ("title_id") Of course. of course). let’s assume that the contract column was excluded from the replication definition as in: create replication definition CHINOOK_titles_rd with primary at CHINOOK. Consequently. the replicate would receive a full update statement of all columns in the replication definition (excluding the contract column. "type" char(12). type = ?type!new?. This 143 . With minimal column replication. An interesting aspect to minimal column replication is what happens if the only columns updated were columns not included in the replication definition.). Undoubtedly. Obviously. advance = ?advance!new?.pubs2 with all tables named 'titles' ( "title_id" varchar(6). For example. setting them to the same values they already are. this behaves differently. there are others you could think of as well. in the above titles table. if a column is updated. pub_id = ?pub_id!new?. what happens is RS submits an update setting the primary key values to after image values – essentially a no-op. price = ?price!new?. consider the following update statement: Update titles set contract=1 where title_id=”BU1234” If this statement was executed at the primary. Minimal Column Replication When the replication definition includes the “replicate minimal columns” phrase.1 Dynamic values – product prices (sale prices. etc. "title" varchar(80). the RS would otherwise attempt to generate an empty “set clause”.Final v2. "notes" varchar(200). if the only column(s) updated were columns excluded from the replication definition.0. "total_sales" int. As you can guess. total_sales = ?total_sales!new?. Note that minimal column replication really only applies for updates. the replication definition can be altered to remove minimal column replication (temporarily). Keep in mind that this does impose a number of restrictions: • • Autocorrection can not be used while minimal column replication is enabled. and when autocorrection is necessary due to inconsistencies. minimal column replication will not have a negative impact on RS functionality. Consequently. Key Concept #14: Unless custom function strings exist for update and delete functions for a specific table. then you may want to use two repdefs for the table(s) in question – one for the Warm Standby – supporting minimal column replication. Even if the values haven’t changed. While minimal column replication documentation does include comments about both update and delete operations. For example. all of the values are new and therefore need replication. the functions would never get invoked. If left on. minimal column replication should be considered. Regarding the first restriction. Note that for Warm Standby. if you have a Warm Standby and a Reporting system and the reporting system uses custom function strings (to perform aggregates). Custom function strings containing columns other than the primary keys may not work properly or generate errors. using multiple repdefs may alleviate the pain of not being able to use minimal column replication. 144 . autocorrection should not normally be on. in which guaranteed assurance of replicated transactions held sway over performance (and rightfully so). for most users. this translates to only the primary key values being placed into the outbound queue (vs. While this is easier handled today in a cleaner approach via using multiple replication definitions. By using minimal columns. Again. update operations at the replicate will proceed much quicker by avoiding unnecessary index maintenance and possibly avoiding updates altogether if the only columns updated at the primary are excluded from the replication definition. this implementation no doubt dates back to the earliest implementations of RS. For delete statements.1 can be confusing and lead to a quick call to TS demanding an explanation. and one for the reporting server.Final v2. the full before image as without minimal column replication) – which means any custom function strings (such as auditing) that is recording the values being deleted in a history table will incur problems. only the rs_update function will be impacted. if not using custom function strings on the table. In the case of insert statements. this can have a greater penalty than not using minimal columns as the index maintenance load could be greater due to first removing the index keys (and any corresponding page shrinkage) and then re-adding them (which could cause splits). performance could be seriously degraded as each update translates into a delete/insert pair. minimal column replication is enabled by default as also is true of MSA implementations. Before you pick up the phone – one little consideration – what if a custom function string simply was counting the number of updates to a table?? By excluding the update from replication simply if only non-replicated columns were updated. If using custom function strings. minimal column replication should be enabled by default.0. Final v2. we will begin by looking at the Data Server Interface (DSI) thread group in detail. with the exception of the RSI thread. a considerable part of it is also due to the processing of the outbound queue. If you remember from the earlier internals diagram. the outbound processing basically includes the SQM for the outbound queue.1 Outbound Queue Processing …must come out. is that when discussing the outbound processing of Replication Server internals. Figure 34 – Replication Server Internals: Inbound and Outbound Processing As you can imagine. The single biggest bottleneck in the Replication System is the outbound queue processing. A closer in diagram would look like the following: 145 . As hard as this seems to be believed. the main reason for this is that the rate of applying transactions at the replicate will often be considerably slower than they were originally applied at the primary. Consequently. you are discussing threads and queues that belong to the replicate database connection and not the primary. While some of this is due to the replicated database tuning issues. the DSI thread group and the RSI thread for replication routes. A key point to remember.0. the outbound queue SQM processing is extremely similar to the SQM processing for an inbound queue – basically manage stable device space allocation and perform all outbound queue write activity via the dAIO daemon. These are illustrated below. When the first batch is sent to the replicate database.0. As the DSIEXEC converts the transaction group to SQL statements. 9. determines if it can group it with already closed transactions it has in cache according to the transaction grouping rules and the various connection configurations. The DSI checks the dsi_serialization_method and if the serialization method is wait_for_commit. When the batch limit is hit (50 commands) or when the batching is terminated due to batching rules/configuration parameters. wait_for_start). 1. it checks to see which of the DSIEXEC’s are available and submits the existing transaction group to the DSIEXEC via message queues 5. 10. The DSIEXEC then processes the results from each of the commands within the command batch. 4. The DSIEXEC takes the transaction group commands and converts the structures to SQL statements 6. 2. transaction grouping. 5. 6. it submits the next command batch until the entire transaction has been submitted (but not yet committed). command batching. etc. however. When the DSI/SQT sees a closed transaction. it uses SQT logic to sort the commands into their original transactions and also into commit order (when multiple sources are replicating to a single destination) 3.DSI SQT processing.Final v2. the batch is held until the previous thread is ready to commit. The DSI thread reads from the outbound queue SQM As the DSI reads each command. it might be a good idea to take a closer walk-through of the DSI/DSIEXEC processing. it attempts to batch the commands into command batches for execution efficiency (similar to multiple statements in an isql script before the ‘go’). Because of the number of DSI & DSIEXEC module counters. we will not necessarily look at each one. First. Otherwise. In any case. will be discussed in this section.e.Close up of DSI Processing Internals Many of the concepts illustrated above . the DSIEXEC notifies the DSI so that the DSI can allow parallel DSI’s to work if the dsi_serialization_method is not wait_for_commit (i. 146 . When all the results have been processed. 2. while the Parallel DSI features will be discussed later. 4. 7. One it can’t add it to an existing group. the DSIEXEC notifies the DSI that it is ready to submit the first batch 8. the DSI notifies the DSIEXEC to send the batch the replicate DBMS for execution. you can think of the flow through the DSI as having the following stages: 1. We will look at the most appropriate counters during each section. 3. Read from Queue (DSI SQM Processing) Sort Transactions (due to multiple sources) (DSI SQT Processing) Group Transactions (DSI Transaction Grouping) Convert to SQL (DSIEXEC Function String Generation) Generate Command Batches for Execution (DSIEXEC Command Batching) Submit SQL to RDB (DSIEXEC Batch Execution) We will use this list as a starting point to discuss DSI processing.1 Figure 35 . there is at least one major difference. The DSI checks the commit order and notifies the DSIEXEC’s when they can commit. when the thread is ready to commit (rs_get_threadseq returns). As far as the SQM itself. While this is a desirable goal for the outbound queue as well. In addition. the implication is that the previous thread rolled back (due to error or contention) and that this thread needs to rollback as well – in which case step #11 becomes a ‘Rollback’ command (currently implemented as disconnect which causes an implicit rollback). Note that in the above diagram. Subsequent command batches are simply applied except in the case of large transactions in which every dsi_large_xact_size commands. the likelihood is that the latency in executing the SQL at the replicate will result in the cache hit quickly dropping to zero once the DSI SQT cache fills. the DSI would not respond with a ‘Begin Batch’ until it got the ‘Commit Ready’(#10) from the top thread. if the dsi_serialization_method was ‘wait_for_commit’. the seq number from the rs_get_threadseq is passed to the DSI for comparison. Consider the following: 147 .0. The reason we say it begins at rs_get_threadseq (#9) is that in parallel DSI’s. it sends the commit to the replicate DBMS and notifies the DSI that it has committed and is available for another transaction group. Any time lag is likely due to the DSI waiting for a previous thread to respond back ‘Committed (#13)’ which means that it has committed successfully. For example. This illustrated in the below diagram (showing only the communications between the DSI and one DSIEXEC – others implied). the rs_threads table is used for serialization . it is identical to the inbound queue SQM. If the seq number is less than expected. Note that only the first command batch is coordinated with the DSI. 12. the DSI reads from the outbound queue SQM. the DSIEXEC notifies the DSI that it is ready to commit via message queue. DSI SQM Processing Much like the SQT interaction with the inbound queue SQM.Final v2. When all the SQL commands have been submitted. A few items of interest relating to the monitor counters from the above diagram Batch Sequencing Time – (Steps 4 5 6 7) Is the time between when the first command batch is ready (#4 Batch Ready) and when the DSIEXEC receives the Begin Batch message (#5). the bottom thread would get a ‘Begin Batch’ response when the top thread sent the ‘Batch Began’ message (#7) Commit Sequencing Time – (Steps 9 10 11 12 13) This is the time between the ‘Commit Ready’ (#10) and the ‘Commit’ (#11) response. Figure 36 – Logical View of DSI & DSIEXEC Intercommunications As you can tell. a rs_get_thread_seq is sent. if the bottom thread sent a ‘Batch Ready’ message. If you remember from the inbound discussion. As each DSIEXEC receives commit notification. if the DSI serialization method is wait_for_commit. If instead the dsi_serialization_method was ‘wait_for_start’. it notifies other DSIEXEC’s that they can send their batch. 13. the primary goal is to be reading the blocks from cache – using BlocksReadCached as the indicator.1 11. This gap is used to control when parallel DSI’s can start sending their respective SQL batches according to the dsi_serialization_method. there is quite a bit of back-and-forth communications between the various DSIEXEC’s and the DSI thread to ensure proper commit sequencing and to also ensure that the command execution sequencing is maintained. when not using commit control. While many of the SQM/SQM-R related counters are the same.and it is in this step that it occurs (as will be discussed later). then you can be sure that the DSI SQT cache is probably full. which means that the most accurate estimate for latency in the outbound queue is: Latency = Last.533 48. Now this also points out a bit of a fallacy.62 100 75.042 19.312 7.075 6.352 2 189 308 185 270 291 530 715 744 405 403 266 418 298 470 2 189 307 185 269 291 401 0 0 240 0 0 0 0 0 100 100 99.751 6 6.496 18.432 2. One aspect to consider is that if there is any latency. that suddenly there is 1MB of backlog in the outbound queue – despite the source being quiescent. If the transactions are not contiguous (replicated rows from the various sources inter-dispersed in the stable queue). if a source database is replicating to a single destination.Read.Cmds Written Deallocated 0 3 4 3 5 4 10 13 9 11 8 10 23 17 13 0 3 4 3 5 4 8 11 12 6 6 4 7 5 7 SegsActive Allocagted 19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21 19:52:22 19:57:45 20:02:48 20:07:49 20:12:51 6 6.920 2.058 41.097.098.Read is higher than Last.Seg Block. the inbound SQT effectively sorts the transactions into commit sequence.876 12. However. since each replicate will have its own independent outbound queue that the single DIST thread is writing commit ordered transactions into.331 21.25 0 0 0 0 0 1 1 1 1 1 1 3 5 2 7 9 15 31 44 50 Cache MemUsed 0 1.Final v2. due to MD caching of writes.727 93.67 100 99.66 0 0 59. once the DSI SQT cache fills. the outbound processing does not have a separate SQT thread.803 29.987 7.0.Block from Last Seg.432 2. The number of active segments above is a good estimate as well – however these are not reported in any easily obtained admin who statistics.Block – Next.807 9.432 As you can see from the above. This is largely due to a very simple reason – transactions in the outbound queue are more than likely already in commit order.468 29.sqm command. providing that the transactions are small enough.098. then the outbound queue is automatically in sorted order.Seg Block – Next.432 2. DSI SQT Processing If you notice in the internals diagram above.Block is even more inaccurate than using Next.098.098.499 25. For the outbound queue.432 2. But this may explain to some why when the connection appears to be all caught up and you suspend the connection.405 42. As a 148 .Read value from the Last Seg. The only time this is not true is when multiple primary databases are replicating into the same replicate database – such as corporate rollup topologies.293 7. the SQT will still encounter complete and contiguous transactions from each source system.792 3. For example.328 0 0 143. Earlier we stated that one way to determine the amount of latency was to subtract the Next.Block includes segments still allocated due to simply not having been deallocated yet as well as segments preserved by the save interval – so subtracting First Seg.Read + (DSI SQT Cache) If Next. The First Seg.920 2.Read + CacheMemUse.432 2.711 4.1 SQMR Cmds Read Sample Time Cache Hit % Blocks Read BlocksRead Cached SQM.097. it is very likely that the DSI is caught up or nearly so. this does represent a “rough” estimate – what it is lacking is the amount in the DSI SQT cache.539 67. Since this ordering is not overridden anywhere within the rest of the inbound processing.098.098.689 4.570 22. unlike with the inbound processing. the SQT will still only have a single transaction per origin in the Open/Closed/Read linked lists as the transactions are still in commit order respective to the source database. the BlocksReadCached quickly hits bottom. Consequently. This does not change if the primary has multiple replicates.270 18.963 7.432 2.238 40.Block in the admin who.098.140 31.104 2.564 52. the most accurate measurement would be Last Seg. even in this latter case.046 6. not only shut down the DIST thread. with the exception of the SQT cache in a WS DSI thread. From a performance perspective. you should consider the ‘alter logical connection logical_DS. This can save CPU time – especially in pre-12. if the SQT module is so little used. Eliminating CPU and memory consumed by the SQT thread in sorting the transactions So.logical_DB set distribution off’ command. the SQT cache contains the actual commands that comprise the transaction – consequently. sqt Spid ---17 98 10 0 Closed -----0 0 0 0 Removed ------0 0 0 0 SQM Reader ---------0 0 0 0 State ----Awaiting Awaiting Awaiting Awaiting Read ---0 0 0 0 Full ---0 0 0 0 Info ---101:1 TOKYO_DS. causing the transaction to be moved to the “Read” queue.normally called the DSI scheduler or DSI-S) simply calls the SQT functions when reading from the outbound queue via the SQM. This is illustrated in the above drawing in which the DSI EXEC threads read from the SQT cache “Closed” queue and after applying the SQL.TOKYO_RSSD 106 SYDNEY_DSpubs2sb Trunc ----0 0 0 0 First Trans ----------0 0 0 0 Parsed ------0 0 0 0 Wakeup Wakeup Wakeup Wakeup Open ---0 0 0 0 SQM Blocked ----------1 1 0 0 Change Oqids -----------0 0 0 0 Detect Orphans -------------0 0 1 1 In the above example output. admin who.e. etc. the DSI SQT processing is reported in the last two lines lacking the queue designator (:1 or :0). but you also shut down the SQT thread. In the following sections we will take a look at why you might want to do either. DSI SQT Performance Monitoring This does not mean that you cannot monitor the SQT processing within the outbound queue processing. by disabling distribution for a logical connection. Consequently. the admin who. the WS-DSI threads read straight off the inbound queue – effectively duplicating the sorting process carried out by the SQT thread. what is the SQT cache used for by the DSI thread? Remember. notify the DSI of the success. sqt output and careful monitoring of the monitor counters. the main DSI thread queue manager (DSI . the SQT cache is where the DSI EXEC threads read the list of commands to generate SQL for and apply to the replicate database. If your only connection within the replication server is a Warm Standby.6 non-SMP RS implementations by: • • Eliminating CPU consumed by the DIST thread unnecessarily checking for subscriptions. One notable difference to this is for Warm Standby DSI’s. if you (hopefully) have tuned the Replication Server’s sqt_max_cache_size parameter (i.pubs2 101 TOKYO_DS. During startup. The way this can easily be verified is by issuing a normal admin who command and comparing the spids (10 and 0 above) with the type of thread reported for those processes in the process list returned by admin who. In a Warm Standby. you may want to adjust the SQT cache for the outbound queue downward or up depending on the status of the removed and full columns in the admin who. the RS first starts the SQM threads then the DSI and DIST threads. If you remember from previous. This lack of workload was the primary driver to simply including the SQT module logic into the DSI vs. to 2-4MB). This can (and must) be done on a connection basis via setting the dsi_sqt_max_cache_size to a number differing from the sqt_max_cache_size. This command shuts down the DIST thread for the logical connection. having a separate SQT thread for the outbound queue.1 result.TOKYO_RSSD 103:1 DIST LDS.0. sqt command reports both the inbound and outbound SQT processing statistics.Final v2. The DIST in turn starts the appropriate SQT thread. 149 . The DIST is more than just a client of the SQT thread – it actually controls it. the standard SQT monitor counters apply. even in these cases. ClosedTransRmTotal ClosedTransTotal CmdsAveTran CmdsLastTran CmdsMaxTran CmdsTotal EmptyTransRmTotal MemUsedAveTran MemUsedLastTran MemUsedMaxTran OpenTransRmTotal 150 . How to size this will be illustrated in the next section. This is extremely unfortunate as DBA’s tend to over allocate sqt_max_cache_size – setting it well above the 4-8MB that is likely all that is necessary even in high volume systems. In such situations. Maximum memory consumed by one transaction. Total empty transactions removed from queues. Total memory consumed by the last completely scanned transaction by an SQT thread. raising the DSI SQT cache allows the DSI to “read ahead” into the queue and begin preparing transactions before they are needed. For this reason. in most common systems. the default dsi_sqt_max_cache_size causes performance degradation. consider the default dsi_max_xacts_in_group setting of 20.Final v2. The proper sizing for the dsi_sqt_max_cache_size is likely 1-2MB at most and can be more accurately determined for parallel DSI configurations by reviewing the monitor counter information (discussed below). Average memory consumed by one transaction. XREC_CHECKPT. Each command structure allocated by an SQT thread is freed when its transaction context is removed. then you would need dsi_sqt_max_cache_size large enough to hold 100 closed transactions at a minimum and probably some number of open transactions that the DSI executer could be working on. As a result. you will probably want to set the DSI SQT cache equal to the SQT cache – or possibly even higher.1 dsi_sqt_max_cache_size < sqt_max_cache_size In most systems. This is especially true in high volume replication environments in which the rate of changes requires more than the default number of parallel DSI threads. Total transactions removed from the Closed queue. Maximum number of commands in a transaction scanned by an SQT thread. In this case. dsi_sqt_max_cache_size >= sqt_max_cache_size A notable exception to this is the Warm Standby implementation. Total commands read from SQM. Total transactions removed from the Open queue. the DSI-S thread will continuously be trying to fill the available DSI SQT cache from the outbound queue – often at the expense of yielding the CPU to the DSI EXEC.0. This could result in a situation where the DSI transaction rate is higher than the amount of rows read from the outbound queue. If the number of parallel DSI’s was set to 5. These are repeated here with DSI appropriate counters highlighted. the default dsi_sqt_max_cache_size setting is 0 – which means the DSI inherits the same cache size as the SQT cache limit (sqt_max_cache_size). However. the DSI thread can effectively process large amounts of row modifications as the load can be distributed among the several available DSI’s. Total commands in the last transaction completely scanned by an SQT thread. SQT cache usage is zero. if no transactions are active in SQT. unless the system only experienced short transactions allowing the primary sqt_max_cache_size setting to remain low at 1-2MB. SQT thread memory use. DSI SQT Monitor Counters Although the DSI SQT is not a separate threaded module. A second exception concerns the use of parallel DSI’s. In fact. When parallel DSI’s are used. As a result. As mentioned earlier. Average number of commands in a transaction scanned by an SQT thread. the dsi_sqt_max_cache_size setting for parallel DSI’s will still likely be less that sqt_max_cache_size. Counter CacheExceeded CacheMemUsed Explanation Total number of times that the sqt_max_cache_size configuration parameter has been exceeded. Commands include XREC_BEGIN. it is the DSI SQT thread that is actually sorting the transactions into commit order. XREC_COMMIT. in a WS topology. Total transactions added to the Closed queue. what we are likely to see is that the CacheMemUsed grows until dsi_sqt_max_cache_size is reached – at which point the CacheExceeded will jump to substantially large values.). this will explain how many possible transaction groups are in cache at a max (exluding partitioning rules. for 5 DSIEXECs and the default of 20 dsi_max_xacts_in_group. these values will be identical as each group of transactions as committed by the DSI makes room for the same number of transactions in to be read into the DSI SQT cache. If divided by the dsi_max_xacts_in_group. different origins. Total transactions removed from the Truncation queue. If the number of commands per transaction is fairly high. Total transactions added to the Truncation queue. DBAs should avoid raising the DSI SQT cache as the latency in processing transactions ahead of them will likely result in their being removed in any case. OpenTransTotal CloseTransTotal ReadTransTotal CacheMemUsed MemUsedAveTran CmdsAveTran Let’s take a look at how these might work by looking at the earlier insert stress test. However. Consequently. you want the DSI SQT cache to contain double the dsi_max_xacts_in_group transactions for each DSI EXEC thread. 151 . you would like to see 2 * 5DSIs * 20Xacts/Group or 200 transactions. Total transactions whose constituent messages have been removed from memory. Since the transactions are nearly all presorted. The number of cached transactions can be derived by dividing the CacheMemUsed by MemUsedAveTran. Unless this happens frequently due to larger transactions. This is useful for helping to size dsi_max_xacts_in_group when using parallel DSI’s. Ideally. These counters are the most appropriate ones to use to size the dsi_sqt_max_cache_size. these counters may differ until the cache fills. Total transactions removed from the Read queue.0. we would associate these values with needing to raise the SQT cache setting (i.e. If we have 200 or more transactions in cache. Total transactions added to the Read queue. These counters take on a different perspective. dsi_sqt_max_cache_size). Removal of transactions is most commonly caused by a single transaction exceeding the available cache.1 Counter OpenTransTotal ReadTransRmTotal ReadTransTotal TransRemoved Explanation Total transactions added to the Open queue. Once the cache fills. The only transactions likely to be removed will be large transactions too large to fit into the DSI SQT max cache size. large transaction groups only will compound any contention between the parallel DSI’s. TruncTransRmTotal TruncTransTotal Let’s take a look at some of these counters and how the can be used from the outbound queue/DSI perspective Counters CacheExceeded TransRemoved Performance Indicator Normally. etc. raising dsi_sqt_max_cache_size is likely of no benefit.Final v2. 432 2.144.223 12.6 4. TransTotal DSI.097.223 12.099.097. this would be a good indication that our SQT cache is oversized as we have nearly twice the number of transaction groups in 152 MaxCached Groups 0.4 4. as soon as transactions arrive. DSIXactInGrp – This is the effective dsi_max_xacts_in_group derived by dividing the number of “ungrouped” transactions as submitted by the source system by the number of transaction groups that the DSI-S created.408 2.0 DSI. the DSI SQT cache was quickly filled by the DSI-S – filled in about 10 seconds.1 Sample Time ClosedTrans Toral 11:37:47 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 11:38:52 11:39:03 11:39:14 11:39:25 0 2.6 4.223 12.223 12.6 37.152.0 2.7 4. However.968 2.7 4.099.223 12.224 2. dsi_max_xacts_in_group was set to 20. CacheMemUsed. then each second one would be moved from Closed Read making room for one more – and at the end of the 10 second sample interval we would show a total of 60 transactions having been “Closed” – the original 50 plus 10 due to processing.098. so these values to not reflect the number of transactions in cache – but the number of transactions that are in cache plus the number of transactions that have been moved to the next stage of the cache (Open Closed Read Truncate). as long as there were transactions in the queue to be delivered.1 37. you also need to realize that the number of Closed & Read transactions are over the full sample period. the cache remained full and the cache was “exceeded” frequently.7 4.223 12. If we were getting our maximum dsi_max_xacts_in_group.101.100. we would need 200 cached transactions to meet the full need.3 38. the derived statistics are in red in the above table.8 4. as the cache became full.504 2. CacheExceeded & TransRemoved – As you can see from the above.100. From that point. However. it helps to know that there were 10 parallel DSI’s. notice that there were 0 transactions removed – implying that this 2MB DSI SQT cache is likely oversized or is correctly sized.729 12. This is the first indication that the DSI SQT cache is possibly oversized from the system performance perspective as we see about 170 transactions in the cache on a regular basis but the DSIEXEC’s are only processing ~30 transactions per second (loosely extrapolating from the NgTransTotal over the time period – NgTransTotal to be discussed later – but it represents the number of original transactions prior to the DSI-S grouping them together).7 To evaluate this. CachedTrans – The actual number of transactions in the cache can be roughly derived by dividing the CacheMemUsed by the MemUsedAveTran.Final v2. the cache may be undersized according to our desired target! With 10 DSIEXEC’s active and a dsi_max_xacts_in_group of 20. let say we were delivering transactions at a rate of one per sec – if the cache quickly filled with 50 transactions.0.6 4.200 2.099.223 12. For example. and dsi_sqt_max_cache_size was set to 2.712 2.3 35.224 0 75 289 327 347 319 345 319 345 295 0 10. As you can see.097. Again. dsi_xact_group_size was set to 262.3 72.223 0 195 171 171 171 171 171 171 171 171 0 1 47 54 42 56 64 61 61 45 0 0 0 0 0 0 0 0 0 0 0 54 296 331 339 311 336 333 326 307 0 21 62 68 75 67 64 68 68 67 0 58 287 322 334 315 310 319 316 291 0. ClosedTransTotal & ReadTransTotal – During the first period of activity when the cache was filled (CacheExceeded=1) we see that the DSI SQT cache had 75 “Closed” transactions and only 54 “Read” transactions – demonstrating that the DSIEXEC’s were lagging right from the start. we are not getting anything close to our desired setting of 20 – likely some other DSI configuration value is affecting this. However.1 39.Ng TransTotal ReadTrans Total Cache MemUsed MemUsed AveTran Cache Exceeded Trans Removed DSIXact InGrp Cached Trans . When looking at these numbers. MaxCachedGroups – This metric is derived by dividing the CachedTrans by the number of transactions being grouped (DSIXactInGrp) – which yields the number of transaction groups at the current grouping that are in the DSI SQT cache.8 36. new transactions could only be read from the queue into the SQT cache at the same rate that the DSIEXEC’s could deliver them – resulting in the situation we described before in which the Closed ≈ Read.1 36.920 2. Let’s take a look at what these counters are telling us.2 37. 6 66.493 2.385 10.081 1.696 1.664 3 2.578 5.097.849 1.060 1.490 2.104 2.906 4.120 11.483 2. TransTotal DSI.0 1.578 5.NgTransTotal ≈ DSI.922 10.400 1.097.0 20.098.860 1.0 57.142 2. DSI SQT cache is slightly undersized for the target performance.760 1.0 0. However.574 1.820 19.0 1.Ng TransTotal 2 1.100 11.477 2.328 0 0 143.844 0 791 414 470 845 845 1.413 17.926 1.561 1.333 0 0 0 0 0 0 2.166.0 923.179 2.432 2.137 0 65 84 2 102 357 480 1.800 13.922 1.944 2.0 20.598 3.574 1.536 (default).Final v2.0 68.873 3.0 14.192 1.344. since we are only averaging about 4 transactions per group.579 1.8 31.097.9 66. but is oversized for the way the system is performing – consequently it is some other setting that this restricting processing.348 10.702 1.109 2.482 2.328.7 25.378 3.760 5.6 56.0 1.0 Cached Trans 153 .0 DSI.408 2.746 1.0 1.417 3.0 1.339 1. which should allow grouping.574 1. we can see we aren’t doing any transaction grouping whatsoever – DSI.574 1.344 1.873 3.740 5.012 1.040 11.593 1.9 20. we can see we are grouping transactions – so perhaps the configuration was changed or the transaction profile differs enough to change how transactions are grouped.747 1.430 DSI.580 6.725.034 1.574 13.899 10.120 996 456 0 0 0 0 0 0 0 0 0 0 3 1.442 2 1.738 1. let’s take a look at the customer example we were looking at earlier: Sample Time ClosedTrans Toral MaxCached Groups 0.348 10.9 13.123 2.097.098. But rather DSIXact InGrp 1. As a result.098.0 1.746 1.0 20.0 1.708 1.0 1.023 1.0 1.792 3.573 1.881 3.0 0.478 2.592 2.547 6.920 2.920 2 1. Now.030 1.780 ReadTrans Total Cache MemUsed MemUsed AveTran Cache Exceeded Trans Removed 19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21 0 1.430 Then the next day. any DSI SQT cache above the bare minimum is excessive.0 20.0.899 10.772 3 148 115 69 101 187 276 652 579 339 11.030 1.605 5.232 1.8 MaxCached Groups 0. TransTotal ReadTrans Total Cache MemUsed MemUsed AveTran Cache Exceeded Trans Removed 19:18:31 19:23:32 19:28:34 19:33:36 19:38:38 19:43:40 19:48:42 19:53:44 19:58:46 20:03:48 0 1.0 1.530 10. it looks like the following: Sample Time ClosedTrans Toral DSI.481 1.0 Ouch!!! In the first sample (day 1).432 2.573 0 0 1 0 0 57 923 1.379 10.0 1.405 3.920 1.580 6.098.1 memory as our effective dsi_max_xacts_in_group.0 1.098.Ng TransTotal 3 1.333.432 2. the number of cached groups would be between 8 & 9.2 59.333 1. But in the second sample (day 2).TransTotal – despite the fact that dsi_max_xacts_in_group=20 and dsi_xact_group_size=65.748 5.273 1.920 1.023.747 1.328 1.567 1.0 1.468 2.432 2.5 60.030 1.599 5.069 0 0 0 0 0 0 0 0 0 0 2 1.098. if we succeed in raising this effective value to even 10 (half of the target dsi_max_xacts_in_group) the number of cached groups drops to 17 (still higher than dsi_num_threads=10 though) – and if we reach our target of 20.0 Cached Trans DSIXact InGrp 1.7 42.432 2.0 0.520 13.5 16. So. 024 172.000 xact/5 mins).794 18.CmdsRead.0.CmdsRead and DSI.744 9.205 26.407 1.632.682. we probably should start by figuring out why transaction grouping is not happening – as well as see if we can’t increase the transaction rate to something above 33 transactions per second (~10.632. Now.424.768 0 0 3 0 741 2.112 442.784 40.524 7.776 8.688 9.871 Source Cmds MaxTran Sample Time Source SQT CmdsTotal 21:40:46 21:42:47 21:44:48 21:46:49 21:48:50 21:50:50 21:52:51 21:54:52 21:56:53 22:02:21 22:04:22 22:06:22 22:08:23 22:10:24 22:12:25 22:14:26 22:16:26 22:18:27 5.524 7.192 442.000 13. Likely.512 13.225 14.768 13.187 8.615 27.869 0 19 19 19 19 0 0 0 0 0 3 0 132 104 105 105 106 105 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57.442 2.873 3.288 406.564 18.180 13.873 5. The real problem is the latency at the DSIEXEC in delivering and executing the SQL at the replicate DBMS – as can be seen by the lag between the destination SQM.512 0 0 0 0 0 0 638.845 5.256 13.359 4.357 1. The last may seem like a strange comment (how could we know this is attainable?) – but considering the insert stress test target system above was a laptop and it was processing 30 transactions per second (and then barely working) and the customer system is likely a server of considerable more capacity.516 3.298 4.720 8. The problem was the SQT was never using more than about 500KB of cache! Now.125 8.868 5.1 lack some of the more granular details around the SQT Open.744 6. the sqt_max_cache_size was raised from 10MB to 13MB to attempt to get better throughput.999 18.200 481.867 5.1 system and RS 12.664 Source SQT CacheExceeded Source SQT TransRemoved DSICmdsRead 7.632.632.960 13.1 customer – who unfortunately was only collecting a few modules of their RS 12.683 9.326 3. This sample comes to us courtesy of a RS 12.191 8. In fact.008 18. Source SQT CacheMemUsed DSI SQT Cache 13.078 Source SQM CmdsWrirren Dest SQM CmdsWritten 5.256 13.797 324 1 2 2 3 0 6 0 844 3. Closed.648 59.684 0 0 3 0 747 3. the same throughput could be achieved by setting sqt_max_cache_size to 4MB and dsi_sqt_max_cache_size to 2MB.632.187 8. 154 Dest SQM CmdsRead .866 8.632.999 6. that doesn’t mean only 500KB is necessary – it means that setting it higher actually wouldn’t help.648 0 0 0 0 0 0 0 0 531.962 18.CmdsWritten or SQM.510 7.632.088 59.411 1.364 2.1 than reducing the DSI SQT cache.256 12.192 8.795 324 0 0 0 0 0 3 0 842 3.075 6. Read and Truncate lists.866 5. as you can see all it did was allow the DSI-S to fill up 13MB of cache waiting for the DSIEXEC to catch up.795 342 0 0 0 0 0 3 0 747 3.366 3.632.837 2.112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7.999 6. let’s take a look at probably what is a more normal sample that illustrates the point we were making earlier about SQT cache & DSI cache being oversized.000 13.632.Final v2.322 In the above system.240 13. each of the above would get turned into separate individual transactions and submitted as follows (RS functions listed vs.00.00.”Sep 1 1 1 1 1 1 1 1 2000 2000 2000 2000 2000 2000 2000 2000 14:20:36.”Sep (123456789.$119.”Sep (123456789.Chk.000001.Chk.324”.Chk.105) 14:20:36.328”.000008. The obvious question is “Why bother doing this?” The answer simply is to decrease the amount of logging on the replicate system imposed by replication and to improve the transaction delivery rate.$5.103) 14:20:36.1 DSI Transaction Grouping Why Group Transactions One function of the main DSI thread is to group multiple independent transactions from the primary into a single transaction group at the replicate.”Sep (123456789.00.00.321”.000006.000007.”Sep (123456789.”Sep (123456789.Chk.00.Chk.000003.322”.327”.$12.000002.108) As you notice.106) 14:20:36. without transaction grouping.107) 14:20:36.323”.00.000005.$250. Now the question is.Chk. these fictitious transactions all were applied during an extremely small window of time.0.101) 14:20:36. what would Replication Server do? The answer is.32.$1132.104) 14:20:36.$395.000004.325”. Consider the following illustration of the difference between the primary database transaction and the DSI transaction grouping: Primary Database Transactions begin tran order_tran insert into orders values (…) insert into order_items values (…) insert into order_items values (…) update orders set total=… DSI Transaction Grouping begin tran insert into orders values (…) insert into order_items values (…) insert into order_items values (…) update orders set total=… insert into ship_history values (…) update orders set status=… insert into orders values (…) insert into order_items values (…) insert into order_items values (…) update orders set total=… insert into orders values (…) insert into order_items values (…) insert into order_items values (…) update orders set total=… commit tran order_tran begin tran ship_tran commit tran ship_tran begin tran order_tran Insert into ship_history values (…) Update orders set status=… insert into orders values (…) insert into order_items values (…) insert into order_items values (…) update orders set total=… insert into orders values (…) insert into order_items values (…) insert into order_items values (…) update orders set total=… commit tran order_tran begin tran order_tran commit tran commit tran order_tran Figure 37 – Primary vs.”Sep (123456789.00. SQL): rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert – insert for check 101 – insert for check 102 – insert for check 103 – insert for check 104 – insert for check 105 – insert for check 106 155 .Final v2.102) 14:20:36. Replicate Transaction Nesting Impact of DSI Transaction Grouping In the example on the right.326”. Consider the worst-case scenario of several atomic transactions such as: insert insert insert insert insert insert insert insert into into into into into into into into checking_acct checking_acct checking_acct checking_acct checking_acct checking_acct checking_acct checking_acct values values values values values values values values (123456789.$99. Replication Server’s DSI thread has consolidated the individual transactions into another transaction (begin/commit pair underlined) grouping the transactions together.$125.Chk.”Sep (123456789.Chk. So while the primary system can take full advantage of multiple CPU’s. Secondly.102) (….104) (….wait for success update rs_lastcommit … commit transaction -.107) (….wait for success begin tran insert into checking_acct -.wait for success begin tran insert into checking_acct -. At the replicate. Consider each of the following primary database transaction scenarios: Concurrent User – Concurrent users applied each transaction at the primary.wait for success update rs_lastcommit … commit transaction -. the batching is essentially undone as each of the atomic commits results in 2 network operations per transaction. while this will be discussed in more detail in the next section.Final v2.wait for success update rs_lastcommit … commit transaction -.wait for success update rs_lastcommit … commit transaction -.wait for success begin tran insert into checking_acct -.103) (…. the amount of I/O has clearly doubled. a single user applies all the transactions at the primary in a large SQL batch. the replicate simply has no concurrency. the delivered transaction rate would not match that at the primary system. Single User/Batch – In this scenario. In regards to the former. only a single user is applying the transactions. rs_commit calls a stored procedure rs_update_lastcommit. This could be significant as anyone familiar with the performance penalties of not batching SQL can attest. Replication Server does not batch the outer commit statements with the transaction batch if batching is enabled.wait for success begin tran insert into checking_acct -.wait for success (….wait for success update rs_lastcommit … commit transaction -.101) (…. Consequently.1 rs_commit rs_begin rs_insert – insert for check 107 rs_commit rs_begin rs_insert – insert for check 108 rs_commit Which does not look that bad until you realize two very interesting facts: 1) the contents of the rs_commit function.wait for success begin tran insert into checking_acct -.wait for success update rs_lastcommit … commit transaction -. this would add to the problem.105) (….wait for success begin tran insert into checking_acct -.108) Why is this a problem? First. At the replicate.wait for success begin tran insert into checking_acct -. Consequently.wait for success update rs_lastcommit … commit transaction -. group commits for the transaction log and every other feature of ASE to improve concurrency. 156 . which updates the corresponding row in the replication system table rs_lastcommit. and 2) how rs_commit is sent as compared to other functions. As far as the second point. the replicate database would actually be executing something similar to: begin tran insert into checking_acct -. if the replicate system was already experiencing I/O problems.0.106) (….wait for success update rs_lastcommit … commit transaction -. 10. 4. 1. A transaction group will end any time one of the following conditions is met: 1. 15. 3. The current or the next transaction is a rollback. A timeout expires While this appears to be quite a long list. The predefined maximum number of transactions allowed in a group has been reached. consequently halving its ability to process transactions. routing. they are still there and tracked. 5. As a result. There are no more transactions in the DSI queue. The current or the next transaction is an orphan transaction. 11. So why didn’t RS engineering simply submit it as nested transactions? Several reasons: • • • The nested commits would have prevented parallel DSI’s from working at all as it would have guaranteed contention on rs_lastcommit Not all DBMS’s support nested transactions (i. Actually. The current or the next transaction is a routing transaction. The current or the next transaction is a subscription (de)materialization transaction queue end marker. Aborted. The current or the next transaction is on disk. 4. 2. and subscription transactions cannot be grouped. Transactions cached in the DSI/SQT closed queue. 3. 7.0. As ASE performs each I/O the user process is put to sleep. 5. 13. it is a special RS-to-RS transaction). The transaction group size is limited by the lesser of dsi_xact_group_size and dsi_max_xacts_in_group. database/log dump. 9. 2. The current or the next transaction will make the total size of the transactions (in bytes) exceed the configured group size. A transaction partitioning rule determines that the next transaction cannot be grouped with the existing group. some may have been quick to notice that the individual transactions “seem” to have gotten lost. Simply. 14.1 Single User/Atomic – A single user performs each of the original inserts using a single atomic transaction per network call. the replicate system – with twice the i/o’s – will spend twice as much time “sleeping”. Key Concept #15: Transaction grouping reduces I/O caused by updating replication system tables and the corresponding logging overhead at the replicate system. The current or the next transaction is a subscription (de)materialization transaction marker. 12. transaction batching is critical to replication performance – although it can be an issue with parallel or multiple DSI’s as discussed later. 157 . all of the following six conditions must be met.e. Transactions will be applied at the replicate with the same username and password. DSI Transaction Grouping Rules Unfortunately. consider the following. 16. One reason for this is that if any individual statement in the above group of transactions fail. the entire group is rolled back and the individual transactions submitted until the point of failure (again). The next transaction is from a different origin. The first transaction has IGNORE_DUP_M mask on. 6. not every transaction can be grouped together. The next transaction has a different user/password. Transactions from the same origin. but not a nested transaction – described later in procedure replication)..Final v2. This also improves throughput as the replication process within the replicate database server spends less time waiting for I/O completion. ODBC interfaces to flat files) Rolling back a nested transaction is not possible (read the ASE docs carefully – you can rollback to a savepoint. While we can see the benefits of this. The current or the next transaction is a dump/load transaction. 8. While the replicate might appear to be similar.e. the rules for grouping transactions can simply be paraphrased into the rule that in order for transactions to be batched together. The current or the next transaction has no begin command (i. orphan. many might say “Wait a minute. Asynchronous Request Functions (discussed later) are also applied by the same user as executed at the originating system. then all of the transactions grouped together must be from the same source. then the group will end at two – even if the next four transactions are from the same source database as the first two. This leaves only the first three conditions that apply to most transactions. If the third transaction in a series is from a different source database. following that logically along. this mix of transactions may not be an issue . 158 . As you can imagine. Earlier. Some find this confusing. It refers instead to the user that will apply the transaction at the replicate. Note that currently. if using parallel DSI’s and a low dsi_max_xacts_in_group to control concurrency. in Warm Standby systems. some transactions are not applied by the maintenance user. As mentioned earlier. Additionally. one of the conditions which causes transactions not to be grouped was stated as “The next transaction has a different user/password”.0. the first 19 transactions would either cause duplicate key errors (if you are lucky) or database inconsistencies if using function strings. A transaction partitioning rule determines that the next transaction cannot be grouped with the existing group. The fourth condition will be discussed in the next section on tuning transaction grouping. which leads us to the following concept: Key Concept #16: Outbound queues that are heavily fragmented with inter-dispersed transactions from different source databases will not be able to effectively use transaction grouping This may or may not be an issue. the DSI does not simply collect all of the closed transactions from the same source. the second condition requires some thought. the OQID of the last transaction’s begin record is used for the entire group. Now then. Consequently. DDL transactions that are replicated are executed at the standby system by the same user who executed the DDL at the primary. Consider a default grouping of 20 transactions into a single group that are applied to the replicate database server and then immediately the replicate database shuts down. It does not. For non-parallel DSI implementations. the DSI will be applying transactions in very small groups. the Replication Server will issue a call to rs_get_lastcommit to determine the last transaction that was applied. then the first 19 transactions would all be duplicates – and not detected as such by the Replication Server as that was the whole reason for the comparison of the OQID in the first place!! As a result. As you will see later. While the first condition makes sense simply from a performance aspect. The reason is that on recovery. it does suggest that increasing dsi_max_xacts_in_group and similar parameters in such cases may prove fruitless. The last condition will be discussed in the section on parallel DSI’s later in this document. which was summarized above that transactions grouped together must use the same user/password combination. assuming that it refers to the user who committed the transaction at the primary system. On recovery. the smaller the group size. During normal operations. when transactions are grouped together. When the DSI groups transactions together.1 6. This assures that the object ownership is identical. the maintenance user will be the login used to apply transactions at the replicate – thereby allowing full transaction grouping capabilities. since the rs_commit function updates only a single row in the rs_lastcommit table for the source database of the transaction. it uses the last grouped transaction’s begin record to determine the OQID for the OQID for the grouped transaction. At this juncture. if the OQID of the first transaction was used. The fifth condition is due to system level reprocessing or ensuring integrity of the replicate system during materialization of subscriptions or routes and is rare – consequently not discussed. a fragmented queue with considerable inter-dispersed transactions from different databases. while the third is fairly easy. the transactions are grouped in memory – not in the stable queue. the next question might be “Why can’t we group transactions from different source databases?” The reason that the transactions have to be from the same origin is due to the management of the rs_lastcommit table and how the DSI controls assigning the OQID for the grouped transaction. For example. as most people are aware. In this latter case.especially if dsi_serialization_method is set to ‘single_transaction_per_origin’. I thought the maintenance user applies all the transactions?” This is mostly true. For that reason. Remember. However. the less efficient the replication mechanism due to rs_lastcommit and processing overhead. not using the last transaction’s OQID could result in duplicate row errors or an inconsistent database. In short.Final v2. it has less to do with the specific user and more to do with ensuring that the transaction is recorded using a different user login than the maintenance user – thereby allowing the changes to be re-replicated back to the originating or other systems without requiring the RepAgent to be configured for “send_maint_xacts_to_replicate”. it should be extremely rare – and possibly not at all – that a transaction group is closed early due to a different user/password. Now that we understand this. Max: 100. If you don’t adjust dsi_xact_group_size. A transaction group can contain at most dsi_max_xacts_in_group transactions.536. Counter CmdGroups Explanation Total transaction groups sent to the target by a DSI thread. The number of bytes available for managing the SQT open. Recommended: 2.2+). A value of "-1" means no grouping. A grouped transaction is multiple transactions that the DSI applies as a single transaction. you will need to monitor the approximate transactions and transaction groups in cache and increase dsi_sqt_max_cache_size only when it can no longer hold 2 * dsi_max_xacts_in_group * num_dsi_threads transactions. These connection level configuration parameters are listed below.0. 159 . While this will be discussed later in the section on parallel DSI’s. closed. read and truncate queues. This impacts DSI SQT processes by also being a limiter on the transaction batches that are cached in memory waiting for the DSIEXEC’s. this parameter may need to be lowered to reduce inter-thread contention. Sybase added the dsi_max_xacts_in_group parameter and suggests that you set dsi_xact_group_size to the maximum and control transaction grouping using dsi_max_xacts_in_group. dsi_sqt_max_cache_size Default : 0 . name. time. origin_sessid. Specifies the maximum number of transactions in a group. DSI Grouping Monitor Counters To help determine the efficiency of DSI transaction grouping. Recommended: see text Explanation The maximum number of bytes. the lesser of the two limits will cause the transaction grouping to terminate. The default value is a good starting point – lower generally should be considered if primarily updates are replicated and using parallel DSI’s and contention is an issue. from this starting point. As a result. etc.Final v2.1 Tuning DSI Transaction Grouping Prior to Replication Server 12.843. if the DSI SQT cache is too small. the following monitor counters are available. however. On the other hand. Parameter (Default) dsi_xact_group_size Default: 65. including stable queue overhead. or inserts with isolation level three due to next key (range) or infinity locks. This setting will be described in detail in the section on parallel DSI’s. A good starting point for dsi_sqt_max_cache_size is to figure on 500-750KB per DSIEXEC thread in use with a minimum of 1MB. This counter is incremented each time a 'begin' for a grouped transaction is executed. the dsi_xact_group_size may appear to be fairly large. there really wasn’t a good way to control the number of transactions in a batch. As mentioned though. but remember from the earlier example that 2MB was enough to cache ~30 transaction groups for one customer. This may seem like an awfully small amount. contention is likely to occur in update heavy environments. However. Valid values are: origin. this includes stable queue overhead – which can be significant as the queue may require 4 times the storage space as the transaction log space. user. it can be a bit difficult controlling the number of transactions with this parameter due to the varying row widths of different database tables. allowing a larger transaction group size. The reason was that the only tuning parameter available attempted to control the transaction batching by controlling the transaction batch size in bytes – a difficult task with tables containing variable width columns and considering the varying row sizes of different tables. With version 12.0. user. Recommended : see text dsi_partitioning_rule Default: none.5. which may improve data latency at the replicate database. in parallel or multiple DSI situations.647 (max) dsi_max_xacts_in_group Default: 20. Additionally. and none At first. the DSIEXEC’s may not be able to group transactions to the number specified in dsi_xact_group_size.0 came the ability to explicitly specify the number of original transactions that could be grouped into a larger transaction. Specifies the partitioning rules (one or more) the DSI uses to partition transactions among available parallel DSI threads.147. time. origin_sessid (if source is ASE 12. to place into one grouped transaction. name and none. however. For example. Remember. dsi_max_xacts_in_group can be raised from the default of 20 if using a single DSI – and perhaps should be if system is performing a lot of small transactions. Valid Values: origin. Total transaction groups closed by a DSI thread because the current group contains asynchronous stored procedures and the next tran does not or the current group does *not* contain asynchronous stored procedures and the next transaction does. We have a new origin (source db) in the next trxn). Total transaction groups closed by a DSI thread due to the next transaction following the execution of the 'resume' command . duplicate. Total transaction groups closed by a DSI thread due to the next transaction satisfying the criteria of being large. This includes transactions that were committed or rolled back successfully. Total transaction groups forced to wait for another group to complete (processed serially based on Transaction Partitioning rule). Total asynchronous stored procedure transaction groups closed by a DSI thread due to the next tran user ID or password being different from the ones for the current group. or a enable replication marker or subscription materialization marker or ignored due to duplication detection. Total transactions in groups sent by a DSI thread that committed successfully. NgTransTotal PartitioningWaits TransInCmdGroups TransSucceeded TransTotal YieldsScheduler 160 .e. Total transaction groups generated by a DSI Scheduler while reading the outbound queue. Total trxn groups closed by a DSI due to no open group from the origin of the next transaction (i.Final v2. Total transactions contained in transaction groups sent by a DSI thread. this is total transactions in queue.1 Counter CmdGroupsCommit CommitsInCmdGroup GroupsClosedBytes GroupsClosedLarge GroupsClosedMixedMode Explanation Total command groups committed successfully by a DSI thread. Total transaction groups closed by a DSI thread due to the next transaction being qualified as special – orphan. 'display' or execute option chosen. Note that the highlighted condition could cause transaction groups to be flushed prior to reaching dsi_max_xacts_in_group – and likely will be the most common cause for transactions closed identified by this metric. This counter is incremented each time the main DSI Scheduler body yields following the dispatch of closed transaction groups to DSI Executor threads.empty. Total transaction groups closed by a DSI thread for a Warm Standby due to the next transaction being special . Total transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_xact_group_size. etc. rollback. GroupsClosedTrans GroupsClosedWSBSpec Total transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group.0. If grouping is disabled. This counter is incremented each time a new transaction group is started. GroupsClosedMixedUser GroupsClosedNoneOrig GroupsClosedResume GroupsClosedSpecial GroupsClosedTranPartRule Total transaction groups closed by a DSI thread because of a Transaction Partitioning rule. Total non-grouped transactions read by a DSI Scheduler thread from an outbound queue. etc.whether 'skip'. Total transactions applied successfully to a target database by a DSI thread. The number of trxns in a group is added to this counter each time a 'begin' for a grouped transaction is executed. marker. or the RS scheduler forced a flush of the current group from the origin leaving no open group from that origin. ddl. DSITransFailed DSITransRetried DSIAttemptsTranRetry DSITranGroupsSent DSITransUngroupedSent DSITranGroupsCommit DSITransUngroupedCommit Transactions in groups sent by a DSI thread that committed successfully. Transaction groups closed by a DSI thread because the current group contains asynchronous stored procedures and the next tran does not or the current group does *not* contain asynchronous stored procedures and the next transaction does. some transactions may be written into the exceptions log. the counters change slightly. Commands read from an outbound queue by a DSI.1 In RS 15. Asynchronous stored procedure transaction groups closed by a DSI thread due to the next tran user ID or password being different from the ones for the current group. This counter records the number of retry attempts. mainly with the addition of more timing counters: Counter DSIReadTranGroups DSIReadTransUngrouped DSITranGroupsSucceeded Explanation Transaction groups read by the DSI. Transactions in groups sent by a DSI thread that rolled back successfully. I. This includes transactions that were successfully committed or rolled back according to their final disposition. Ungrouped transactions read by the DSI. GroupsClosedMixedUser GroupsClosedMixedMode GroupsClosedTranPartRule Transaction groups closed by a DSI thread because of a Transaction Partitioning rule. Trxn groups closed by a DSI due to no open group from the origin of the next trxn. 161 .e. Transaction groups applied successfully to a target database by a DSI thread. the DSI thread performs postprocessing for the failed command. If grouping is disabled. GroupsClosedTrans CmdGroupsRollback RollbacksInCmdGroup GroupsClosedLarge Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group. Transactions committed successfully by a DSI thread. Transaction groups closed by a DSI thread due to the next transaction satisfying the criteria of being large. Transactions contained in transaction groups sent by a DSI thread. Command groups rolled back successfully by a DSI thread. If grouping is disabled. Depending on error mapping. We have a new origin in the next trxn. Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_xact_group_size. Grouped transactions retried to a target server by a DSI thread. Transaction groups sent to the target by a DSI thread. When a command fails due to data server errors. or the Sched forced a flush of the current group from the origin leaving no open group from that origin. grouped and ungrouped transaction counts are the same. DSICmdsSucceed DSICmdsRead GroupsClosedBytes GroupsClosedNoneOrig Commands successfully applied to the target database by a DSI. grouped and ungrouped transaction counts are the same. Grouped transactions failed by a DSI thread.Final v2. This counter is incremented each time a 'begin' for a grouped transaction is executed. A transaction group can contain at most dsi_max_xacts_in_group transactions.0. 1 Counter GroupsClosedWSBSpec Explanation Transaction groups closed by a DSI thread for a Warm Standby due to the next transaction being special . etc.orphan. Time spent by the DSI/S dispatching a regular transaction group to a DSI/E. Similarly. The second is incremented whenever a group is closed due to reaching dsi_max_xacts_in_group. the default value for dsi_large_xact_size of 100 is simply too small – and in fact. it isn’t being used. but someone may have decreased it).0. 'display' or execute option chosen. it is not likely that too many threads will actually be used. TransSucceeded (DSICmdsRead. Even without parallel DSI. The first set point to likely configuration issues. This includes time spent finding a large group to dispatch. If you see very many GroupsClosedBytes. the maintenance user – typically DDL commands. you may have to look to the others for the explanation. However. As a result. rollback. the most often it is referring to the fact the scheduler forced a flush. XactsInGrp.). no matter what you have dsi_max_xacts_in_group set to.0 formulas/names in parenthesis): CmdsRead. but the point is to avoid GroupsClosedBytes and if GroupsClosedNoneOrig or GroupsClosedTran are not where expected. Time spent by the DSI/S dispatching a large transaction group to a DSI/E. Time spent by the DSI/S loading SQT cache. it may provide a reason why even though you have a well defined dsi_max_xacts_in_group.empty. The first – while it may refer to the fact that the next transaction is from a different origin (corporate rollup). or a enable replication marker or subscription materialization marker or ignored due to duplication detection. Number of DSI/E threads put to sleep by the DSI/S prior to loading SQT cache. Transaction groups closed by a DSI thread due to the next transaction being qualified as special . The second (GroupsClosedMixedMode) refers to asynchronous request functions. duplicate. the more the merrier. it is likely because you have not adjusted dsi_xact_group_size from its default of 64K to something more realistic such as 256K.Final v2. There are other ‘GroupClosed’ counters. GroupsClosedResume GroupsClosedSpecial DSIFindRGrpTime DSIDisptchRegTime DSIDisptchLrgTime DSIPutToSleep DSIPutToSleepTime DSILoadCacheTime Let’s take a look at some of these counters and how the can be used from the outbound queue/DSI perspective as well as clarifying some of these that appear to be confusing. GroupsClosedNoneOrig and GroupsClosedTrans will be the most common causes. ddl. if the next set appears. the most common counters in the DSI include (15. The next sets of counters will explain why a group of transactions were closed. The first (GroupsClosedMixedUser) happens whenever the DSI has to connect as another user vs. 162 . One of the keys to parallel transaction use is to increase this parameter as much as possible (until contention starts) – at lower settings. GroupsClosedTrans GroupsClosedMixedUser. Time spent by the DSI/S finding a group to dispatch. etc. GroupsClosedLarge GroupsClosedNoneOrig. these are the most common. so they can be ignored if tuned properly. on the other had is clearly tied to configuration settings – specifically dsi_max_xacts_in_group. marker. considering the overhead during the commit phase (updating rs_lastcommit. etc. By comparing the number of ungrouped transactions (NgTransTotal) to the number of grouped transactions (TransTotal) we can observe much transaction grouping is going on. Transaction groups closed by a DSI thread due to the next transaction following the execution of the 'resume' command .whether 'skip'. The first set is mostly (again) monitoring type counters – CmdsRead should match SQM CmdsWritten (for the outbound queue) but likely won’t as the most frequent source of latency is the DSIEXEC due the replicate database. a low value here will prevent the grouping. These DSI/E threads have just completed their transaction. Time spent by the DSI/S putting free DSI/E threads to sleep. DSITranGroupsSucceeded) XactsInGrp = NgTransTotal / TransTotal (DSIReadTransUngrouped/DSIReadTranGroups) GroupsClosedBytes. arguably large transactions are not effective in any case so you should set this to the upper limit of 2 billion and forget about it. Other than the SQT aspects. GroupsClosedMixedMode While there are others. A lot of these may indicate that dsi_max_xacts_in_group is too low (the default of 20 is typically plenty. 3 98.0 0. GroupsClosedBytes – This counter is incremented any time the transaction group is closed because the number of bytes in the transaction group exceed dsi_xact_group_size.097.8 4. One way to think of the differences between TransTotal/NgTransTotal and CmdGroups/TransInCmdGroups is that TransTotal/NgTransTotal represents the planned transaction grouping where as CmdGroups/TransInCmdGroups represent the actual. some of the more common reasons are listed in the above table.0 0.099.1 95.504 2.5 98.1 Let’s take a look at how these might work by looking at the earlier insert stress test.0 0.0 0.0 97.3 0 17 63 68 75 67 64 68 68 67 0 51 289 322 334 315 310 319 316 291 0.7 3. Slight differences may occur.9 2.0 0.099. CachedTrans & DSIXactInGrp – Repeated from the DSI SQT cache metrics.0 0.224 2.0 0.7 4. In addition – and a more common cause in WS systems . as variable substitution may cause the original grouping to exceed the byte limit on the transaction group.0 104.5 100.0 0.0 0.0 0.0 0.this counter is incremented when the DSI-S can’t find an open transaction group from the same origin – a situation usually caused when the scheduler forces the DSI to close pending transaction groups and send them to the DSIEXEC’s.0 CachedTrans Sample Time TransInCmd Groups Groups ClosedLarge Groups ClosedTrans Groups ClosedBytes Groups ClosedOrig 163 . these derived values are calculations of the number of transactions in the DSI SQT cache (based on average memory used per transaction) and the average number of transactions grouped together by the DSI thread respectively.9 6. CacheMem Used DSI CmdGroups Yields Scheduler 0 103 389 418 433 414 436 416 421 396 DSIXact InGrp 11:37:47 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 11:38:52 11:39:03 11:39:14 11:39:25 0 2.0 0.7 4.224 0 195 171 171 171 171 171 171 171 171 0.0 4.0 0. DSI.101. this counter is incremented anytime a group is closed due to the number of transactions exceeding dsi_max_xacts_in_group.0 0. DSIXactInGrp (a derived statistic based on dividing NgTransTotal by TransTotal) represents a planned transaction grouping ratio vs.920 2. GroupsClosedLarge – This counter is incremented any time a group of transactions is closed due to the fact that the next transaction is considered large – either because it exceeds dsi_large_xact_size or because it involves text/image data (which automatically qualifies it as a large transaction).100.0 0. however.CmdGroups & TransInCmdGroups – These metrics report the actual number of transaction groups sent by the DSI to the DSI EXEC – and operate very similarly to the metrics DSI. GroupsClosedOrig – This counter is incremented any time a group is closed because the next transaction to be delivered to the destination comes from a different source database (think corp rollup).0 0.0 0. the dsi_max_xacts_in_group was 20 – and we are hoping to determine (if we can) why the actual value is more in the 4-5 range than close to 20.Final v2.144 (256KB) – which although much smaller than the suggested maximum setting. As you may remember. To that extent.0 0.4 4.0 0.1 97. there are some (few) that do reach the maximum and likely many in between.5 6.0 0.0 0.0.712 2.408 2.0 0.0 0.7 4.0 0.099.200 2.098.0 1.6 4.5 The only derived columns above are the same in the previous example from the SQT – in fact the first four columns are repeated – partially to put in context some of the others. That is the case here as the system in GroupsClosed Resume 0. did not contribute to the reason the transaction grouping was less than desired.0 0. GroupsClosedTrans – Similar to above.100.5 93. the dsi_xact_group_size was 262. While there are additional DSI metrics for GroupsClosed______ not listed above.0 0.6 4.0 2. In the case above. Note especially. that the GroupsClosed______ metrics are presented as a percentage (of 100%) and not the actual values (rationale is that it is easier to recognize the primary reasons this way) – hence the blue color highlighting the metrics above.TransTotal and NgTransTotal.0 0. we see that 1. actual – while the actual may be slightly deviated.0 0.097.432 2.0 0.5-7% of the groups reached the maximum of 20 – so despite the computed average of 4 transactions per group. Interestingly.6 4.6 2.8 93.8 1.0 0. it is well within a margin of error.0 0.968 2. 0 1.0 0.2 0. From the above.1 question was a WS implementation in isolation – so no other connection existed to cause this counter to be incremented.328 1.396 21. The reason for this is that often times a transaction group needs to be rolled back and applied as individual transactions up to the point of error – and then the DSI is suspended.8 69. Perhaps other DSI or DSI EXEC counters will help us learn why the scheduler is doing this – but we will look at them late.0 1.098.746 1.0 100.348 10.0 52. YieldsScheduler – This metric is illustrated here to show how often the DSI is yielding after a group has been submitted to a DSI EXEC.0 0.055 21.1 115.0 0. Now.920 1.0 0.6 0.0 0.0 0.0 100.0 0.920 2.849 0.0 0.430 0.0 2 1.452 6.0 0.0 164 Yields Scheduler 9 702 638 DSIXact InGrp Yields Scheduler 6 DSIXact InGrp .232 0 791 414 1.0 0.432 2.702 1.397 2.281 Almost instantly we see that most of the transactions were closed because the next transaction followed a ‘resume’ command – rather odd and suggestive of a significant number of errors.104 2.0 0.Final v2.1 100.0 11. As a result.0 0.333 1.0 0.097. Some that are observant might have noted that some of these percentages are above 100% .0 0.574 1.0 0.0 0.0 1.1 98.344 1.899 10. as mentioned earlier – transaction groups are automatically tried individually until the individual transaction with the problem re-occurs.0 1.578 5.TransTotal vs. Note as well that the ratio of YieldsScheduler to transactions ranges from slightly more than 1 to 2. it looks like the scheduler is closing transaction groups prior to reaching dsi_max_xacts_in_group – but otherwise no real indication of what the cause may be.0 0.0 0. However.1 99.0 0. when the DSI is resumed.030 1.0 3 148 115 3 1.0. For now.0 0.0 0.696 1.873 3.0 0.746 1.725.9 100.0 0.6 99. We just need to determine what is driving the scheduler… GroupsClosedResume – This counter is incremented any time a group is closed due to the next transaction following a resume command.030 1. let’s look at the next day: GroupsClosed Resume CachedTrans Sample Time TransInCmd Groups Groups ClosedLarge Groups ClosedTrans DSI CmdGroups Groups ClosedBytes CacheMem Used Groups ClosedOrig 19:18:31 19:23:32 19:28:34 0 1.0 0.899 10.098.0 0.3 CachedTrans Sample Time TransInCmd Groups Groups ClosedLarge Groups ClosedTrans DSI CmdGroups Groups ClosedBytes CacheMem Used Groups ClosedOrig 19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21 0 1.873 3.0 0.0 58. DSI.remember.0 0.281 2.328 0 0 143.0 0.0 0.0 0.0 0. let’s take a look at the customer examples from the 2 different days.0 1.792 3.432 2.0 0.0 0.0 0. we see that the number of yields is 4-6x the number of transaction groups which suggests that the DSI was repeated checking to see if the DSI EXEC was finished with the current group and ready for the next.023.574 1.0 1.0 0.0 0.348 10. It also simply could be due to calculating the percentage based on DSI.0 0.951 2.920 0 0 1 0 0 57 923 1.0 0.3 100.0 0. The first day’s counter values for DSI grouping are illustrated below: GroupsClosed Resume 100.0 1. the DSI rebuilds transaction groups from that point.0 0.0 0.528 1.7 25.097.794 8.430 2 1.0 0.0 0.920 1.0 1.0 1.0 0.0 100.5 16.0 1.CmdGroups.0 0.578 5.0 0. 6 0.3 36. Yields Scheduler 372 588 DSIXact InGrp 165 .0 0.0 0.040 11.271 1.097.790 Note in this case.0 20.6 72.9 13.592 2. However. the replicated functions (rs_insert.0 1.0 0.166.0 0.0 69.0 0.0 0.405 3.0 0.034 1.192 1.740 5.098. it is the responsibility of the DSI Executer (DSI-E) threads to actually perform the SQL string generation.580 6.0 0.432 2.520 13. very quickly the reasons shift to GroupsClosedTrans as the DSIXactInGrp climbs and eventually reaches the dsi_max_xacts_in_group of 20.2 99.3 0.664 470 845 845 1. The key to the DSI-E is that the DSI-S simply passes the list of transaction id’s in the group to it.0 CachedTrans Sample Time TransInCmd Groups Groups ClosedLarge Groups ClosedTrans DSI CmdGroups Groups ClosedBytes CacheMem Used Groups ClosedOrig 19:33:36 19:38:38 19:43:40 19:48:42 19:53:44 19:58:46 20:03:48 1.3 100. rs_update. This helps the rest of the Replication Server as it does not have to perform SQL language parsing (which is not in the transaction log anyhow – something many people have a hard time understanding – the transaction log NEVER logs the SQL).0 20.097.0 0.0 0.432 2. the DSI-E thread execution looks like the following flow diagram.0 0.0 20.0 0.0 104.0 69 101 187 276 652 579 339 1. command batching and exception handling.7 97.098.0 0.339 1.0 0.0 0. rs_delete.1 GroupsClosed Resume 0. etc. If you remember from the earlier discussion on LTL. the transactions at the beginning are largely closed due to GroupsClosedOrig – likely due to the same scheduler driven reasons as the insert test.3 100.0 0.137 14.0 0.098.098.0.0 0.0 0. As a result. DSIEXEC Function String Generation DSI Executer Processing While the DSI is responsible for SQT functions and transaction grouping.) actually are identified by the Replication Agent.944 2.0 33.432 2. we need to send ASCII language commands to the replicate system (or RPC’s).0 0.418 2.0 0.9 20. However.333 1.0 20.952 2.702 1.Final v2.408 2. The DSI-E then reads the actual transaction commands from the DSI SQT cache region.780 0.0 0. Increasing this number to the number of active replication definitions prevents Replication Server from executing expensive table lookups. Mentioned here as often questions are asked whether changing this would help – short answer “No”. From a DSI Executer performance perspective. Long answer is this was deprecated by sts_full_cache_xxxxx. this one is probably the most critical as insufficient STS cache would result in network and potentially disk i/o in accessing the RSSD. retry. The total number of rows cached for each cached RSSD system table. allow the DSI to continue uninterrupted. Parameter (Default) Replication Server scope fstr_cachesize (obsolete/deprecated) Obsolete and deprecated.1 Transaction group from DSI Translate replicated functions into SQL via fstring definitions Break transaction into dsi_cmd_batch_size Batches of SQL Send SQL batch to Replicate database No Yes “Stop” Errors? Rollback transaction No Done? Yes Suspend connection Commit Transaction Figure 38 – DSI Executer SQL Generation and Execution Logic Note that in the above diagram. the tuning parameters available for the DSI Executer are listed in the following table (other parameters are available. If you remember. Of all the parameters below.0. do not specifically address performance throughput). Suggested: 1000 166 .Final v2. some error actions such as ignore (commonly set to handle database change. however. Explanation sts_cachesize Default: 100. (essentially).0. DSI Executer Performance Beyond DSI command batching (next section). it was decided that this was not necessary (possibly viewed as duplicative as function string RSSD rows would be in STS cache as well) and the parameter was made obsolete (although still in the documentation). In RS 12. etc. print and other information messages). Note that parameters specific to parallel DSI performance are not listed here. only “stop” errors cause the DSI to suspend. the STS cache could be used to hold RSSD tables such as rs_systext that hold the function string definitions. “On” is the default as it is the typical “safe” approach that Replication Server defaults assume. the network packet value must be within the range accepted by the database. For single DSI systems. Note that for Oracle. Some heterogeneous replicate systems – such as Oracle – do not support command batching. When batch is "off. Simply put – if you leave this “on” – you WILL have RS latency & performance problems. Recommended: 8192 or 16384 dsi_cmd_batch_size Default: 8192. this is set to "on" for all databases except standby databases. Recommended: on Explanation For DSI performance the list of tables that should be fully cached include rs_objects. this value should be ‘on’ (the default)." Replication Server may send multiple commands to the data server as a single command batch.048 are suspect and should only be used if the target system does not support larger packet sizes. and rs_functions Specifies how Replication Server sends commands to data servers. Indicates whether a begin transaction can be sent in the same batch as other commands (such as insert. The maximum number of bytes that Replication Server places into a command batch. we are referring to the actual DBMS engine – as of 9i and 10g. the connection will automatically be bumped to 2048 as the minimum packet size. Set to "off" to cause Replication Server to set triggers off in the Adaptive Server database. You need to be careful with this setting as too high of a setting may exceed the stack space in the replicate database engine. there should be compelling reasons not to have this turned “off” – including security as the replication maintenance user could be viewed as a “trusted agent” fully supportable in Bell-Lapadula and other NCSC endorsed security policies. the value should be ‘on’ as well. Recommended: 32768 dsi_keep_triggers Default: “on” for most – “off” for WS.0. However. The maximum size of a network packet. however. Values less than 2. having it on is no guarantee of database consistency as will be illustrated later in the discussion on triggers. and so on). For most other parallel DSI serialization methods (i. Additionally. rs_columns.e. delete. This is “on” for ASE and should be on for any system that supports command batching due to performance improvements of batching. Recommended: “off” 167 . During database communication.Final v2. The rationale for ‘off’ is that the DSIEXEC will post the ‘Batch Began’ message quicker to the DSI allowing the other parallel threads to begin quicker than waiting for the begin and the first command batch (and possibly only command batch) to execute before the message is sent." Replication Server sends commands to the data server one at a time. On ASE 15 systems. although caution should be exercised when replicating procedures. and consequently this parameter needs to be set to “off”. batch SQL is handled outside the DBMS engine by the PL/SQL engine. wait_for_start) this value should be ‘off’.384 on high speed networks or tuned to network MTU on lower speed networks is appropriate. When batch is "on.1 Parameter (Default) sts_full_cache_xxxxx Connection scope batch Default: on. If using parallel DSI’s and ‘wait_for_commit’. A recommended packet size of 16. Recommended: see text db_packet_size Default: 512. By default. so that triggers do not fire when transactions are executed on the connection. Arguably should be off for all databases. it should be at least the same as the db_packet_size if not doubled. batch_begin Default: on. You may change this value if you have an Adaptive Server that has been reconfigured for “max network packet size” minimally at the desired size or greater. Specifies whether triggers should fire for replicated transactions in the database. in 100ths of a second. transaction characteristics and general function string generation issues. Total rs_delete commands processed by a DSIEXEC thread. not usually replicated further (except if there is a standby database). Total SQLDDL commands processed by a DSI DSIEXEC thread." the DSI executes set replication off in the Adaptive Server database. therefore. setting this parameter to "off" avoids writing unnecessary information into the transaction log. Those that are highlighted are specifically applicable to DSI Executer performance. DSI EXEC DML Monitor Counters Several monitor counters in the DSIEXEC module help analyze throughput. such as the STS and other server level configurations. in 100ths of a second. Total commands applied by a DSIEXEC thread. Total rs_insert commands processed by a DSIEXEC thread. Additionally. The reason this is mentioned as a possible performance enhancement is its applicability in multiple DSI situations discussed later. Total invocations of function rs_get_textptr by a DSIEXEC thread. in 100ths of a second.0. Some of these. preventing Adaptive Server from adding replication information to log records for transactions that the DSI executes. Since these transactions are executed by the maintenance user and.Final v2. to perform function string mapping on the last command. dsi_replication must be set to "on" for the active database in a warm standby application for a replicate database. Explanation 168 . to perform function string mapping on a command. and for applications that use the replicated consolidated replicate application model. When dsi_replication is set to "off. The RS 15. The amount of time taken by a DSI/E to obtain control of the next logical transaction. The number of times DSI/E must wait for the command it needs next to be loaded into SQT cache. Total rs_writetext commands processed by a DSIEXEC thread. to perform function string mapping on a command. have been discussed before and have been included here simply for completeness. Counter Explanation Command (DML or DDL Related) CmdsApplied CmdsSQLDDLRead DeletesRead ExecsGetTextPtr ExecsWritetext InsertsRead UpdatesRead Function String Generation DSIEFSMapTimeAve DSIEFSMapTimeLast DSIEFSMapTimeMax Average time taken. Time. several have to do with command batching which is discussed in the next section.0 equivalent counters are: Counter Read From SQT Cache DSIEReadTime DSIEWaitSQT DSIEGetTranTime The amount of time taken by a DSI/E to read a command from SQT cache. The maximum time taken.1 Parameter (Default) dsi_replication Default: “off” for most – “on” for WS Explanation Specifies whether or not transactions applied by the DSI are marked in the transaction log as being replicated. Total rs_update commands processed by a DSIEXEC thread. This function is executed each time the thread processes a writetext command. Command (DML or DDL Related) TransSched UnGroupedTransSched DSIECmdsRead DSIECmdsSucceed BeginsRead CommitsRead SysTransRead CmdsSQLDDLRead InsertsRead UpdatesRead DeletesRead ExecsWritetext ExecsGetTextPtr Function String Generation DSIEFSMapTime Time. UpdatesRead. Commands successfully applied to the target database by a DSI/E. let’s consider the insert stress test: 169 . 'begin' transaction records processed by a DSIEXEC thread. This function is executed each time the thread processes a writetext command. the real effort at this stage is command batching. ExecsGetTextPtr These are fairly obvious as they help us establish rate information for throughput as well as which commands were being executed. rs_insert commands processed by a DSIEXEC thread. to perform function string mapping on commands. it does mean that if looking across all the DSIEXEC’s. For now.0. Commands read from an outbound queue by a DSIEXEC thread. Internal system transactions processed by a DSI DSIEXEC thread. The amount of time taken by a DSI/E to parse commands read from SQT. we will need to aggregate the counter values per sample period. An important aspect to these counters is to remember that they are per DSI EXEC thread – so with parallel DSI enabled. Transactions groups scheduled to a DSIEXEC thread. rs_update commands processed by a DSIEXEC thread. Invocations of function rs_get_textptr by a DSIEXEC thread.e. However. in 100ths of a second. rs_delete commands processed by a DSIEXEC thread. The last set refer more to text/image processing and can be used to develop profiles (i. Some of the more useful general counters include: CmdsApplied (DSICmdsSucceeded). SQLDDL commands processed by a DSI DSIEXEC thread. Transactions in transaction groups scheduled to a DSIEXEC thread. the largest change is that the DSIEXEC has more counters tracking the time spent retrieving the commands/command groups from the SQT cache in the DSI thread. CmdsPerSec=CmdsApplied/seconds InsertsRead. First. a relative indication of the size of the text/image is WritesPerBlob=ExecsWritetext/ExecsGetTextPtr).Final v2. we will focus on just the function generation and DML aspects – later we will take a look at the parallel DSI aspect of the problem. 'commit' transaction records processed by a DSIEXEC thread. As you can see.1 Counter DSIERelTranTime DSIEParseTime Explanation The amount of time taken by a DSI/E to release control of the current logical transaction. DeletesRead ExecsWritetext. As mentioned earlier in the general discussion about the RS M&C feature.instance_id column corresponds to the thread number for each value – allowing us to also track how efficiently each thread is utilized. Let’s take a look at how these counters can be used. the rs_statdetail. rs_writetext commands processed by a DSIEXEC thread. more than one value will be recorded. While these are interesting to monitor (and the number of updates may give a clue to how effective minimal column replication might be). 7 ExecsGet TextPtr MsgChks PerCmd 0. we will see statistics that help support that it is the replicate ASE that is the bottleneck.2 0.299 10.7 0.255 21.491 3.267 2.414 27 203 203 226 195 203 204 201 191 219 200 1.745 6.2 0.059 3.2 0.2 0.119 6.7 0.235 2. The disparity between CmdsApplied and InsertsRead is simple – the begin tran/commit tran commands are counted as well.343 42.210 5.609 1.Final v2.711 0 20 25 13 23 24 51 137 140 72 0 615 0 0 0 0 526 10.0.7 0.2 0.107 2.909 3. Later when we look at the timing information.234 2. Because the insert stress test is rather simplistic.725 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 94 541 567 640 571 556 580 587 584 654 As you can see the cumulative throughput was ~200 commands/sec across all the DSI’s and it was all inserts (no surprise).595 1. And interesting statistic is the message checks per command – which is averaging close to 25%.738 11.431 2 1. Note that the test machine can easily hit 900 inserts/sec using RPC calls and 200 inserts/sec using language commands – consequently the 200 inserts/sec rate may be the max we can get out of the replicate ASE using a Warm Standby configuration.044 31.450 1.150 2.711 31. let’s take a look at some of these metrics – for the most part the description will concentrate on the customer numbers and only refer back to the insert test when necessary.595 41.839 2.357 5.536 1.3 0.491 15.841 1 270 9 0 615 0 0 0 0 430 10.292 7.030 2.299 Now.1 CmdsApplied UpdatesRead Sample Time CmdsPerSec DeletesRead InsertsRead MsgChecks Execs Writetext 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 11:38:52 11:39:03 11:39:14 11:39:25 11:39:36 305 2.253 2.2 0.504 1.7 0.411 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 4.595 1.347 10.2 0.802 5. let’s next take a quick look at the first day of the customer’s data that we have been looking at before we discuss the counters: CmdsApplied UpdatesRead Sample Time CmdsPerSec DeletesRead InsertsRead MsgChecks Execs Writetext 19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21 6 6.620 1.2 0.914 3. 170 MsgChks PerCmd 1.469 5.7 0.3 0.2 ExecsGet TextPtr .212 2.679 4.360 5.7 0.7 0.983 7.7 0.580 1.734 16. CmdsPerSec – This metric is derived by dividing the CmdsApplied by the number of seconds in the sample interval. these counters track the number of inserts.012 3 2. In this case. insert/delete. once it can start.133 3. The first metric refers to the number of text/image columns that are involved.000 commands. A key element here is that we are doing 2 message checks for every 2 commands – which makes sense if these are atomic transactions as each transaction would be 3-4 commands (begin.000 begin tran + 10. with large groups allowed. the system is nearly idle at the beginning and then builds to executing tens of thousands of SQL commands per sample period.2 0.802 7. let’s look at the second day’s metrics: CmdsApplied UpdatesRead Sample Time CmdsPerSec DeletesRead InsertsRead MsgChecks Execs Writetext 19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 9 6.000 deletes + 10.2 0.603 14. commit) and we would need to check group sequencing and batch sequencing (two message checks).758 7. we see the curious pattern of inserts/deletes mimicking each other.079 2. As you can see in the above.058 2. MsgChksPerCmd – This metric is derived by dividing the number of MsgChecks by the CmdsApplied to get a ratio of how autonomous the DSIEXE is.3 0. The reason is that the counters for the begin transaction & commit transaction are not shown above. Again.114 6. The reason this can be deduced is that each text/image column per replicated row will require an execution of rs_get_textptr (see section on text replication later). If you see a ratio of 100 or more writetext commands per rs_get_textptr. we would have ~10. we are going to have more coordination between the DSI and the DSIEXEC with each new transaction.2 0.685 2.000 inserts + 10.479 3. For example. we see that we are checking with the parent DSI thread nearly every command – but then remember. you can be fairly confident that the text/image data is fairly substantial – which may be contributing to the slow delivery rate at the replicate database server.085 MsgChks PerCmd 1.171 5.371 4. However. the DSIEXEC can do a lot of processing without having to continuously check for transaction group/batch sequencing with the parent DSI thread. updates and deletes read out of the outbound queue and sent to the replicate database. CmdsApplied as it gives an execution rate.071 0 615 0 0 0 0 3. The second counter is incremented for each writetext operation.2 ExecsGet TextPtr 171 . ExecsGetTextPtr/ExecsWritetext – These counters are related to text/image processing.0. For example.128 0 22 24 13 18 49 73 0 627 0 6 0 0 4. MsgChecks – This metric tracks how often the DSI EXEC threads check for pending commands via the OpenServer message structures referenced in the internals section at the beginning of this document – specifically the batch sequencing and commit sequencing messages (but also the actual transaction commands are posted here as well). then you would know that the amount of text/image data is fairly small (<16KB) and can be issued with a single writetext call.000 commit trans – which does work out to 40. This can be used to gauge the real performance of the DSI threads vs.978 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 2. then at time 19:37. Note that it peaks at ~140/sec – which really is not all that good (compared to the insert test steadily achieving 200 inserts/sec on a laptop and even that is not ideal) – but then we are dealing with a single DSI thread as well.3 0. Now.095 1. if the delete/insert were a pair in a single transaction.638 4.Final v2.198 1. the number of updates also suggests that minimal column replication should be considered as well. these two also give you a fairly good indication of the amount of text/image data flowing. the calculated dsi_max_xacts_in_group was 1 – which means with only 1 transaction per group. One thing that is interesting is that the sum of the DML commands is only ½ of the CmdsApplied value. Inserts/Updates/DeletesRead – Much like the DIST counters.2 0.1 CmdsApplied – CmdsApplied reports the number of SQL statements issued to the replicated database.959 22. if they were equal. While there are counters available. For example. 520 5. In this case.085 46. This is analogous to executing the following from isql: -. how often – after finding something that helps – do DBA’s go back and retry something that didn’t help previously?? Answer: Not very often.2 0. While some database systems. we know that in the first day.382 73 37 2. Whatever the customer did to change the picture for day 2 helped the transaction grouping. So many P&T sessions follow the same flawed logic: 1. the MsgChksPerCmd is considerably less.445 Notice that although the CmdsPerSec are in the same order of magnitude.980 172 153 89 13. In this case. transaction grouping was not occurring.1 CmdsApplied UpdatesRead Sample Time CmdsPerSec DeletesRead InsertsRead MsgChecks Execs Writetext 19:37:18 19:42:19 19:47:21 52. such as older Oracle (pre-9i). they should leave the changes they made that affected transaction grouping intact – as larger groupings of transactions are much more efficient in non-Parallel DSI systems. without command batching.797 12. Change one setting and run the test (this is already flawed as sometimes settings work together cooperatively). it was during this time frame that the transaction grouping was much more effective – hitting the goal of 20 constantly for the last.344 12.Final v2. but we have not really improved the delivery rate. So.0. If you remember. This points out an interesting perspective. However. remove each bottleneck as noticed – and leave it removed.isql script insert into orders values (…) go insert into order_items values go insert into order_items values go insert into order_items values go insert into order_items values go insert into order_items values (…) (…) (…) (…) (…) 172 MsgChks PerCmd 0. do not allow this feature or have limitations – those that do gain a tremendous advantage in reducing network I/O time by batching all available SQL and sending it to the database server in a single structure.584 5.268 0 0 0 0 0 0 14. DSIEXEC Command Batching In addition to transaction grouping. As a result.885 11. the same isql script would look like: -. but did not help the overall throughput. it is best sometimes to look at each and if possible.2 ExecsGet TextPtr . Putting it back and then removing the first one doesn’t help either as the re-introduced second bottleneck restricts the flow making the removal of the first bottleneck appear to be without benefit as well. Removing the second or third one will not necessarily improve the throughput as the first one is still constricting the flow.301 26.2 0.isql script insert into orders values (…) insert into order_items values insert into order_items values insert into order_items values insert into order_items values insert into order_items values insert into order_items values insert into order_items values go (…) (…) (…) (…) (…) (…) (…) vs. if transaction grouping is much more efficient (and notice that we are now only coordinating with the DSI approximately every fifth command – but with 2 checks (batch & transaction sequence) we really are checking every 10 commands). reset it to the original setting and try something else The question is. another DSI feature that is critical for replication throughput is DSI command batching. a better way of looking at P&T issues is to think of the system as a pipeline – with at least one or more bottlenecks. when using the M&C.047 11. If it didn’t improve anything.738 7. The tendency might be to reset whatever was changed – and look at something else. then something else is the bottleneck now – and may have been primary limiting factor for the day before. 2. As the large transaction begins. as you will see when we look at the counters. Key Concept #17: Along with transaction grouping. this is actually not optimal – optimal is to set it as large enough that a single transaction group is sent as on command batch. just like transaction grouping – the command batching limits are upper bounds/goals. The DSIEXEC executes the grouped transaction by sending dsi_cmd_batch_size (8192 bytes by default) sized command batches until completed. The DSI groups a series of transactions until one of the group termination messages is hit (for example.048 as a large data row size of >1. Consequently.e. One problem that many people who have coded large stored procedures might remember about this – each user connection has a limited stack size in ASE for their connection. Consequently. 3.536 by default. If no errors.1 go insert into order_items values (…) go insert into order_items values (…) go Anyone with basic performance and tuning knowledge will be able to tell that the first example will execute an order of magnitude faster from the client application perspective. isql is faster executing batches of SQL than individual statements. 2. each dsi_cmd_batch_size bytes will be sent to the replicate database. The common wisdom has been to set this to the same size as db_packet_size or some small multiple (i.192 bytes (depending on command boundaries as ASE requires that commands must be complete and not split across batches). are added to the command.000 bytes might easily violate this by the time the column names.192 bytes per command batch. then the commit is sent separately. even with the default of 512 bytes (how many of us typically set the “-A” to higher??). Command Batch Monitor Counters Several DSIEXEC module counters exist to help optimize command batching: 173 . So lowering dsi_cmd_batch_size to db_packet_size is typically will degrade throughput. RS sends a maximum of 50 commands even if the dsi_cmd_batch_size will support more. The DSI passes the entire transaction group to the next DSIEXEC that is ready. Command batching is critical for large transaction performance. If errors occurred. How does this apply to Replication Server? Believe it or not. However. however.192 no matter what db_packet_size is set to. Command batches could be flushed from the DSI EXEC for any number of reasons – some of which are tracked by the monitor counters. The dsi_cmd_batch_size should rarely (hesitating to say “never” only to avoid setting a precedence) be set to less than 8. that all of the members of a single transaction group may be sent as a single command batch – with the exception of the final commit (due to recovery reasons. In this way. DSI command batching is critical to throughput to replicate systems that support it. Remember. Issuing too large a batch of SQL results in stack overflow – while later releases of ASE can easily handle the large batches. It does mean. The way this works is as follows: 1. the maximum of 65. a large transaction – especially one that gets removed from SQT due to cache limitations – will force the previous transaction group to end. The optimal setting would be to set it to the largest command buffer that your DBMS can handle and let the network layer break it up into smaller chunks. It might be tempting then to assume that it would be best to set dsi_cmd_batch_size to the same as dsi_xact_group_size or 65. either a rollback is issued or the DSI connection is suspended (most common) which implicitly rolls back the transaction). As you could imagine. if the transaction group was terminated due to 65. etc. In the example above. However. reducing contention between parallel DSI’s as well as contention with normal replicate database users. early releases from quite a few years ago could not. the effect of command batching – which is on by default for ASE replicates – is that performance of each of the transaction groups is maximized by reducing the network overhead of sending a large number of statements within a single transaction. The optimal size for DSI command batching would allow the entire transaction group to be sent as a single command batch. the entire transaction would be sent to the replicate database in ~8 batches of 8. 2x) of db_packet_size.Final v2. it does NOT mean that multiple transaction groups can be lumped into a large SQL structure and sent in a single batch.and never less than 2. lock time on the replicate is minimized.536 bytes).536 byte limitation and we still had the default of 8. The commit is withheld until all transaction statements have executed without error.0. in 100ths of a second. of a command batch submitted by a DSI.0. in 100ths of a second. Maximum memory consumed by a DSI/S thread for a single transaction group. Memory consumed by a DSI/S thread for the most recent transaction group. The maximum size. in bytes. Number of output commands in the last command batch submitted by a DSI. Number of batch flushes executed because the next command would exceed the batch byte limit.1 Counter Preparation DSIEBatch DSIEBatchSizeAve DSIEBatchSizeLast DSIEBatchSizeMax DSIEBatchTimeAve DSIEBatchTimeLast DSIEBatchTimeMax DSIEICmdCountAve DSIEICmdCountLast DSIEICmdCountMax DSIEOCmdCountAve DSIEOCmdCountLast DSIEOCmdCountMax MemUsedAvgGroup MemUsedLastGroup MemUsedMaxGroup TransAvgGroup Explanation The number of command batches started. For example.Final v2. Average number of input commands in a batch submitted by a DSI. Number of input commands in the last command batch submitted by a DSI. in bytes. If a DSIEXEC thread is capable of utilizing any degree of transaction grouping logic. The maximum number of output commands in a batch submitted by a DSI. Average number of output commands in a batch submitted by a DSI. Average memory consumed by a DSI/S thread for a single transaction group. Number of batch flushes executed because the next command is a 'transaction begin' command and by configuration such commands must go in a seperate batch. of the last command batch submitted by a DSI. in 100ths of a second. to process a command batch submitted by a DSI. TransLastGroup TransMaxGroup Execution DSIEBFBatchOff DSIEBFBegin DSIEBFCommitNext DSIEBFForced Number of batch flushes executed because command batching has been turned off. Average time taken. to process the last command batch submitted by a DSI. or the next command is the first chuck of BLOB DDL. The maximum number of transactions dispatched as a single atomic transaction. If the value of this counter is close to the value of TransMaxGroup. Number of batch flushes executed because the situation forced a flush. Time. in bytes. Number of batch flushes executed because the next command is a get text descriptor command. DSIEBFGetTextDesc DSIEBFMaxBytes 174 . The maximum time taken. Size. an 'install java' command needs to be executed. this counter reports the number of transactions executed in the last grouped transaction. you may want to consider bumping dsi_xact_group_size and/or dsi_max_xacts_in_group. of a command batch submitted by a DSI. to process a command batch submitted by a DSI. The maximum number of input commands in a batch submitted by a DSI. The average number of transactions dispatched as a single atomic transaction. Number of batch flushes executed because the next command in the transaction will be a commit. Average size. The maximum time taken. these counters are similar but lack the total.0. DSIEBFResultsProc DSIEBFRowRslts DSIEBFRPCNext DSIEBFSysTran Sequencing DSIESCBTimeAve DSIESCBTimeMax Average time taken. of command batches submitted by a DSI.1 Counter DSIEBFMaxCmds Explanation Number of batch flushes executed because we have a new command and the maximum number of commands per batch has been reached. Size. in 100ths of a second. Number of batch flushes executed because we have a new command and the maximum number of commands per batch has been reached. Time. Number of batch flushes executed because the next command is a get text descriptor command. Number of input commands in command batches submitted by a DSI. Explanation 175 . to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'. Number of batch flushes executed because we expect to have row results to process. Number of batch flushes executed because the next command is to have its results processed in a context different from the current batch. Number of batch flushes executed because the next command is an RPC. to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.Final v2.0. In RS 15. Number of batch flushes executed because the next command is part of a system transaction. max as per: Counter Preparation DSIEBatchTime DSIEBatchSize DSIEOCmdCount DSIEICmdCount Execution DSIEBFResultsProc DSIEBFCommitNext DSIEBFMaxCmds DSIEBFRowRslts DSIEBFRPCNext DSIEBFGetTextDesc DSIEBFBatchOff DSIEBFMaxBytes DSIEBFBegin DSIEBFSysTran Number of batch flushes executed because the next command is to have its results processed in a context different from the current batch. Number of output commands in command batches submitted by a DSI. average. This limit currently is 50 commands as measured from the input command buffer. Number of batch flushes executed because the next command is an RPC. Number of batch flushes executed because command batching has been turned off. Number of batch flushes executed because the next command would exceed the batch byte limit. in bytes. Number of batch flushes executed because we expect to have row results to process. in 100ths of a second. Number of batch flushes executed because the next command is a 'transaction begin' command and by configuration such commands must go in a seperate batch. Number of batch flushes executed because the next command in the transaction will be a commit. in 100ths of a second. Number of batch flushes executed because the next command is part of a system transaction. to process command batches submitted by a DSI. Sequencing DSIESCBTime Time. the “I” for the similar batch of counters (e. DSIEBFGetTextDesc. the slower the throughput. The next set report the size in bytes of the command batches. then a ‘go’). then the RPC sent.g. the language commands before it being accumulated in a batch have to be flushed. DSIEBFBegin DSIEBFMaxCmds. we first have to get the textpointer value from the replicate server. DSIEBFGetTextDesc .This counter is typically incremented when batch_begin is off. The best way to think of this is after every DSIEOCmdCountAve commands (on average) a “go” is sent ala isql.This counter signals that the end of the transaction group has been reached. you will quickly find that with smaller numbers (i. an 'install java' command needs to be executed. 2 or 3 inserts. RPC’s can not be batch. Consequently. If you’ve ever done this test. After SQL generation and variable substitution. The Output commands have the most interest to us. then. The default dsi_cmd_batch_size of 8192 typically is too small and most often results in 4-6 SQL commands per batch. report the number of commands actually sent per batch vs.Final v2.this clearly suggests that dsi_cmd_batch_size is too small as described in the above paragraph. the smaller the batches. to check the sequencing on command batches which required some kind of synchronization such as 'wait_for_commit'. Note that the equivalent of DSIEBatch in RS 15. In other words. The real goal.This counter tells how often a batch was flushed because the next command would be a writetext command. One reason for limiting the number of commands per batch is that some servers would have stack overflow if the number of command batch bytes exceed 64KB (including earlier copies of Sybase SQL Server). Increasing this to 256K is likely advisable as well. or the next command is the first chuck of BLOB DDL. consequently. in 100ths of a second. DSIEOCmdCountAve DSIEBFCommitNext.equivalent RS 15. DSIEBFRPCNext . Obviously. DSIEOCmdCountMax/Ave. Command batching can kind of be compared to how many SQL statements before each ‘go’ you put in a file to be executed by isql. the batch is sent because it exceeded dsi_cmd_batch_size. DSIEICmdCountAve) refers to Output vs. DSIEBFMaxCmds . For example. the DSI submits a transaction grouping of commands – but they are commands in which the SQL generation has not yet happened. DSIEBFBegin . The “O” vs.0. DSIEBFMaxBytes . 176 . you want this counter to be the primary reason for command batch flushes to the replicated database.0 counters can be easily determined): DSIEBatch DSIEBatchSizeMax. The set after that. All the counters beginning with DSIEBF (DSIEXEC Batch Flush). DSIEBFSysTran The first one is fairly simple – the number of command batches used. DSIEBatchSizeAve DSIEOCmdCountMax. Along with the above. the number of bytes per batch or other factor may reduce the actual number of commands sent in the batch to the replicate DBMS. is to try to submit the entire transaction group in one command batch Much like with DSI transaction grouping. DSIEBFCommitNext . these are some of the more important counters. it can be ignored. As mentioned above. you will want to want the following counters (RS 12. Since the writetext requires a text pointer.This counter signals how often a batch was flushed because the next command had an output style of RPC instead of language.e. As a result. If this is deliberate. the bytes.1 Counter DSIEBFForced Explanation Number of batch flushes executed because the situation forced a flush. it is much slower than with 100 or so.this counter tells when the batch size hits the internal limit of 50 commands before function string mapping. Some of the more common ones will be described in the following bullets. Input. with command batching. DSIEBFMaxBytes DSIEBFRPCNext.0 is to get the counter_obs column value for the DSIEBatchSize counter. there can be many reasons why a command batch is terminated. if the goal is to submit the entire transaction group as a single batch.6 listed . In order to replicate DDL statements.170 50 50 12 12 126 137 9 8 0 0 It helps.999 15. The first two are from when the dsi_max_batch_size was set at twice the packet size of 8192. This is because the actual commit is sent in a separate batch. 2) sending the batch to the replicate database and then 3) processing the results. Let’s use the same time samples from the insert stress test above and see if it can explain why we have latency (or at least which component is to blame): SendTimeMax SendTimeAvg DSIEResult TimePerCmd 2. you can subtract DSIEBatch from DSICommitNext to reach a true DSICommitNext. not only is the batch flushed. We will take a look next at the timing aspect.999 21 21 9 10 131 120 0 0 50 43 dsi_max_batch_size at 65536. Let’s take a look at our insert stress test. Consequently. 1 tran per group avg 16:13:20 16:13:30 136 119 100 100 1 1 3. but the transaction grouping stopped as well. On interesting statistic to keep in mind is that each command batch will have 2 batch flushes at a minimum.864 15.952 39. 1 tran per group avg 16:13:20 16:13:30 136 119 100 100 1 1 16 16 5 5 0 0 0 0 100 100 13 17 DSIEResult TimeAve DSIEBatch DSIEBatch TimeMax DSIEBatch TimeAve DSIEBF MaxBytes 177 . There are three sample periods below. However. ~4 tran per group avg 11:17:39 11:17:50 66 59 100 100 4 5 7.0.004 12. but for now. they are submitted outside the scope of a transaction so in this case.141 8. milliseconds) per ungrouped transaction to process the batch.6 3.536.550 3. 5 inserts/transaction.299 7. the ideal situation would be to have DSIEBFCommitNext ≈ 2 x DSIEBatch – which is what we have. we can see that RS is taking about 10ms (counter is in 1/100ths of a second or 0. increasing the dsi_max_batch_size shifted the ~25% batch flushes due to hitting the configuration limit to a <10% due to hitting the maximum number of commands.01 vs. 5 inserts/transaction.004 16 16 5 5 271 244 0 0 dsi_max_batch_size at 16384. this puts the DSIEBFMaxBytes in more perspective as it suggests that nearly every batch in the middle sample exceeded dsi_max_batch_size – requiring three command batches instead of two. Since we are looking at the execution stage.This counter tells us how often a batch was flushed due to the next command being a DDL command. However. It is interesting to note that the number of CmdsPerSec jumped from ~150 to ~200 (earlier execution statistics for first set not shown here) simply by increasing the number of commands per command batch.170 39. The difference between the first two has to do with the average transactions per group the DSI was submitting. Since each batch that begins will have at least one separate batch flush for the commit record. between #2 and #3. Sample Time DSIEBF CommitNext DSIEOCmd CountMax DSIEOCmd CountAve DSIEBatch DSIEBatch TimeMax DSIEBatch TimeAve DSIEBatch SizeAve DSIEBatch SizeMax DSIEBF MaxCmds 0 0 dsi_max_batch_size at 16384. ~4 tran per group avg 11:38:08 11:38:19 63 68 100 100 4 4 9. 5 inserts/transaction. let’s look at the commands.Final v2. Notice that the average number of commands per batch increased from 5 to ~10 to 12. it would help to take a look at the times for the various stages – RS 1) preparing the batch.1 DSIEBFSysTran . of course to have the intrusive counters for timing purposes turned on.4 Sample Time DSIEOCmd CountMax DSIEOCmd CountAve DSIEResult TimeMax dsi_max_batch_size at 16384.770 12. and the latter when this was increased to 65. 5 inserts/transaction. From the above. 2 dsi_max_batch_size at 65536.491 0 0 0 0 ~16-20 tran per group avg 19:23:32 19:28:34 19:33:36 19:38:38 148 115 69 101 0 0 0 0 11 16 14 13 4.325 2. it is taking RS about 4 times longer to process the results than it does to process the batch internally – and when no grouping.768 5.4 2.189 8.852 5. a 10-20x drop in the number of batches sent would suggest a drop in throughput – either because of fewer transactions being replicated or due to slower throughput.185 8.323 2. no matter how much tuning we do to RS. 5 inserts/transaction.1 SendTimeMax SendTimeAvg dsi_max_batch_size at 16384.5 2. we could only hit ~50 commands/sec – of course using parallel DSI’s help some. they barely let us 4x the throughput of this system.148 3. let’s take a look at the customer system. 5 inserts/transaction.574 1. it will be extremely difficult to achieve much faster – we need to speed up the replicate SQL execution first.030 1. so we will only be able to look at the batching efficiency over the two days: Sample Time DSIEBF CommitNext DSIEOCmd CountMax DSIEOCmd CountAve DSIEBatch DSIEBatch TimeMax DSIEBatch TimeAve DSIEBatch SizeAve DSIEBatch SizeMax DSIEBF MaxCmds 0 0 0 0 ~1 tran per group avg 19:07:08 19:12:10 19:17:12 19:22:13 1.1 And we notice a very key clue – it is taking ~20ms (counter is in 1/100ths of a second or 0. milliseconds) per command to process the results from each one. Given this lag.340 3 3 3 3 2 2 2 2 3. Now.Final v2. ~4 tran per group avg 11:17:39 11:17:50 66 59 100 100 4 5 21 21 9 10 0 0 0 0 100 100 14 22 1.279 2.920 1.0.TransAveGroup jump the same amount. In fact. 178 DSIEBF MaxBytes DSIEResult TimePerCmd Sample Time DSIEOCmd CountMax DSIEOCmd CountAve DSIEResult TimeMax DSIEResult TimeAve DSIEBatch DSIEBatch TimeMax DSIEBatch TimeAve .342 2.189 8. this shows the same jump. In this case.01 vs. we need to figure out a way to speed up the individual command processing – increasing the number of DSI’s may not help if the system is already CPU bound.746 0 0 0 0 1 1 1 1 884 1. if we had the space to show DSIEXEC.060 3.840 2.189 61 16 17 16 11 10 9 9 294 230 138 201 0 0 0 0 287 538 304 402 You can see the changes in some of the counters as described below: DSIEBatch – Normally. but it works out to slightly less than 1/100th of a second (10ms) per transaction group – so we are not too concerned here – although we wish we would have had some of the other time based counters such as DSIEResultTimeAve for comparison. DSIEBatchTimeAve – Again.708 8. this is ~15x longer.274 1.271 1. Unfortunately. the customer system did not have the timing counters enabled. but in this case. however. Ideally. At 20ms per command. ~4 tran per group avg 11:38:08 11:38:19 63 68 100 100 4 4 50 50 12 12 0 0 0 0 100 100 29 26 2.849 5. .. we end up with 530 instead of 863 which is a 3:1 ratio for DSIBFMaxBytes – and a truer picture of the problem.Final v2.. the output command language would require: set identity_insert tablename on insert into tablename set identity_insert tablename off Consequently a single command becomes three. during the same period.. So.. While this may seem odd. this system is still using the default dsi_max_batch_size = 8192 – which.. RS execution of SQL statements is effectively very similar to the basic ct_results() looping in sample CT-Lib programs.. If we subtract 333 from DSIEBFCommitNext. case CS_DESCRIBE_RESULT.. additional commands may be necessary. Consequently. this one hits >>50 without tripping DSIEBFMaxCmds... DSIEICmdCountMax was equal to 41..500 bytes. the input counters DSIEICmdCountAve/Max/Last should never exceed 50. /* ** Other values of result_type: */ case CS_CMD_DONE. Curiously. Remember also that only complete commands can be sent – therefore if the average command size is 1. it may be difficult for the max to be hit that often.. case CS_MSG_RESULT. running ~10 commands per batch and peaks of ~60 commands per batch. Increasing dsi_max_batch_size is still a good idea. DSIEOCmdCountAve/Max – In the first day’s metrics.. 179 . when the DSIEOCmdCountMax was equal to 61.. it is not the largest and tuning it will help some but not likely as much as some may be looking for. removing other bottlenecks may have greater impact – for example... while you may see DSIEOCmdCountAve/Max/Last higher than 50. it does show that the most common reason for batch flushes is due to hitting the dsi_max_batch_size limit. whereas during SQL generation. Some of you may have noticed that during the 19:23:32 period (first sample in the second group in the table)... while it may not look to be a problem as the average is only 70% of the max – with the average being that close to the max. DSIEBatch – which reports the number of batches began .1 DSIEBatchSizeAve/Max – Obviously. case CS_ROW_RESULT. the most we will be able to send is 5 commands or 7. but command batching is all but ineffective as well. As a result. remember that DSIEBatch is measured at the beginning – and likely some of the command batches exceeded dsi_max_batch_size several times within the same batch – resulting in multiple batch flushes per command batch – in addition to the separate commit flush. not only is transaction grouping an issue. Day two is a lot better. The basic template might look similar to: ct_command() – called to create command batch ct_send() – send commands to the server while ct_results returns CS_SUCCEED (optional) ct_res_info to get current command number switch on result_type /* ** Values of result_type that indicate fetchable results: */ case CS_COMPUTE_RESULT. part of the issue with this system is that the dsi_max_batch_size is undertuned.. if we replicate a table containing identity columns. it tells us that the max is being hit pretty frequently (as DSIEBFMaxBytes does show). we get a total of DSIEBFCommitNext=863 and DSIEBFMaxBytes=1531 or nearly a 2:1 ratio for DSIBFMaxBytes. /* ** Values of result_type that indicate non-fetchable results: */ case CS_COMPUTEFMT_RESULT. Much like the multiple bottlenecks in a pipe. case CS_ROWFMT_RESULT. However. For example. DSIEBFCommitNext/DSIEBFMaxBytes – Unlike the insert stress test system. The command limit is based on replicated commands from the input. case CS_CURSOR_RESULT. Consequently. However. In the case above. 50% of the latency can be eliminated for this system simply by eliminating the delete/insert pairs and replacing with an update statement. case CS_STATUS_RESULT. DSIEXEC Execution Replication Server is simply another client to ASE or any other DBMS – it has no special prioritization nor special command processing. the actual replicated command is the rs_insert – a single command..500 bytes.is only at total of 333. While this may be a big bottleneck. case CS_PARAM_RESULT. If we sum the four values above.0. that the value for DSIOCmdCountMax was 61 – definitely higher than the limit we stated as 50. only a few of these are applicable as most replication environments are fairly basic (consequently values for other counters may be an indication of unexpected behavior that may be contributing to the issue at hand). consequently RS needs to know how to handle the results type. ct_send() phase SendTimeAvg SendTimeMax SendRPCTimeAvg SendRPCTimeMax SendDTTimeAvg SendDTTimeMax ct_results() processing DSIEResultTimeAve DSIEResultTimeMax Exception Processing Average time taken..Final v2.1 (optional) ct_res_info to get the number of rows affected by the current command case CS_CMD_FAIL. 180 . in 100ths of a second. Counter Explanation Batch sequencing (repeated from earlier) DSIESCBTimeAve DSIESCBTimeMax Average time taken. Maximum time. to process the results of a command batch submitted by a DSI. case CS_CANCELED. However. but again.. spent in sending command buffers to the RDS. Some of the counters are repeated from earlier sections. Ideally. Those familiar with CT-Lib programming also know that within this ct_results() loop often is a ct_fetch() loop – which RS has to implement as well. in 100ths of a second. Maximum time.. just about any SQL statement could be contained within the replicated procedure. end switch The only real difference would be if an RPC call was made or text/image processing. remember that with stored procedure replication. DSIEXEC Execution Monitor Counters The following monitor counters deal specifically with sending the commands to the replicate DBMS. in 100ths of a second. in the case of stored procedure replication. processing the results (and error handling) during processing... spent in sending RPCs to the RDS. case CS_FAIL. in 100ths of a second. in 100ths of a second.. So why are we discussing all of this? For two main reasons. to help you understand how RS works.. the many variations of result type processing may seem to be a bit overkill as RS really doesn’t need or care about the results – let alone compute-by clause results.. First. Average time. Secondly and most appropriate to this section is the counters that are mostly associated with execution statistics. in 100ths of a second. in 100ths of a second. in 100ths of a second. but since they are applicable here – particularly in light of some of the derived values – they are repeated here for ease of reference. Normally.. there will only be a single result for each DML command.0. case CS_CMD_SUCCEED.. there might be any number of rows to be fetched and/or messages from print statements. spent in sending RPCs to the RDS. The maximum time taken. Average time. Maximum time. To some. Average time. spent in sending chunks of text or image data to the RDS. The maximum time taken. end switch end while switch on ct_results’ final return code case CS_END_RESULTS. in 100ths of a second. spent in sending chunks of text or image data to the RDS. to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'. in 100ths of a second. spent in sending command buffers to the RDS. to process the results of a command batch submitted by a DSI. to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'. spent in sending chunks of text or image data to the RDS. flushing them. Time. etc. Total checks for Open Server messages by a DSIEXEC thread. spent in sending RPCs to the RDS. This process includes creating command batches. in 100ths of a second. The amount of time taken by a DSI/E to execute commands. The maximum time taken. etc. in 100ths of a second.Final v2. The amount of time taken by a DSI/E to prepare commands for execution. 181 . in 100ths of a second. to check the sequencing on a commit. in 100ths of a second. to check the sequencing on command batches which required some kind of synchronization such as 'wait_for_commit'. but only when RS deadlocks with another nonRS process. in 100ths of a second. MsgChecks_Fail returns if timer expired before an event is returned. A transaction may span command batches.0. A transaction may span command batches. ErrsLogFull ErrsLogSuspend ErrsNoConn ErrsOutofLock Commit Sequencing DSIESCCTimeAve DSIESCCTimeMax MsgChecks Average time taken. If a timer is specified. in 100ths of a second. handling errors. to process a transaction by a DSI/E thread. sending and processing results. flushing commands. This includes function string mapping. to process a transaction by a DSI/E thread. handling errors. Total times that a DSI thread failed to apply a transaction due to no locks available in the target database (ASE Error 1204). The maximum time taken. MsgChecksFailed DSIETranTimeAve DSIETranTimeMax In RS 15. Note that this does not track the times when deadlocks occur with parallel DSI’s. Time. the counters are similar: Counter Explanation Preparation & Batch Sequencing DSIESCBTime DSIEPrepareTime Ct_send() phase SendTime SendRPCTime SendDTTime DSIEExecCmdTime DSIEExecWrtxtCmdTime Time. Message checks are for group and batch sequencing operations as discussed earlier in association with the dsi_serialization_method Number of MsgChecks_Fail returned when a DSIEXEC thread calls dsie__CheckForMsg(). in 100ths of a second. Total times that a DSI thread failed to apply a transaction due to no available log space in the target database (ASE Error 1105). Total times that a DSI thread failed to apply a transaction due to target the database in log suspend mode (ASE Error 7415). sending and processing results. Total times that a DSI thread failed to apply a transaction due to no connections to the target database (ASE Error 1601).1 Counter ErrsDeadlock Explanation Total times that a DSI thread failed to apply a transaction due to deadlocks in the target database (ASE Error 1205). Average time taken.0. Time. The amount of time taken by a DSI/E to execute commands related to text/image data. to check the sequencing on a commit. in 100ths of a second. spent in sending command buffers to the RDS. This includes function string mapping. This process includes initializing and retreiving text pointers. Total times that a DSI thread failed to apply a transaction due to no locks available in the target database (ASE Error 1204). to process the results of command batches submitted by a DSI. Note that this does not track the times when deadlocks occur with parallel DSI’s. The number of times a data server reported a status in the results of a command batch execution.0 BatchTime =(DSIEBatchTimeAve * DSIEBatch)/100. The number of times a data server reported the results processing of a command batch execution as complete.00 CommitSeqTime=(DSIESCCTimeAve * TransApplied)/100. requiring ‘re-calculating’ the original total that was used in the average: FSMapTime=(DSIEFSMapTimeAve * CmdsApplied)/100. The number of times a data server reported a parameter. Total times that a DSI thread failed to apply a transaction due to target the database in log suspend mode (ASE Error 7415).Final v2. the most useful DSIEXEC counters are the ‘time’ counters. This includes function string mapping. cursor or compute value in the results of a command batch execution.0 ResultTime=(DSIEResultTimeAve * DSIEBatch)/100. Time. The amount of time taken by a DSI/E to finish cleaning up from committing the latest tran. to process transactions by a DSI/E thread. to check the sequencing on commits.0 SendTime=(SendTimeAvg * DSIEBatch)/100. The number of times a data server reported failed executions of a command batch. Time. ErrsLogFull ErrsLogSuspend ErrsNoConn ErrsOutofLock Commit Sequencing DSIESCCTime DSIETranTime Time. but only when RS deadlocks with another nonRS process. in 100ths of a second. The number of times a data server reported a message or format information as being returned in the results of a command batch execution.0 182 . Total times that a DSI thread failed to apply a transaction due to no available log space in the target database (ASE Error 1105). In RS 12. sending and processing results.1 Counter ct_results() processing DSIEResSucceed DSIEResFail DSIEResDone DSIEResStatus DSIEResParm DSIEResRow DSIEResMsg DSIEResultTime Exception Processing ErrsDeadlock Explanation The number of times a data server reported successful executions of a command batch. in 100ths of a second. These clean up activities include awaking the next DSI/E (if using parallel DSI) and notifying the DSI/S. Total times that a DSI thread failed to apply a transaction due to deadlocks in the target database (ASE Error 1205).6. A transaction may span command batches. The number of times a data server reported a row as being returned in the results of a command batch execution. Total times that a DSI thread failed to apply a transaction due to no connections to the target database (ASE Error 1601). DSIEFinishTranTime However. the only counters were averages which meant that the most useful way of looking at them was from a total perspective. in 100ths of a second.0. MsgChecksFailed). Consequently. A high time here may indicate inefficient batching or slow response to client applications from the replicate server. it could point to fairly big customized function strings . as described earlier.00 RS 15. this includes the execution time as RS does very little result processing. If there is a lot of time spent in this area. this is the amount of time translating the replicated row functions into SQL commands. If the expected message is not there.0 DSI Post-Execution Processing After the DSIEXEC finishes executing the SQL. If the response is there. this is the amount of time creating the batches. Unfortunately. tuning RS isn’t going to help – you have to either tune the replicate database. we don’t really need to look at the value too closely as RS monitor counters will explicitly tell us how long the batch sequencing and commit sequencing times were. we have also highlighted the two message check counters (MsgChecks.0. we have to multiply the individual timing counters by the number of commands.1 BatchSeqTime=(DSIESCBTimeAve * TransApplied)/100. As discussed at the beginning of this paper. Frequently. BatchSeqTime .0 simplifies this thanks to the counter_total column in the rs_statdetail table. it notifies the DSI – which in turn 183 . If it can commit. ResultTime . The key to all of these is to remember that we are executing command batches with the transaction group currently being dispatched by the DSIEXEC and that multiple groups may be executed by the DSIEXEC within the sample interval. Although it seems odd. From these we can most often find quite clearly where RS is spending the time. Above. All the times reported by these counters are in 100ths of a second.or it also could point to contention within the replicate server . these metrics will among the highest and points to a need to speed up the replicate DBMS as the key to improving RS throughput.As noted earlier as well. Again.Most of the time for 12.which you may not be able to do much about. these counters were removed in RS 15.This calculated value can be used to determine the amount of time spent processing results from the replicate server.beyond the mechanics of append the SQL clauses is high enough that when the number of batches is high due to a low batch size setting. the MsgChecksFailed is incremented along with the MsgChecks. it then checks its own message queue for the response. SendTime). However. leaves execution time by the replicate database as the result. However. to get the time spent for each sample interval.Final v2.As noted earlier.This represents the amount of time spent sending the command batch to the replicate data server. it adds up considerably.possibly within the rs_threads group. batches or transactions processed by that DSIEXEC during that interval to get the total time spent on that aspect (note that this changes substantially in RS 15 as it tracks totals already). BatchTime . Let’s take the above ‘times’ in order of the execution and describe the likely causes: FSMapTime . generally when this value is high. it almost always goes hand in hand with dsi_cmd_batch_size being too small.This is the amount of time spent waiting to commit. is the time spent trying to coordinate sending of the first batch in parallel DSI’s. While the number of failures could be an obvious indication of a lengthy batch/commit sequencing issue. In actuality. a high value may indicate a near-serial dsi_serialization_method such was wait_for_commit . One possibility is that the overhead of batch creation . For parallel DSI’s this is done by first sending an rs_get_threadseq or using DSI Commit Control.00 TotalTranTime=(DSIETranTimeAve * TransApplied)/100. use parallel DSI’s (and the key here is to achieve the greatest degree of parallelism without introducing prohibitive contention) or use minimal columns/repdefs to reduce the SQL execution time.6 systems will be reported as TotalTranTime – which when you subtract the other components (FSMapTime. consequently we need to normalize to seconds to make them more readable. the number of message checks is kind of handy from a different perspective. when a DSIEXEC puts a message such as ‘Batch Ready’ on the DSI message queue.This. inter-thread communications are conducted using OpenServer message structures internally – allowing asynchronous processing between the threads. Consequently. To understand how these counters can be useful. And if this is the largest chunk of time. CommitSeqTime . think back to the earlier diagram of the DSI to DSIEXEC intercommunications concerning batch and commit sequencing. TotalTranTime . A lengthy time could indicate that the dsi_serialization_method is wait_for_commit and a previous transaction is running a long time – or that the DSI thread is simply too busy to respond to the Batch Sequencing message. only the MsgChecks counter is incremented. it checks to see if it can commit. A very high number in comparison to the number of transaction groups or command batches processed gives us an indication of whether transaction grouping is effective (along with other explicit counters for this). SendTime . you may wish to ensure that STS cache is sized appropriately. any latency will manifest itself in one of three locations: 1. If not sleeping.564 18.078 0 184 DSIEXEC CmdsApplied n/a n/a n/a n/a n/a n/a n/a n/a Src SQM CmdsWritten Dest SQM CmdsWritten Sample Time Dest SQMR CmdsRead RepAgent CmdsTotal DIST CmdsTotal Src SQMR CmdsRead .866 8.795 342 0 0 0 0 7. When you think about it.Block relative positions.008 18. Additionally.795 342 0 0 0 0 5.Read with Last Seg.1 coordinates the commits among the DSIEXEC threads. this will show the Next.797 324 1 2 2 3 5.615 27.sqm at the beginning and start simply by looking at the various “command” metrics across the full path through the RS.Block – although this is not totally accurate. The fastest way to isolate the problem is to do the following: Sp_help_rep_agent: Check the RepAgent state. Once the DSIEXEC has committed.797 324 1 2 2 3 5. sqm: Compare Next. DSIEXEC and RDB.797 324 1 2 2 2 5.962 18.866 5. you can skip the admin who.125 8. get sp_sysmon output to aid in further diagnostics. Once you know where you are beginning. for normal replication. you start by analyzing the SQT. Problem determination begins from that point forward according to the main near-synchronous pipelines: TranLog RepAgent RS RepAgent User SQM (W) Inbound Queue Inbound Queue SQT DIST Outbound SQM Outbound Queue Outbound Queue DSI DSIEXEC RDB Inbound Queue WS DSI WS DSIEXEC WS RDB (Warm Standby only) Outbound Queue RSI RRS DIST Outbound SQM Outbound Queue (Route only) For example.Final v2.Block. 1: For WS applications.524 7. For example: SQT CmdsTotal DSI CmdsRead SQT TransRemoved 21:40:46 21:42:47 21:44:48 21:46:49 21:48:50 21:50:50 21:52:51 21:54:52 5. Focusing on the RS.776 8.510 7.524 7. then the RepAgent is caught up. any latency in minor. End-to-End Summary The two most common questions that are asked are “Where do you begin?” followed closely by “How do you find where the latency is?” The answer actually is the second question. The outcome of this will identify which of the three disk locations mentioned above contains the latency. Alternatively.180 13.999 18. it notifies the DSI that it successfully committed and the DSI in turn notifies the SQM to truncate the queue of the delivered transaction groups.684 0 7. Admin who. DIST and outbound queue SQM threads.794 18. this will typically mean beginning with the SQM commands written. admin who. If the thread is next to commit.205 26.sqm is particularly ineffective. if the dsi_sqt_max_cache size is <4MB. you focus on the WS DSI.Read and Last Seg. Admin sqm_readers.868 5.0. Primary Transaction Log Inbound Queue Outbound Queue So. while for WS implementations. the next step is to verify the latency by using the M&C and comparing the “commands”. the first place to begin is to identify which of those three are lagging.867 5.Read is greater than Last Seg. if the latency is in the inbound queue. queue#. the DSI handles regrouping transactions after a failure. the DSI sends a message to the DSIEXEC telling it to commit.sqm though. Similar to admin who.868 5. it is likely that if Next.225 14.795 324 0 0 0 0 0 0 0 0 0 0 0 0 5.524 7.510 7. with 3 near-synchronous pipelines for normal replication (2 for WS). 3. If sleeping. 2.524 7.866 5.868 5. 869 0 0 0 0 0 0 0 0 0 0 3 0 747 3.744 9.364 2.075 0 3 0 842 3.688 9.407 1.192 8. the second step is to look for the obvious/common bottlenecks for each thread: Thread/Module RepAgent User Common Issues • • • • • • • • • • • • • • • • • • • • • RSSD interaction (rs_locater.999 0 3 0 741 2.357 1.999 0 3 0 747 3.187 8.326 3.0.869 0 6 0 844 3. However.688 9.411 1.837 2.366 3. remember that latency in one thread may be the result of build up in threads further in the pipeline – classically SQT type problems.357 1.683 9.) STS Cache RepAgent Low packet size/scan batch size SQM Write Waits RSSD interaction Slow Disks Read Activity Large Transactions Write Activity Physical Reads vs.187 8.298 4.1 SQT CmdsTotal DSI CmdsRead SQT TransRemoved 21:56:53 22:02:21 22:04:22 22:06:22 22:08:23 22:10:24 22:12:25 22:14:26 22:16:26 0 6 0 844 3.187 8. etc.191 8. but we can not tell from the RS statistics.873 3.873 5.744 6.366 3.516 The DSIEXEC Cmds were not available as the customer who gathered the above did not collect all the statistics.744 9.411 1. Cached Cache Size (too large or too small) Large Transactions DIST/Outbound Queue slow No RepDefs Large Transactions RSSD Interaction STS Cache SQM Write Waits Cache Size (too large or too small) Large Transactions Transaction Grouping configuration SQM (Write) SQM (Read) SQT DIST DSI DSIEXEC CmdsApplied n/a n/a n/a n/a n/a n/a n/a n/a n/a Src SQM CmdsWritten Dest SQM CmdsWritten Sample Time Dest SQMR CmdsRead RepAgent CmdsTotal DIST CmdsTotal Src SQMR CmdsRead 185 .442 2. Regardless of the example above.359 4.688 9. enough is there to quickly determine the following: • • There definitely is latency in the DSI/DSIEXEC pipeline There may be latency at the source RepAgent.075 0 6 0 844 3.192 8.411 1.Final v2.192 8.366 2.999 0 3 0 747 3. After identifying where the problem is.442 2. This is appropriate as probably 90% of all latency problems stem from the SQL execution speed at the replicate database. 186 .000 or higher Max md_sqm_write_request_limit Right-size DSI SQT Cache – cache should be able to hold 1.5-2 times (max) the number of grouped transactions that you execute on average Target dsi_max_xacts_in_group Max dsi_xact_group_size. rs_functions. rs_columns Max sqm_write_request_limit Tune RepAgent Increase sqm_recover_seg Max sqm_write_request_limit Right-size SQT & DSI SQT Cache Right size cache Break up large transactions (app change) Use table RepDefs Sts_full_cache rs_objects. RS M&C includes timers for these actions that allow you to isolate that it is these endpoints that are the problem.0. In any case. we are done looking in detail at the RS aspects to the problem and can focus on the replicate database & replicate DBMS. From the previous sections we have tried to illustrate problems and provide general configuration guidance. A summary of the this guidance is repeated here: Thread/Module RepAgent Thread Common Issues • • • • • • • • • • • • • • • • • • DSIEXEC • • • Use large packet sizes Use larger scan batch sizes Watch RS response time Sts_full_cache rs_objects. dsi_large_xact_size to eliminate their effects Target dsi_cmd_batch_size to full tran group (40KB+ as starting point) or 50 commands Watch RDB DBMS response times Use Parallel DSI’s RepAgent User SQM (Write) SQM (Read) SQT DIST DSI At this point. rs_columns.Final v2. Set sts_cache_size to 1. Even the replicate DBMS is disk space in effect as the DML statement execution depends on changing disk rows and logging those changes.1 Thread/Module DSIEXEC Common Issues • • • • • • Replicate DBMS response time Command Batching configuration Lack of Parallel DSI’s Text/Image replication RRS DIST/SQM slow Network issues RSI These can readily be spotted by looking at the monitor counters detail in the previous sections. One aspect to consider is that each of the pipelines mentioned above begin and end with disk space. The most frequent source of bottlenecks will be the components that talk to these disks – the RS SQM threads and the Replicate DBMS. it turns out the real cause of the problem is the replicate database. a series of samples using a short sample interval will be necessary to determine Most MDA tables are not stateful . the first point should illustrate that the SPID is likely not too reliable as any disconnect/reconnect can change the SPID. the KPID will differ meaning that counter values for the previous SPID will be lost for all but the stateful tables. Even it reconnects with the same SPID.although this can be skewed by large transactions and other conditions. how even the distribution of the workload for parallel DSI configurations .since all we are executing are DML operations based on primary keys or atomic inserts (ignoring procedure replication). The purpose of this section is not to discuss how to tune a replicate dataserver as that can be extremely situational dependent. several points to consider and common problems associated with replication will be discussed. the lack of a tuned replicate database system really impedes transaction delivery rates. While the goal of this table is not to teach how to monitor ASE using MDA tables . Two things contribute to the Replication Server’s quick blame for this: 1.but only show the cumulative values for the current sample period When querying the MDA tables. However. • • • While it might be tempting to simply look for the maintenance user by SPID. Consider the following: 187 . Key Concept #18: Not only is a well tuned replicate dataserver crucial to Replication Server performance. 2. Because the MDA tables can be accessed directly via SQL and provide process level metrics. when monitoring the maintenance user. the tables and queries contained in this section will focus primarily on the tables most applicable to monitoring maintenance user performance. In fact. poor design is quickly evident administrators will monitor replication delivery rates quicker than DBMS performance. As a strictly write intensive process. using the known parameters can reduce the query time significantly.Final v2. Additionally.0. we primarily are interested in determining the following conditions: • • • How quickly statements are executed . you can get a clear picture of replication maintenance user specific activity. it is an extremely rare database shop these days that regularly monitors their system performance beyond the basic CPU loading and disk I/O metrics. However.existing white papers already cover this topic. most statements should execute extremely quickly What the maintenance user process within ASE is waiting on Possibly. Maintenance User Performance Monitoring For ASE based systems.1 Replicate Dataserver/Database You gotta tune this too !! Often when people are quick to blame the Replication Server for performance issues. As with any client application. it is critical to have the Monitoring Diagnostic API (MDA) Tables set up for performance monitoring of the primary and replicate dataservers (possibly the RSSD as well if located on an ASE with production users). there are a couple of nuances when using MDA based monitoring of the replicate database: • The maintenance user may disconnect/reconnect during the following circumstances: o Errors mapped to stop replication o Parallel transaction failure due to deadlock or commit control intervention o DSI fadeout due to inactivity As with any MDA based monitoring. As a result. but a well instrumented primary and replicate dataserver is critical to determining the root cause of performance problems when the do occur. Final v2. we will use this diagram to special points of interest for monitoring the performance of the maintenance user.sysprocesses or from the monProcessLookup table (at the top in the diagram).. Assuming we had used the above query to determine which SPID/KPID’s we are interested in. In the next paragraphs. network I/O. However.0. As a result. prior to each sample the query: declare @SampleTime datetime select @SampleTime=getdate() select SampleTime=@SampleTime. etc. CPU access. the query to retrieve the wait events would be: 188 .0.monProcessActivity where ServerUserID=suser_id(‘<maint_user_name>’) The other tables then can be queried using a join with this table to narrow the results to only the SPID/KPID pairs used by the maintenance user in question. from the above diagram.. you can see ServerUserID is also in the monProcessActivity table which will need to be queried anyhow. Maintenance User Wait Events The best starting point for detecting maintenance user performance issues is to begin by looking at the “Wait Events” from monProcessWaits (bottom center of diagram). * into #monProcessActivity from master.1 Figure 39 . it might be tempting to retrieve the SPID/KPID pairs either from master. This table is key to determining how long the maintenance user task spent waiting for disk I/O.Useful ASE 15.1 MDA Tables for Monitoring Maintenance User Performance The first trick is to identify which of the SPID/KPID combinations we are interested in. Logically. WaitEventID Once we have the “wait events”.monWaitEventInfo e where w.Final v2.. master. Looking at the schema for the monProcessWaits table.SPID and a.KPID = w..EventDescription into #WaitEvents from #monProcessWaits w.*. which by default is 100 milliseconds. a wait event is recorded with a WaitTime of 0. #monProcessActivity a where a. In measuring wait events. Waits.* into #monProcessWaits from master. w.SPID = w. WaitEventID. MaxWaitTime=(case when Waits * 100> WaitTime then Waits * 100 else WaitTime end). WaitTime. there is a slight consideration that may make this not as important. Consequently a handy query for weighting the Waits and the WaitTime equitably might be a query similar to: select SampleTime.monProcessWaits w. A logical assumption might be to focus on the WaitTime. however.order by MaxWaitTime The following table lists some common wait events that you might see for a maintenance user WaitEventID CPU Related 214 215 Disk Read Related 29 waiting for regular buffer read to complete waiting on run queue after yield waiting on run queue after sleep Event Description Memory/Cache Related 33 34 36 37 Disk Write Related 51 52 waiting for last i/o on MASS to complete waiting for i/o on MASS initiated by another task waiting for buffer read to complete waiting for buffer write to complete waiting for MASS to finish writing before changing wait for MASS to finish changing before changing Transaction Log/Write Related 54 55 Network Receive 250 Network Send 171 251 waiting for CTLIB event to complete waiting for network send to complete waiting for incoming network data waiting for write of the last log page to complete wait for i/o to finish after writing last log page 189 . the server simply subtracts the timeslice a process was put to sleep in from the timeslice value when it was woken up. If it is the same timeslice. we need to find the ones of key interest. ASE measures time based on a timeslice or “ticks”.1 select [email protected] = e.KPID select w.0.Waits and WaitTime. we see that there are two columns for the metrics . e. EventDescription from #WaitEvents where Waits * 100 > 0 order by 5 -. The most common memory contention issue for maintenance users. one possible cause . if the replicate database is also being used be production users for reporting purposes or in a peer-to-peer fashion. ASE does all I/O write requests using as large of an I/O as possible. This can happen particularly for updates and deletes when the table is missing indexes on the primary key columns and during inserts when the clustered index is not unique and is non-selective (based on low-cardinality columns). Disk Read Delays While delays due to disk reads certainly could be due to slow disk drives or disk contention.particularly when the machine is used by production users . ASE will attempt a 2 or 3 page I/O sized write (46K for 2K page sized servers). if a query results in an APF pre-fetch of an entire extent. the maintenance users are competing for CPU time with the production users. the parallel DSI threads will all be attempting to append rows to heap tables or tables whose clustered index is ordered by a monotonic sequential key (including datetime values). if you see a lot of write based delays. if one parallel DSI just filled one page. When the housekeeper. In the case of the former.Final v2. Using cache partitions may alleviate this problem. consider reducing the number of threads to see if any improvement in throughput occurs. the next insert from a different parallel DSI may have to allocate a new page for the object and may try to append it to the same MASS area. or a checkpoint process flushes the pages based on the recovery interval. the most common form of MASS contention is in a high insert environment. Once in memory. As a result. however. Memory/Cache Contention Normally. Disk Write Delays As mentioned in the previous paragraph. when the wash marker is reached.is too few cache partitions. a much more likely cause for the maintenance user is excessive I/O due to a bad query plan. If they are. CPU Contention If there is a high degree of CPU contention (wait events 214 & 215).0. will be focused on the Memory Address Space Segment (MASS) spinlocks. all 8 pages are read from disk and placed into cache. A good starting rule of thumb is 5-10 threads per engine as a maximum. for IO efficiency. individual logical I/O’s as represented by wait events 33 & 34 will not be a problem. user DML statements may cause several pages to be updated (marked dirty). 190 . As a result. you may first want to look at the monDeviceIO/monIOQueue tables (not in the above diagram) along with OS utilities such as sar to see if slow disk response times. Note that writes of data pages normally only happen when either the housekeeper flushes a page. other users are prevented from trying to use those same pages by the MASS bit. For example. you have a couple of options available: • • • Increase the maintenance user priority to EC1 Use engine grouping to restrict reporting users to a subset of engines as well as focusing the maintenance user at the remaining engines Increase the number of engines If CPU contention is high and parallel DSI threads are being used. A MASS is a way of controlling concurrent access to group of contiguous pages in memory . In the case of replication server maintenance users. While those pages are being placed into cache. you will need to consider the priority of the maintenance user as well as the numbers of parallel DSI threads being used.1 WaitEventID Event Description Contention/Blocking Related 150 41 Internals/Spinlocks 272 waiting for lock on ULC waiting for a lock wait to acquire latch Some of the more common issues are discussed below. For example. or ASE configuration values are causing the IO times to be longer than normal. ASE will do multi-page write of the pages within the MASS . to safely record the page as having been flushed.again. concurrent user access during the write operation is blocked. if 2 or 3 contiguous pages in a cache MASS area are dirty. checkpoint process or other write operation forces the pages to be flushed. This can be confirmed by looking at the statement and object statistics as will be described in “Query Related Causes” section later.typically 8 pages. If the replication latency is greater than desired. As a whole. one possible solution for this is to enable ‘delayed commit’ . This can be verified by looking at the monOpenDatabases table. The second condition (55) suggests that either the log device is slow in responding or that the number of writes per log page is causing the last log page to be busy. the first restriction was removed by adding a configuration setting to control ‘literal parameterization’ as well as a session setting. In addition. we separated them into different sections for this discussion. on a parallel effort.normal caveats about future functionality apply) the same capability for atomic insert/values statements which should benefit RS environments greatly. Replication Server engineering is looking at an enhancement to RS 15. Sybase introduced statement caching. which has columns that track the AppendLogRequests and the AppendLogWaits. so you will likely need to create a user defined function class that inherits from rs_sqlserver_function_class to minimize the impact and the work involved to implement this capability. the problem can be caused by: • • RS slow in sending commands to the ASE due to spending time on other processes ASE slow in parsing. While Replication Server could be viewed as sending very simplistic SQL statements (atomic inserts.0 (again. This has been proven in test scenarios involving high insert environments in which using fully prepared SQL statements were 3-10 times faster than the equivalent language commands.either for the entire database . transaction log based delays are collectively grouped with disk write activity . Commonly we might associate this with log semaphore contention. you will need to modify one of the class scope function strings executed at the beginning of the DSI connection sequence . Network Receive Delays This is likely the largest single cause of latency and as a result. When enabled. The danger in this is that non-ASE 15. It was further proven that the most expensive part of the delay was due to compilation or optimization as it was determined that language procedure calls did not exhibit the same delays as language DML statements. function string conversion and nearly all the time is spent in the send/execute and results processing windows. Beginning with ASE 12.54 & 55.2 is looking at providing (note this is a future release . if the majority of the write delays are due to waiting for the MASS to complete from a different user.and the housekeeper/checkpoint is forcing a disk flush before the page is completely full.Final v2. the ASE 12. it could point to a need to increase the ULC size at the replicate or speed up the physical log I/O of the process that currently has the log semaphore. any real attempt at improving the throughput of a maintenance user will likely need to begin with this. The first one (54) actually is referring to waiting to get access to the transaction log to flush the maintenance user’s ULC to the primary log cache. Transaction Log Delays In the MDA tables. it is hashed with an MD5 hash for that login and environment settings (such as isolation level). 191 .or just for the maintenance users.such as rs_usedb. The reason was that fully prepared SQL statements create a dynamic procedure that is executed repeatedly by simply sending the parameter values with each call vs. optimized and then executed. caveats regarding future release functionality apply) that would enable RS to send dynamic SQL vs. As of ASE 15. In the above list. then it is most likely is the second cause. ASE 15. this suggests that in a high insert environment you need more cache partitions or the clustered index is forcing parallel DSI’s to insert into the same page .0 servers may not understand this command.2. In the future. If modifying just for the maintenance users.5.0.0. If no real appreciable time is being spent in batching. language statements.consequently updates or deletes . Early tests with this have reported substantial improvements.5.but due to the differences in causes. If the maintenance users appear to be waiting on the log semaphore and the replicate system is not being used by production users.0.0. If the hash matches an already executed query. However. compiling. a language command. there were two transaction log delay wait events .1. optimizing language commands as typical DML statements are sent by RS The first one can be double checked by looking at the DSIEXEC time related counters. • In ASE 15. RS environments are strongly encouraged to enable this if the environment sustains a lot of update or delete activity. compiled.2 statement cache did not benefit Replication Server environments due to the following reasons: • The literal values were included in the hash key . Statement caching was not used for atomic insert/values statements. The second cause is a bit nasty. In reality.1 However. the issue is that every statement sent to the replicate DBMS needs to parsed. updates and deletes based on primary keys).especially those caused by a single statement at the source .could not use the statement cache as the literal values for the primary keys differed. that query’s optimization plan is used instead. execution (less any contention or other causes) is by far the least of these times. as each SQL command is received. However.and later Multi-Standby Architecture (MSA) . This can be caused by two primary issues: • • A low/default configuration for the server configuration “user log cache spinlock ratio” ULC flushing to the transaction log The first one is a setting that is often not changed by DBA’s. The specific table involved can be diagnosed via monOpenObjectActivity.2 or RS 15. Latch contention is likely caused by inserts into the same index pages by parallel threads and typically are not a major concern as latch duration is extremely short. While monLocks may seem the most apparent. RS is slow at processing the results. ASE CPU contention is preventing a task to be scheduled quick enough to tell if the network send was acknowledged. The third cause can be alleviated by changing the proc/trigger code by bracketing print statements as well as the set flushmessage setting with a check for either the replication_role or the maintenance user by name . MSA and the Need for RepDefs When Sybase implemented Warm Standby Replication . Since this is a dynamic parameter. but needs to perform network I/O on a different network engine that it is connect to. this may point to the need to either increase the process priority for the maintenance user or use engine grouping to deconflict with other production users. The first is a most likely cause on larger systems. the most likely cause will be the second cause. Warm Standby. the logical lock event (150) corresponds directly to a lock contention issue either at a page or row level. The goal was to extremely simplify replication installation and setup for simple systems.or by ensuring that triggers are disabled at the replicate if the print statements are within triggers. Unfortunately. Of the two listed. The replicated procedure or trigger contains a number of print statements . then this is the likely cause. significant improvements in RS throughput can be achieved by using stored procedures and changing the function strings to call the stored procedures instead of the default language commands. it is unlikely that this will be a significant cause. Contention/Blocking Related Delays With parallel DSI’s or other production users on the replicate system. If the replicate DBMS has a large number of SMP engines. task to engine affinity is not explicitly supported within Sybase ASE.1 Until either ASE 15.EngineNumber column for the same SPID/KPID pairs. replication definitions are strongly recommended in high volume systems and in most cases due to the following reasons: 192 . If the above doesn’t help.Final v2. the ULC is locked from the user to prevent overwriting of the log pages in the ULC. you will need to monitor this closely.0 are enhanced to resolve the ASE optimization issue. this means that a single spinlock is used for every 20 ASE processes. Network Send Delays Network send delays can be caused by several factors within a replicate database • • • • The maintenance user task was running on one engine.0. Internal/Spinlock Delays Another common wait event for maintenance users is the waiting for a lock on their own ULC cache. an engine group may be desired. the only real alternative is to use engine groups to try to constrain the maintenance users to a subset of cpu’s . One way that it can be verified is by reducing the sample interval significantly and then monitoring the monProcess. the result is that likely only a single spinlock is used for all the parallel threads. it would be difficult to spot transient blocks. However. while engine to CPU affinity can be performed via dbcc tune(). If task migration is occurring a lot. By default. A second cause is that when a user’s ULC is flushed to the transaction log. this should be done with extreme caution and only after verifying that task migration is occurring. because the lock hash table changes so rapidly. there likely is not a lot that can be done about alleviating this problem. On smaller systems or non parallel DSI environments.thereby reducing the task migration.0. Again.particularly if the setting ‘set flushmessage on’ is enabled. For most replicate/standby databases attempting to use parallel DSI threads. However.the need for individual replication definitions for each table was made optional. Unless the ULC is full for the maintenance user. you may wish to reduce this to a low single digit (1-3) to see if it alleviates any delays. This can result in either database inconsistencies or errors that stop replication.although this is enabled for the standby database in a WS or MSA setup by default without a repdef. Even with repdefs. When minimal column replication is enabled. When database inconsistencies are reported to Sybase with a Warm Standby system.and if a table contains a float datatype. with the exception of the discussion on triggers which is covered in a later section. For example. inserting a value of 12. the RS has to assume all non-text/image/rawobject columns are part of the primary key. However. different operating system versions or different cpu hardware within the same family. assuming that the original system stored a “12.999999999999998 at the destination.999999999998. however. database inconsistencies could result. Similarly a delete hits zero rows. Primary keys are identified. for example. 193 . Consider the impact of the following type of query for a subsequent update: Update data_table Set column = new_value Where obj_id=12345 and float_column = 12. However. it is likely a duplicate key error will result on the insert. however. Because of the approximate nature of the float datatype.00000001 and stored at the replicate as 12. it ended up as 11. possibly everything is fine. all non-BLOB columns are included in the primary key/where clause generation for updates and deletes. Even worse. • • While the first two have either been explained before or are self-evident. Again. What happens is that the update simply affects 0 rows. real. it is extremely important to note that unless the replication definition contains the ‘send standby’ clause. this slight difference in the stored value may not be an issue for the application. As a result. ansinull is enforced or other similar conditions (such as data modifications due to a trigger if dsi_keep_triggers is “on”).0 on the primary may result in a translated value of 11. The subsequent insert will then fail as any unique index or constraint will flag the duplicate and raise the error (unless ignore_dupe_key is set). if different character sets/sort orders are used. the presence of approximate numeric datatypes/lack of repdefs leads the causes by a wide margin when materialization errors are excluded. the new value may not match the stored value resulting in not finding the row.especially when the table contains a float. not having a repdef can lead to database inconsistencies .000000002. In some cases. Before we do this. or if the primary key is not specified and all columns are used to define the where clause for update or delete DML operations. it will not be used by Warm Standby or MSA for primary key or other determination. each part of the where clause has to be compared vs. replicate database performance can be improved for updates as the number of unsafe indexes is reduced and a direct in-place update may be doable instead of a more expensive implementation. at the replicate. replicated as 12. When applied to the destination system. If basic scientific principals such as rounding to a specified number of significant digits were implemented in the application. strictly the primary key values. or double datatype. the time it takes to generate the SQL statement within RS is shorter and the execution at the replicate is also shorter. a common implementation today includes reporting/historical database feeds from the standby system. minimal column replication is allowed with replication definitions . Let’s take a look at each of these. The reason is that the delete will likely miss the desired row due to the float datatype. By having a repdef and defining the primary key.000000000001 at the primary. the last bullet may catch some by surprise.0” perfectly. float or any approximate numeric should not be used as a primary key or a searchable column . Approximate Numerics & RepDefs Without a replication definition. an insert of 12. Without a primary key. it can happen in work tables as well as older GUI’s that translated primary key updates into delete and insert statements. when the row was sent to the destination. the destination database language handler translates the string literal ASCII number to the binary numerical representation – typically by calling the C library routine atof(). Most data movement systems encode data values as ASCII text values for transport between the systems. Replication Server does not support significant digit rounding. a replication definition must be used. but during execution. If a different host platform is involved. The result is that at the primary.Final v2. the translation on the destination machine may be slightly different that at the origin. Consider what happens if an application deletes a row and then later the same row is reinserted.1 • As mentioned earlier.0 Note that the result is not an error. The problem becomes especially acute when the float column is a member of the primary key. While this does not appear to be common.0. The result not only is that the where clause that is generated a lot longer.0 at the primary may get stored as 12. database consistency is not guaranteed. Consider the case in which last name may be part of the primary key and two records are inserted with the only distinction in the key values being that in one case the name is “McDonald” and the other “Mcdonald” .Final v2.or if repdefs are not being used. Now. if the table has a repdef. the result is instantaneously disastrous as the RS latency stretches for hours. if any part of the actual key includes character data. not specifying a primary key . However.4.5.5. a close second . it is safe to say that any warm standby or MSA implementation between different character sets or sort-orders is risky and could result in data inconsistencies. as you can guess by now. each row becomes a single delete at the replicate . but even then. replication definitions may not be recommended. One million table scans to be precise. In other cases. both of these settings can now be exported from a login trigger . Those that are alert may point out that this requires the connection to issue the ‘set ansinull on’ statement whereas the default is ‘set ansinull off’ (or fipsflagger). the generated update or delete could resemble: Delete data_table Where first_name = ‘Fred’ and last_name= ‘McDonald’ With a repdef and primary key. Query Related Causes While the language command optimization issue (see Network Receive Delays above) is likely the biggest cause of throughput issues for high-insert intensive environments. The main tables that will help with this are illustrated here: 194 . character data . procedure replication is being used . Whether or not the table has a replication definition.2) forcing non table owners to a table truncation in this fashion. Equally problematic was that this table easily contains ~1 million rows or more. The most common example of this is when the original system uses binary sort order and the standby uses case-insensitive sort order. the other attributes may differ and prevent the problem. In a typical lazy standby implementation that does not have a repdef defined. the replicated delete may affect more than one row at the replicate. as of this writing. By definition then.0. As a result.1 ANSINULL enforcement If ANSINULL is enabled. Without a replication definition. in 12. when using different character sets. you will need to carefully monitor the query performance at the replicate.especially for update/delete intensive transactions are standard query related problems. The problem is that while the delete is a single statement at the primary.it promptly becomes a table scan for each delete. when triggers are enabled. the biggest problem was that the table did not have any defined primary key constraint nor any unique indices (although an identity column existed and had a nonunique index defined solely on that column).especially if a localized system only uses numeric keys vs.while other non-key attributes may differ. a common financial trading application includes a delete statement without a where clause. if a warm standby is created and ansinull is enforced. Consequently. As an example. then without a primary key. Different Character Sets/Sort Orders If replicating between different character sets and sort orders. if the primary uses a case sensitive sort order and the replicate uses a case insensitive sort order. database comparisons using a syntax such as column=null are always treated to be false.and lacking any index information based on the where clause . While it is likely that this was done prior to truncate table being a grantable option (ASE 12.could result in database inconsistencies. While this may be an extreme example.consequently care must be taken to ensure that the login trigger doesn’t set these automatically for the maintenance user. it is likely that nearly every update and delete will fail to work correctly as any column containing a null value will result in 0 rows affected. database inconsistencies can happen. a primary key may help reduce database inconsistencies caused by character conversion/sort comparison. Final v2.0.1 Figure 40 - MDA Tables Useful for Query Analysis Note that the table monSysPlanText was excluded from the above - this is due to the fact that while the query plan could confirm what is happening - due to the need to configure an appreciable pipe size and the impact the configuration value has on execution speed, we have avoided it. However, for particularly perplexing issues, it still maybe required. To begin with, you will want to make sure that the monProcessActivity.TableAccesses, IndexAccesses and LogicalReads/PagesWritten have the correct relative ratios for the maintenance users. For example, if the number of TableAccesses are high, it could be an indication of a table scan - which should also be evident as the number of LogicalReads may be orders of magnitude higher than expected. The obvious question is ‘What are the expected orders of magnitude?’ The answer is that it depends on the operation, minimal column replication setting and volatility of the indexed columns. Consider the following table: Operation Insert I/O pattern 1 index traversal to locate insert point (reads), write for the data row; index traversals to locate index key insert points and writes for each index key PK index traversal to locate row, write for the data row, index traversals for each unsafe index plus index key overwrites Typical Cost 50-75 Update 10-50 195 Final v2.0.1 Operation Delete I/O pattern PK index traversal to locate row, write to delete row, index traversals for all indexes plus index key deletion Typical Cost 50-75 As a result, if the delta between two samples shows that the maintenance user did 100,000 logical I/O’s but only did 60 page writes, this points to a likely indexing issue. To find the issue, the next step is to try to isolate which object it is occurring for. There are several possibilities for this. The first is monProcessObject, but it is unlikely to help as it only records the object statistics for the currently executing statement in the batch. Consequently, unless the server just happened to be still executing the bad statement, it is unlikely that this will provide any useful information. monProcessStatement has the same issue. The second likely answer is to use monOpenObjectActivity. If no other production users are on the system, the task is a simple comparison of the LogicalReads/PagesWritten ratio - and in addition, you can look for a table in which the IndexID=0 and a non-null LastUsedDate (indicative of a table scan). Failing that, you can use monSysStatement and again compare the LogicalReads/PagesModified (and in ASE 15.0.1 the new RowsAffected column) for the maintenance user SPID/KPID pairs. While this can prove beyond a shadow that an ineffective index was being used (or if proc replication or triggers enabled - bad logic within them), the actual table involved can not be identified without monSysSQLText. Regardless, if triggers are still enabled or procedure replication is occurring, you will need to watch monSysStatement closely for the maintenance user and attempt to keep the total IO cost of any triggers/procedures to the absolute minimum - which may mean that triggers may have to be rewritten to avoid joins with the insert/deleted tables and be optimized for single row DML statements. Triggers & Stored Procedures In this discussion, we are not focusing on stored procedure replication - but rather what can happen when triggers are enabled and in particular when the trigger calls stored procedures at the replicate database. Triggers & Database Inconsistencies Other than float/approximate datatype issues, the second (and a distant second) most common cause of inconsistencies as a result of not having replication definitions is when triggers are enabled. For a standard warm-standby, triggers are disabled by default via “dsi_keep_triggers”. However, if replicating stored procedures, DBAs may have changed this setting as they have been instructed to do so to ensure the integrity of actions with replicated procedures. Or, some DBAs have simply enabled triggers out of fear that without them database inconsistencies could result. Additionally, for MSA implementations, the default setting is that triggers are enabled. Some of the most common fields modified by triggers include auditing data (such as last update time), aggregate values, derived values, etc. Typically, these columns are not part of the primary key. As a result, if no replication definition is found, the update or deletes may fail as the actual values for these columns may differ. There is a common fallacy that triggers should be enabled for all replication except Warm Standby – and that this is the only way to guarantee database consistency. Actually this is only true for the following situations: 1. Not all the tables in the database are being replicated, and one of the replicated tables has a trigger that maintains another table (i.e. a history table) that is not replicated, but a similar table maintenance is desired at the replicate A stored procedure that is replicated has DML statements that affect tables with triggers that update other tables (replicated or not) in the same database. 2. The latter reason is likely the most common – however, leaving dsi_keep_triggers to ‘on’ just for this cause is grossly inefficient as a more optimal solution would be to have the proc check @@options and manually issue ‘set triggers on/off’ as necessary. To balance the above, there are cases where leaving the triggers enabled would result in database inconsistencies as well. Consider the following: 1. 2. All tables in the database are replicated. The trigger calls a stored procedure that does a rollback transaction or returns a negative return code between -1 and -99 The first case is fairly obvious. Any trigger that causes an insert (i.e. maintains a history table) or does an update to an aggregate value will cause problems at the replicate – either throwing duplicate key errors – or the triggered DML statements from the primary will clobber the triggered changes at the replicate – and the values may be different. 196 Final v2.0.1 The second case is really interesting and requires a bit of knowledge of ASE internals. Returning a negative number from a stored procedure return code is something that is fairly common among SQL developers. Now, we all know that just because something is documented as something developers shouldn’t do doesn’t mean that we all obey it. Case in point is that the ASE Reference Manual clearly states that: One aspect for the customer to consider is that return values 0 through -99 are reserved by Sybase. For example: 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 Procedure executed without error Missing object Datatype error Process was chosen as deadlock victim Permission error Syntax error Miscellaneous user error Resource error, such as out of space Non-fatal internal problem System limit was reached Fatal internal inconsistency Fatal internal inconsistency Table or index is corrupt Database is corrupt Hardware error Now then, consider the following schema: use pubs2 go create table trigger_test ( rownum int identity not null, some_chars varchar(40) not null, primary key (rownum) ) lock datarows go create table hist_table_1 ( rownum int not null, ins_date datetime not null, primary key (rownum, ins_date) ) lock datarows go create table hist_table_2 ( rownum int not null, ins_date datetime not null, primary key (rownum, ins_date) ) lock datarows go create procedure bad_example @rownum int as begin declare @curdate datetime select @curdate=getdate() insert into hist_table_2 values (@rownum, @curdate) return -4 end go create trigger trigger_test_trg on trigger_test for insert as begin declare @currow int select @currow=rownum from inserted insert into hist_table_1 values (@currow, getdate()) exec bad_example @currow end go Note the highlighted line – the proc returns -4 – no error raised…..just a negative return code. We would expect that by inserting a row into trigger_test that the trigger would fire, inserting a row in hist_table_1, then calling the proc which would insert a row in hist_table_2….let’s try it: 197 Final v2.0.1 ---------- isql ---------1> use pubs2 1> truncate table trigger_test 1> begin tran 1> insert into trigger_test (some_chars) values ("Testing 1 2 3...") 2> select @@error (1 row affected) 1> commit tran 1> select * from trigger_test 2> select * from hist_table_1 3> select * from hist_table_2 rownum some_chars ----------- ---------------------------------------(0 rows affected) rownum ins_date ----------- -------------------------(0 rows affected) rownum ins_date ----------- -------------------------(0 rows affected) Output completed (0 sec consumed) - Normal Termination What happened???? It looks like the insert happened – we did get back the standard “(1 row affected)” message after all – and no error was raised….but curiously, neither did we get the results of @@error….hmmmmmm…and all the tables are empty. Let’s change the trigger slightly to: create trigger trigger_test_trg on trigger_test for insert as begin declare @currow int select @currow=rownum from inserted insert into hist_table_1 values (@currow, getdate()) exec bad_example @currow select @@error select * from hist_table_1 end go And add an extra insert to the execution: ---------- isql ---------1> use pubs2 1> 2> begin tran 1> insert into hist_table_1 values (0, getdate()) 2> insert into trigger_test (some_chars) values ("Testing 1 2 3.....") 3> select @@error (1 row affected) ----------0 (1 row affected) rownum ins_date ----------- -------------------------0 Jan 4 2006 1:21AM 401 Jan 4 2006 1:21AM (2 rows affected) 1> commit tran 1> select * from trigger_test 2> select * from hist_table_1 3> select * from hist_table_2 rownum some_chars ----------- ---------------------------------------(0 rows affected) 198 Final v2.0.1 rownum ins_date ----------- -------------------------(0 rows affected) rownum ins_date ----------- -------------------------(0 rows affected) Output completed (0 sec consumed) - Normal Termination Whoa! Still no error inside the trigger immediately after the proc call with -4 returned, and the rows were being inserted….but…no data. The reason is that if a nested procedure inside a trigger (or another procedure) returns a negative return code, ASE assumes that the system actually did raise the corresponding error (i.e. -4 is a permission problem) and that it is supposed to rollback the transaction. All of course, without errors….which means if this happened at the replicate database, the replicate would get out of synch with the primary and no errors would get thrown. Ouch!!! Trigger/Procedure Execution Time Besides data inconsistency problems when triggers exist, the biggest problem with triggers is that the typical coding style for triggers is not optimized for single row executions. It is not uncommon to see throughout a trigger multiple joins to the inserted/deleted tables or joins where if a single row was all that was affected could be eliminated using variables. This results in a lot of unnecessary extra I/O that lengthens the trigger execution time needlessly. Trigger and procedure execution time are extremely, extremely critical. One metric of interest may be to know that trigger based referential integrity is 20 times slower than declarative integrity (via constraints). Remember, in order to maintain commit order, the Replication Server basically applies the transactions in sequence – even in parallel DSI scenarios, the threads block and wait for the commit order. As a result, while procedure execution is great for Replication Server performance from thread processing perspective, the net effect is that as soon as a long procedure begins execution, the following transactions in the queue effectively are delayed. Note, that this is not unique to stored procedures – long running transactions will have the same effect (i.e. replicating 50,000 row modifications in a single transaction vs. a procedure that modifies them have the same effect at the replicate system – however, the procedure is much less work for the Replication Server processing). As a result, particular attention should be paid to stored procedure and trigger execution times (if you for some odd reason opt not to turn triggers off for that connection). Any stored procedure or trigger that employs cursors, logged I/O in tempdb, joins with inserted/deleted tables, etc. should be candidates for rewriting for performance. Ideally, triggers should be disabled for replication at the replicate via the DSI configuration ‘dsi_keep_triggers’. Key Concept #19: Besides possibly causing database consistency issues, trigger execution overhead is so high and probable coding style so inefficient, that triggers may be the primary cause of replication throughput problems – and as a consequence triggers should be disabled via ‘dsi_keep_triggers’ until proven necessary and then enabled individually if possible. To see how to individually enable triggers, refer back to the trick on replicating SQL statements via a procedure call and using @@options to detect the trigger status. Concurrency Issues In replicate only databases, concurrency is mainly an issue between the parallel DSI threads or when long running procedures execute and lock entire tables. However, in shared primary configurations – workflow systems or other systems in which the data in the replicate is updated frequently, concurrency could become a major issue. In this case, user transactions and Rep Server maintenance user transactions could block/deadlock each other. This may require decreasing the dsi_max_xacts_in_group parameter to reduce the lock holding times at the replicate as well as ensuring that long running procedures replicated to that replicate database are designed for concurrent environments. 199 Final v2.0.1 Key Concept #20: In addition to concurrency issues between maintenance user transactions when using Parallel DSI’s, if the replicate database is also updated by normal users, considerable contention between maintenance user and application users may exist. Reducing transaction group sizes as well as designing long running procedures to not cause contention are crucial tasks to ensuring the content does not degrade business performance at the replicate or Replication Server throughput. Similar to any concurrency issue, depending on what resources are the source of contention, it may be necessary to use different locking schemes, etc. at the replicate than at the primary (or same if Warm Standby). Consider the following activities: Strategy Additional Indexes Comment Additional indexes, particularly if replicating to a denormalized schema or data warehouse could increase contention. While not necessarily avoidable, it may require a careful “pruning” of OLTP specific indexes. Eliminate index contention and data row contention by implementing DOL locking at the replicate system. Provide parallel DSI’s multiple last pages to avoid contention without implementing DOL locking. Have RS DSI disable triggers – especially data validation triggers DOL Locking Table Partitioning Triggers Off Obviously, the above list is not complete, but may provide ideas to resolve contention issues when the contention is not due to the holding of locks longer due to transaction grouping. 200 it may be advisable to create table replication definitions and subscriptions for the same tables that the replicated stored procedures will affect. do not attempt to replicate affected data using table replication definitions and subscriptions. they would not receive anything. Which brings us to the point the second reference was making.5 However. the DML changes within the procedure will not be replicated.0.1 Procedure Replication Is it true that I can’t replicate both procedures and affected tables?? Procedure vs. Attempting to force both to be replicated (i. . Neither London nor Tokyo would ever receive the update mutual fund values!!! Why?? Since the DML within the replicated procedure is not marked for replication. What it really refers to is if you replicate a procedure. All the other sites subscribe to the mutual fund portfolio table.page 9-3 in Replication Administration/11. The one that would be evaluating the DML for replication would not be aware that the DML was even inside a procedure that was also replicated. executing a replicated procedure in one database with replicated DML modifications in another) could lead to database inconsistencies. A procedure at New York is executed at the close of the day to update the value of mutual funds based on the closing market position of the funds stock contents. The only way to force this replication is to a) replicate a procedure call in one database and b) that procedure modify data in a table that is also replicated in another database. and the ASE logger does not mark any DML records for replication until after that procedure execution has completed. The reason for this is exactly the fact that DML within a procedure is NOT replicated – and needs reverse logic to understand the impact. consider what would happen if only San Francisco and Chicago subscribed to the procedure execution. the Replication Agent would only forward the procedure execution log records and NOT the logged mutual fund table modifications.page 3-145 in Replication Reference/11. then the ASE logger sets the transaction log record’s LSTAT_REPLICATED flag. The second reference stated that it “may be advisable to create table replication definitions and subscriptions for the same tables…”. London Tokyo. duplicate updates would result.Final v2. If the affected tables are also replicated. The way this is achieved is that normally. and consequently forwarded by the Replication Agent to the Replication Server. no matter what. Thus.” . consider the following paragraphs: In replicating stored procedures via applied functions. in this case. By doing this you can ensure that any normal transactions that affect the tables will be replicated as well as the stored procedure executions. if the object’s OSTAT_REPLICATE flag is set. Since neither subscribed to the procedure. This is partially based on the following paragraph: “If you use function replication definitions. This is illustrated below: 201 . this means that the stored procedure receives the LSTAT_REPLICATED flag. as a DML statement is logged. Table Replication The above question is a common misconception that you cannot replicate both procedures and tables modified by replicated procedures. If the stored procedures are identical. For a stored procedure. San Francisco and Chicago all sharing trade data. Now. DML inside stored procedures marked as replicated is not replicated. This is illustrated with the following sample fragment of a transaction log: XREC_BEGINXACT XREC_EXECBEGIN proc1 XREC_INSERT Table1 XREC_INSERT Table2 XREC_DELETE Table3 XREC_EXECEND XREC_ENDXACT (implicit transaction) (proc execution begins) (insert DML inside proc) (insert DML inside proc) (delete DML inside proc) (end proc execution) (end implicit tran) Only the highlighted records will have the LSTAT_REPLICATED flag set. you must subscribe to the stored procedure even if you also subscribe to the table. Consider the scenario of New York.5 Confused?? A lot of people are. they will make identical changes to each database. This would allow both to be replicated as two independent log threads would be involved. However.e. only the insert to Table A occurs. At the primary. If replicating procedures and the dsi_keep_triggers setting is ‘off’ database inconsistencies might develop.0. The answer – as in all performance questions – is: “It 202 . 2. First the obvious – set dsi_keep_triggers to ‘on’. @paramn datatype] as begin if proc_role(“replication_role”)=1 set triggers on … dml statements … if proc_role(“replication_role”)=1 set triggers off return 0 end By ensuring user has replication role. 3. an insert occurs on Table A. the question is how does this affect performance. Procedure Replication & Performance Now that we have cleared that matter up and we understand that we can replicate procedures and tables they affect simultaneously. any replicate that subscribes to one should also subscribe to the other to avoid data inconsistency. Preventing this can be done in one of two ways. The other – and possibly better approach – is to consider how the triggers got disabled in the first place – via a function string executing the command “set triggers off”. A notable exception to that is that if replicating to a data warehouse.Final v2. However. This brings up another key concept about procedure replication: Key Concept #22: If replicating procedures. a replicated procedure is executed. When applied. The reason is evident in the below scenario: 1. In the procedure. there is a gotcha when replicating procedures and tables. However. the data warehouse may not want to subscribe to a purge or archive procedure executed on the OLTP system. This then can be included in the procedure logic via a sequence similar to: create procedure proc_a @param1 datatype [. Table A’s trigger modifies Table B Procedure is replicated as normal via Rep Agent to Replication Server. other users executing the same procedure would not get permission violations. Because triggers are off. this could significantly affect throughput. the procedure is executed. special care must be taken to ensure that DML triggered operations within the procedure are also handled or otherwise you risk an inconsistent database at the replicate.1 Exec proc1 Chicago exec proc1 OBQ Chicago (Nothing) BT X proc1 I Table1 I Table2 D Table3 D Table4 CT London OBQ London (Nothing) Tokyo OBQ Tokyo New York IBQ New York BT exec proc1 CT Exec proc1 San Francisco OBQ San Francisco Figure 41 – Replicated Procedure & Subscriptions Which brings us to the following concept: Key Concept #21: If replicating a procedure as well as the tables modified by the procedure. However. using our earlier scenario of our savings account interest procedure. 5. Secondly. 2. Replication Agent forwards procedure execution to RS nearly immediately. Procedure begins execution at 8:00pm and implicitly begins a transaction. the CPU resources required for tens to hundreds of thousands of comparisons are enormous.0. At midnight the procedure completes execution. We would see the following behavior: 1. the time it would take to process that many individual updates would probably exceed the required window. the bank updates all of the savings accounts with interest calculated on the average daily balance during that month. The Replication Agent would have to process and send to the Replication Server every individual account record. At a certain part of the month. If this procedure were replicated. we have a total of 8 hours from when the process begins until it completes at the replicate. Each account record would have to update as individual updates at each of the replicates The impact would be enormous. 3. 4. 2. The former is often referenced in replication design documents. RS SQT thread caches execution record until the procedure completes execution and the completion record is received via the implicit commit. the Replication Agent would lag significantly. let’s assume that the procedure takes 4 hours to execute. and 4 hours from when it completes at the primary until it completes at the replicate. 4. this would mean the following: 1. Replication Server only replicates committed transactions. how can replicating a stored procedure negatively affect replication? The answer is two reasons: 1) the latency between begin at the primary and commit at the replicate. Reduced Rep Agent & RS Workload Consider a normal retail bank. 6. Replicating procedures can both improve replication performance as well as degrade replication performance. and consequently. First. which in turn would only have to save/process that single record. the procedure will complete at the replicate at 4:00am Consequently. a stored procedure containing the update would be executed instead. 3. This timeframe might be acceptable to some businesses. 5. Remember. the space requirements and the disk I/O processing time would be nearly insurmountable. Within seconds. the Replication Agent has forwarded the commit record to RS and RS has moved the replicated procedure to the Data Server Interface (DSI). For sake of the example. failover sites. then the Replication Agent would only have to read/transfer a single log record to the Replication Server. Key Concept #23: Any business transaction that impacts a large number of rows is a good candidate for procedure replication. How would replicating stored procedures help?? That’s easy to see. An example of this happening can be illustrated with the following scenario. And lastly. The account records would have to be saved to the stable device. Now. If replicating the savings account table to regional offices. or elsewhere. what if the procedure took 8 hours to execute? Basically. 7. The account records would have to be saved again to the stable device – once for each destination. The difference could be hours of processing saved – and the difference between a successful replication implementation or one that fails due to the fact the replicate can never catch up due to latency caused by excessive replication processing requirements. Let’s discuss #1.000tph and that the interest calculation procedure takes 8 hours to run. Third. beyond a doubt. Let’s assume that we have a bank that has a sustained 24x7 transaction rate of 20. and 2) extreme difficulty in achieving concurrency in delivering replicated transactions to the replicate once the replicated procedure begins to be applied. the replicate would not be caught up for several hours after the business day began – which may not be acceptable for some systems such as stock trading systems with more real time requirements. Each and every account record would be compared to subscriptions for possible distribution. along with very frequent transactions that affect a small set of rows. Rather than updating the records via a static SQL statement at the primary. The DSI begins executing the procedure at the replicate shortly after midnight Assuming all things being equal. if stored procedures are can reduce the disk I/O and Replication Server processing. will be discussed first. Increased Latency & Contention at Replicate So. This literally can be tens of thousands to hundreds of thousands of records. let’s assume that we have Replication Server 203 .Final v2.1 depends”. e.0.000 xactn in 17 hours (plus interest calculation) =20. Consequently.000 transactions arrive! Another way to look at it is that the RS has a net gain of 10.000tph. transactions must be delivered in commit order. More than 7 hours in fact. Consequently. we are significantly behind. 70. concurrent transactions by customers (i. it must complete before any other transactions can be sent/committed (discussed in next paragraph). Whatever the cause.000tph. we are now 70. but consider this: while the procedure is executing at the primary. Why? Remember.000 transactions behind represents 7 hours before we are caught up.000tph)): 20K 40K 60K 80K Interest Calculation Procedure 100K 120K 140K 160K 180K 200K 220K 240K 260K 280K 300K 320K 340K 340.Final v2.000 xactns behind T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 43 – Procedure & Transaction Execution At The Replicate Even at 30. another 27. That explains the latency issue – what of the concurrency? Why can’t the normal transactions continue to execute at the replicate simultaneous with the procedure execution the same way it did at the primary? This requires a bit of thinking. This delays the procedure from starting for 4 hours after it completes at the primary.000 transactions must be delivered by the RS before it can send the proc for execution.000 transactions behind – which sounds not that bad – a mere two hours or so at 30. This is illustrated in the following diagram (each of the lines represents one hours worth of transactions (20K=20.000tph. During those 140 minutes.000tph T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 42 – Procedure & Transaction Execution At The Primary Normally we would be happy as it would appear that we have a 50% surge capacity built into our system and we can go home and sleep through the night.000tph rate (2h:20min to be exact).1 tuned to the point that it is delivering 500tpm or 30. But…. 204 . a full 240. Now that we are executing the procedure. Except that we would probably get woken up at about 4am by the operations staff due to the following problem: 30K 60K 90K 120K 150K 180K 210K 240K Interest Calculation Procedure 270K 270.000 xactn in 17 hours (plus interest calculation) = 70. While it may take RS several hours to catch up. As a result. Since RS guarantees commit order at the replicate. Since they would commit far ahead of the interest calculation procedure. If only using a single DSI. 205 . as illustrated in the second timeline. Long Running Procs . it may be advisable to actually replicate the row modifications. In addition to the fact that transactions committed shortly after the interest procedure suddenly have a 8 hour latency attached.Purge procedures when one of the targets for replication is a reporting system which is used for historical trend analysis. checks clearing from business retailers). there is no reason why Annie Aunt’s interest calculation needs to be part of the same transaction as Wally the Walrus – but whether or not that is how it is done at the primary. Purge Procs . If multiple DSI’s and no contention. the Replication Server would be able to keep up with the ongoing stream of transactions. 3. entirely on the replicate – it just might be less than the latency incurred due to replicating the procedure. the following would happen: 1. Due to contention. Is there a way around this problem without replicating the individual row updates? Possibly. However.Procedures containing mass updates in a single statement. in which the replicated procedure could use it’s own dedicated connection to the replicates. It is followed by a steady stream of other transactions – possibly even a batch job requiring 3 hours to run. as illustrated in the first timeline above. each a separate transaction (after all. This would be the same as atomic updates. they would show up at the replicate within a reasonable amount of time. the DSI would have to ensure that the follow-up transactions did not commit first and would do so by not sending the commit record for the follow-up transactions until the procedure had finished. this would only work in such places where having a transaction that committed at the primary after the interest calculation but commits before it at the replicate does not cause a disparity in the balance. You should consider not replicating the procedure and allowing the affected rows to replicate when: Cursor Procs . at the replicate they would be all part of the same transaction due to the fact the entire procedure would be replicated and applied within the scope of a single transaction. a multiple DSI approach could be used to the replicate system. the replicated batch process may not even begin execution via a parallel DSI until the replicated interest procedure committed.Procedures that process a large set using cursor processing and applying the changes as atomic transactions. You probably should consider replicating stored procedures when: OLTP Procs .Procedures that are processing sequential lists such as job queues (replicating these could result in inconsistent databases). The CPU and disk I/O savings with RS need to be balanced against this before deciding to replicate any particular procedure. Assuming this pattern continues even after the procedure completes (i. This could be done by not replicating the procedure but have the procedure cursor through each account. assuming the average daily balance is stored on a daily basis (or other form so that changes committed out of order do not affect the final result). RS processes the transactions in commit order and internally forwards them to the DSI thread for execution at the replicate. … Key Concept #24: Replicated procedures with long execution times may increase latency by delaying transactions from being applied at the replicate. The following guidance is provided to determine whether or not to replicate the procedure or allow the affected tables to replicate as normal.0.Final v2. More will be discussed about this approach in a later section after the discussion about Parallel DSI’s. Queue Procedures . Procedure completes at primary.e. The answer is doubtfully prior to the start of the business day. Consequently. 4. the question that should come up is “Can the Replication Server catch up?”.Procedures that either perform a lot of I/O (selects or updates) that causes it to have a long runtime (more than a few seconds). So. 2.1 ATM withdrawals) may also be executing in parallel.Frequently executed stored procedures with more than 5 DML operations with fast execution times. the follow-up transactions would not even begin until the interest procedure had committed – some 8 hours later. In this particular example. which when the individual rows affected when replicated will exceed any reasonable setting for number of locks. Large Update Procs .). while concurrently executing the procedure. Triggers Executed by Proc .and that it is only sent to the server one time.1 System Functions in Proc .and for the same reason in both cases: query optimization. If you think about it. however.Procedure does not implement proper transaction management (discussed earlier) unless it can be corrected to behave properly. In either case. As we all know. a stored procedure can either be executed as a language call or an RPC call. you should test your transactions to determine which is best for your environment. Subsequent calls simply set the parameter values and execute the procedure. we have heard that stored procedures are faster than language batches. user_name() or other system functions. Obviously. the ct_dynamic() statement is used along with ct_param() and a slightly different form of ct_send(). Improper Transaction Management in Proc . As mentioned. The first is quite easily understood .executed either as language calls or RPC’s Fully prepared SQL Statements/Dynamic SQL . RPC calls can not be batched 206 .setParam(2. 2. As with all guidance. This is not always true for reads .<value[n. a stored procedure containing a simple insert executes anywhere from 2-3x faster for C code and up to 10x faster for JDBC applications than the individual insert/values statement.<value[n.we are referring to the usual database objects created using the “create procedure” T-SQL command. While this can be a problem for reports and other complex queries that have a lot of flexibility in query search arguments. stored procedures are optimized at the initial execution and then subsequent executions re-use this preoptimized plan. if a vendor package.1]>) stmtID. 4. Note.0 is considering an implementation using dynamic SQL). it is offered as a starting point.Final v2. A pseudo-code representation of application logic for this might resemble: stmtID=PrepareSQLStatement(‘insert into table values (?. so this approach is not open to us (although a future enhancement to Replication Server 15. each DML statement sent by RS to the replicate database goes through the same sequence: 1. Most scripts that call stored procedures via isql are using a language call.0. Fully prepared statements or dynamic SQL are used in very high speed systems with a large number of repeating transactions. Procedures & RPC’s vs. Command parsing Query compilation/object resolution Query optimization Query execution It turns out that step 2 and especially step 3 take significantly more time than one would think. since Replication Server is a compiled application. Language (DML) From the very earliest times. however.these create a dynamic procedure on the server which is invoked via the RPC interface Queries using the ASE Statement Cache .DropSQLStatement() Note the ‘language’ portion of the command is parameterized as it is sent to the server . what happens is that the ASE server creates a dynamic procedure that is executed repeatedly via the RPC interface.?)’) while n <= num_rows stmtID.a query contained in the statement cache is compiled as a dynamic procedure. For JDBC. it can significantly help DML operations. 3.2]>) stmtID. While the difference varies by platform and cpu speed. The obvious question is how can this be exploited for a DML-centric process such as Replication Server? The answer is understanding what all constitutes a stored procedure in ASE: • • • Traditional stored procedure database objects . we can’t change the CT-Lib source code to invoke fully prepared statements. while inter-server invocations such as ‘SYB_BACKUP…sp_who’ gets executed using the RPC interface (particularly in that case as it is running against the Sybase Backup Server which doesn’t support a language interface).setParam(1.Procedures that contain calls to getdate(). this involves setting the connection property DYNAMIC_PREPARE=true. you may not have the ability to change the source.execute() end loop stmtID. suser_name().but it is certainly true for write operations . that unlike language statements. which when executed at the replicate by the maintenance user will result in different data values than at the primary.Procedures that contain DML operations that in turn invoke normal database triggers – particularly if the connection’s dsi_keep_triggers is set to ‘off’ – disabling trigger execution (this can be corrected by using “set triggers on/off” within the procedure. while for CT-Lib applications. delete statement and select statements. if not for the server. update. the only means to exploit this today (through RS 15. consequently.0. it demonstrates improper transaction control.0 ESD #1) is to use custom function strings that call stored procedures and create stored procedures for each operation (insert.query #2 select * from authors where au_lname=’Greene’ The problem with this approach was that statements executed identically with only a change in the literal values would still incur the expense of optimization. If your system experiences a large number of update or delete operations.0.query #1 select * from authors where au_lname=’Ringer’ -. in reality. The ASE statement cache introduced in ASE 12.2 is a combination of the two. The second method using the ASE statement cache likely will not help either. create procedure proc_name <parameter list> as begin <variable declarations> begin transaction tran_name insert into table1 if @@error>0 rollback transaction 207 . keep in mind that both ASE 15. converting to function strings and procedure calls. For example.2 is supposed to lift the restriction on caching insert statements. • However. Since ASE 12. As a result. If the statement cache is enabled.0. Procedure Transaction Control One of the least understood aspects of stored procedure coding is proper transaction control. If found. the query is optimized and converted into a dynamic procedure much like the preceding paragraph. as each SQL language command is parsed & compiled.5. However. there likely will be restrictions on using custom function strings for obvious reasons. Additionally.0. In ASE 15. early tests suggest that the performance advantage of a dynamic SQL implementation within RS even without command batching is much faster than language statements with command batching.5. the query literals are used much like parameters to a prepared statement and the preoptimized version of the query is executed. this should be considered. a new configuration option ‘statement cache literal parameterization’ was introduced along with the associate session level setting ‘set literal_autoparam [off | on]’. this hash is compared to existing query hashes already in the statement cache. certainly for the maintenance user by altering the rs_usedb function to include the session setting syntax. the improved statement caching will only help applications with significant update or delete operations. Additionally the restrictions on statement cache fairly much limited it to update statements.x does not benefit Replication Server unless there is an extremely high incidence of the same row being updated. RS users may see a considerable gain in throughput. Insert table () values () statements were not cacheable nor were SQL batches such as if/else constructs. a query with a where clause specifying where date_col between <date 1> and <date 2>.1 . It is commonly thought that the following procedure template has proper transaction control and represents good coding style. If no match is found.x and 15. For example. ASE 15. the restriction on insert/values is still in effect. However. the statement MD5 hash value included the command literals. When enabled. However. Techniques to finding/resolving bad queries may not be able to strictly join on the hashkey as the hashkey may be representing multiple different query arguments. While this may result in a performance gain for some applications. the statement cache in ASE 12. consequently. a MD5 hash value is created. You may wish to test using a function string output style of RPC as well as language to determine whether language based procedures with command batching give you performance gains over RPC style or vice versa.2 and a future release of RS that will implement dynamic SQL/fully prepared statements may eliminate the need for this.0 ASE’s.5. the reason it is stated that this may not help is that in early 12. queries using a range of values may get a bad query plan. the constant literals in query texts will be replaced with variables prior to hashing. when it is released. delete).5. it may be more practical to upgrade to those releases when available vs.Final v2.2.however. the following queries would result in two different hashkeys being created: -. there are a few considerations to keep in mind: • Just like stored procedure parameters. Consequently for insert intensive applications. As a result of these restrictions.1. should check to see if in a transaction and rollback the transaction as appropriate during exception processing. As a result. the implicit transaction (started by any atomic statement) was not rolled back. a procedure that attempts to implement transaction management can have undesired behavior during a rollback if itself was called from within a transaction. <variables> return –1 end return 0 end It often surprises people that if the procedure is marked for replication and an error occurs. particularly regarding rollback transaction statements: • • • Rollback transaction without a transaction_name or savepoint_name rolls back a user-defined transaction to the beginning of the outermost transaction. This is crucial as Replication Server always delivers stored procedures within an outer transaction as part of the normal transactional deliver. The underlined sentence sums it up quite simply – unless you use transaction savepoints (explicit use of “save transaction” commands) – you can only rollback the outermost transaction. The second common problem with procedures is the fact that if transaction management is not implemented at all. this leads to the following points: • Stored procedures that are replicated should always be called from within a transaction. Consequently. application developers need to keep the following rules in mind. Consider the following code: Begin tran tran_1 <statements> begin tran tran_2 <statements> begin tran tran_3 <statements> if @@error>0 rollback tran_3 commit tran tran_3 if @@error>0 rollback tran tran_2 commit tran tran_2 if @@error rollback tran tran_1 commit tran tran_1 While nested commits do only commit the innermost transaction. as in: Begin transaction tran_1 Exec proc_name <parameters> Commit transaction The reason the problem occurs is the mistaken belief that if nested “commit transactions” only commit the current nested transaction. Even though an error was raised. The reason is simple.Final v2. you can roll back only the outermost transaction.1 insert into table2 if @@error>0 rollback transaction insert into table3 if @@error>0 rollback transaction commit transaction end The problem arises when the procedure is called from within another transaction. it still gets replicated and fails at the replicate resulting in the DSI thread suspending. Rollback transaction savepoint_name rolls a user-defined transaction back to the matching save transaction savepoint_name. The above bullets are word for word from the Adaptive Server Enterprise Reference Manual.0. then a nested rollback only rolls back to the proper transaction nesting level. Consider the following common code template: create procedure my_proc <parameter list> as begin insert into table_1 if @@error > 0 begin raiserror 30000 <error message>. Though you can nest transactions. any rollback transaction encountered automatically rolls back to the outermost transaction unless a savepoint name is specified (it also points to the fact that only outer transactions and savepoints can have transaction names). simply raising an error and returning an non-zero return code does not represent a failed execution. Consequently. Rollback transaction transaction_name rolls back a user-defined transaction to the beginning of the named transaction. 208 . implementing proper transaction control for a stored procedure actually resembles something similar to the following: create procedure my_proc <parameter list> as begin declare @began_tran int if @@trancount=0 begin select @began_tran=1 begin tran my_tran_or_savepoint end else begin select @began_tran=0 save tran my_tran_or_savepoint end <statements> if @@error>0 begin rollback tran my_tran_or_savepoint raiserror 30000 “something bad happened message” return -1 end if @began_tran=1 commit tran return 0 209 . • The first point is illustrated with the following template: create procedure my_proc <parameter list> as begin if @@trancount<1 begin raiserror 30000 “This procedure can only be called from within an explicit transaction” return -1 end insert into table_1 if @@error > 0 begin raiserror 30000 <error message>.1 • Alternatively. stored procedures that are replicated should be implemented as sub-procedures that are called by a parent procedure after local changes have completed successfully AND then the sub-procedure should be called from within a transaction managed by the parent procedure. <variables> rollback transaction return –1 end return 0 end Notice the highlighted sections that are modifications to the previous code. Finally. A sample code fragment would be similar to: create procedure my_proc <parameter list> as begin insert into table_1 if @@error > 0 begin raiserror 30000 <error message>.0. Stored procedures that implement transaction management should ensure a well-behaved model is implemented using appropriate save transaction commands (see below). The second point is probably the best implementation for replicated procedures as it allows minimally logged functions for row determination (exact details how are beyond the scope of this discussion) and ensures the local changes are fully committed before the “call” to the replicated procedure is even attempted.Final v2. <variables> return –1 end begin tran my_tran @retcode=exec replicated_proc <parameters> if @retcode!=0 begin raiserror 30000 “Call to procedure replicated_proc failed” rollback transaction return –1 end else commit tran return 0 end Note that this would rollback an outer transaction as well if called from within a transaction. if succeeded rs_update_lastcommit commit tran -. let’s suppose that the second call to replicated_proc (exec replicated_proc_2) fails and a “normal” transaction management model was implemented as discussed earlier vs. a proper implementation. disconnect to force a rollback -. it begins a transaction. A more useful mechanism as demonstrated is to implement savepoints at strategic locations that can be rolled back as appropriate.rollback tran Now. On retry (after the error was fixed for the replicated proc). simply needs to determine if it has been called from within a transaction or not. both would raise “duplicate key errors”.Final v2. using nested transaction is a fruitless exercise.” from developers who are quick to state that replicating procedures with “select/into. Consider the following SQL batch as if sent from isql: insert insert insert insert insert go statement_1 statement_2 statement_3 statement_4 statement_5 If statement 3 fails with an error. Procedures with “Select/Into” The latter example probably raised a quick “but.. Each procedure. Since only the outermost transactions actually commit the changes. the error raised would cause RS to attempt to rollback and retry the entire transaction group individually. in a way. However. note the highlighted sections. While this may seem to be more appropriately discussed in the primary database section earlier. “seemingly spurious duplicate key errors”. Now. A summary table is included first for ease of reference between the scenarios. it simply implements savepoints to rollback the changes it initiated. If it was called within a transaction. but leave a very confused DBA wondering what happened.0. the transaction “wrapping” effect of Replication Server has often caused application developers to change the procedure logic at the primary. The effect would be that the entire transaction batch would get rolled back to where the transaction began. upon reaching inserts #4 & #5. it does make sense to discuss it here. statements 4 & 5 still execute as members of the batch. a rollback does not suspend execution. Fortunately.” is not possible due to “DDL in transaction” errors at the replicate system. and simply resuming the DSI connection and skipping the transaction would have keep the database consistent. in one sense. If not. procedures with select/into execute fine at the primary. The best way to decide what to do with procedures containing “select/into” is by assessing the number of physical changes actually made to the real tables the procedure modifies and the role of the worktable created in tempdb. however.but.. Many developers then are quick to re-write both to eliminate the select/into – not only affecting the performance at the replicate. you need to consider the impact of transaction batching and error handling. Procedures & Grouped Transactions To understand why this can lead to inconsistencies at the replicate – and more to the point. when called. it would still be the responsibility of the parent procedure to rollback the transaction (by checking the return or error code as appropriate). Several scenarios are discussed in the following sections.1 end Again. put this in context of replication transaction grouping – which if issued via isql would resemble the following: begin transaction rs_update_threads 2. since inserts #4 & #5 were executed outside the scope of a transaction. Checking the database would reveal the rows already existing. So. they would not get rolled back by the RS. However.if it didn’t succeed. 210 . but also endangering performance at the primary. <value> insert statement_1 insert statement_2 exec replicated_proc_1 insert statement_3 exec replicated_proc_2 insert statement_4 insert statement_5 rs_get_thread_seq 1 --end of batch -. Very true if procedure replication is only at the basic level – which typically is not the optimal strategy for procedure replication.. it merely undoes changes). however the subsequent inserts (#4 & #5) would succeed (remember. fail at the replicate due to DDL in tran errors. Case in point. 1 Solution replicate tables vs.5 update businesses set tax_rate = tax_rate . the final number of rows affected in replicated tables is actually fairly small. the number of final rows affected will probably be only a couple dozen. For example. minority_owner_ship. Consider the following example: Update all of the tax rates for minority owned business within the tax-free empowerment zone to include the new tax structures. In this case. think of the logic in the original procedure at the primary: Step 1 – Identify the boundaries of the area Step 2 – Develop list of businesses within the boundaries Step 3 – Update the businesses tax rates 211 . The pseudo code would look something like: select business_id. it still may be better to replicate the rows vs. the average execution of the procedure is 20 seconds modifying 72 rows. Take the above example again. a certain linear distance from a epicenter) and may require “culling” down the list of prospects using successive temp tables until only the desired rows are left. the first worktable may be a table simply to get a list of businesses and their range to the epicenter – possibly using the zip code to reduce the initial list evaluated. the stored procedure may use a large number of temporary tables to identify which rows to modify or add to the real database in a “list paring” concept.business_id=t2. If it takes 10 seconds to move the 72 rows through Replication Server and another 13 seconds to apply the rows via the DSI. lets take a look at what if this was in a procedure. only this time.business_id Now. (range formula) into #temptable_1 from businesses where zip_code in (12345. However. businesses b where b. The net effect would be a procedure that requires (just for sake of discussion) possibly 20 seconds for execution – 19 of which are the two temp table creations.e. changing the procedure to use logged I/O and permanent worktables as that might slow down the procedure execution to 35 seconds. minority_owner_ship.0.10 from #temptable_2 t2. in many cases.. The second list would be constrained to only those within the desired range that are minority owned.12346) select business_id. Procedures In this case. Since these empowerment zones typically encompass only an area several blocks in size. the logic to identify the rows may be fairly complicated (i. For instance. The decision to replicate the rows or the procedure then becomes on of determining whether the average number of rows modified by the procedure take longer to replicate than the time to execute the procedure at the replicate. In some cases.Final v2. distance into #temptable_2 from #temptable_1 where distance < 1 and minority_owner_ship > 0. The first temporary table creation might take several seconds simply due to the amount of data being processed and the second may also take several seconds due to the table scan that would be required for the filtering of data from the first temp table. However. lets assume that the target area contains thousands of businesses. Replicating that many rows would take too long. it is a classic case of replicating the wrong object. let’s say that when executed. Worktable & Subprocedure Replication However. it is simply too much to replicate the actual rows modified. procedure Work table & subprocedure Applicability • complex (long run time) row identification • small number of real rows modified • complex (long run time) row identification • small number of rows in work table • large number or rows in real tables procedure rewrite without select/into • row identification easy • work tables contain large row counts • large number of rows modified in real table Replicate Affected Tables vs. This is due to the simple fact that if the procedure hits an error and aborts. it must be passed to the subprocedure so that the maintenance user knows which rows to use. the number of rows are fairly small. @street_n varchar(50). @target_demographic.high_streetnum and swt.streetname and b. As a result it would still be replicated and attempted at the replicate. @target_demographic varbinary(255).logic to identify N-S streets in boundary using select/into -. you simply need to replicate the worktable containing the street number ranges and the inner procedure. @new_tax_rate decimal(3. while reducing the number of rows actually replicated to the replicate system. As a result. The inner procedure call and inserts into the worktable are enclosed in a transaction at the primary. streetnum_w. @target_demographic varbinary(255). street_work_table swt where swt. @streetnum_e int. @new_tax_rate commit tran return 0 end create procedure set_tax_rate_sub @proc_id int. the difficult logic to identify the streets between the others is not performed at the replicate. allowing the use of select/into at the primary database for this logic. In this example. Consequently the logic for a stored procedure could be similar to: (Outer procedure – outer boundaries as parameters) Insert list of streets and address range into temp table (Inner procedure) Update business tax rate where address between range and on street.3) as begin update businesses set tax_rate= @new_tax_rate from businesses b. streetnum_n. Step 1 really needs a bit more logic.1 Now think about it.low_streetnum and swt.streetnum between swt. @streetnum_s int. streetnum_e. identifying the boundaries as the outer cross streets does not help you identify whether an address is within the boundary unless employing some form of grid system ala Spatial Query Server (SQS).logic to identify E-W streets in boundary using select/into begin tran insert into street_work_table select @@spid. Note the following considerations: • • Inner procedure performs cleanup on the worktable. streetnum_s. streetname from #EW_streets exec set_tax_rate_sub @@spid. the procedure execution was successful according to the primary ASE.demographics & @target_demographics > 0 delete street_work_table where process_id=@proc_id return 0 end By replicating the worktable (street_work_table) and the inner procedure (set_tax_rate_sub) instead of the outer procedure. This reduces the number of rows replicated as only the inserts into the worktable get replicated from the primary. @@spid is parameter to the inner procedure and column in the worktable. Since the spid at the replicate will be the spid of the maintenance user and not the same as at the primary.3) as begin -. @streetnum_w int. @new_tax_rate decimal(3. streetname from #NS_streets union all select @@spid. The procedure at the primary then might look like: create procedure set_tax_rate @streetnum_n int. @street_w varchar(50). @street_s varchar(50). The reason for this is that in multi-user situations. By • 212 .process_id = @proc_id and b. The real logic would probably be more likely: Step Step Step Step Step 1 2 3 4 5 – – – – – Identify the outer boundaries of the area Identify the streets within the boundaries Identify the address range within each street Develop list of businesses with address between range on each street Update the businesses tax rates Up through step 3. you may need to identify which rows in the worktable are for which user’s transactions.0.Final v2.streetname=b. @street_e varchar(50). This will allow user-defined exits (raiserror.1 enclosing the inserts and proc call in a transaction. An example is if a procedure is given a range of N-Z as parameters.00 50. unfortunately.050. the update via join can use the index even if no other SARG (true in the above case) is possible.000. area code + phone exchange.275.050.125. Then if the real data table (businesses) has an index on streetname. However.00 1. rather than using a range.00 75. in the above example. A crucial performance suggestion for the above is to have the clustered index on the worktable have the spid and one or more of the other main columns as indexed columns. top 10 lists.050. Since most credit cards operate on an average daily balance to calculate the finance charges.000. the whole unit could be rolled back at the primary.00 1.00 1. Specified List Criteria – In certain situations.000. a list of personnel names being replicated from a field office to the headquarters. this usually requires permanent working tables in which the procedure makes logged inserts/updates/deletes.00 1. sometimes it is necessary. the end result is the same – thousands of rows will be changed.00 1. In such a situation – even if the “load” was distributed across every day of the month by using different “closing dates” – tens of thousands to millions of rows would be updated each execution of the procedure. The list of blood donations would be huge. The last point is fairly important.000. it still is a successful procedure execution according to ASE. A classic example is the “mark all blood collections from regions with E-Coli outbreak as potentially hazardous” example often used in replication design examples as good procedure replication candidates. etc.00 150.00 1. manufacturers.00 1. a specified list is necessary to prevent unnecessarily updating data inclusive in the range at the replicate (a consolidated system) but not in the primary. For example.00 1. the rows).Final v2. This is a bit more difficult than simply taking the average and dividing by the number of days.00 1. regions. For example. Consider the following table: Day Begin 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Charge 1. Procedure Rewrite without Select/Into This. subtract any payments (as these always apply to “old” balances first). Situations it is notably applicable for include: Area Bounded Criteria – DML involving area boundaries identified via zip codes.00 1. This could include dates. As well as any other situation in which a fairly small list of criteria exists compared to the rows actually modified.00 213 . is the most frequent fallback for developers suddenly faced with the select/into at replicate problem – and agreeably. While this technique may appear to have limited applicability. and streetname. countries. the first step would be to get the previous month’s balance (hopefully stored in the account table). resulting in a mini-abort in the RS that would purge the rows from the inbound queue. Despite the fact an error is raised and a negative return status returned from the procedure.125.00 1. return –1) to be handled correctly provided that the error handling does a rollback of the transaction.275. etc. stores.00 1.125. account numbers. but the list of collection centers located in those regions is probably very small.00 1. This should only be used when the identifying criteria is the entire set of rows or a range criteria that is huge in itself. consequently replicated to all subscribing databases where the same raiserror would occur resulting in a suspended DSI.000.0. Any procedure that is replicated should be enclosed in a transaction at the primary.050.050.00 Balance 1. the clustered index might include spid.050. A classic case would be calculating the finance charges for a credit card system.00 1.00 1. in actuality.275. it probably resolves most of the cases in which a select/into is used at the primary database and not all the rows are modified in the target table (establishing the fact some criteria must exist – replicate the criteria vs. While it is possible to create a list of 13 characters and attempt the above. The method to achieve this is based on multiple (not parallel – multiple) DSI’s which is covered later in this section.0. One of the other advantages to this approach. it would several hours before any other transactions could begin. let’s assume this is done via a series of select/into’s (possible with about 3-4 – an exercise left for the reader). is that statement and transaction batching could both be turned off. the net result would be a slow lockdown of the table.900. as described before.900.400. create procedure proc_w_select @parm1 int as begin declare @numtrans int select @numtrans=@@trancount while @@trancount > 0 commit tran -.900. in order to guarantee that the transactions are delivered in commit order. Additionally.00 1.400.1 Day 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Avg Bal Charge 125. This.00 Balance 1.366. Replicating the procedure is a must as replicating all the row changes at the end of every day (assuming every day is a “closing date” for 1/30th of the accounts). Separate Execution Connection This last example (finance charges on average daily balance) clearly illustrates a problem though in replicating stored procedures.00 1. One way around this is to employ a separate connection strictly for executing this and other business maintenance. At the primary system – assuming no contention at the primary – the finance charges procedure could happily run at the same time as user transactions (assuming the finance charge procedure used a cursor to avoid locking the entire table). is extremely unsatisfactory. One of those considerations is the impact on subsequent transactions that used/modified data modified by the maintenance procedure.400. a series of real worktables would have to be used.00 1. at the replicate.400.67 500.400.400.00 1. this should only be used when other methods have failed and procedure replication is really necessary.00 1. the Replication Server applies the transactions serially. it will run for several hours on a very large row count. In doing so.00 As you can see.updates to table 214 .00 1.400.900.900. could be impractical.00 1. it is fully possible that the update makes it to the replicate first – only to be clobbered by later execution within the maintenance record.00 1.00 1.00 1. 21% for department store cards) for the finance charge.00 1. instead of using select/into’s to generate the average daily balances. no matter what time the procedure runs. However.00 1.00 1. Consequently.e. Obviously.900. once the procedure started running at the replicate.00 1. normal replicated transactions could continue to be applied while the maintenance procedure executed on it’s own. the entire update would be within a transaction – if it didn’t fail due to exhausting the locks. many considerations to implementing this which are covered later. Needless to say. the system needs to first calculate the daily balance for each account and then insert the average daily balance multiplied by some exorbitant interest rate (i.00 1. of course.Final v2. there are many.select into logic begin tran -. the following procedure would work. consequently. For sake of argument. there is no way to simply take the sum of the new charges ($900) and get the final answer. Due to timing issues with a separate execution connection. Consequently.400. This would allow the procedure at the replication to contain the select/into provide that system administrators were willing for a manual recovery (similar to system transactions). With both statement and transaction batching off. As a result.900. only three conditions could exist: rs_lastcommit OQID < first rs_commit OQID – In this case. resume connection DS. However. it is critical that the procedure be fully recoverable – possibly even to a point where it could recover from a previous incomplete run. recovery is fairly simple as the empty transaction prior to the DDL has not yet been applied. the OQID is not updated when it finishes. During recovery.1 commit tran while @@trancount < @numtrans begin tran return 0 end This is similar to the mechanism used for system transactions such as DDL or truncate table. recovery is simple as this implies that the DDL was successful since the empty transaction that followed it was successful. by committing and re-beginning the transactions at the procedure boundaries.0. the administrator is telling RS to re-apply the system transaction as it never really was applied. or 2) both were applied. for example. If instead it had run but the second rs_commit had not.DB skip transaction provides similar functionality to leaving of “execute transaction” for system transactions. Since the DDL operation is not within an rs_commit. By specifying execute transaction. Accordingly. If the actual data modifications were made outside a transaction. So. In the case of system transactions.Final v2. Consequently. the RS can simply begin with the transaction prior to the DDL. 215 . Replication Server submits the following: rs_begin rs_commit -. then when a failure occurs during the execution. Reason is that one of two possible situations exists. rs_lastcommit OQID = first rs_commit OQID – Here all bets are off. As a result. then simply leaving it off the resume connection is sufficient. reapplying the procedure after recovery would result in duplicate data. you are not sure if the proc finished if the OQID is equal to the OQID prior to the proc execution. rs_lastcommit OQID >= second rs_commit OQID – Similar to the above. Either 1) the empty transaction succeeded but the DDL was not applied (replicate ASE crashed in middle).DDL operation rs_begin rs_commit The way this works is that the rs_commit statements update the OQID in the target database. If it was successful. Rep Server can begin with the transaction following the one for which the OQID was recorded. the finance charge procedure would only develop the list of average monthly balances from accounts that did not already have a finance charge for that month. Hence the added “execute transaction” option to resume connection command. Consequently the administrator has to check the replicate database and make a conscious decision whether or not to apply the system transaction. . 1 Replication Routes To Route or Not to Route. and should not be viewed as limitation. Understanding routing architectures requires an understanding of the basic route types and then the different topologies and the types of problems they were designed to solve. The below diagram illustrates two one-directional routes between the primary and replicate servers: LOG PRS RSSD LOG RRS RSSD RSM PRS RSSD DS PDB LOG RA PDS PRS RRS RSSD DS RRS RDS LOG RDB Figure 44 .Final v2. however. That’s a topic for later. An intermediate route was illustrated at the beginning of this paper with the following diagram: 217 . Routing Architectures Replication routing architectures is not a topic for the uninitiated as it has significant similarities with messaging/EAI technologies. This has more to do with how routes work from an internals perspective. In fact.Two One-Direction Direct Routes between Primary & Replicate Indirect Routes An indirect route infers that the Primary Replication Server (PRS) and Replicate Replication Server (RRS) are separated by one or more Intermediate Replication Servers. The goal of this section is to provide the reader with a fundamental understanding of this feature. it would have posed a problem with indirect routes and the flexibility of having different intermediate sites between two endpoints. Direct Routes A direct route implies that the Primary Replication Server (PRS) and Replicate Replication Server (RRS) have a direct logical path between them (logically adjacent). That is the Question… One of the key differences between Sybase’s Replication Server and competing products is the routing capabilities. how it works. Routing was developed for Sybase Replication Server from the onset to support long-haul network environments while providing performance advantages in that environment over non-routed solutions. it is the only replication product on the market that supports intermediate routes. considerations and performance aspects. it is common to have two connections since routes are unidirectional in Sybase Replication Server. Sybase very easily could have used a single command to construct a bi-directional route.0. In fact. Route Types Anyone who has been around Sybase RS for more than a few months knows that there are two different types of routes that Rep Server provides: Direct and Indirect. however. bidirectional replication.one local and one remote. Since WS only uses inbound queue. The reason is that with the longer WAN’s.0. certain things are understood (i.Final v2. 218 . not only is the bandwidth lower. At first glance. Consequently sometimes it may be advisable to set up a “replicated copy” in which all the tables are published and subscribed to using standard replication definitions and subscriptions and use two replication servers . Remote Standby In a typical Warm Standby system. the connectivity between the RepAgent and the RS become a bit of a problem. Classic implementations include Remote Standby and Shared Primary (Peer-to-Peer). but many of the topologies (as we will see) fairly much require them.e. Each of the base topologies are discussed in the following sections.e. some may question the reason for even using intermediate routes. Point-to-Point A point-to-point topology is characterized by every RS having a direct connection to every other RS. etc. In some environments where the standby system is extremely remote (i. it doesn’t take long before the term topology starts being discussed.e.An Example of an Intermediate Route Each of the Replication Servers above first has a direct route to its neighbor and then an indirect route to the replicate. however. This restriction is mainly due to the fact that routing is implemented as a connection and hence an outbound connection. but also the line quality and other factors become an issue. 100’s of miles) away.). large implementations may find that they combine different topologies within their data distribution architecture. With each topology. a single Replication Server is used.1 LOG PRS RSSD LOG RRS RSSD RSM PRS RSSD DS PDB LOG RA PDS PRS RRS RSSD DS RRS RDS LOG RDB IRS IRS RSSD DS LOG IRS RSSD Figure 45 . it has been restricted to a single RS. Topology is nothing more than a description of the connections between the different sources & targets. a hierarchical topology implies a rollup factor) and certain aspects are also immediately known (i. There are only a limited number of base topologies. Route Topologies Once routing gets implemented. Instead. this is not a problem.The Standby as a Target Puzzler Now.Example of a Remote Standby This has some distinct performance advantages: • • • • • • • Empty begin/commit pairs and other types of non-replicated data gets filtered out immediately at the primary The transaction log continues to drain as normal and is not impacted by WAN outages Other destination systems are not impeded by having transactions first go to remote site as a normal WS would indicate. The issue is that NY users can modify the 219 . Having transactions applied to SF directly could cause database out-of-sync issues. the two would be different connections in the same domain . comes one of those times in this paper where you have to engage your thinking cap… • How’s the switch affected from Chicago’s viewpoint (question marks above)? Remember. they can subscribe at the local node. it is equally true that if the system participates in replication to/from other sites. it can be a real bear to deal with. but in reality. The reason is that some of the nuances of a logical connection are not well known. Consider the following scenario’s: Chicago (Different app) ??? ??? San Francisco (Standby) New York (Primary) Figure 47 . it gets real sticky.1 San Francisco (Standby) New York (Primary) Figure 46 .0. especially with large transactions The first point may appear to be fairly minor. While it is true that if the system is isolated.Final v2.duplicate subscriptions are not the answer. Tends to be more resilient from network issues It also has some very acute disadvantages: Doesn’t support a logical connection internal to RS Doesn’t support automated failover Has increased latency in respect to RS processing. But.Final v2. rs_lastcommit is replicated. Additionally. if we just replicate to NY from Chicago. Simply trying to switch Chicago to SF could result in missing transactions since the currently active segment in the queue is past those transactions and routing does not forward transactions intact (later discussion in internals). • Using NY RS as an intermediate route for SF RS from Chicago (CH NY SF as a RS route) would not be the answer either. but to use a multiple-DSI approach and a different maintenance user (and connection name due to the domain). If performance is the issue. an rs_marker is used to signal when to begin sending all transactions to avoid transactions applied by the other node). A simpler solution to the problem above is to have Chicago not use a route to NY & SF. two of the important aspects of a WS connection is that the transactions sent to the logical pair are routed correctly in the event of a failover and that transactions are applied to the primary which in turn re-replicates them to the standby (‘send warm standby xacts’ effectively encapsulates ‘send maint xacts to replicate’. we are digressing deep into a topic that deserves its own discussion. Regardless. however.either on different sets of tables. Now it is important to note that request functions have been used by some customers to modify another sites data by sending the change request to that site . the Chicago replicated updates get to SF first. 220 . but infers another concept called “application partitioning”. columnwise within tables or row-wise within tables. In a shared primary implementation. due to latency and timing.1 source data. Result is that Chicago’s transactions would appear to have been lost. which will not drain since NY ASE is dead. Shared Primary (Peer-to-Peer) The other classic implementation for point-to-point topologies is a shared primary or peer-to-peer implementation. The driver for this sort of implementation should be network resilience. they see the last transaction that made it to the pair (hence the ‘strict’ save interval as well). While this is a different topic altogether (Warm Standby Replication). application partitioning is done implicitly at each site by restricting the users from modifying other sites data. However. Again. consider the problem posed at the end of the last bullet. what happens when NY fails? Some of the Chicago transactions will be stranded in the transaction log while others will be in the queue . consequently once replicate systems reconnect to the logical pair. So.Typical Shared Primary/Peer-to-Peer Implementation This technique is often referred to as “data ownership” from a replication standpoint.the outbound queue in NY RS. be absolutely 200% positive that it is the best approach. • By now you are beginning to see the real purpose behind the logical connection for a WS. In a Peer-to-Peer implementation a distinct model of data ownership is defined . Potentially others are still stranded in the Chicago RS outbound queue for the route. then the replicated NY changes. later updated by Chicago replicated transactions.or by having the change request implement an ownership change.0. the point of this entire discussion is that while it may be tempting to set up replicated standby’s for more local systems. The Chicago transaction still has a distinct probability of getting to SF first if the transactions are executed close together. This type of implementation is often illustrated as: NY CH SF Chicago San Francisco NY CH SF New York NY CH SF Figure 48 . it probably is solvable via other means than this implementation as it is doubtful that this implementation really will improve performance over a properly tuned WS implementation. Final v2. aggregates) to prevent blocking on contentious columns (think balance for a bank account . is that each node is looking at a point in time (historical) copy of the data from other nodes. which slows a cross node query (hence their own benchmarks do not allow users to read a block they didn’t write).0. The downside. But what if no larger machines are available? Additionally. Cross node writes can be handled as request functions or via function strings (i. Queries involving remote data execute substantially quicker as the data is local. Microsoft) and a MPP shared-nothing approach (IBM. A typical implementation looked like: Transaction routers A-G H-Q R-Z A-G H-Q R-Z A-G H-Q R-Z Figure 49 . For ASE 15.now consider cross account transfers).MPP via Load Balancing with RS This implementation is more or less a cross between a MPP share-disk approach (Oracle. the reads (selects) grossly outnumber the writes (DML) and consequently is the driving force when a machine is at capacity. RAC/MS (and each node of a share-nothing) has a single copy of the database . 221 . If it can not support the full write load under any query load.1 MP Implementations Another successful implementation of the shared primary implementation that really drives home this point is when the system is divided for load balancing purposes.the blocks are copies from when the transaction began. Shared-nothing approaches essentially union data. IBM and Sybase (old Navigation Server/Sybase MPP) split the data among different nodes and used result set merging. which may not be current.e. unioning the results).0. Some customers started using RS from the earliest days to maintain a loose model of a massively parallel system by using peer-to-peer replication. although it would be easy to implement in an application server as well. a single large machine is a single point of failure and leaves customers exposed. of course. is that the same is true of Oracle RAC . A little known fact. a fine example of a transaction router is OpenSwitch. Shared-disk architectures in particular have problems with this as Distributed Lock Managers are necessary to coordinate and cache coherency resolution is necessary. 2. Sybase). Incidentally. Shared-nothing architectures have severe problems as well as this often reverts to a 2PC. An additional downside is that each node must be able to support the full write load while handling a fraction of the query load. The above implementation has a couple of advantages over RAC/MS (shared-disk) as well as result set merging (shared-nothing). In a typical environment. Sybase is planning on implementing MPP via a federated database using unioned views. a larger machine often is the answer. In some cases an aggregate function across the datasets then becomes an application implementation (i.e. then a shared-primary implementation and pure application partitioning will be necessary in which only data truly needed at the other nodes is replicated.read the manuals) and implements a block ownership and block transfer. 1. it has some advantages over both models. 4. 3. Oracle 9i Real Application Clusters (RAC) enforces application partitioning (forget the marketing hype . As weird as the above may look. Microsoft quite explicitly uses a transaction router to enforce application partitioning.and consequently a single point of failure. Probably a little closer in time than with the above. but still a problem. In such a case. Interestingly enough. The problem of course is that the block transfers are on demand. count(*) or sum(amount) across nodes involves summing the individual results vs. First. of course. however. Chicago. Consequently a “hub & spoke” implementation could be used with a common “arbitrator/re-director” in the middle.0. the number of connections gets to be entertaining. Circular Rings A circular ring is a topology in which each Replication Server has direct routes only to those “adjacent” to it. This is largely due to the fact that most communications flow sequentially about the ring. remember that the number of connections from each site is one less than the totals sites. Figure 50 . as the numbers grow beyond 5. but these represent the “main” support centers for English speaking customers. As you can tell. However. etc.). Dublin. Massachusetts. In fact. Hong Kong. and Maidenhead. Every replicated row goes to the same outbound queue where it is passed to another replication server (“hub”) which determines the distination(s). A true statement .so for M sites. California. For 3 sites. it would be twice that number due to the unidirectional nature of routes . Illinois.Hub & Spoke Implementation Note that the site in the center “lacks” a database. Sydney.5 would require 40. An astute observer may be quick to point out that logically you still need to create the individual routes as if it were point-to-point with the only difference in the above being that the “hub” is specified as the intermediate node. It is fine as long as the number of sites is in the 3-4 range and possibly could be extended to 5. The reason for this is that it’s sole purpose is to facilitate the connections. England. Australia. Such systems typically use globally dispersed corporate centers to avoid having 24-hour shifts locally. A classic example was illustrated earlier in “follow-the-sun” technical support systems. Additional support staff are distributed to other locations as well (Brazil. this can be represented by: 222 .Final v2. the total number of connection is M*(M-1)*2. Globally. Sybase has support centers in Concord. a total of 12 would be needed…. Netherlands. For example. it does not take into consideration the processing and possibly the disk space that is saved at each site.1 Hub & Spoke Hub & Spoke implementations are common implementations where the point-to-point implementations are no longer practical due to scalability and management. typically in a single direction. Consider the common point-to-point scenario described in the last section. China. it will cause it to replicate and consequently the next site will have the info. there are ~35 sites. The amount of processing saved is enormous.1 Figure 51 . We will use a mythical chain of Syb-Mart. they must perform direct routes from/to every site. A hierarchical topology is very similar to an index tree for databases in that there is a root node and several “levels” until the bottom is reached. Some of these items bear the Syb-Mart label while others are national brands. the biggest benefit from this then becomes Replication Server performance as efficiencies are realized by implementing a system as such.Possible Geographic Distribution Topology for a Global Corporation This is where IBM. Consider the following topology: Figure 52 . It is different in an aspect that the intermediate levels also represent functional nodes. it is sent to the next site as a precaution. Because of their lack of indirect routes. furniture. it is very similar to probably one of the most common routing implementations (along with remote standby) . a ownership change for that case is effected. As a result. Oracle and Microsoft lose it. Hierarchical Trees The above topology is considered a basic one even though it combines elements of others. etc. While Sybase’s implementation is different. automotive goods. In the above illustration. The primary reason for this topology used to be the limited bandwidth between the continents. One of the clearest examples of a hierarchical implementation can be witnessed in a large retail department store chain. As that has largely been resolved in recent years. you could picture as a support case is opened. Each store reports its 223 . If a handoff is necessary. Geographic Distribution This logically leads to the next and one of the more common topologies . sites that need to communicate to other local sites have direct routes to those sites. yet the most that any one site has a direct route to is 5. In it.“Geographic Distribution”.Final v2. As soon as the support person at the next site makes any modification.Sybase’s Follow-The-Sun TS Implementation Just by looking at it. you can discern the “ring” between the centers.0. Each Syb-Mart store sells the usual clothing. A change to a lookup table is easily distributed to all of the sites. Looking at in a slightly different view and you get the illustration of cascading nodes. A system that does not have indirect routing would have to create 35x34 or 1190 connections in order to support replication to/from every site.hierarchical. tools. Some HR information. 224 . On of the difficult concepts to grasp is that each of the tiers need not be simply a “roll-up” of all the information below.) could be replicated down the tiers. price increases. etc. As soon as a re-organization is announced. while the higher level tiers would only retain daily/monthly/yearly aggregates. It is often viewed that each of the tiers are consolidations of each of the tiers below. In fact. and perhaps on-site inventories may be present in all the tiers. etc. etc. they need to review what the original and final physical topologies would resemble and then determine the actions necessary to carry it out. prices. product SKU’s. store. the personnel would still be under the same “area” or “national” aspect. simply dropping the subscriptions to the previous regional center and adding them to the new regional center (without materialization of course) may be all that is necessary from the stores perspective. the field sites may have record of each individual transactions (business “events”).Final v2. would not change. each of the field sites would be similar and somewhat independent of the regional site. in the above illustration. employees would no longer be (possibly) accountable to the original region and it more than likely would be a security risk to have employee data still resident in a system to which no one has need to know that information anymore. but rather the subscription de-materialization/rematerialization and supporting data elements. each record may be present in detail for payroll purposes. which is trivial to modify. employee id. For example. but at the top level. HR information is a little different. etc. hirings. There may be minor additional rows needed at the regional center to handle the new field site (or some removed). firings. perhaps with the addition of some aggregate values. This last example is one that is sometimes missed . This hierarchy can be illustrated as follows: Corporate National Area Regional Field Figure 53 . such as individual employee timesheets might also only record aggregates at each level.) would move up the tiers (perhaps using function strings to only apply aggregates at each higher level). The stores current database status regarding past sales. of course. It is true that many of the “business objects” . along with individual employee records (such as name. The biggest problem with hierarchical tiers is a re-organization in which field sites migrate from on regional center to another. but all-in-all fairly simple. The new regional center would need to know the employee data.1 receipts to a regional office. it is arguable. In the case of HR data. but database administrators need to plan for the capability to re-organize quickly. However. The problem is not the routing.Syb-Mart Mythical Hierarchical Topology Both sales and HR information (such as timesheet data. This is kind of an interesting paradox in that at some point in the tiers. while intermediate locations only receive aggregates.detail records “going” to the top. address. and finally to corporate headquarters. Hierarchical implementations still remain one of the most common. However.).0. either bulk or atomic de-materialization and re-materialization would be required. At whatever levels in between. that all detail records should rollup to the top. while pricing information (sale prices. current inventory. if for no other reason than to feed the corporate data warehouse. promotions. which in turn feeds to an area office. In this case. which in turn feeds to a national headquarters.products. brownouts in San Francisco.usually because a corporate bandwidth strategy is allocated from corporate to main regional centers (more than likely larger metropolitan areas with the infrastructure in place). Assuming a very wide distribution of stores (one in every friendly neighborhood) consider the following hypothetical map of high-bandwidth networks (maintained by that great monopoly phone system). company systems personnel could easily re-route replication along alternate routes. stores in Charleston SC technically report to the Eastern Regional HQ in Boston. let’s discuss the internals of how it works. the network routers from the phone company would take care of physically routing the traffic most effectively. Major Metropolitan City Syb-Mart Regional HQ High Bandwidth Network Figure 54 . a direct connection would be created between them.similar to the hub-and-spoke earlier. Let’s consider our Syb-Mart hierarchical example above. consequently. and more often resembles the geographic distribution in topology . Routing Internals Now that we understand logically how routing can be put to use. train crashes in tunnels in Baltimore.0. For example.1 Logical Network For large systems. backhoes in Reston. However. while ensuring that data flows along the quickest route possible. 225 . However. VA.some for days. each of the major metropolitan centers could function as “collectors” for all of the stores in their region. In a pure hierarchical model. it typically is a mix of geographic distribution as well as. in moving the data through the system.Hypothetical High-Bandwidth Connections It would make sense to put a Replication Server at each of the metropolitan areas above to implement the “back-bone”. have disrupted communications . routes exploit some neat features. However. it may be best to borrow an analogy from the hardware domain and implement a logical network. in the past years. reducing network traffic for price changes.Final v2. the route is the same as any other destination. A logical network essentially is a “back-bone” of Replication Servers whose sole purpose is to provide efficient routing and ease connection management . etc. RS Implementation Support for routing within the Replication Server is fairly unique. Additionally. By using a “back-bone” with multiple paths. Consider the following diagram. MA. it may be possible to do so. Certainly. From a source system’s perspective. there will be 3 RSI outbound threads and associated SQM’s and outbound queue’s. 7. The outbound SQM for the route writes the data to the outbound queue as normal The Replication Server Interface (RSI) thread reads the data from the outbound queue via the SQM The RSI forwards the rows to the RS via the RSI User thread in the remote RS.0. a single copy of the message with the bitmask is placed into the RS outbound queue. Consequently. A route does not have an inbound queue. The DSI-S performs transaction grouping and submits each group to the DSI-Execs as usual The DSI-Exec’s generate the appropriate SQL and apply to the replicate database. we stated that a bitmask was used to reflect the destinations.1 Figure 55 . The SQT thread performs transaction sorting as usual The SQT thread passes the sorted transactions to the DIST thread The DIST thread passes each transaction to the subscribing sites SQM. However. The RSI User thread sends the data to the DIST thread which only needs to call the MD module to read the bitmask of destinations and determine the appropriate outbound queues to use. For local databases. then it sends the data to that database’s SQM thread. 4. 14. Hence only a single copy of the message is necessary for each direct route. If the subscriber is a local database.Final v2. it finds the next RS on the route and sends the data to the SQM for that RS. 11. The DIST send the rows to the SQM of the destination database The SQM writes the data to the outbound queue The DSI-S reads the data from the outbound queue (via SQM) and then sorts the transactions into commit order. this bitmask translates to an outbound queue. if a RS has 3 direct routes to 3 other RS’s. 8. 13. 9. if the subscriber is a remote database. The RSI User thread (a type of EXEC thread similar to the RepAgent User thread) merely serves as a connection point. 10. • Consider the following points about the above: There will be a SQM and RSI thread for each direct route created from any RS.either an outbound queue for a local database. The MD is the only module of a Distributor thread necessary. The Rep Agent sends the LTL stream to the Rep Agent User thread as normal The Rep Agent user thread performs normalization and then passes the information to the SQM for storage as usual The SQM writes the data to the inbound queue. 15. 3.Replication Server Routing Internal Threading The path for routing is as follows: 1. 6. The “inbound” processing (if you would call it that) is to simply determine which queues to place the data in . If you remember. • • 226 . 12. All of the subscription resolution (SRE) and transactional organization (TD) have already been completed at the primary RS. For remote databases. 2. 5. the same as “admin quiesce_check”. The number of seconds between RSI synchronization inquiry messages. Packet size. In return. For example. 600 = 10 minutes) to reduce connection processing in the replicate Replication Server. the message acknowledgements are sent only on a periodic or as requested basis. In addition. The Replication Server uses these messages to synchronize the RSI outbound queue with destination Replication Servers. This parameter is applicable only to direct routes where the site version at the replicate site is 12. See the Replication Server Administration Guide Volume 2 for details.” The number of minutes that the Replication Server saves messages after they have been successfully passed to the destination Replication Server. it does not make SQT calls and does not base delivery on completed transactions. the RSI interface is non-transactional in nature.144. Enter the logical name of the partition to which the next segment should be allocated when the current partition is full. Values must be greater than 0. The reason this command is used is that similar to the RepAgent RepAgent User thread communications. The range is 1024 to 262. However. When the number of bytes in the buffer will exceed the packet size. The default (-1) specifies that Replication Server will not close the connection. the send buffer is flushed to the replicate RS. it is the only mechanism in Replication Server in which orphan transactions can happen – due to a data loss in the outbound queue mainly).0 ESD #1. 227 . you may want to boost this to 8192. RSI Monitor Counters Replication Server 12. but is measured in seconds instead of rows. there are very few adjustments needed to the defaults for routing. This works similar to the Replication Agent’s scan_batch_size configuration setting. the RSI thread batches messages to send to remote RS’s.1 • Unlike the DSI interface. for communications with other Replication Servers. Instead.144 Recommended: 4MB if on RS 12. This normally should not be adjusted downwards unless in a fairly unstable network environment and want the RSI outbound queue to be kept trimmed. The range is 1024 to 8192. The RSI uses an 8K send buffer to hold pending messages to be sent.6 ESD #7 and RS 15.e. RSI Configuration Parameters The following configuration parameters are available for tuning replication routing. in bytes.1 or earlier.6 ESD #7 or RS 15. in really only applies to RSI connections as DSI threads are in a perpetual state of attempting to quiesce. Values are “skip” and “shutdown. “admin quiesce_force_rsi” forces all of the RSI threads to send any outstanding messages and then prompt for a acknowledgements. In low volume routing configurations this may be set higher (i.Final v2. In RS 12. rsi_packet_size Default: 2048 Recommended: 8192 rsi_sync_interval Default: 60 rsi_xact_with_large_msg Default: shutdown save_interval Default: 0 minutes As you can see.6 extended the basic counters from 12.0. Specifies route behavior if a large message is encountered. This is analogous to the scan_batch_size parameter of a Replication Agent. this was increased to a max of 128MB The number of seconds of idle time before Replication Server closes a connection with a destination Replication Server. A common misconception is that the “admin quiesce_force_rsi” is used to quiesce all RS connections .1 to the following counters to monitor RSI activity. rsi_fadeout_time Default: -1 Description Specifies an allocation hint for assigning the next partition. Parameter (Default) disk_affinity Default: off rsi_batch_size Default: 262.DSI and RSI. In high-speed networks. it operates much on the same principals of a Replication Agent – it simply passes the row modifications as individual messages to the replicate Replication Servers and tracks recovery on a message id basis (and consequently. The “admin quiesce_force_rsi” checks to see if the RS is quiescent.0 ESD #1. where-as “admin quiesce_check” merely checks to see if RSI acknowledgements have been received. The number of bytes sent to another Replication Server before a truncation point is requested. spent in sending packets of data to the RRS. an idea of different performance metrics could be established. Total packets sent by an RSI sender thread. Total RSI get truncation messages sent by a RSI thread. One thing to note is that the RSI does not have an SQT library function . For example. by comparing with other threads.0. in 100ths of a second. the SQMR counters for BlocksRead and BlocksReadCached may be helpful in determining why a route may be lagging. Number of times that a RSI thread has been faded out due to inactivity. These messages contain the distribute command. Total RSI messages sent by an RSI thread. This count is affected by the rsi_batch_size and rsi_sync_interval configuration parameters. other than adding the new counter RSIReadSQMTime. This count is influenced by the configuration parameter rsi_fadeout_time.0 and the route seems slow. in 100ths of a second. if comparing PacketsRead and BytesSent. Replication Server 15. spent in sending packets of data to the RRS.Final v2. in 100ths of a second. This count is affected by the rsi_batch_size and rsi_sync_interval configuration parameters. spent in sending packets of data to the RRS. these messages contain the distribute command. SendPTTimeMax Maximum time.messages are simply sent in the order they appear in the outbound queue.e. This count is influenced by the configuration parameter rsi_fadeout_time. The time taken by an RSI thread to read messages from SQM. The problem with this is that the RSI lacks the SQT cache that can help buffer activity when the downstream system is lagging slightly . Total number of blocking (SQM_WAIT_C) reads performed by a RSI thread against SQM thread that manages a RSI queue. Additionally. SQM:CmdsWritten and RSI:MsgsSent). in 100ths of a second. since the RSI includes an SQMR logic. RSI get truncation messages sent by a RSI thread.0 changed these slightly to: Counter BytesSent PacketsSent MsgsSent MsgsGetTrunc FadeOuts BlockReads SendPTTime RSIReadSQMTime Explanation Bytes delivered by an RSI sender thread. If using RS 15. RSI messages sent by an RSI thread. SendPTTimeAvg Average time. As a consequence.1 Counter BytesSent PacketsSent MsgsSent MsgsGetTrunc FadeOuts BlockReads SendPTTimeLast Explanation Total bytes delivered by an RSI sender thread. Time. the ability of the RSI to keep up can be determined (i. spent in sending the packet of data to the RRS. the last two can be of use to determine if it is the network (or downstream RRS) or the outbound queue reading speed that is the largest source of time. Again. Number of times that a RSI thread has been faded out due to inactivity. the only other change is inline with the others in than the SendPTTimeLast/Max/Avg is collapsed into a single counter SendPTTime. an idea of the usefulness of changing the rsi_packet_size parameter can be determined. Time. Number of blocking (SQM_WAIT_C) reads performed by a RSI thread against SQM thread that manages a RSI queue.which may translate into more blocks being read physically than desired. 228 . by looking at some of these in comparison with each other. Essentially. Packets sent by an RSI sender thread. there have been a number of incidents that have illustrated how easy it is to disrupt wide-area networks. little cpu is available for the DSI connection to generate and apply the SQL. a route still may be optimal to ensure network resilience. a train crash and resulting fire in a tunnel in Baltimore. Additionally. the configuration settings are fairly optimal for most environments. SQL Delivery In some cases. should a physical network outage occur. however. this has some tremendous puzzlers to solve the minute the standby pair is a target of replication from another system. This is very difficult to substantiate as some of the highest throughputs measured with Replication Server at customer sites has all been with traditional Warm-Standby configurations. a sudden drop in 229 . the inbound processing could fully utilize a cpu. Consequently. In recent years. a single RS could easily be swamped trying to maintain a large number of high volume connections. Maryland USA disrupted network communications for MCI for several days.5/SMP. it may make sense to offload the DSI processing to another cpu via replication routing. although some recommendations as above are appropriate. For most cases. while another cpu (perhaps on the same box) would concentrate on SQL delivery. Consequently. If the route is over a very low bandwidth network or is sharing the bandwidth with extremely high bandwidth applications such as video teleconferencing. In this case. it might be said that the most appropriate place for a “SQL Delivery” based performance improvement using routing is when the system is a normal replicate database and not a standby.x. This capability is directly attributable to the concept of indirect routes. it made sense to split the replication processing in half by using a route. Not too many years ago. this advantage is totally eliminated for local nodes. An intermediate node in the route really experiences minimal loading outside of the outbound queue for the outgoing route. For remote nodes. in these cases. While obvious for remote nodes. It is important to note that routes may not out-perform in “all” circumstances .in fact a common fallacy is that a route will outperform a normal Warm Standby setup even if the sites are located fairly close. the DSI connection ends up getting the same amount of cpu time as the other threads. In such implementations. However.5/SMP. By using an indirect route.and those that routed services through it equally disadvantaged.a primary and a replicate. In some cases. Distributed Processing One of the more common implementations in routing environments is using multiple RS’s to distribute the processing load when a single RS needs to communicate with a large number of connections. prior to RS 12. in some extremely high volume situations. However. Consequently. the World Trade Center disaster on 9/11 left many business in Manhattan electronically stranded . since the RS threads are executed at the same priority. nearly all of the cpu is consumed with processing the inbound stream. you can expect very low performance from the route. this was even implemented between only two nodes . Out of the box. customers were implementing multiple replication servers using routing as a way of getting multi-processor performance. Prior to RS 12. With RS 12. when not using the SMP version of RS. often the symptom is a fully caught up outbound queue.Final v2. you still shouldn’t have an intermediate node attempting to service dozens of direct routes when a more conservative approach would be much more efficient. replication system administrators can simply re-direct the route over an alternate direct route.5/SMP. Network Resilience One of the biggest advantages to replication routes is its ability to provide network resilience. the amount of resources necessary on a single machine would be tremendous. generally a single RS was implemented at each “source” with multiple Replication Server’s serving the destinations as necessary. As a result. Under these circumstances. Similarly. one cpu could concentrate on the inbound connection. As noted earlier. This is particularly true in the case of corporate rollup scenarios in which the DSI’s SQT library may be exercised more fully since transactions from different sources may be intermingled.1 Routing Performance Advantages In certain circumstances. This is frequently the excuse why some set up their standby systems as remote standby’s even when close together. Consequently. a routed connection will perform better than a non-routed connection.0. even from the earliest days of version 10. the amount of cpu “gained” over a normal “WS” must exceed the cost of additional cpu used for the DIST thread (typically suspended in WS only configurations) as well as the extra I/O cost to write to the outbound queue. Additionally. it would not appear to be as necessary when both nodes are local. but a lagging inbound queue (due to DIST thread having to wait for access to the outbound queue SQM) or a lagging RepAgent. However. route performance becomes more of a network tuning exercise. While a single Replication Server can handle dozens of connections. Some of these are described below. Routing Performance Tuning There really is not much to tune for a route. there is only one RSI for each route. all 12 databases use the same route. there are 4 DIST threads . as soon as we created 4 routes to London. there are 4 routes . In the bottom example.A More Optimal Multi-DB Routing Implementation Why is this more optimal? In the first example. DNS errors.1 routing throughput will be due to an unexpected network issue such as an outage. This may be fine for low volume systems. Consequently. the outbound queue for the RSI connection is likely going to be a source of contention and may become an IO bottleneck as well. This can lead to IO saturation in some instances. however. or other network related problems. There is one aspect to consider.Final v2. While this may be true simply from a loading perspective. but for high volume systems. the route will have a unique DIST thread at the RRS that will be writing directly into the outbound queue for the destination connection. 230 . it may not help routing performance considering the direction London New York.one for each route .0. if multiple databases are involved .and the load is split between the 4 routes using 4 outbound queues (one for each route) and 4 RSI’s send the messages.A Common Multi-DB Routing Implementation Figure 57 .in the NY_RS to handle the traffic in reverse. reducing the chances for an IO bottleneck on a single device. Additionally each of the routes could have disk affinity enabled. This means that 12 DIST threads in one RS are all trying to write to the same outbound queue and a single RSI is trying to send the messages for 12 connections. Remember. Consider the differences between the following two scenarios: Figure 56 . It might be tempting to thing then that New York should have 4 RS’s as well. Final v2. In fact.Feels Much Better 231 . better. though.1 As mentioned. considering workload distribution and using multiple RS’s.Not Much Better . best architectures for a large multidatabase source system: Figure 58 . better-yet.Ahhh….0. the New York RS may be overloaded with the 12 connections.Not a Good Plan Figure 59 . the following depict the bad. All Too Common Figure 60 .Bad .But Unfortunately. 0.1 Figure 61 .The Best Yet!!! The rationale for the above stems from multiple factors: • Currently with RS 15. the division of routes allows load balancing of IO processing for the route messages.Final v2. RS can best deal with about 2 high volume connections and a total of 10 connections before latency is impacted due to task switching. While more connections may be doable in low volume situations.0. • 232 . this is optimal As mentioned above. a solid foundation in Replication Server internal processing is necessary. However. organizations with 24x7 processing requirements or those with OLTP during the day and batch loading at night quickly realized that this “flattening” required a lull time of little or no activity during which replication would catch up. Parallel DSI’s were introduced to improve the replication system delivery rates. The obvious solution was to somehow introduce concurrency into the replication delivery. Replication Server was limited to a single process.1 Parallel DSI Performance I turned on Parallel DSI’s and didn’t get much improvement – what happened? The answer is that if using the default settings. 200 tpm max 100 tpm each = 500 tpm total High sleep time 1 cpu busy RS queue growing steadily Outbound queue steady OLTP 1 OLTP 2 OLTP 3 OLTP 4 OLTP 5 High Volume OLTP Balanced work/load in run/sleep queue Figure 62 – Aggregate Primary Transaction Rate vs. In the following sections. The challenge was to do so without breaking the guarantee of transactional consistency. Due to normal information flow. we will discuss the need for parallel DSI. particularly the DSI. the latency would increase dramatically. 4. This goes beyond just understanding the functions of the internal threads – it also means understanding how the various tuning parameters as well as types of transactions affect replication behavior. the organizations did not have this time to provide. if the aggregate processing at the primary exceeded the processing capability of a single process. tuning parameters. Single DSI Delivery Rate It should be noted that in the above figure. special transaction processing and considerations for replicate database tuning Need for Parallel DSI There are five main bottlenecks in the Replication Server: 1. The reason was very simple. Consider the following diagram. it does illustrate the point how a single threaded delivery process can quickly become saturated.0. but rather the “sleep” time waiting for the I/O to complete.Final v2. Early responses to this issue “talked” around it by attributing this to Replication Server’s ability to “flatten” out peak processing to a more “manageable” steady-state transaction rate.x versions of Replication Server. The result was that in version 11. Consequently. internal threads. the numbers are fictitious. 233 . Much of this time was actually not spent on processing as most replication systems were typically handling simple insert/update/delete statements. 5. not a whole lot of parallelism is experienced. Replication Agent transaction scan/delivery rate Inbound SQT transaction sorting Distributor thread subscription resolution DSI transaction delivery rate Stable Queue/Device I/O rate In early 10.0. While this may be appealing to some. at the replicate database. On the other hand. At the primary database. serialization methods. performance was achieved by concurrent processes running on multiple engines using a task efficient threading model. it was noticed that the largest bottleneck in high volume systems was #4 – DSI transaction delivery rate. In order to understand parallel DSI’s. 2. 3. very little is different for Parallel DSI’s. as well as tuning Parallel DSI’s. index rs_threads_idx on rs_threads(id) 234 . int. create table rs_threads ( id seq pad1 pad2 pad3 pad4 ) go create unique clustered go int. after a certain number of threads. we discussed the internal processing of the Replication Server. Up to 255 Parallel DSI threads can be configured per connection. rs_threads processing As mentioned earlier (and repeatedly). considerable skill and knowledge is necessary to understand how these little differences are best used to bring about peak throughput from Replication Server. Replication Server 12.. later sections will focus on the serialization methods as they are key to throughput.5 and earlier implemented a synchronization point at the end of every transaction by way of the rs_threads table.padding for rowsize. adding more will not increase throughput.Final v2. Parallel DSI Internals Earlier in one of the first sections of this paper.0. From this aspect.and DSI 2 might get ahead.. char(255). -. etc. this would seem an impossible task where Parallel DSI’s are employed – a long running procedure on DSI 1 .1 Key Concept #25 – Replication/DSI throughput is directly proportionate to the degree of concurrency within the parallel DSI threads. char(255). However. While this section discusses the internals and configuration/tuning parameters. however. apply the transactions and perform error recovery. char(255). char(255).one up used for detecting rollbacks -.thread id -. At first glance. To prevent this. it is the responsibility of the DSI Executor threads to perform the function string translation. the Replication Server guarantees transactions are applied in the same order at the replicate as at the primary. Parallel DSI Threads The earlier diagram discussing basic Replication Server internal processing included in the illustration Parallel DSI’s (step 11 in the below) RSSD STS Memory Pool Outbound 12 DSI-Exec 11 DSI SQT DSI-Exec DSI-Exec Replicate DB 10 Stable Device SQM Primary DB SRE TD MD 7 6 9 8 Outbound (0) Inbound (1) dAIO Distributor SQT 5 1 RepAgent Rep Agent User Outbound (0) Inbound (1) 2 SQM 4 3 Inbound Figure 63 – Replication Server Internals with Parallel DSI’s While the DSI thread is still responsible for transaction grouping. alternative implementation used on servers with >2KB page size -. pad2 char(1). the first statement a DSI issues immediately following the begin transaction for the group is similar to the following: create procedure rs_update_threads @rs_id int. consider an example in which 5 Parallel DSI threads are used. each subsequent thread is blocked and waiting on the previous thread’s update on rs_threads. seq int. but only used when isolation_level_3 is the serialization method.Final v2. Similar to above. the current thread possibly also can commit. consequently.sql script create table rs_threads ( id int.1 -. Following this. when Parallel DSI’s are in use. this could be illustrated as in the below diagram. Used by a thread to block its row in the rs_threads table to ensure commit order and also to set the sequence number for rollback detection. pad3 char(1). During the initial connection processing during recovery. then the thread is blocked (due to lock contention) by the update lock on the row by the previous thread. pad1 char(1).0. Function rs_initialize_threads rs_update_threads Explanation Used during initial connection to setup rs_threads table.e. then the lock is not held. 235 . Used by a thread to determine when to commit by selecting the previous thread’s row in rs_threads. If the previous thread has not yet committed. and then inserts blank rows for each DSI initializing seq value to 0. Note that in each case. @rs_seq int as update rs_threads set seq = @rs_seq where id = @rs_id go Each DSI simply calls the procedure with its thread id (i.0) an alternative implementation called "DSI Commit Control" is also available and is discussed in the next section. During processing. Replication Server will first issue the rs_initialize_threads function immediately after the rs_usedb. ) lock datarows go create unique clustered index rs_threads_idx on rs_threads(id) go While still in later versions of RS (i. Issued shortly after rs_usedb in the sequence. 12. Ignoring the effects of serialization method on transaction timing.6 and 15.contained in rs_install_rll. normal transaction statements within the transaction group are sent as normal. truncate table due to heterogeneous support). Since this update is within the transaction group.e. The rs_threads table is manipulated using the following functions used only when Parallel DSI is implemented. the DSI then attempts to select the previous thread’s row from the rs_threads table using the rs_get_thread_seq function. rs_get_thread_seq rs_get_thread_seq_noholdlock To understand how this works. pad4 char(1). 1-5 in our example) and the seq value plus one from the last transaction group (the initial call uses a value of 1). it has the effect of blocking the thread’s row during the transaction group’s duration. This procedure simply performs a delete of all rows (logged delete vs. If the previous thread has committed. After all the transaction statements have been executed. After each thread commits. 236 . However. Note this happens in commit order. the transaction groups will proceed in sequence through the threads. then the current thread can commit its transaction. the transactions are still committed in order (1-20 in the above). .Final v2. If it is equal to the previous value. The theory of the above is that transactions can acquire locks and execute in parallel – but due to the rs_threads locking mechanism.1 CT 1 TX 05 TX 04 TX 03 TX 02 TX 01 UT 1 BT 1 Blocked CT 2 GT 1 TX 10 TX 09 TX 08 TX 07 TX 06 UT 2 BT 2 Blocked CT 3 GT 2 TX 15 TX 14 TX 13 TX 12 TX 11 UT 3 BT 3 Blocked CT 4 GT 3 TX 20 TX 19 TX 18 TX 17 TX 16 UT 4 BT 4 . it then requests the next transaction group from the DSI-S. This value is simply compared to the previous value. The first question that comes to mind for many is: “What happens if one of the threads hits an error and rollsback its transaction? Wouldn’t the next thread simply commit?” The answer is no. . T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 BT n CT n TX ## rs_begin for transaction for thread n rs_commit for transaction for thread n Replicated DML transaction ## UT n GT n rs_update_threads n rs_get_thread_seq n Figure 64 – Parallel DSI Thread Sequencing Via rs_threads To anyone who has monitored their system and checked object contention. it returns the seq column for the previous thread. if the seq value is higher than the previous seq value for that thread.0. it is actually deliberate. This is where the seq column comes in and the realization why rs_get_thread_seq has seq in the name. then an error must have occurred and subsequent transactions need to rollback as well. As each rs_get_thread_seq function call is made. consequently in an ideal situation. they probably thought all of the blocking on rs_threads was a problem. As illustrated above. Techniques for finding the true causes of deadlocks/contention are discussed below in the section “Resolving Parallel DSI Contention” DSI Commit Control So. For very short transactions with small or no transaction grouping. If so. it is an indicator that the statement it surfaced the deadlock with has contention with out of sequence execution. the current thread can simply go ahead and commit. consequently does not lend itself to heterogeneous situations. Blocking on rs_threads is NOT an issue – it is deliberate and precisely used to control the commit order. Deadlocks raised involving rs_threads does not infer that rs_threads is an issue. 3. then why was DSI Commit Control implemented.Final v2. The implementation is as follows: 1. The logic for rs_threads is heavily dependent on the ASE locking scheme. If there is intra-thread contention. Each thread submits its batch of SQL as usual After the batch has completed execution. 3. then. monLocks especially if the replicate database is also used by end-users for reporting or if maintenance activities are being performed. 2.0. 2. it is handled by causing a deadlock. ASE chooses the deadlock victim according it's own algorithm which favors longer running tasks – which in this case probably is the task that should have waited – consequently. 2. rs_threads is NEVER the issue!!! To find out the real cause of concern. The rationale stems from several reasons: 1. 4. To put it simply. 237 . often the wrong task is rolled back as the deadlock victim. only the offending thread and subsequent threads need to be rolled back. if contention does occur under DSI Commit Control. Threads will block until their turn to commit. This adds additional work to the re-submittal of the SQL batches involved. Since RS knows the sequence of commit.1 rs_begin rs_update_threads n (replicated transactions) rs_get_thread_seq n-1 (Blocked) Rollback transaction No seq > previous Yes suspend connection commit transaction Figure 65 – rs_get_thread_seq and seq value comparison logic It should be emphatically stated that: 1. The blocked thread and any other up to the blocking thread can continue. the rs_threads activity adds significantly to the IO processing of replication. the current thread issues rs_dsi_check_thread_lock function to see if thread's SPID is blocking another DSI thread. it checks to see if the previous thread has committed. DSI Commit Control was implemented in RS 12. you can monitor the true contention through monDeadlocks and monOpenObjectActivity as well as watching monProcessWaits. if rs_threads is not the issue.6 as a more internal means of controlling contention detection and resolution between Parallel DSI's. As a result. Instead. If the previous thread has not committed. Final v2. Additionally.Commit Control Logic Flow Note that of course if the thread is blocked. after which the batch is rolled back regardless.the earlier thread is blocked .1 4. 6. it does not get out of the first stage (executing SQL) until the contention is resolved. If rs_dsi_check_thread_lock returns 0. This can best be illustrated by the following flow-chart: Execute SQL Commit Yes Did previous thread commit? No Rollback/Abort >0 Rs_dsi_check_ thread_lock =0 dsi_commit_check _locks_intrvl Yes >dsi_commit_check No _locks_max Figure 66 .hence rs_dsi_check_thread_lock.and the later thread is waiting for it to finish . it sends an acknowledgement to the DSI-S before doing posttransaction clean-up and sending a “thread ready” message. Step 5 is repeated dsi_commit_check_locks_max times. note that if the threads commit quickly. as each thread commits. 238 . If rs_dsi_check_thread_lock returns a non-zero number. Figure 67 – Logical View of DSI & DSIEXEC Intercommunications From the above diagram. you can see how that it would be fairly simple for the DSI-S to withhold the “Commit” message from a subsequent thread until it gets a “Committed” message from the previous thread.0. The only issue then is to determine when a later thread is blocking an earlier thread resulting in an application deadlock . the thread rollsback it's transaction. 5. there also is no delay at all. it waits dsi_commit_check_locks_intrvl seconds and then checks again to see if the previous thread has committed and re-issues rs_dsi_check_thread_lock if not. The first question that might be asked is “How would a thread know the previous thread had committed?” Referring back to the earlier diagram. consider what would happen if 5 threads were being used and the first thread had a long running transaction. the default configuration values are likely too high to provide effective throughput as well.sysprocesses virtual table.sysprocesses where blocked = @@spid ' As noted. a slight alteration would achieve the desired affect of only blocking when blocking another maintenance user transaction: alter function string rs_dsi_check_thread_lock for sqlserver_function_class output language ' select count(*) "seq" from master.in a high volume system.... depending on the timing of the rs_dsi_check_thread_lock calls.add to rs_install_primary. it still could be up to 1 second later before thread 2 commits due to waiting dsi_commit_check_locks..install in RS -. Note that thread 3 is waiting on thread 2. The problem is that this method depends on the speed of materializing the master.sysprocesses where blocked = @@spid and suid=suser_id() return 0 end go -. it distinctly focuses in on the exact threads with contention and execution continues as soon as the contention is lifted. threads 2-5 would each execute the rs_dsi_check_thread_lock function and wait for 1 second.1 On the plus side of rs_threads.6 is much less specific – and in fact may lead to excessive false rollbacks just due to contention between the RS and other processes. This definition is: alter function string rs_dsi_check_thread_lock for sqlserver_function_class output language ' select count(*) "seq" from master.sql on NT) create procedure rs_dsi_check_thread_lock as begin select count(*) "seq" from master. On replicate systems used for reporting. several thousand SQL statements could have been executed during this period.added to detect only maintenance user blocks ' As this statement may get executed extremely frequently.0. the max delay at the default settings would be 4 seconds .Final v2. The biggest problem is that the default value for dsi_commit_check_locks_intrvl is set to 1000ms or 1 second. Consequently. Net result is that the maximum delay will be: max_delay=(num_dsi_threads-1) * dsi_commit_check_locks_intrvl So with 5 threads. a better starting value for dsi_commit_check_locks_intrvl is likely 100ms or even less. this would return a non-zero value whenever the DSI thread was blocking any other user . In addition to the modification needed for rs_dsi_check_thread_lock. consequently. As a result.sysprocesses where blocked = @@spid and suid=suser_id() -. As soon as thread 1 commits. The default function string provided for RS 12.for example someone running a report or trying to do table maintenance.function string modification alter function string rs_dsi_check_thread_lock for rs_default_function_class output language ' exec rs_dsi_check_thread_lock ' go The rationale is that this avoids optimizing the above SQL statement every 100 milliseconds or whatever dsi_commit_check_locks_intrvl is set to.procedure modification -. One important note. this could result in 239 . As a result. This likely is too long to wait by a full order of magnitude as any contention will result in the thread waiting 1 second as well as causing subsequent threads from committing as well.sql (rsinspri. the recommended approach is to actually use a stored procedure and a modified function string definition that calls it such as: -. thread 3 could be delayed up to 1 second after thread 2 and so forth. To understand the magnitude of the problem. a rollback when none is necessary. If an earlier thread acquires a lock and blocks a later thread.assuming it had received a ‘Batch Ready’ message from thread #2.In this scenario. parallel threads can start based on if the previous thread has reached one of three conditions: Ready to Commit .1 considerable rows that then have to be table scanned for the values (virtual tables such as sysprocesses do not support indexing). Thread #2 checks the commit status of thread #1 and sees that it isn’t ready to commit.which returns a non-zero number since thread #3 is blocked. Thread #2 completes it’s transaction. In perspective of the thread sequencing the thread at the bottom (with no lines to it) could begin executing at the following points: Ready to Commit . it is merely ready to commit. Figure 68 – Logical View of DSI & DSIEXEC Intercommunications Based on the above diagram. it is likely that this could be a deadlock chain . you could see how commit control would work from an internals perspective . this should be expected and not an issue. threads can start at any point as soon as they are ready.0. If you look back at the earlier detailed diagram of the DSI Execution flow. This coordination is done by the DSI-Scheduler. waiting for another thread.such as #2 blocking #3 who is in turn blocking #1. received a successful rs_get_thread_seq function and is ready to send the rs_commit function. in the process. it acquires locks that block thread #3. However. it would send ‘Begin Batch’ message to thread #2 . One might think that this is easily rectified by returning the spid being blocked. Without knowing all the spids for previous threads and traversing the full chain. NOTE: A common misconception is that this implies the previous thread has committed . The key to thread sequencing is to understand that based on the dsi_serialization_method. The result is predictable. However. There is another problem: “false blocking”. thread #2 would have to wait until the ‘Commit Ready’ (step 10) message was received by the DSI-S. Thread Sequencing As mentioned. Now that we understand how they commit in order. This doesn’t change the commit order. it merely allows a thread to start when it is ready vs. the statement above would detect that a blocked user existed.each subsequent thread to be committed would simply not get told to commit (step 11) until the previous thread had successfully committed (step 13). Thread #1 starts processing and is executing a larger than average transaction or one that executes longer than normal due to a replicated procedure or a invoked trigger. each DSIEXEC sends messages back to the DSI-S informing of the current status of it’s processing. 240 .In this scenario.in reality. 2. Consider the following scenario: 1. Started . subsequent threads can start only after the previous thread has already started.In this scenario. so it then issues a rs_dsi_check_thread_lock .In this scenario. there is no way for a thread to know that if the block is a real problem or not. When Ready . When the DSI-S got the ‘Commit Ready’ message from thread #1. it might help to understand how the start in order. subsequent threads can start only when the previous thread has submitted all it’s transaction batches successfully.Final v2. the parallel transactions are submitted to each of the threads in order. Net result. The basic premise is this. When Ready . the problem with this theory is that it depends largely on the following factors: Transaction Group Size . Used with parallel DSI.1 Started .Final v2. ASE Execution Scheduling .Essentially. Recommended: 50-100ms 241 . the first batch of SQL statements in each thread logically should follow the last from the previous thread. If insufficient cache or time was spent grouping the transactions. As a result.400. While it is unarguable that it should be ‘on’ for non-parallel DSI and for parallel DSI’s using a wait_for_commit serialization method. the parallel DSI thread needs to get the next batch of transactions from the DSI Scheduler. it is likely that logical and/or physical IO’s will need to be performed. If the first transaction group is allowed to start in its proper order. Problems in any one of these areas could lead to a “bursty” behavior in which blocking or commit sequencing results in apparent thread inactivity.again. how large the transaction group is from a number of statements. When the IO has completed. The goal then is understanding how the configuration parameters . Configuration Parameters There are several configuration parameters that control Parallel DSI’s. resulting in out of order execution. a transaction group may not be available. resulting in an overlap in which the vulnerability of a deadlock is raised. this vulnerability is increased. The larger the transaction groups.along with replicate DBMS tuning can minimize periods of inactivity enabling maximum parallelism for the transaction profile. Maximum: 86. Default: 1000ms (1 second).the likelihood is that subsequent threads will get ahead of the first thread and most likely be ready to commit (waiting on rs_threads or commit control) by the time the first thread completes the long running statement. The purpose for command batch sequencing is to try to control contention by proper execution.In this scenario. However. DSI Transaction Grouping . The number of milliseconds (ms) the DSI executor thread waits between executions of the rs_dsi_check_thread_lock function string. However. it will acquire the locks it needs first.000 ms (24 hours) dsi_commit_check_locks_intrvl Default: 1000ms. it would send ‘Begin Batch’ message to thread #2 . Minimum: 0. delete.After each complete execution. Subsequent command batches are sent until the thread reaches the end and is ready to commit. the thread is woken up and put on the runnable queue for processing. deadlocking.especially the serialization method . it is likely that multiple DSI threads will be waiting for IO concurrently. the SPID for the DSI thread is put to sleep pending the IO and execution moves to the next task on the ASE run queue. threads can start at any point as soon as they are ready. When the DSI-S got the ‘Batch Began’ message from thread #1. However. they are being executed first.If a thread executes a long running statement . Note that ASE doesn’t know the ideal execution order based on the DSI pattern.In this scenario.As each statement is executed. the DSI-S would immediately reply with ‘Begin Batch’. Consequently. when thread #2 would send it’s ‘Batch Ready’ message. assuming that it had received a ‘Batch Ready’ message from thread #2. As a result. any other statements left to be executed by the first thread increases the vulnerability of a rollback due to a deadlock issue. so ASE can wake up any one of them in any order. Parameter (Default) batch_begin Default: on. thread #2 would only wait until the ‘Batch Began’ (step 7) message was received by the DSI-S. Subsequent threads will simply block vs. Long Running SQL . there is a disagreement currently whether having this enabled for parallel DSI serialization methods such as wait_for_start delays the begin sequencing. Note that the ‘batch’ we are discussing is only the first batch.such as a stored procedure or if an invoked trigger runs long . If the transaction groups are submitted nearly in parallel. Recommended: (see text) Explanation Indicates whether a begin transaction can be sent in the same batch as other commands (such as insert. and so on).0. so that triggers do not fire when transactions are executed on the connection.which is far. if we use 10 seconds as our max. delete). the reality is that unless you are doing procedure replication . 3 – prevents phantom rows.0. dsi_commit_check_locks_max Default: 400. 2 – prevents nonrepeatable reads and dirty reads. The ANSI standard and Adaptive Server supported values are: 0 – ensures that data written by one transaction represents the actual data. Recommended: 1 dsi_keep_triggers Default: on (except standby databases). This should be set to a setting which puts a log warning out after 3-5 seconds to provide an earlier indication of an issue. Maximum: 1. Recommendation is based on your preference as both mechanisms have positives and negatives as discussed above. Again. to derive the value we would simply divide 10000ms by the dsi_commit_check_locks_intrvl . Used with parallel DSI. Note that at the default settings of 1000ms for dsi_commit_check_locks_intrvl. and ensures that data written by one transaction represents the actual data. far too long. Default: 200.the shorter especially for pure DML (insert.Final v2. this can be safely set to ‘off’. Specifies whether triggers should fire for replicated transactions in the database. While the book suggests to set on for all databases except standby databases.000.000 Specifies whether commit control processing is handled internally by Replication Server using internal tables (on) or externally using the rs_threads system table (off). The default value is the current transaction isolation level for the target data server. Minimum: 1. 1 – prevents dirty reads and ensures that data written by one transaction represents the actual data. Likely this will be a number <100. update. Maximum: 1. nonrepeatable reads.000.000 The maximum number of times a DSI executor thread checks whether it is blocking other transactions in the replicate database before rolling back its transaction and retrying it. Recommended: (see text) dsi_isolation_level Default: DBMS dependent. and ensures that data written by one transaction represents the actual data. The max should terminate in 5-10 seconds or less . and dirty reads.1 Parameter (Default) dsi_commit_check_locks_logs Default: 200.or are not replicating tables that are strictly populated by triggers. Default: on Specifies the isolation level for transactions. Default: 400. Replication Server supports all values for replicate data servers. Used with parallel DSI. To arrive at this value. Minimum: 1. Recommended: <100 (see text) Explanation The number of times the DSI executor thread executes the rs_dsi_check_thread_lock function string before logging a warning message. Recommended: (see text) dsi_commit_control Default: on. the answer would be 100. Set off to cause Replication Server to set triggers off in the Adaptive Server database.667 minutes .if 100ms. Recommended: off 242 . the default setting of 400 becomes 400 seconds or 6. Data servers supporting other isolation levels are supported as well through the use of the rs_set_isolation_level function string. simply divide 3000 (3 secs in milliseconds) by dsi_commit_check_locks_intrvl. Larger numbers may improve data latency at the replicate database.prevents conflicts by allowing only one active transaction from a primary data server. In parallel DSI environments .1 Parameter (Default) dsi_large_xact_size Default: 100. likely 1 is the best setting. See sub-section on Large Transaction Processing later in this section for details.Final v2. No_wait .0. Recommended: wait_for_start 243 . The maximum value is one less than the value of dsi_num_threads. If dsi_large_xact_size is set to 2 billion. The minimum value is 4. Recommended: 10. this should be set to 0. if the application does have some poorly designed large transactions. origin_sessid. setting this higher may help throughput. setting this to a much higher number than ordinary might help reduce DSI latency when the DSI is waiting on a commit before it even starts. parallel DSI’s are a better approach than increasing this value significantly. name. single_transaction_per_origin . Recommended (see text) dsi_partitioning_rule Default: none. The reason this is mentioned here at all is because of the impact on parallel DSI’s. user. time.000 or 2.assumes that your application is designed to avoid conflicting updates. More than 2 are probably not effective. and none. time.843. The number of parallel DSI threads to be used. The method used to maintain serial consistency between parallel DSI threads. hence.647 (max) Explanation The number of commands allowed in a transaction before the transaction is considered to be large for using a single parallel DSI thread.specifies that transaction isolation level 3 locking is to be used in the replicate data server. Values are origin. The maximum value is 255. None/wait_for_start . The number of parallel DSI threads to be reserved for use with large transactions. waiting for previous threads to at least start as other settings. Range of values: 1 – 100.maintains transaction serialization by instructing the DSI to wait until one transaction is ready to commit before initiating the next transaction. In non-parallel DSI environments. See sub-section on Serialization Methods dsi_serialization_method Default: wait_for_commit. See the Replication Server Administration Guide Volume 2 for detailed information. Recommended (see text) Specifies the partitioning rules (one or more) the DSI uses to partition transactions among available parallel DSI threads. this may have to set considerably lower (i. when applying transactions to a replicate data server. 5 if parallel_dsi is set to true.e. While 100 may work in some instances. Then try the combination origin_sessid. wait_for_commit . A common mistake is setting this to 100 and using a single DSI instead of attempting parallel DSI’s and a lower value. Specifies the maximum number of transactions in a group. Recommended: (see text) dsi_num_large_xact_threads Default: 2 if parallel_dsi is set to true. The recommended setting is to leave this set to none unless using parallel DSI’s and experiencing more than 1 rollback every 10 seconds. dsi_max_xacts_in_group Default: 20. or that lock protection is built into your database system. See section on parallel DSI for appropriate setting . Recommended: 0 or 1 (see text) dsi_num_threads Default: 1 if no parallel DSI’s.but it is likely that the default is too low for high performance situations. all too often grouping rules make it difficult to achieve.especially those involving a lot of updates or deletes. While the initial recommendation would be to raise this to 2 billion and thereby eliminate this configuration from kicking in as it has little real effect. Values are: isolation_level_3 . ignore_origin. If attempting some large transactions. The default is probably far too low for other than strictly OLTP systems.threads begin as soon as they are ready vs. 5-10).147. Recommended: 4-8MB Explanation Maximum SQT (Stable Queue Transaction interface) cache memory for the database connection. due to the serialization method. One of the most difficult concepts to understand is the difference between the serialization methods. However. "0.Final v2. More DSI threads will not necessarily improve performance.1 Parameter (Default) dsi_sqt_max_cache_size Default: (0). This serialization method uses the ‘Ready to Commit’ transaction sequencing as it assumes that there will be considerable contention between the parallel transactions. 244 . transactions at the replicate are always applied in commit order. Note that parallel_dsi sets several configuration values to what would appear to be fairly low numbers.0. However. A setting of "on" configures these values: dsi_num_threads = 5 dsi_num_large_xact_threads = 2 dsi_serialization_method = "wait_for_commit" dsi_sqt_max_cache_size = 1 million bytes. the more dsi_sqt_max_cache_size you may need." means that the current setting of sqt_max_cache_size is used as the maximum cache size for the connection. Provides a shorthand method for configuring parallel DSI threads. the higher the probability of contention as the degree of parallelism increases or the higher the probability of contention with other users on the system. You can set this parameter to "on" and then set individual parallel DSI configuration parameters to fine-tune your configuration. The best way to describe this is that the serialization method you choose depends on the amount of contention that you expect between the parallel threads. it does control the timing of transaction delivery with Parallel DSI’s in order to reduce contention caused by conflicts between the DSI’s. parallel_dsi Default: off. This results in the thread timing in which execution is more staggered than parallel as illustrated below. The more transactions grouped together. Some of this you can directly control via the dsi_max_xacts_in_group tuning parameter. A setting of "off" configures these parallel DSI values to their defaults. in bytes. many of these work together. This parameter controls the use of parallel DSI threads for applying transactions to a replicate data server. The more DSI threads you plan on using. wait_for_commit The default setting for dsi_serialization_method is “wait_for_commit”. Recommended: (see text) As illustrated by the single parameter “parallel_dsi”. The default. This will become more apparent as each of the serialization methods will be described in more detail in the following sections. Serialization Methods Key Concept #26 – Serialization Method has nothing to do with transaction commit order. the next thread’s transaction group is not sent until the previous thread’s statements have all completed successfully and it is ready to commit. As a result. these settings are typically the most optimal. No matter which serialization method. the thread transactions are submitted nearly in parallel based on a transaction sequencing on the begin statement . updates.0. isolation level 3 locking includes some addition locks . As a result. . isolation_level_3 is the safer. there is no difference between ‘isolation_level_3’ and ‘none’ from a performance perspective. As the first method being discussed which has a high degree of parallelism. and deletes). in pure DML replication this likely will have minimal if any . However. there is no difference in lock hold times. This is an absolute falsehood for the following reasons: 1. however.namely range and infinity (or next key) locks. T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 70 – Thread timing with dsi_serialization_method = none 245 . inserts. etc. it assures that contention between the threads does not result in one rolling back – which would cause all those that follow to rollback as well. .when DOL locking is involved (and unfortunately datarows locking is likely needed to support parallel DSI’s. most people would think that isolation_level_3 will be slower than ‘none’ as a serialization method and invoke considerable more blocking with non-replication processes at the replicate. “none” could result in an inconsistent database – see the section on isolation_level_3 for details. . T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 69 – Thread timing with dsi_serialization_method = wait_for_commit . this is likely). and consequently Sybase recommended setting up through RS 12. . However. in certain situations.5. Since RS delivers all changes inside of a transaction and DML statement locks are held during the duration of the transaction. Normally. the same is true for DRI constraint locks. What it really means is that no (none) contention is expected between the Parallel DSI threads. 2. again. the legacy term ‘none’). When purely replicating DML (i. none/wait_for_start “None” does not infer that no serialization is used.Final v2. This looks like the illustration similar to the below. .e. Since DRI constraints hold their locks until the end of the transaction. this timing sequence would have limited scalability beyond 3-5 parallel DSI threads. isolation_level_3 Let’s first make sure of one thing – a common misconception is dispelled. so.1 As discussed earlier. As a result. let’s take a look at contention and see how it can be reduced to reduce the number of rollbacks. There is one exception to this statement . In this case.(timing then more a factor of the number of transactions in the closed queue in the SQT cache) with each thread waiting until the previous thread has begun (hence the new name ‘wait_for_start’ vs. no one ever mentions having the offending writer invoke isolation level 3. isolation level 3 may not be necessary as alternatives exist. Of course. the impact is the same as setting the serialization method to wait_for_start and setting dsi_isolation_level to 3. etc. Once eliminated.0. update.but also between the different parallel DSI threads. if you think about it. or procedure replication is involved. 246 . this could be implemented as a repeatable read triggered by the replicated insert. Consider the classic case of the aggregate. if it is needed to ensure repeatable reads for aggregate or other select-based queries invoked in replicated procedures or triggers – the primary goal should be to see if a function string approach could eliminate the repeatable read condition. Serialization method “isolation_level_3” is identical to “none” with the addition that Replication Server first issues “set transaction isolation level 3” via rs_set_isolation_level_3 function.e. and then the row is re-read. By exploiting function strings – or by encapsulating the set isolation command within procedure or trigger logic. In this case. The same effect could be implemented by setting the dsi_serialization_method to “wait_for_start” and setting dsi_partitioning_rule to “origin”. While it is still available to support legacy configurations. Consider for example the normal “phantom read” problem where a process scanning the table reads a row – the row is moved as a result of an update. delete or whichever DML operation. Even in these cases however. the balance is needed to ensure timely ATM and debit card transactions. When replicating to the central corporate system. a considerable amount of read activity could occur in the following: • • • • Declarative integrity (DRI always holds locks) Select statements inside replicated procedures Trigger code if not turned off for connection Custom function strings Aggregate calculations. However. isolation_level_3 can be set safely without any undue impact on performance. However. However. you may find that you can either avoid using isolation level three or restrict it only to those transactions from the primary that truly need it. in assuring that isolation_level_3 is necessary to ensure repeatable reads from the aspect of the replicated transactions and that the extra lock hold time for selects in the procedure will not increase contention.1 impact. single_transaction_per_origin is outdated in RS 15. Of course. Because Replication Server has access to the complete before and after images of the row. it holds the lock until the read completes – thereby blocking the writer and preventing the problem. as one would expect. this is simply avoided by having the scanning process invoke isolation level 3 via the set command. in addition to the contention increase simply from holding the locks on select statements. if triggers are still enabled.0. Although isolation_level_3 is currently the safest parallel DSI serialization setting.not only causing contention with reporting users . While Replication Server is normally associated with write activity. so. this could increase contention between threads dramatically due to select locks being held throughout the transaction – but ONLY if the replicated operations invoke or contain select statements (i. care should be taken when replicating stored procedures. the lock time for these locks can be extended . a function string similar to the following could be constructed: alter function string <repdef_name>.rs_update for rs_default_function_class output language ‘update bank_account set balance = balance – (?tran_amount!old?-?tran_amount!new?) where <pkeycol> = ?tran_id!new? ‘ • This maintains the aggregate without isolation level three required – and much more importantly – without the expensive scan of the table to derive the delta. In summary. Let’s assume that a bank does not keep the “account balance” stored in the primary system (possibly because the primary is a local branch and may not have total account visibility??). the most obvious example of when isolation level 3 is normally thought of is when performing aggregation for data elements that are not aggregated at the primary and consequently the replicate may have to perform a repeatable read as part of the replication process. However.Final v2. In a normal system. most of Replication Server’s transactions will be as the writer. it probably is in the same role as the offending writer in the phantom read – no isolation level three required. replicated procedures). it is totally unnecessary. This could be a scenario similar to replicating to a DSS system or a denormalized design in which only aggregate rollups are maintained. It should also be noted that isolation_level_3 as a dsi_serialization_method is a bit of an anachronism in RS 15. etcConsequently.0. a possibly bigger performance issue when isolation level three is required is the extra i/o costs of performing the scans that the repeatable reads focus on – all within the scope of the DSI transaction group. single_transaction_per_origin Similar to isolation_level_3. The reason for that is that it would be unnecessary as once the read scans the row to be updated. For example. if another transaction is hit from the same origin. By only allowing a single transaction per origin. there should not be any contention between transactions. This is not quite true and is just an illustration.Final v2. it would resemble: DSI-Exec DSI (S) SQT DSI-Exec DSI-Exec Corporate HQ Stable Device Distributor Distributor Distributor SQM dAIO SQT SQT SQT SQM SQM SQM Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1) New Yo rk Chicago Seattle Rep Agent Rep Agent Rep Agent Figure 72 – Internal threads processing of single_transaction_per_origin Note that the above diagram suggests that each DSI handles a separate origin regardless. From an internal threads perspective. The real impact of single_transaction_per_origin is that if any origin already has a transaction group in progress on one DSIEXEC thread.0. Toronto. stock trades in Chicago. However. each DSI could simply be processing a different sites transaction – consequently. all the available routes between the sites normally present in a shared primary are not illustrated simply due to image clarity. within each site – for example. San Francisco. Chicago San Francisco New York London Tokyo Figure 71 – Corporate Rollup or Shared Primary scenario In the above example. Although clearly applicable for corporate rollups.1 The single_transaction_per_origin serialization method is mainly used for corporate rollup scenarios. Tokyo and London are completely independent of each other – consequently their DML operations would not interfere with each other except in cases of updates of aggregate balances. another implementation for which single_transaction_per_origin works well is the shared primary or any other model in which the target replicated database is receiving data from multiple sources. since the transactions are from different origin databases. that transaction is applied 247 . transactions from Chicago – some significant amount of contention may exist. the transaction timing is similar to none or isolation_level_3 in that the Parallel DSI threads are not staggered waiting for a the previous commit. In this serialization method. the outbound queue will get behind quickly. it also means that since a thread can start when ready. it could even start before the previous thread if the previous thread is not ready for any reason (i. Either one of these situations is fairly common and could cause apparent performance throughput to appear much lower than normal. it could be applied in parallel. . While the error handling is easily spotted from the Replication Server error log. Remember. If the source system has a very high transaction volume. This could cause the outbound and inbound queues to rapidly fill – possibly ending up with a primary transaction log suspend Origin Transaction Rate – Again.instead the simply start as soon as they are ready. The result could be something like: Figure 74 – Thread timing with dsi_serialization_method = no_wait 248 . In global situations where normal workday at one location is offset from the other sites. this is not true. no_wait The dsi_serialization_method of no_wait is similar to wait_for_start except that the threads do not wait for the other threads to start . .e. The result is a slightly staggered starting sequence illustrated a few pages ago. each individual site effectively has a single DSI of all the parallel DSI’s to use. if the next transaction was from a different origin. From a performance perspective. Single Origin Error – Consider what happens if one of the replicated transactions from one of the sites fails for any reason. similar to the following: . single_transaction_per_origin may not have as high of a throughput as other methods such as none. T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 73 – Thread timing with dsi_serialization_method = none No wait no only eliminates this slight stagger.0. the source transaction rate or the balance of transactions is extremely difficult to determine on the fly. all of the transactions for a period of time come from the same origin – and consequently are single threaded. still converting to SQL).Final v2. with wait_for_start or none . All DSI threads are suspended and the queue fills until the one site’s transaction is fixed and connection resumed. However. Instead.each thread waits to begin it’s batch until the previous thread begins. Consider the following: Origin Transaction Balance – single_transaction_per_origin works best in situations where all the sites are applying transactions evenly.1 as if the serialization was “wait_for_commit” instead. they can use none/wait_for_commit. now that it is better understood – which one to use??? Consider the following table: Dsi_serialization_method Wait_for_commit None/wait_for_start When to use • • • • • • • • • • High contention at primary Low to mid volume High volume Insert intensive application Commit consistent transactions Mid-High Volume Ensure database consistency Low cardinality rollup with high volume from each Short cycle update/DML High cardinality rollup with low volume from each When not to use • • • • • • • High volume Short cycle update/DML (unless dsi_isolation_level is set to ‘3’) Not commit consistent transactions High number of selects in procs or function strings Satisfiable via before/after image Commit consistent Low cardinality rollup with high volume from each Isolation_level_three Single_transaction_per_origin As you can tell. Dsi_serialization_method summary So.2. no_wait could increase the number of parallel failures/rollbacks. but they must have dsi_isolation_level set to ‘3’ which has the same effect. no_wait may help. “Disappearing Update” Problem Consider the following scenario of atomic transactions at the primary -. 2. in an update intensive environment. wait_for_start?? In an insert intensive environment.”this is a test string”}. However. there are transaction profiles that must use dsi_serialization_method=isolation_level_3 vs.create table tableX ( -col_1 int not null. If parallel DSI’s are used and the dsi_serialization_method of “none” is selected.”dummy row”} – however. dsi_serialization_method=’none’. -col_2 int not null.Final v2. Alternatively. col_2. -constraint tableX_PK primary key (col_1) -. If the first insert is the last SQL statement in a group and the update the first statement in the next group. col_3) values (1.0. Consider the following picture: 249 . -col_3 varchar(255) null.5. When would you use no_wait vs. col_3=”dummy row” where col1_1=1 One would always expect the resulting row to tuple to be {1. it is possible that the result at the replicate could be {1. because the probability of a conflicting update executing ahead of a previous one is even higher than it was under wait_for_start. This is a big difference. the update will physically occur BEFORE the insert.assume a table similar to: -. Transaction Execution Sequence However.1 Note that the commit order is still maintained. the SQL statements are executed OUT OF ORDER – but committed in SERIALIZED ORDER. The reason for this is the timing of the transactions. as if the update never occurred. “this is a test string”) Update tableX set col_2=5.) Insert into tableX (col_1. it simply depends on the transaction profile from the source system. For example. if the dsi_serialization_method was set to isolation_level_3. Sybase stopped holding the locks unless isolation level 3 is enabled. As of SQL Server 10. It reads: This method assumes that your application is designed to avoid conflicting updates. the update would hold the locks and consequently the insert would block resulting in a deadlock as earlier discussed in the last section. dsi_serialization_method=none Many of you may already see the problem.Final v2. 250 . Sybase ASE engineering introduced several optimizations under isolation levels 1 & 2. SQL Server version 4. transaction serialization cannot be guaranteed if you choose either the "none" (no coordination) or the "isolation_level_3" method. Thus. or that lock protection is built into your database system.1 Insert into tableX () CT 1 TX 05 TX 04 Update tableX TX 03 TX 02 TX 01 UT 1 BT 1 Blocked CT 2 GT 1 TX 10 TX 09 TX 08 TX 07 TX 06 UT 2 BT 2 Blocked CT 3 GT 2 TX 15 TX 14 TX 13 TX 12 TX 11 UT 3 BT 3 Blocked CT 4 GT 3 TX 20 TX 19 TX 18 TX 17 TX 16 UT 4 BT 4 T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 BT n CT n TX ## rs_begin for transaction for thread n rs_commit for transaction for thread n Replicated DML transaction ## UT n GT n rs_update_threads n rs_get_thread_seq n Figure 75 – Statement Execution Sequence vs. which may not be supported by non-Sybase databases. However. The high-lighted section probably makes a lot more sense now to those who read it in the past and wondered. The DOL/RLL Twist An aspect that caught people by surprise was when this started happening even when using wait_for_commit and DOL tables. Consider the following description from the Replication Server Administration Guide located in the section describing Parallel DSI Serialization Methods (located in Performance and Tuning chapter) – and in particular is the description for “none”. consequently the above could happen.0. But wait…. So.9.shouldn’t the update block the insert??? Locking in ASE No. Uncommitted Insert By-Pass – Uncommitted inserts on DOL tables would be bypassed by selects and other DML operations such as update or delete. The update effectively sees 0 rows affected. Unconflicting Update Return – Select queries could return columns from uncommitted rows being updated if ASE could determine that the columns being selected were not being updated. SQL Server version 10.0. In implementing DOL. Many people state that situations like this are not described in the books – but they are (as well as nearly all the material in this paper). For replication to non-Sybase databases. conflicting updates between transactions are detected by parallel DSI threads as deadlocks. an update of a particular author’s phone number in a DOL table would not block a query returning the same author’s address.2 holds update locks for the duration of the transaction when an update references a nonexistent row. consequently the insert physically occurs later and the values are never updated. For example.0 and later does not hold these locks unless transaction isolation level 3 has been set. The result would be the typical rollback and serial application – and all will be fine. in the above illustration. Instead of the update or delete having to be executed prior to insert.2 – “Range” and “Infinity” locks. As a result. the above situation becomes: Insert into tableX () CT 1 TX 05 TX 04 Update tableX TX 03 TX 02 TX 01 UT 1 BT 1 Deadlock CT 2 GT 1 TX 10 TX 09 TX 08 Blocke d TX 07 TX 06 UT 2 BT 2 Blocked CT 3 GT 2 TX 15 TX 14 TX 13 TX 12 TX 11 UT 3 BT 3 Blocked CT 4 GT 3 TX 20 TX 19 TX 18 TX 17 TX 16 UT 4 BT 4 T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 BT n CT n TX ## rs_begin for transaction for thread n rs_commit for transaction for thread n Replicated DML transaction ## UT n GT n rs_update_threads n rs_get_thread_seq n Figure 76 . It should be noted that in ASE 11. Additionally. a “0 row affected” DML operation must hold a lock where the row should have been to prevent another user from inserting a row prior to commit (and hence a re-read of the table would yield different results violating isolation level 3).Deadlock instead of “disappearing update” with isolation level 3 251 . As described in the Replication Server Performance & Tuning White Paper.1 At this time.0. parallel. the end of the table) for the lock. because the vulnerability was exposed until the commit was actually executed. The “Infinity” lock is simply a special “Range” lock that occurs at the very beginning or end of a table. An update lock is a special type of read lock that indicates that the reader may modify the data soon. DOL (Data Only) Locking – To ensure isolation levels 2 & 3. the lock is placed on the index page. in the case of the uncommitted insert by-pass. A “Range” lock is placed on the next row of a table beyond the rows that qualify for the query. Sybase introduced two new types of locking contexts with ASE 11. In order to prevent phantom rows.5. An update lock allows other shared locks on the page. This may be preferred if isolation level 3 leads to increased contention with parallel DSI’s. This is prevented in ASE via the following methods: APL (All Page) Locking – The default locking scheme protects isolation level 3 on tables by retaining an “update” lock with an “Index Page” context. this optimization can be monitored with trace flag 694 and disabled with 693. but does not allow other update or exclusive locks.9-12. This prevents a user from adding an additional row that would have qualified either at the beginning or end of a range. by retaining these locks.9.e.Final v2. Since it is not always possible to determine the page location (i. customers are suggested to do one of the following if they find themselves in this situation: • • Use dsi_isolation_level=3 or dsi_serialization_method=isolation_level_3 Boot the server with -T693 to disable the locking optimizations. if isolation level 3 is set. this broadened the vulnerability to the above problem substantially. the premature execution of the update will cause a deadlock with rs_threads. any subsequent update or delete would by-pass the uncommitted insert and as a result would return “0 rows affected”. there is no proof that the unconflicting update is a cause of concern. This prevents inserts by blocking the insert from inserting the appropriate index keys. However. and results in Replication Server rolling back the transactions and re-issuing them in serial vs. As a result. this is deliberate. Consequently. the vulnerability was extended to the other serialization methods with the exception of dsi_serialization_method = isolation_level_3. although the window of opportunity was much narrower. Isolation Level 3 The reason isolation level 3 does not experience the problem is intrinsic to the ANSI specification to prevent phantom rows under isolation level 3. Applications in which explicit transactions were avoided due to contention Common “wizard” based applications if each screen saves its information to the database individually prior to transitioning to the next screen (assuming the following screen may update the same row of information).if so. Specifically. retrieving a job usually entails updating the job status). Estimating Vulnerability Before eliminating parallel DSI’s. you should first assess the vulnerability of your systems. The conditions that could cause an issue fall into the category of: Non-Existent Row – Basically the scenario addressed in the book and illustrated above characterized by an insert followed closely by DML. Customers witnessing a frequent number of “duplicate key” errors that appear spurious as subsequent execution succeeds should attempt to resolve the problem by ensuring proper transaction management is in place or by other application controls outside the scope of this issue. the duplicate key insert does in fact raise an error – which causes a rollback of the SQL that is causing the problem. delete or procedure (containing conditional logic or aggregation) closely follows a previous transaction such that it is within num_xacts_in_group * num_threads transactions – but definitely separate transactions. this would result in a duplicate key error being raised by the unique index on the primary key columns. This situation could occur when an application might delete and re-insert rows vs. Basically any DML operation followed closely by a read (either in a replicated proc or a trigger on a different table) This leads up to: 252 . Repeatable Read – The typical isolation level three problem as discussed for isolation_level_3 dsi_serialization_method. In any case. when any error occurs during parallel execution. the above could happen when an update. performing an update – somewhat analogous to Replication Server’s “autocorrection” feature. As the row is already present. the Replication Server logs the error along with a warning that a transaction failed in parallel and will be re-executed in serial. Work table data is replicated and then a procedure that uses the work table data is replicated. assessing the window of vulnerability finds it extremely small. This situation differs from the disappearing update problem in several critical areas: • • While the “disappearing update” problem does not raise an error. referring to the previous drawings. you can determine if DOL has exposed your application by booting the replicate dataserver with trace flag 694. Note that this situation differs significantly from the disappearing update problem and is only related in the aspect of the execution order and that it might be conceived to be a related problem – but in fact is not related. one of the situations that does not apply is a delete followed by an insert in which the insert would be executed first due to the same parallel execution that causes the disappearing update problem. Subsequent execution in serial by Replication Server would correctly apply the SQL and the database would not be inconsistent. Basically.1 Spurious Duplicate Keys Although the issue above has gained the most attention as “disappearing updates” and Sybase has been able to identify other situations (such as insert/delete) that could occur. the likely cause of the spurious keys is our old friend the missing repdef/primary key identification. In this case.0. Consequently this scenario is always characterized by an insert followed by update or delete. Middle tier components that perform immediate saving of data as each method is called vs. waiting until the object is fully populated. none of the current proposals for addressing the disappearing update problem would address this issue. Additionally. Normally. it would execute (or attempt to) prior to the delete. One frequent fix is to determine if the system is a Warm Standby and if an approximate numeric (float) column exists . other than proper transaction management within the application. Note that a deferred update is NOT logged nor replicated as a separate delete/insert pair and consequently it must be an explicit delete/insert pair submitted by the application.Final v2. The lack of a row doesn’t return an error and doesn’t hold the lock when the second statement is executed first. Updates or deletes triggered by an insert would not be a case as any triggered DML is included in the implicit transaction with the triggering DML. Examples of applications that might be vulnerable include: • • • • • • Applications with poor transaction management in which atomic SQL statements all part of the same logical unit of work are often executed outside the scope of explicit transactions. A typical job queue application in which a job is retrieved from the queue very quickly after begin created (as is normal. if the insert ended up where the update was illustrated above. Later (Multiple DSI section). this should be set to 0. the DSI processes it as a large transaction. dsi_num_large_xact_threads Default: 2 if parallel_dsi is set to true. then we would be executing them in serial fashion (ala wait_for_commit). This section takes a close look at how large transactions affect the DSI thread. large transactions are processed normally with no special handling.1 Key Concept 27: Parallel DSI Serialization does NOT guarantee transactions are executed in the same order – which could lead to database inconsistencies – particularly with dsi_serialization_method=’wait_for_start’ or ‘none’ and dsi_isolation_level other than ‘3’. Allow the transaction to be sent to the replicate without waiting for the commit record to be read. If attempting some large transactions. While the initial recommendation would be to raise this to 2 billion and thereby eliminate this configuration from kicking in as it has little real effect. DSI Tuning Parameters There really only are two tuning parameters for large transactions.147. If Parallel DSI is not used. Recommended: 10. use isolation_level_3. if the application does have some poorly designed large transactions. The tuning parameters are: Parameter dsi_large_xact_size Default: 100. setting this to a much higher number than ordinary might help reduce DSI latency when the DSI is waiting on a commit before it even starts. 2. suffice it to say. the Replication Server also processes the large transaction slightly differently during execution. the impact of large transactions on SQT cache and DIST/SRE processing were discussed. Parallel DSI Processing In addition to beginning to process large transactions before the commit record is seen by the DSI/SQT. Recommended: 0 or 1 (see text) The key tuning parameter of both of these is dsi_large_xact_size. If dsi_large_xact_size is set to 2 billion. It should be noted that it is at the DSI that a transaction is defined as “large”. If we didn’t.000 or 2. The default is probably far too low for other than strictly OLTP systems. Parallel DSI Tuning Tuning parallel DSI’s for large transactions is a mix of understanding the behavior of large transactions.0. likely 1 is the best setting. When a transaction exceeds this limit. The maximum value is one less than the value of dsi_num_threads. we are deliberately executing the transactions kind of out of order to achieve greater parallelism. The number of parallel DSI threads to be reserved for use with large transactions. 3. particularly in relationship to the dsi_large_xact_size and the SQT open queue processing. At this point. Large Transaction Processing One of the most commonly known and frequently hit problems with Replication Server is processing large transactions. The minimum value is 4. we will discuss the concept of “commit consistent”.Final v2.843. the DSI does the following: 1. The main differences are: 253 . Both of these are only applicable to Parallel DSI implementations. which does not achieve any real parallelism. While a transaction may be “large” enough to be flushed from the SQT cache – it still can be too small to qualify as a large transaction. the DSI will attempt to provide early detection An important note is that this is only applicable to Parallel DSI. Use a dedicated large transaction DSI Each dsi_large_xact_size rows. if using Parallel DSI’s. If you think about it.647 (max) Definition The number of commands allowed in a transaction before the transaction is considered to be large for using a single parallel DSI thread. See the text in this section for details. In doing so. In earlier sections. More than 2 are probably not effective. is that if the transactions are not commit consistent. First. This is accomplished by the DSI processing large transactions from the “Open” queue vs. the benefit of this approach far outweighs the very small amount of risk. Consider the above example. As the row changes are logged.1 • • • DSI/SQT open queue processing (DSI doesn’t wait for commit to be seen) Early conflict detection Utilizes reserved DSI threads set aside for large transactions SQT Open Queue Processing The reference manual states that large transactions begin to be applied by the DSI thread before the DSI sees the commit record. it finishes well before it would if it waited until the transaction was moved to the SQT Closed queue. for normal (non-Warm Standby) replication. the more normal “Closed” queue in the SQT cache. However. the overall latency of the transaction is extended. While some people misinterpret this to mean that the transaction has yet to be committed at the primary. As a result. the DSI could start delivering the commands before the DIST has processed all of the commands from the inbound queue.0. they are forwarded to the Replication Server by the Rep Agent long before the commit is even submitted to the primary system. In addition. several problems can occur. it 254 . How can this happen??? Simple. in order for the inbound queue SQT to pass the transaction to the outbound queue. In fact. Overall. Now. In the bottom DSI execution of the transaction (labeled DSI > RDS (large xactn)). Consider the latency impact of waiting until the commit is read in the outbound queue as illustrated below by the following timeline: Large Xactn at PDS Rep Agent Processing Inbound SQT Sort DIST/SRE Outbound SQT Sort DSI -> RDS (normal) DSI -> RDS (large xactn) T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 77 – Latency in processing large transactions Without starting to apply the transaction until the commit is read. However. If each time unit equaled an hour (although 2 hours for DIST/SRE processing is rather ludicrous) at the transaction began at the primary at 7:00pm. why is this done?? The answer is simple – transaction rollbacks in production systems are extremely rare (or should be!!) – therefore this issue is much more of an exception and not the norm. the commit may not even have been submitted to the primary yet).not only in the primary. With such a negative. after 100 rows have been processed to the Replication Server. the commit had to have been issued at the primary and processed in the inbound queue or it would not have even got to the outbound queue. the DSI would start applying the transaction’s row changes immediately without waiting for a commit (in fact.000 rows into a replicated table (slow bcp so row changes are logged). except in the case of Warm Standby. If at the default. the transaction would be labeled as a large transaction. This is definitely an important benefit for batch processing to ensure that the batch processing finishes at the replicate prior to the next business day beginning. as illustrated above. Consider the case of a fairly normal bcp of 100. should the bcp fail due to row formatting problems – it will need rolled back . while a Warm Standby system could be delivering SQL commands prior to the command being committed at the primary. the transaction had to be committed. this can significantly reduce latency as the DSI does not have to wait for the full command to be in the queue prior to sending it to the replicate. this does have a possible negative effect in Warm Standby systems that a large transaction may be rolled back at the primary – and need to be rolled back at the replicate. but also at the replicate as the transaction has already been started. the transaction has not only been committed. but fully forwarded to the Replication Server.Final v2. Remember. Remember. The problem is that if a large transaction begins to be applied and another smaller transaction commits prior to the large transaction. it is likely to occur again. the large transaction is rolled back and the smaller concurrent transaction committed in order. Perform an atomic insert into another table at the primary (allow to implicitly commit) Use a dirty read at the replicate to confirm large transaction rolled back and does not restart until delay expires and transaction commits. large transactions run concurrently (provided started in order of commit) such as concurrent purge routines may be able to execute without the rollback/wait for commit behavior.especially in Warm Standby . the Standby DSI is reading from the inbound queue’s SQT cache. Configure the DSI connections for parallel DSI using the default parallel_dsi=’on’ setting. concurrent large transactions may not experience the desired behavior as will be discussed in the next section.Final v2.1 would finish at the replicate at 7:00am the next morning using large transaction thread processing. Key Concept #28: Large transaction DSI handling is intended to reduce the double “latency penalty” that waiting for a commit record in the outbound queue introduces in normal replication and latency as well as switch active timing issues associated with Warm Standby.e. 3. Large Xactn at PDS Rep Agent Processing Inbound SQT Sort DSI -> RDS (normal) DSI -> RDS (large xactn) dsi_large_xact_size rows scan time T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Figure 78 – Latency in processing large transactions for Warm Standby However. since a large transaction reads from the SQT “Open” queue. However. Compare the following timeline with the one above. it is nearly only useful when large transactions run in isolation (such as serial batch jobs). This behavior. Use a dirty read at the replicate to confirm large transaction is started. attempts to tune for and allocate large transaction threads will be negated if smaller/other transactions are allowed run concurrently and commit prior to the large transaction(s). However. 2. the large transaction does not restart from the beginning automatically . Without it. are not sent to the Standby database until they have committed.to reduce dsi_large_xact_size with hopes of improving throughput and reducing latency. This probably is due to the expense of large rollback’s and the aspect that if it the rollback occurs once. 5. After the smaller transaction commits. of course. for Warm Standby. The latency savings for this is really evident in Warm Standby. the above will only happen if large transactions run in isolation. the transaction would not finish at the replicate until 10:00am – 2 hours into business processing. it is fully possible that the Standby system will start applying the transaction within seconds of it starting at the primary and would commit within nearly the same time. However. 255 . Begin a large transaction at the primary (i. Normal (small) transactions. As a result. a 500 row insert into table within an explicit transaction). Having said that. This behavior is easily evident by performing the following in a Warm Standby configuration: 1. coupled with the “early conflict detection” and other logic implemented in large transaction threads to avoid excessive rollbacks is a very good reason to avoid the temptation .but rather waits until the commit is actually received before it is reapplied. At the end of the transaction place a waitfor delay “00:03:00” immediately prior to the commit. 4.0. in which thread #2 is being blocked by a conflicting insert by thread #3.” What this really means is the following. This is illustrated in the below diagram – a slight modification of the above – with the intermediate rs_thread selects grayed out. Consequently. allowing it to commit. By surfacing the offending conflict earlier rather than later. the Replication Server would insert an rs_get_threadseq every 100 rows (assuming dsi_large_xact_size is still the default of 100). the rollback time of the large transaction is reduced. 256 .e.1 Early Conflict Detection Another factor of large transactions that the dsi_large_xact_size parameter controls is the timing of early conflict detection. a deadlock is caused. This is crucial as no other transaction activity is re-initiated until all the rollbacks have completed. By doing this. a normal transaction may take a full order of magnitude longer to rollback than it takes to fully execute (i. So. CT 1 Upd Ins UT1 BT 1 Blocked CT 2 ST1 Upd Ins UT2 BT 2 Deadlock CT 3 ST2 Upd Ins Ins ST2 Upd Upd Ins UT3 BT 3 Blocked CT 4 ST3 Upd Ins Ins ST3 Upd Upd Ins UT4 BT 4 BT # UT# ST# CT # Begin transaction for transaction # Update on rs_threads for thread id # (blocks own row) Select on rs_threads for thread id # (check for previous thread commit) Commit transaction for transaction # Figure 79 – Early Conflict Detection with large transactions The reason for this is the extreme expense of rollbacks and the size of large transactions.000 statements (i. a transaction with an execute time of 6 minutes may require an hour to rollback). a bcp of 1.e. if there is a situation in which the large transaction is blocking the smaller one. for example. This is illustrated in the diagram below. 900 rows for the bcp example). the user thread attempts to select the row for the next thread to commit in order to surface conflicting updates. a large transaction could have a significantly large “penalty” (i.e.0.Final v2. without the periodic check for contention by selecting rs_threads every dsi_large_xact_size rows. the DSI thread attempts to select the sequence number of the thread before it. for a large transaction of 1. try a large transaction in any database within an explicit transaction and roll it back vs. During processing of large transactions. Although performance varies from version to version of ASE as well as the transaction itself. every dsi_large_xact_size rows. This is stated in the Replication Server Administration manual as “After a certain number of rows (specified by the dsi_large_xact_size parameter).000 rows). To put this in perspective. thus “surfacing” the conflict. being a “daring” individual. the large DSI threads would block waiting for the other threads to commit. The problem with this is that every 10 rows. After the first dsi_large_xact_size rows. the large transactions could experience considerable rollbacks and retries. this eliminates the benefits of transaction grouping and increase log I/O and rs_lastcommit contention. The higher the ratio of dsi_large_xact_size to average transaction size. the 5 SQL statements expands to a total of 12 statements per transaction (not at all hard). it is rarity – and a shame – these days to note that few if any of large IT organizations stress test their applications or even have such a capability). Barney decides to capitalize on the large transaction advantage of reading from the SQT Open queue and sets dsi_num_threads to 5. 10). In order to ensure most transactions qualify.. customer profile updates. To understand why this is a bad idea. took the class – but didn’t test his system with a full transaction load (nothing abnormal here – in fact. Consequently. the first would have all 20 statements executing while the other 4 would execute up to the 10th and block. let’s assume due to triggered updates for shipping costs. If the transactions have considerable contention between them to the extent wait_for_commit would have been a better serialization method.1 CT 1 Upd Ins UT1 BT 1 Rollback/Block Penalty Range CT 3 ST2 Upd Ins Ins Blocked CT 2 ST1 Upd Ins UT2 BT 2 Deadlock ST2 Upd Upd Ins UT3 BT 3 Blocked CT 4 ST3 Upd Ins Ins ST3 Upd Upd Ins UT4 BT 4 Figure 80 – Possible Rollback Penalty without Early Conflict Detection Now then.Final v2. Now then. However. • The last bullet takes a bit of thinking before it can be understood. dsi_num_large_xact_threads to 4 and finally sets dsi_large_xact_size to 5 (his average number of SQL statements set from the application – a web order entry system). consider the following points: • • Large transactions are never grouped. If the average transaction was 20 statements and 5 large transaction threads were used. dsi_large_xact_size has to be set fairly low (i. By contrast – a serialization method of “none” would let all 5 threads execute up to the 20th statement before blocking.e. Let say we have a novice Replication System Administrator (named Barney) who has diligently read the manuals. The serialization between large transaction threads is essentially none up to the point of the first dsi_large_xact_size rows – since we are not waiting for the commits at all (let alone waiting until they are ready to be sent). inventory tracking. the rs_threads blocking changes the remainder of the large transaction to more of a wait_for_commit serialization. the more the performance degradation. getting back to the point earlier discussed in the previous section – the temptation to reduce dsi_large_xact_size until most transactions qualify – with the goal of reducing latency.0. What Barney assumes he is getting looks similar to the following: 257 . etc. remember – we have now lost transaction grouping. While not denying that in some very rare instances of Warm Standby with a perfect static transaction size with no contention between threads that there is a probability that this type of implementation might help a small amount – the reality is that it is highly improbable especially given the concurrent transaction induced rollback earlier discussed. the unbeliever would be quick to say that the dsi_large_xact_size could be increased to exactly the rows in the transaction (i. consider the impact if the last statement in thread 4 conflicts with one of the first rows in thread 5. increased load on rs_lastcommit and replicate transaction log – all for very little gain in latency for smaller transactions. Now. introduced a high probability of contention/rollbacks. By the way.Final v2. This can surprise some administrators – who in checking their replicate dataserver – 258 . Possibly – be real hard as the number of statements in a transaction is not a constant. Thread Allocation A little known and undocumented fact is that dsi_num_large_xact_threads are reserved out of dsi_num_threads exclusively for large transactions.1 T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Begin/Commit Transaction Replicated Statement rs_threads select rs_threads block on seq Figure 81 – Wishful Concurrent Large Transaction DSI Threads The expectation: everything is done at T05.0. A rollback at T12. What Barney actually gets is more like: T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Begin/Commit Transaction Replicated Statement rs_threads select rs_threads block on seq Thread 3 blocked by thread 2 Thread 4 blocked by thread 3 Thread 5 blocked by thread 4 Figure 82 – Real Life Concurrent Large Transaction DSI Threads This illustrates how the first dsi_large_xact_size rows are similar to a serialization method of “none” while those statements after transition to more of a wait_for_commit. 12) at which point we would really have the execution timings in the earlier figure. However.e. That means only 3 threads are available for processing normal transactions if you set the default connection parameter of “parallel_dsi” to “on” without adjusting any of the other parameters (parallel_dsi “on” sets dsi_num_threads to 5 and dsi_num_large_xact_threads to 2 – leaving only 3 threads for normal transactions of <100 rows (at default)). Any contention between threads. Maximizing Performance with Parallel DSI’s By now. no more benefit will be realized as the next large transaction could reuse the first thread. eliminate any replication or system induced contention at the replicate and develop Parallel DSI profiles of settings corresponding to the transaction profile during each part of the business day. Combining this with the previous topic yields another key to understanding Parallel DSI’s: Key Concept #29: For most systems. it is extremely doubtful that more than 2 large transaction threads will improve performance.0. Consider the following review of points from above: • In keeping with Replication Server’s driving philosophy of maximizing resilience.marks beginning of transaction group Commit Tran for thread # . is more than likely going to cause a deadlock. the default serialization method is “wait_for_commit” as this minimizes the risk of inter-thread contention causing significant rollbacks. if one rolls back. After this point. then. If this is not true for your application (typically the case).1 discover that “only” a few of the configured threads are active.marks end of transaction group Normal thread sequencing block Inter-thread contention on transactions within group Figure 83 – Deadlocking between Parallel DSI’s with serialization method of “none” 259 . For most large transactions – due to the early conflict detection algorithm – no more than 2 large transaction threads will be effective. To determine the optimal settings. it also means a higher probability of contention causing rollback of a significant number of transactions (remember. Remember – the threads are already blocked on each other’s rows in rs_threads – deliberately – to ensure commit order is maintained. only 3 Parallel DSI’s will be effective. then the default “parallel_dsi” settings are inadequate. CT 1 CT 2 Upd Table B Ins Table A Upd rs_threads 1 BT 1 BT 2 Deadlock Sel rs_threads 1 Upd Table C Ins Table B Upd rs_threads 2 Blocked CT 3 Sel rs_threads 2 Upd Table C Blocked Ins Table A Upd rs_threads 3 BT 3 Deadlock CT 4 Sel rs_threads 3 Upd Table A Ins Table C Upd rs_threads 4 BT 4 BT # CT # Begin Tran for thread # . you have enough information to understand why the default settings for the parallel_dsi connection parameter are what they are in respect to threading – and why this may not be the most optimal. When using the “wait_for_commit” serialization method. the higher the throughput. the rest do as well). since large transaction threads are “reserved”. this may not be even close to optimal as the assumption is that there will be significant contention between the Parallel DSI’s and the large transactions are significantly higher than dsi_large_xact_size setting. Using more than this number will not bring any additional benefit. Parallel DSI Contention Wait_for_start serialization method provides some of the greatest scalability – the more DSI’s involved. In addition. However. you need to understand the profile of the transactions you are replicating.Final v2. Consider the following illustration. increasing the number of large transaction threads may require increasing the total number of threads to avoid impacting (small) normal transaction delivery rates. • • However. 000* 1. Use datapage locking or reduce dsi_max_xacts_in_group Reduce dsi_max_xacts_in_group until contention reduced Note that nowhere in the above did we suggest changing the serialization method to “wait_for_commit”. the Replication Server encountered 3 rollbacks per minute – ordinarily excessive. Contention Last page contention Index contention Row contention Possible Resolution(s) Change clustered index. at another customer site.Final v2.0.1900 0830 . While the book states that “This method assumes that your application is designed to avoid conflicting updates. one of the more frequent tables “blamed” for deadlocks in replicated environments is the rs_threads table.000 625. if you do not have a lot of contention at the primary. As usual.1630 0500 – 0700 0900 . partition table or use datarow locking. Consider the following fictitious profile: Transaction Execute Trade Place Order Adjust Stock Price 401K Deposit Money Market Deposit Money Market Check Close Market Type OLTP OLTP OLTP Batch OLTP Batch Batch Time Range 0830 . If the problem is system induced as compared to the primary – yes. or that lock protection is built into your database system.000 750 1 Leading Contention Cause get next trade id get next order id place order read mutual fund balance central fund update central fund withdrawal isolation level 3 aggregation 260 . Backing off from that goal too quickly when other options exist could have a large impact on the ability of Replication Server to achieve the desired throughput. If the contention is system induced. then contention at the replicate may be a direct cause of system tuning settings at the replicated DBMS and not due to the transactions. two deadlocks exist – threads 1 & 2 are deadlocked since thread 2 is waiting on thread 1 to commit as normal (rs_threads) yet thread 2 started processing its update on table B prior to thread 1 (assuming the same row hence the contention). you need to develop a sense of the transaction profile being executed at the primary during each part of the business day.1 In the example above. In order to prevent the contention from continuing and causing the same problems all over again. Interestingly enough. using the default wait_for_commit resulted in the inbound queue rapidly getting one hour behind the primary bcp transaction.2200 1700 . wait_for_commit will resolve it – however. The biggest problem with this is that once one thread rollsback (typical response for a deadlock). In almost any system.1630 0700 . this illustrates the point that no one-size-fits-all approach to performance tuning works and that each situation brings its own unique problem set. However.1700 1800 .. but in this case. As you can see – this is rather deliberate. Obviously.” it is not as difficult to achieve as you think. the Replication Server will retry the remaining transactions serially (one batch at a time) before resuming parallel operations. Consider the following matrix of contention and possible resolutions. An easy way to find out the offenders is to turn on “print deadlock info” configuration in the dataserver using sp_configure and simply ignore the pages for the object id/table rs_threads.000 750. Consequently deadlocks involving rs_threads should not be viewed as contention issues with rs_threads. Threads 3 & 4 are similarly deadlocked.000 125. Keep in mind that even 2 threads running completely in parallel with a serialization of “none” may be better than 5 or 6 using “wait_for_commit”. but rather an indication of contention between the transactions the DSI’s were applying. During these 30 minutes. the impact on throughput can be severe. you need to first determine the type of contention involved and whether it involves. Basically. Understanding Replicated Transaction Profile In order to determine if the contention at the replicate (if there is any) is due to replication or schema induced contention. #2 is waiting on #1 and #1 is waiting on #2 – a classic deadlock. a serialization method of “none” should be the goal. a small number of occurrences are probably not a problem. However. During a benchmark at a customer site. all the subsequent threads will rollback as well. the serialization method of none was outperforming the default choice. As a result. a rollback followed by a serial transaction delivery will cause performance degradation if it happens frequently enough. Switching to “none” drained the queue in 30 minutes as well as keeping up with new records. a parallel transaction failed every 3-4 seconds – and no performance gain was noted in using “none” over “wait_for_commit”.1930 Volume (tpd) 500. Accordingly. the RowsInserted. Resolving Parallel DSI Contention Figure 84 – Parallel DSI Contention Monitoring via MDA Tables In the above set of tables. therefore. some of the OLTP transactions not only affect individual rows of data representing one type of business object (such as customer account) – but they also affect either an aggregate (central fund balance) or other business object data.Final v2. we would be most interested in what the second leading cause of contention is. It is the activity against the fund data that could be the source of contention and not the individual account data.LockWaits column. In addition. When replicating this transaction. however. if the database is only being used by the maintenance user.0. we may not be able to determine that as the first one may be masking it. in the above list of sample transactions.2359 Volume (tpd) 1 Leading Contention Cause Index maintenance * Normalized for surge occurring on regular periodic basis Note that the first two OLTP transactions have a monotonic key contention issue.1 Transaction Purge Historical Type Batch Time Range 2300 . Also. 261 . each individual 401K pay deposit affects the individual investor’s account. monOpenObjectActivity can provide a fairly clear indication of which tables are causing contention among the parallel DSI’s by monitoring the monOpenObjectActivity. For example. If the transaction profile is not well understood. it also adjusts their particular fund’s pool of receipts with which the fund manager uses for reinvestment. the id value will be known. The contention could be on the latter. this will not cause contention at the replicate. While it is theoretically possible to eliminate all replication-induced contention caused by transaction grouping in this manner. but almost in every case in which a customer has called Sybase Support with Replication Server performance issues. And this is an extremely low estimate as we have not include the time it took to reapply the transactions in serial – during which the system could still be applying the transactions in parallel. Transactions that have contention at the replicate due to the timing of delivery where at the primary no contention existed due to different timings. as soon as the first error that occurs stating a parallel transaction failed and had to be retried in serial. For normal transactions. the main cause of contention directly attributable to replication is the transaction grouping. The result is last page contention or index contention at the replicate. an average of 3 parallel transactions per minute failed and were retried. at its default of 20 transactions per group. a long running procedure at replicate would delay further transactions). and RowsUpdated also can provide a sense of what is going on at a more table/index level perspective than the RS DSI monitor counters.e. the mere fact that Parallel DSI’s are involved means that the individual Parallel DSI’s could experience contention between them as well as with other users on the system. Just think though – if you were able to eliminate the contention that caused even 50% of the failures – the number of additional transactions per minute would be at least equivalent to the number of DSI’s. Replicated transactions that had contention at the primary. This means you will need to be willing to accept some degree of parallel transactions failing and being retried. As discussed earlier. The timing difference could be the result of Replication Server component availability (i. After switching to “none”. This means an extra 15 transactions (3 * 0. DML applied serially at source that is being applied in parallel at replicate in which contention exists.Final v2. A few general tips are DSI Partitioning Having understood where the contention occurs at the primary. Transaction grouping is a good feature. during a bulk load test in a large database. then the Replication Server may send the individual batches using Parallel DSI’s. The goal is to eliminate any replication induced contention that is preventing use of “none” and then to assess whether the level of parallel transaction retries is acceptable. the queue was fully caught up in 30 minutes. The trade-off was considered more than acceptable – 90 errors and empty queue vs. This is especially true in Warm Standby scenarios in which the replicate system is the only updated by the Replication Server (and attempting a serialization method of “none”). in an earlier session we mentioned that in one system. For example. It is unfortunate. however. it can frequently lead to contention at the replicate that didn’t exist in the primary. 10 DSI’s were in use. The easiest way to resolve this is to simply reduce the dsi_max_xacts_in_group parameter until most of the contention is resolved. Replication Server not only was able to keep up.e. Additionally. Replication Agent was down) or due to long running transactions at the replicate delaying the first transaction until the conflicting transaction was also ready to go (i. • • 262 . This was completely acceptable considering the relative gain in performance. Which brings us back to the point – how can we eliminate the contention at the replicate? The answer is (of course) it all depends on what the source of contention is – is it contention introduced as a result of replication or contention between replication and other users. a serialization method of “none” will achieve the highest throughput. it was able to fully drain the 1GB backlog in less than 30 minutes.50 * 10) could have been applied per minute – or 450 transactions during that time. if the bcp specified a batch size (using –b). By switching to “none”. In one example of the latter.0. Possible areas of contention include: • • Replication to aggregate rollups in which many source transactions are all attempting to update the same aggregate row (i. there is a definite tradeoff in eliminating transaction grouping and the associated increase in log and I/O activity and a limited acceptance of some contention. no errors and 1GB backlog. During that time.1 RowsDeleted. total sales) in the destination database. however. In addition to transaction grouping. in this case. replication itself can induce contention – frequently resulting in the decision to use suboptimal Parallel DSI serialization methods. few have bothered to investigate if and where contention is the cause. However. in the few cases where the administrators have been brave enough to attempt the “none” serialization method. For example. approximately 3 parallel transactions failed per minute and were retried in serial. you then have to look at where contention is at the replicate. the transaction grouping is a form of concurrency that is causing contention. Replication got 1GB behind using “wait_for_commit”. Concurrency Induced Contention In a sense. a (slow) bcp at primary does not have any contention. However. Replication Induced Contention As discussed earlier. the queue got 1GB behind after 1 hour using “wait_for_commit”. during that period. the immediate response is to switch back to wait for commit vs. A possible strategy is to simply halve the dsi_max_xacts_in_group repeatedly until the replication induced contention is nearly eliminated. If you remember.e. eliminating the contention – or even determining if that level of contention is acceptable. 5. but technically they are known as the "monitoring tables". you mainly need to look at five of these tables: Monitoring Table monLocks monProcess monProcessSQLText monSysStatement monSysSQLText Information Recorded Records the current process lock information Records information about currently executing processes Records the SQL for currently executing processes Records previously executed statement statistics Records previously executed SQL statements The relationship between these tables is depicted below: SPID = BlockingSPID monProcess SPID KPID WaitEventID FamilyID BatchID ContextID LineNumber SecondsConnected BlockingSPID DBID EngineNumber Priority Login Application Command NumChildren SecondsWaiting BlockingXLOID DBName EngineGroupName ExecutionClass MasterTransactionID smallint int smallint smallint int int int int smallint int smallint int varchar(30) varchar(30) varchar(30) int int int varchar(30) varchar(30) varchar(30) varchar(255) <pk> <pk> <fk1> monSysStatement <fk2> SPID KPID DBID ProcedureID PlanID BatchID ContextID LineNumber StartTime EndTime CpuTime WaitTime MemUsageKB PhysicalReads LogicalReads PagesModified PacketsSent PacketsReceived NetworkPacketSize PlansAltered smallint int int int int int int int datetime datetime int int int int int int int int int int <pk.0. Finding Contention using MDA Monitoring Tables In ASE 12.0. Sybase provides system monitoring tables via the Monitoring and Diagnostics API (MDA). but not on the same rows.fk> int smallint int int int varchar(20) varchar(20) varchar(30) int int int SPID = SPID KPID = KPID SPID = SPID KPID = KPID <pk> <pk> SPID = SPID KPID = KPID monProcessSQLText SPID KPID BatchID LineNumber SequenceInLine SQLText smallint int int int int varchar(255) <pk. where contention exists at index or page level for data tables.1 How and if this contention could be eliminated depends on the type of contention.3 and later. sometimes these are often referred MDA tables.fk> <pk. These tables are actually proxy tables that interface to the MDA via standard Sybase RPC calls.fk> <pk> Figure 85 – MDA-based Monitoring Tables Useful for Identifying Contention The difficult aspect of using monitoring tables is remembering which of the tables contain currently executing information and which contain previously executed statements. the information about that statement will be only available in the monSys* tables vs.fk> <pk.fk> <pk. This is important since once a SQL statement is done executing.fk> <pk.Final v2.fk> <pk. As a result.fk> monLocks SPID KPID DBID ParentSPID LockID Context ObjectID LockState LockType LockLevel WaitTime PageNumber RowNumber smallint <pk. In order to determine where intra-parallel DSI contention is originating. monProcess* - 263 .fk> <pk> <pk> <pk> SPID = KPID = BatchID = LineNumber = SPID KPID BatchID LineNumber monSysSQLText SPID KPID BatchID LineNumber SequenceInBatch SQLText smallint int int int int varchar(255) <pk. changing the replicate system to use datapage or datarow locking may bring relief.fk> int <pk. For example. This phenomena can be controlled loosely by adjusting the dsi_max_xacts_in_group as well as dsi_xact_group_size. atomic inserts) and both are set fairly small (i. However. you would actually track monProcessSQLText and monProcess. if there is contention. while the statement(s) that caused the contention may be either in monProcess* (if still executing such as long running procedure or if blocked itself) or in monSys* if the statement has finished executing but the transaction has not yet committed. by the time the DSI Scheduler dispatches the second batch to the second DSI. Key Concept #30: Maximum performance using Parallel DSI’s can only be achieved after replication and concurrency caused contention is eliminated and DSI profiles (based on the transaction profile) are developed to minimize contention between Parallel DSI’s. Consequently. The first time you query these tables.BlockingSPID column you can identify both the blocked and blocking users along with their SQL statements at the time via monProcessSQLText. the next step is to enable statement monitoring. Developing Parallel DSI Profiles Similar to managing named data caches in Adaptive Server Enterprise. If the transactions are fairly fast (i. If so. Executing admin who. Subsequent queries will only return rows that previously have not been returned to your connection. however. Parallel DSI Configuration vs. But if not. you may have to establish DSI profiles to manage replication performance during different periods of activity.1 however. you should start with the monOpenObjectActivity table (specifically the LockWaits column). Consider the following table of example settings: dsi_serialization _method dsi_num_large_ xact_threads dsi_max_xacts_ in_group 5 30 -1 5 dsi_large_xact_ size 1000 100 1000 100 Profile normal daily activity post-daily processing bcp data load bcp of large text data None wait_for_commit None wait_for_commit 10 5 5 3 num_threads 1 2 0 2 Developing a similar profile for your replication environment will enable the Replication Server to avoid potentially inhibitive deadlocks and retries during long transactions such as large bcp and high incidence SQL statements typical of post-daily processing routines. the num_dsi_threads configuration parameter is a “limiter” or maximum number of DSI threads that the Replication Server can use for any connection. Basically. With statement monitoring. the blocked statement will still be executing and in the monProcess* tables. that may be all that is necessary. the proper rows will be returned. Classic examples of this are monSysStatement and monSysSQLText. Another aspect is that some of the monitoring tables concerns the fact that some of the tables are meant to be queried to build historical trend systems – by multiple users simultaneously. it simply sends the next batch of SQL back to thread seq #1 instead of the next thread in sequence. after each batch is sent to a thread. rather than looking at monLocks. it returns all the rows that the pipe contains. if two different users are querying the monSysStatement table at different times. the statement may still be holding locks. Then by using the monProcess. You can also review the past statements from monSysSQLText as well as look for statements with WaitTime in monSysStatements as indicators of where contention might exist. a check of sp_who <maint_user> may as few as two connections or may have show some number of connections but monitoring may show that only a few of them are actually active. 264 . By setting them to higher values – such as20 and 65536.e. the first will be available again – and will be reused.rapidly poll the tables (frequently enough to get an idea of the contention).e. Note that as mentioned earlier. Consequently. The technique is fairly simple . If RS is the only user in the system. the RS checks to see if thread seq #1 is available again.0. 3 and 2048 respectively). of course. it may take more time for the larger transaction to commit and the RS may use the full complement of DSI threads configured for. Actual In a sense. will list all of the parallel DSI thread processes within the Replication Server.Final v2. remember to use the –B option to breakup potentially queue filling bulk loads of data. For small and large bcp loads. Time spent by the DSI/S putting free DSI/E threads to sleep. This counter is incremented each time a 'begin' for a grouped transaction is executed. Transaction groups applied successfully to a target database by a DSI thread. the object is to see if the aggregate numbers for these counters is higher than with a single DSI. Time spent by the DSI/S loading SQT cache. A rollback occurred. Grouped transactions failed by a DSI thread. First. Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group. Transactions contained in transaction groups sent by a DSI thread. Depending on error mapping. In addition. A rollback and retry occurred. Number of rs_dsi_check_thread_lock invocations returning true. Time spent by the DSI/S finding a group to dispatch. This includes transactions that were successfully committed or rolled back according to their final disposition. This counter is incremented each time a Large Parallel Transaction must wait because there are no available parallel DSI threads. Time spent by the DSI/S determining if a transaction is special. Invocations of rs_dsi_check_thread_lock by a DSI thread.Final v2. Overall the counters to watch are listed here (note that for this section we will only be reporting RS 15. The function determined the calling thread holds locks required by other threads. Transactions in groups sent by a DSI thread that rolled back successfully.0. Time spent by the DSI/S dispatching a large transaction group to a DSI/E. This counter is incremented each time a Parallel Transaction must wait because there are no available parallel DSI threads. These DSI/E threads have just completed their transaction. DSITransUngroupedSent DSITranGroupsSucceeded DSITransFailed RollbacksInCmdGroup AllThreadsInUse AllLargeThreadsInUse ExecsCheckThrdLock TrueCheckThrdLock CommitChecksExceeded GroupsClosedTrans DSIFindRGrpTime DSIPrcSpclTime DSIDisptchRegTime DSIDisptchLrgTime DSIPutToSleep DSIPutToSleepTime DSILoadCacheTime DSIThrdRdyMsg 265 .0 counters): Monitor Counter DSITranGroupsSent Description Transaction groups sent to the target by a DSI thread. some transactions may be written into the exceptions log. there are a few counters that relate specifically to Parallel DSI tuning. Number of times transactions exceeded the maximum allowed executions of rs_dsi_check_thread_lock specified by parameter dsi_commit_check_locks_max. ''Thread Ready'' messages received by a DSI/S thread from its assocaited DSI/E threads. A transaction group can contain at most dsi_max_xacts_in_group transactions. Number of DSI/E threads put to sleep by the DSI/S prior to loading SQT cache. let’s start by taking a look at what counters are available that might be of use during tuning parallel DSI’s Parallel DSI Monitor Counters While some of the same counters are used for parallel DSI tuning as with regular DSI tuning. and executing it if it is. This includes time spent finding a large group to dispatch. Time spent by the DSI/S dispatching a regular transaction group to a DSI/E. This function checks for locks held by a transaction that may cause a deadlock.1 Tuning Parallel DSI’s with Monitor Counters Tuning parallel DSI’s with monitor counters really boils down to maximizing parallelism while decreasing contention. and this actually may be the case under normal circumstances.in which case some are not being used If the group size is too small. dsi_max_xacts_in_group=2 . However.Final v2. One key counter to look at for this is the DSI counter AllThreadsInUse. if you remember.due to contention discussed later): DML Stmts 333 370 382 DSIThread Cmds Per Sec NgTrans Xact Per Grp 14:56:48 14:56:48 14:56:48 1 2 3 85 93 104 165 183 189 1.1 Monitor Counter DSIThrdCmmtMsgTime DSIThrdSRlbkMsgTime DSIThrdRlbkMsgTime Description Time spent by the DSI/S handling a ''Thread Commit'' message from its associated DSI/E threads.AllThreadsInUse=0. the number of threads you can effectively use will be reduced. With respect to the first comment.but in the following sections we will be taking a closer look to see how they work in parallel DSI environments. The secret is to start with a reasonable number of threads based on the transaction profile and either increase the number of threads and/or adjust the transaction group size to keep all of them busy.0. The key is to look at the value during peak processing. we will only look at the AllThreadsBusy column (highlighted).8 666 739 763 18 20 21 1 0 0 332 370 382 266 Updates Cmds Applied Trans Groups Sample Time Inserts RS Threads 0 7 0 0 0 23 Trans Succeed Cmd Groups Sample Time Checks True Check Locks . then it is unlikely that adding threads will help.9 1. As we can see. Then it becomes a progression of finding the right balance of number of threads. the first step to maximizing parallelism is to use a dsi_serialization_method of “wait_for_start” and disabling large transactions. If we look at the load distribution during this period we would see something similar to the following (dsi_num_threads=7. the DSI-S will re-use a free thread before using a new/idle thread if a thread becomes free. Some of these have been discussed before . If the counter DSI. if this has any value. the group size and the partitioning rules to effectively use the parallel threads. Let’s take a look at a high-end trading stress test: PartnWaits 0 0 0 0 0 0 DML Stmts Sec 9 10 10 AllThreads Busy Trans Fail RollBacks Checks Exceeded 0 0 0 0 0 0 Deletes 0 0 0 Trans In Groups 14:56:11 14:56:48 14:57:29 14:58:10 14:58:57 14:59:38 2 643 1796 1817 3212 2056 2 1241 3447 3508 6391 4066 2 637 1803 1814 3207 2039 0 0 0 0 0 0 0 0 0 0 0 0 0 216 562 596 1907 998 0 0 0 0 0 0 0 0 0 0 0 0 For now.9 1. this was 0 . A couple of key points: • • You can have too many threads . Before the test began. Time spent by the DSI/S handling a ''Thread Rollback'' message from its associated DSI/E threads. Time spent by the DSI/S handling a ''Thread Single Rollback'' message from its associated DSI/E threads. processing started around 14:56 and peaked about 14:58. Maximizing Parallelism Obviously. then looking closer at the load balancing of transaction groups and commands sent on each of the individual threads will give a good idea if adding threads will help. Final v2.0.1 DML Stmts 340 361 338 359 970 966 972 1018 975 1004 994 1002 1025 962 995 994 1002 1021 1221 1220 1238 1209 1245 1219 1221 790 805 833 849 799 856 811 14:56:48 14:56:48 14:56:48 14:56:48 14:57:29 14:57:29 14:57:29 14:57:29 14:57:29 14:57:29 14:57:29 14:58:10 14:58:10 14:58:10 14:58:10 14:58:10 14:58:10 14:58:10 14:58:57 14:58:57 14:58:57 14:58:57 14:58:57 14:58:57 14:58:57 14:59:38 14:59:38 14:59:38 14:59:38 14:59:38 14:59:38 14:59:38 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 86 91 88 90 253 251 258 262 255 259 258 258 265 250 258 256 262 268 458 463 458 457 462 458 458 287 288 298 302 289 312 286 168 179 167 178 485 483 486 509 486 502 496 502 512 483 498 500 502 511 911 919 915 911 921 910 910 574 576 584 590 578 604 574 1.9 1.9 1.8 1.9 1.9 1.9 1.8 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 2 2 1.9 1.9 2 1.9 2 679 721 675 718 1940 1932 1944 2036 1950 2008 1989 2005 2052 1926 1990 1994 2004 2042 3040 3056 3069 3031 3084 3039 3038 1938 1954 1996 2026 1955 2060 1955 18 20 18 19 47 47 47 49 47 48 48 48 50 46 48 48 48 49 66 66 66 65 67 66 66 47 47 48 49 47 50 47 0 0 0 0 0 0 0 0 0 0 0 1 2 2 1 1 1 0 599 614 590 613 593 601 595 354 343 329 327 356 347 331 340 361 338 359 970 966 972 1018 975 1004 994 1001 1023 960 994 993 1001 1021 622 606 648 596 652 618 626 436 462 504 522 443 509 480 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Looking at the workload distribution during the peak period, we see that the number of transaction groups/transactions is extremely balanced. This gives us and indication that adding addition threads during this time frame would increase throughput. It is extremely interesting to note that the transaction profile starts as almost exclusively updates and then DML Stmts Sec 9 10 9 9 23 23 23 24 23 24 24 24 25 23 24 24 24 24 26 26 26 26 27 26 26 19 19 20 20 19 20 19 DSIThread Cmds Per Sec NgTrans Xact Per Grp Updates Cmds Applied Trans Groups Sample Time Deletes Inserts 267 Final v2.0.1 becomes an even balance of inserts/updates. Let’s bump the num_threads to 10 and also increase the dsi_max_xacts_in_group to 3 since the load is so evenly balanced and the AllThreadsBusy is so high. This increase is a bit cautious as we are dealing with updates which have been experiencing contention. AllThreadsBu sy Sample Time Cmd Groups Check Locks Checks True RS Threads Fail 0 0 0 9 26 0 5 3 4 5 4 5 3 3 4 5 10 11 8 11 11 10 13 9 19:30:43 19:31:18 19:32:01 19:32:45 19:33:32 19:34:17 6 306 904 2252 2154 16 7 794 2362 6334 6306 16 6 306 900 2253 2128 16 0 0 0 0 0 0 0 0 0 0 0 0 0 43 79 1248 1918 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 We can see that the bulk of the processing was accomplished in ~1.5 minutes (from 19:32 to 19:33:32) and only ~2 minutes overall. This is a bit better than the first run which took about 3 minutes overall and processing was distributed over all three minutes. Looking at one reason why, we see immediately that at peak we were processing 6,300 original transactions each ~40 second interval whereas in the first run we were only accomplishing mostly 3,000 transactions with a peak of 6,300. Looking at the load distribution gives us a better idea why. DML Stmts Sec DSIThread DMLStmts XactPerGr p Cmds Per Sec NgTrans Updates Cmds Applied Trans Groups Sample Time 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 39 22 32 32 27 44 26 24 27 33 89 98 76 90 101 86 108 76 97 56 82 88 74 107 65 62 73 90 228 253 186 241 255 233 290 206 2.4 2.5 2.5 2.7 2.7 2.4 2.5 2.5 2.7 2.7 2.5 2.5 2.4 2.6 2.5 2.7 2.6 2.7 388 223 328 352 296 421 260 248 292 360 912 1011 744 962 1010 932 1160 824 11 6 9 10 8 12 7 7 8 10 21 23 17 22 23 21 26 19 0 1 0 0 0 6 0 0 0 0 0 0 0 2 1 0 0 0 194 110 164 176 148 200 130 124 146 180 456 506 372 478 504 466 580 412 Deletes Inserts 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 194 111 164 176 148 206 130 124 146 180 456 506 372 480 505 466 580 412 268 PartnWaits 0 0 0 0 0 0 RollBacks DSIYields TransFail Checks Exceeded Trans In Groups Trans Succeed Final v2.0.1 19:32:01 19:32:01 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 97 83 220 224 219 248 229 222 228 216 226 230 209 206 234 206 207 235 209 209 231 206 253 217 619 630 622 640 647 643 644 620 644 658 627 621 648 618 621 651 627 627 645 621 2.6 2.6 2.8 2.8 2.8 2.5 2.8 2.8 2.8 2.8 2.8 2.8 3 3 2.7 3 3 2.7 3 3 2.7 3 1009 859 2472 2518 2484 2558 2577 2567 2564 2470 2575 2622 2486 2465 2570 2448 2471 2587 2493 2484 2561 2441 23 19 56 57 56 58 58 58 58 56 58 59 54 53 55 53 53 56 54 54 55 53 1 0 2 1 3 0 1 1 1 0 0 1 18 16 27 23 19 18 23 18 21 31 503 429 1232 1257 1237 1277 1286 1278 1280 1234 1287 1309 1216 1208 1244 1188 1207 1267 1212 1213 1250 1175 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 504 429 1234 1258 1240 1277 1287 1279 1281 1234 1287 1310 1234 1224 1271 1211 1226 1285 1235 1231 1271 1206 Again, the load is fairly balanced during peak processing. The question is whether or not this was indeed a better configuration. The easiest way to tell the difference (besides aggregating across sample periods) is that during the peak processing, this run is steadily in the high 20’s of DML Statements per Second while the first run was primarily in the mid-20’s. However, the transaction mix is a bit different as well as the number of inserts vs. updates in the latter part are significantly different. The problem was that during the last part of the processing (when a few inserts were occurring), the number of parallel DSI failures was nearly 1 every 2 seconds. For this application, it turns out the optimal mix was 9 threads and a group size of 3. However, the same application also executes transactions (mainly inserts) against another database. Since the transactions were nearly exclusively inserts, the DSI profile was 20 threads and a group size of 20. Controlling Contention In the above section we made a mention to the fact that parallel DSI contention was nearly 1 every 2 seconds. The most common way to spot contention is to review the errorlog and look for the familiar message stating that a parallel transaction had failed and is being retried serially. However, if looking back over time just using the monitor counters, you may not have access anymore to historical errorlogs. Additionally, even when it is happening, keeping track of all the error messages to determine the relative frequency can be an inexact science. This is spotted by looking at one of two possible sets of counters - depending on whether Commit Control is used or rs_threads. If Commit Control is used, the answer is fairly obvious - simply look for TrueCheckThrdLock and DML Stmts Sec 11 9 28 28 28 29 29 29 29 28 29 29 26 26 27 26 26 27 26 26 27 26 DSIThread DMLStmts XactPerGr p Cmds Per Sec NgTrans Updates Cmds Applied Trans Groups Sample Time Deletes Inserts 269 Final v2.0.1 CommitChecksExceeded - which are recorded as ChecksTrue and ChecksExceeded in the spreadsheet below. However, in this case we were not using Commit Control. In this case, remember a bit from our notion of how parallel DSI’s communicate (and with some experimentation) we determine that in RS 15.0, the DSI counter DSIThrdRlbkMsgTime (specifically the counter_obs column) will tell us how often the DSI had to rollback transactions due to parallel DSI contention. Repeating the above last run’s spreadsheet: Sample Time Cmd Groups Check Locks Checks True All Threads Busy RS Threads Fail 0 0 0 9 26 0 PartnWaits 0 0 0 0 0 0 RollBacks DSIYields TransFail 19:30:43 19:31:18 19:32:01 19:32:45 19:33:32 19:34:17 6 306 904 2252 2154 16 7 794 2362 6334 6306 16 6 306 900 2253 2128 16 0 0 0 0 0 0 0 0 0 0 0 0 0 43 79 1248 1918 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 As we can see, once the inserts start, contention immediately spikes. Possible causes include: • • • The inserts may be firing a trigger which can be causing the contention The inserts may be causing conflicting locks (either due to range/infinity locks or similar if isolation level is 3) The updates may have shifted to another table and may be the cause of the contention - possibly even be updates to the same rows (such as updates to aggregate values). Only further analysis using the MDA tables could tell us what tables are involved in the contention. Note that the key activity here is to try to reduce the contention - the suggested order to use is: 1. 2. 3. 4. first within the DBMS (i.e. change to datarows locking, optimize trigger code, etc.) if this is not possible, to decrease the grouping then try DSI partitioning finally reduce the number of parallel DSI’s (as a last resort) DSI Partitioning In RS 12.6, one of the new features that was added to help control contention was the concept of DSI partitioning. Currently, the way DSI partitioning works is that the DBA can specify the criteria for partitioning among such elements as time, origin, origin session id (aka spid), user, transaction name, etc. During the grouping process, the DSI scheduler compares each transaction’s partition key to the next. If they are the same, they are processed serially - if possible, grouped within the same transaction group. If they are different, the DSI scheduler assumes that there is no application conflict between the two and allows them to be submitted in parallel. If the transaction group needs to be closed due to group size and the next transaction has the same partition key value, then that thread is executed as if the dsi_serialization_method was wait_for_commit (and subsequent threads are also held until it starts). Note that the goal of this feature was specifically aimed at the case in which RS introduces contention either by executing transactions on different connections that were originally submitted on the same - or simply didn’t have any contention at the primary due to the time. As a result, the recommended starting point is “none” for dsi_partitioning_rule. However, if contention exists and it can’t be eliminated and reducing the group size doesn’t help, a good starting point is to set the dsi_partitioning_rule to origin_sessid or the compound rule of ‘origin_sessid, time’. Once implemented, you will need to carefully monitor the DSI counters PartitioningWaits and in particular the counters for the respective partitioning rule you are using. For example, if using origin_sessid, the counters OSessIDRuleMatchGroup and OSessIDRuleMatchDist will identify how often a transaction was forced to wait (submitted in serial - OSessIDRuleMatchGroup) vs. how often it proceeded in parallel (OSessIDRuleMatchDist). If the parallelism is too low, it might actually be better to reduce the number of parallel threads and try without DSI partitioning. Remember, however, the goal is to reduce the contention. So if by implementing dsi_partitioning_rule = ‘origin_sessid’, you see a drop of AllThreadsBusy from 1000 to 500 and PartitionWaits climbs to 250, but the failed 270 Checks Exceeded 0 0 0 0 0 0 Trans In Groups Trans Succeed Final v2.0.1 transactions drops from 1-2 per second to 1 every 10 seconds, this is likely a good thing. The final outcome (as always) is best judged by comparing the aggregate throughput rates for the same transaction mix. 271 Final v2.0.1 Text/Image Replication Okay, just exactly how is Replication Server able to replicate non-logged text/image updates??? The fact that Replication Server is able to do this surprises most people. However, if you think about it – the same way that ASE had to provide the capability to insert 2GB of text into a database with a 100MB log – Replication Server had to provide support for it – AND also be able to insert this same 2GB of text into the replicate without logging it for the same reason. The unfortunate problem is that text/image replication can severely hamper Replication Server performance – degrading throughput by 400% or more in some cases. Unfortunately, other than not replicating text, not a lot can be done to speed this process up. Text/Image Datatype Support To understand why not, you need to understand how ASE manages text. This is simply because the current biggest limiter on replicating text is the primary and replicate ASE’s themselves. While we are discussing mainly text/image data, remember, this applies to off row java objects as well as these are simply implemented as image storage. Throughout this section, any reference to “text” datatypes should be treated as any one of the three Large Object (LOB) types. Text/Image Storage From our earliest DBA days, we are taught that text/image data is stored in a series of page chains separate from the main table. This allows an arbitrary length of text to be stored without regard to the data page limitation of 2K (or ~1960 bytes). Each row that has a text value stores a 16-byte value – called the “text pointer” or textptr – that points to the where the page chain physically resides on disk. While this is good knowledge, a bit more knowledge is necessary for understanding text replication. Unlike normal data pages with >1900 bytes of storage, each text page can only store 1800 bytes of text. Consequently a 500K chunk of text will require at least 285 pages in a linked page chain for storage. The reason for this is that each text page contains a 64-byte Text Image Page Statistics Area (TIPSA) and a 152-byte Sybase Text Node (st-node) structures located at the bottom of the page. Page header (32 bytes) Text/image data (1800 bytes) Head of st-node (152 bytes) TIPSA (64 bytes) Figure 86 – ASE Text Page Storage Format Typically, a large text block (such as 500K) will be stored in several runs of sequential pages – with the run length depending on concurrent I/O activity to the same segment and available contiguous free space. For example, the 285 pages needed to store 500K of text may be arranged in 30 runs of roughly 10 pages each. Prior to ASE 12.0, updating the end of the text chain – or reading the chain starting at a particular byte offset (as is required in a sense), meant beginning at the first page and scanning each page of text until the appropriate byte count was reached. As of ASE 12.0, the st-node structure functions similar to the Unix File System’s I-node structure in that in contains a list of the first page in each run and the cumulative byte length of the run. For simplicity sake, consider the following table for a 64K text chunk spread across 4 runs of sequential pages on disk: 273 Final v2.0.1 Page Run (page #’s) 8 (300-307) 16 (410-425) 8 (430-437) 5 (500-504) st-node page 300 410 430 500 byte offset 14400 43200 57600 65536 This allows ASE to rapidly determine which page needs to be read for the required byte offset without having to scan through the chain. Depending on how “fragmented” the text chain is (i.e. how many runs are used) and the size of the text chain itself, the st-node may require more than 152 bytes. Rather than use the 152 bytes on each page and force ASE to read a significant portion of the text chain simply to read the st-node, the first 152 bytes are stored on the first page while the remainder is stored in it’s own page chain (hence the slight increase in storage requirements for ASE 12.0 for text data vs. 11.9 and prior systems). It goes without saying, then, that Adaptive Server Enterprise 12.0+ should be considerably faster at replicating text/image data then preceding versions. Thanks to the st_node index, the Replication Agent read of the text chain will be faster and the DSI delivery of text will be faster as neither one will be forced to repeatedly re-read the first pages in the text chain simply to get to the current byte offset where currently reading/writing text. The first page in the chain – pointed to by the 16-byte textptr is called the First Text Page or FTP. It is somewhat unique in that when a text chain is updated, it is never deleted (unless the data row is deleted). This is surprising but true and still true when setting the text value explicitly to null still leaves this page allocated – simply empty. The textptr is a combination of the page number for the FTP plus a timestamp. The FTP is important to replication because it is on this page that the TIPSA contains a pointer back to the data row it belongs to. So, while the data row contains a textptr to point to the FTP, the FTP contains the Row ID (RID) back to the row. Should the row move (i.e. get a new RID), the FTP TIPSA must be updated. The performance implications of this at the primary server is fairly obvious (consequently, movements of data rows containing text columns should be minimized). The FTP value and TIPSA pointers can be derived using the following SQL: -- Get the FTP..pretty simple, since it is the first page in the chain and the text pointer in the row -- points to the first page, all we have to do is to retrive the text pointer select [pkey columns], FTP=convert(int,textptr(text_column)) From table Where [conditions] -- Getting the TIPSA and the row from the TIPSA is just a bit harder as straight-forward functions for -- our use are not included in the SQL dialect. Dbcc traceon(3604) Go Dbcc page(dbid, FTP, 2) Go -- look at last 64 bytes, specifically the 6 bytes beginning at offset 1998. The first 4 bytes are -- the page id (depending on platform, the byte order may be reversed) followed by the last 2 bytes -- which are the rowid on the page. For APL tables, you then can do a dbcc page on that page at use -- the row offset table to determine the offset within the page and read the pkey values. As you can see, determining the FTP is fairly easy, while the TIPSA resembles more of an nonclustered lookup operation which the dataserver internally can handle extremely well. Standard DML Operations Text and image data can be directly manipulated using standard SQL DML Insert/Update/Delete commands. As we also were taught, however, this mode of manipulation logs the text values as they are inserted or updated and is extremely slow. The curious might wonder how a 500K text chunk is logged in a transaction log with a fixed log row size. The answer is that the log will contain the log record for the insert and subsequent log records with up to 450 bytes of text data – the final number of log records dependent on the size of the text and the session’s textsize setting (i.e. set textsize 65536). SQL Support for Text/Image In order to speed up text/image updates and retrievals as well as provide the capability to insert text data larger than permissible by the transaction log, Sybase added two other verbs to the Transact SQL dialect – readtext and writetext. Both use the textptr and a byte offset as input parameters to determine where to begin read or writing the text chunk. In addition, the writetext command supports a NOLOG parameter which signals that the text chunk is not to be logged in 274 Final v2.0.1 the transaction log. Large amounts of text simply can be inserted or updated through repetitive calls to writetext specifying the byte offset to be where previous writetext would have terminated. Of special consideration from a replication viewpoint is that the primary key for the row to which the text belongs is never mentioned in the writetext function. The textptr is used to specifically identify which text column value is to be changed instead of the more normal where clause structure with primary key values. Hold this thought until the section on Replication Agent processing below. Programming API Support Anyone familiar with Sybase is also familiar (if only in name) with the Open Client programming interface - which is divided into the simple/legacy DB-Lib (Database Library) API interface and the more advanced CT-Lib (Client Library) interface. Using either, standard SQL queries – including DML operations – can be submitted to the ASE database engine. Of course, this is one way to actually modify the text or image data – but as we have all heard, DML is extremely slow at updating text/image and forces us to log the text as well (which may not be supportable). Consequently, both support API calls to read/write text data to ASE very similar to the readtext/writetext functions described above. For example, in CT-Lib, ct_send() is used to issue SQL statements to the dataserver while ct_get_data() and ct_send_data() are used to read/write text respectively. Similar to writetext, ct_send_data supports a parameter specifying whether the text data is to be logged. Note that while we have discussed these functions as if they followed readtext/writetext implementation, in reality, the API functions basically set the stage for the SQL commands instead of the other way around. In any case, similar to write text, the sequence for inserting a text chunk using the CTLIB interface would look similar to: ct_send() –- send the ct_send() –- retrieve ct_send_data() – send ct_send_data() – send ct_send_data() – send … ct_send_data() – send insert statement with dud data for text (init pointer) the row to get the textptr just init’d the first text chunk the next text chunk the next text chunk the last text chunk The number of calls dependent on how large of a temporary buffer the programmer wishes to use to read the text (probably from a file) into memory and pass to the database engine. A somewhat important note is that the smaller the buffer, the more likely the text chain will be fragmented and require multiple series of runs. Of all the methods currently described, the ct_send_data() API interface is the fastest method to insert or update text in a Sybase ASE database. RS Implementation & Internals Now that we now how text is stored and can be manipulated, we can begin applying this knowledge to understand what the issue is with replicating text. sp_setreptable Processing If not the single most common question, the question “Why does sp_setreptable take soooo long when executed against tables containing text or image columns?” certainly ranks in the top ten questions asked to TSE. The answer is truthfully – to fix an oversight that ASE engineering “kinda forgot”. If you remember from our previous discussion, the FTP contains the RID for the data row in its TIPSA. The idea is that simply by knowing what text chain you were altering, you would also know what row it belongs to. This is somewhat important. If a user chose to use writetext or ct_send_data(), a lock should be put on the parent row to avoid data concurrency issues. However, ASE engineering chose instead to control locking via locking the FTP itself. In that way (lazily) they were protected in that updates to the data row also would require a lock on the FTP (and would block if someone was performing a writetext) and concurrent writetexts would block as well. Unfortunately for Replication Server Engineering, this meant that ASE never maintained the TIPSA data row RID if the RID was never initialized – which frequently was the case – especially in databases upgraded from previous releases prior to ASE 12.0. In order to support replication, the TIPSA must be initialized with the RID for each data row. Consequently, sp_setreptable contains an embedded function that scans the table and for each data row that contains a valid textptr, it updates the column’s FTP TIPSA with the RID. Since a single data row may contain more than one text or image column, this may require more than one write operation. To prevent phantom reads and other similar issues, this is done within the scope of a single transaction, effectively locking the entire table until this process completes. The code block is easily located in sp_setreptable by the line: if (setrepstatus(@objid, @setrep_flags) != 1) Unfortunately, as you can imagine, this is NOT a quick process. On a system with 500,000 rows of data containing text data (i.e. 500,000 valid text pointers), it took 5 hours to execute sp_setreptable (effectively 100,000 textptrs/hour – usual caveat of your time may vary is applicable). An often used metric is that the time required is the same as that to build a new index (assuming a fairly wide index key so the number of i/o’s are similar). 275 12. while a text chain in the other).@datetimecol='4-15-1988 10:23:23. This is especially true in non-Warm Standby architectures such as shared primary or corporate rollup scenarios in which the page more than likely will be allocated to different purposes (perhaps an OAM page in one.) as well as the textptr. @tran_id=0x000000000000000000000001 begin transaction 'Full LTL Test'distribute @origin_time='Apr 15 1988 10:23:23.@varbinarycol=0x01112233445566778899. Then it simply reads the FTP TIPSA for the RID (itself a combination of a page number and row id) along with table schema information (column names and datatypes as normal) and reads the parent row from the data page. If the text chain was modified with a writetext.@text_col=hastext always_rep. consequently.002PM'. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000004.3. Key Concept #32: The Replication Agent uses the FTP TIPSA RID to locate the parent row and then constructs a replicated function rs_datarow_for_writetext to send with the text data to identify the row at the replicate.56. as those with experience know. the nagging question – “Why on earth is initializing the FTP TIPSA with the RID so critical??” Some may already have guessed.!!?This is the text column value.@identitycol=1. the text functions internal within ASE check to see what the replication status is for the text column any time it is updated.0.1 Key Concept #31: The reason sp_setreptable takes a long time on tables containing text/image columns. @smalldatetimecol='Apr 15 1988 10:23:23. The Replication Agent reads the log record. the Replication Agent tells the Replication Server what the primary keys were by first sending a rs_datarow_for_writetext function with all of the columns and their values.@rsaddresscol=1. there is no such thing as an “unlogged operation” in Sybase. replicated databases have their own independent allocation routines. This executes immediately and supports replication of text data manipulated through standard DML operations (insert/update/delete) as well as new text values created with the writetext and ct_send_data methods and slow bcp operations.@floatcol=3. 276 . object id.Final v2. there is no way to guarantee that because a particular text chain starts at page 23456 at the primary that the identical page will be used at the replicate.@imagecol=hastext rep_if_changed. no other columns in row changed).2. That method is to use the legacy sp_setreplicate procedure which does not support text columns and then call sp_setrepcol as normal to set the appropriate mode (i.@tinyintcol=1. @varcharcol='first insert'.e.56. Replication Agent Processing Now.e.001PM'.1. In addition to logging the space allocations for text data. operations are considered “minimally logged” – which means that while the data itself is not logged. Instead. distribute @origin_time='Apr 15 1988 10:23:23. @binarycol=0xaabbccddeeff.004PM'. then it would be impossible for the Replication Server to determine which row the text belonged to at the replicate. If the text column is to be replicated. distribute @origin_time='Apr 15 1988 10:23:23. @tran_id=0x000000000000000000000001 applied 'ltltest'. the space allocations for the data are logged (required for recovery). replicate_if_changed).@smallmoneycol=$0.002PM'. the Replication Agent must break up the text into multiple chunks and send via multiple rs_writetext “append” calls. As a result. the Replication Server MUST be able to determine the primary keys for any text column modified. @bitcol=1 distribute @origin_time='Apr 15 1988 10:23:23. Remember. If a user specifies a non-logged writetext operation and only modifies the text data (i.@smallintcol=1. There is a semi-supported method around this problem provided that pre-existing text values in a database will never be manipulated via writetext or ct_send_data(). this lot falls to the task of the Replication Agent. As you could guess.@realcol=2.rs_insert yielding after @intcol=1.rs_writetext append first last changed with log textlen=30 @text_col=~. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000001.@decimalcol=. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000002. even in Warm Standby. An example of this from a normal logged insert of data is illustrated in the below LTL block (notice the highlighted sections). extracts the textptr and parses the page number for the text chain. In either case – text modified via DML or writetext – similar to transaction logging of text data.@charcol='first insert'. etc. is that it must initialize the First Text Page’s TIPSA structure to contain the parent row’s RID.003PM'. @numericcol=2. in order to send data to the Replication Server. in reality. While we have used the term “NOLOG” previously. ASE inserts a log row in the transaction log containing the normal logging information (transaction id. @moneycol=$1.001PM'. @tran_id=0x000000000000000000000001 applied 'ltltest'. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000003. rs_update yielding before @intcol=1. @identitycol=1. @bitcol=1 after @intcol=1. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000007.2.002PM'. ASE 15. @smallmoneycol=$0.56. do_not_replicate | replicate_if_change | always_replicate] [.0. The text replication is passed through a series of rs_writetext append first.002PM'. Key Concept #33: Similar to the logging of text data.. @varbinarycol=0x01112233445566778899.'ALL' | 'NONE' | 'L1'] [. use_index] -.1 provides the option of creating an index on the text pointer value in the base table.rs_writetext append @imagecol=~/!!7Ufw@4ª"ÌÝîÿðÿ@îO@Ý@y@f distribute @origin_time='Apr 15 1988 10:23:23.Standard table replication marking sp_setreptable <table_name> [. 'use_index'] -. @bitcol=0 A couple of points are illustrated above: • • The base function (insert/update) contains the replication status and also whether or not the column contains data. it can perform an internal query of the table via the text pointer index to find the datarow belonging to the text chain.Final v2. @varbinarycol=0x01112233445566778899.rs_writetext append last @imagecol=~/!!Bîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bh distribute @origin_time='Apr 15 1988 10:23:23. even when not logging the text. @imagecol=notrep rep_if_changed.0. when the Replication Agent is scanning the log and sees a textchain allocation.0. As a result. @datetimecol='Apr 15 1988 10:23:23.1. @floatcol=3. In the last example. owner_on | owner_off | null] [. @binarycol=0xaabbccddeeff. @text_col=notrep always_rep. this still is likely to complete in hours vs. Changes in ASE 15. ASE implemented a different method in ASE 15.3.006PM'.rs_writetext append first changed with log textlen=119 @imagecol=~/!"!gx"3DUfw@4ª»ÌÝîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bhP+|p{`"]?>.2.1 implementation).12. the Replication Agent can simply read the text chain (after all. text data is passed to the Replication Server by “chunking” the data and making multiple calls until all the text data has been sent to the Replication Server.007PM'.005PM'.56. @smallintcol=1. @numericcol=2. This implementation has advantages and disadvantages • Advantages o The speed of this index creation obviously depends on the size of the table as well as the settings for ‘number of sort buffers’ and parallel processing.56.Standard text/image column replication marking sp_setrepcol <tab_name> [.1 Because of customer complaints about the impracticality of marking large pre-existing text columns for replication. @moneycol=$1.12. @floatcol=3. @varcharcol='first insert'. @tran_id=0x000000000000000000000001 applied 'ltltest'. @text_col=notrep always_rep. @varcharcol='first insert'. @realcol=2. “notrep” refers to the fact that the text chain is empty. append. o On extremely large tables. @realcol=2. This can be enabled using the following syntax: -. the only difference between these and pre-ASE 15.3. @datetimecol='Apr 15 1988 10:23:23. As you could guess. true | false] [. @numericcol=2. @binarycol=0xaabbccddeeff. …. @rsaddresscol=1. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000005. @identitycol=1. @charcol='updated first insert'.Warm Standby and MSA syntax with DDL replication sp_reptostandby <db_name> [.1 that did not involve updating the TIPSA. @smallmoneycol=$0. column_name] [. @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000006. @rsaddresscol=1. append.1. @smallintcol=1.0.1 @tran_id=0x000000000000000000000001 applied 'ltltest'. Instead.56. @smalldatetimecol='Apr 15 1988 10:23:23. @moneycol=$1. @imagecol=notrep rep_if_changed.0. it already has started to in order to find the RID on the FTP TIPSA). use_index] As you can see.002PM'. @decimalcol=. @charcol='first insert'. @decimalcol=. @tran_id=0x000000000000000000000001 applied 'ltltest'.D *@4ª distribute @origin_time='Apr 15 1988 10:23:23. @tinyintcol=1. @smalldatetimecol='Apr 15 1988 10:23:23.1 systems is the final parameter of ‘use_index’ (or null if using the pre-ASE 15. days o Read only queries can still execute as create index only uses a shared table lock 277 .002PM'. append last functions with each specifying the number of bytes. @tinyintcol=1.0. @tran_id=0x000000000000000000000001 applied 'ltltest'. the DSI thread treats text as a large transaction. o Additional storage space is required to store the text pointer index o Normal DML operations (such as insert.e. as we saw from above. more I/O’s will need to be performed traversing the index to find the data row where as in the TIPSA method. translates this into the appropriate rs_insert and rs_writetext commands. 278 . state in pubs2. In itself. Replicated Text Functions As we discussed earlier. The replication agent. As a result. It then follows this with a call to rs_get_textptr to retrieve the textptr for the text chain allocation just created. the biggest difference is how the DSI handles the text from a replicated function standpoint. Once it receives the textptr. the DSI uses the CT-LIB ct_send_data() function to actually perform the text insert. the DSI first sends the rs_insert as normal and then follows it with a call to rs_init_textptr – typically an update statement for the text column setting it to a temporary string constant. the page pointer is located on the first index page.Final v2. You may have wondered previously why the rs_datarow_for_writetext didn’t simply contain only the primary key columns vs. The latter is probably the most important of the two – without all of the columns. The bulk of the special handling for text data within the Replication Server is within the DSI thread. if expecting a large number of text operations and you can take the upfront cost of the TIPSA method.0. At the replicate. the entire row. update. except of course. if a site subscribed to data from the primary based on a searchable column (i. However. In addition to these considerations. text data is handled no differently than any other. the table has precedence and will use the TIPSA method. the primary log will contain the insert and multiple inserttext log records. the text/image marking precedence is column table database. RS & DSI Thread Processing As far as Replication Server. this looks like the below distribute rs_insert rs_insert distribute rs_writetext append first rs_writetext distribute rs_writetext append rs_writetext … distribute rs_writetext append last rs_writetext rs_insert rs_insert rs_init_textptr rs_init_textptr rs_get_textptr rs_get_textptr (textpointer) textpointer) rs_writetext rs_writetext … rs_writetext rs_writetext Figure 87 – Sequence of calls for replicating text modified by normal DML. when a text row is inserted using regular DML statements at the primary. First. However. deletes) may incur extra processing to maintain the index (except updates when the text column is not modified and the text pointer index would be considered a ‘safe’ index). From a timeline perspective. if the database is marked ‘use_index’. the site would probably never receive any text data. we are lacking something fairly crucial – the textptr. this is not necessarily odd as often text write operations result in a considerable number of rows in the replication queues. but a specific table is marked using the TIPSA method..1 • Disadvantages o On really large tables. you may wish to use this instead of the text pointer index. by providing all data. that the DIST thread needs to associate the multitude of rows with the subscription on the DML function (rs_insert) or as designated by the rs_datarow_for_writetext. As a result. and 2) subscriptions on non-primary key searchable columns would be useless.authors). There actually are two reasons: 1) the DBA may have been lazy and not actually identified the primary key (used a unique index instead). the DIST thread can check for searchable columns within the data row to determine the destination for the text values. Consequently. Text Function Strings Consider the pubs2 database. In other words.Final v2. etc. It first sends the rs_datarow_for_writetext to the replicate. this function is empty for the replicate – the rs_get_textptr function is all that will be necessary. 279 . ASE will put a log record in the transaction log. the blurbs table contains biographies for several of the authors in a column named “copy”. the rs_datarow_for_writetext is sent when a writetext operation was executed at the primary.1 For text inserted at the primary using writetext or ct_send_data. but you will also need to understand the following: • Text function strings have column scope.rs_datarow_for_writetext. In that database. In the default function string classes. It then is followed by the rs_init_textptr and rs_get_textptr functions as above. rs_update. As noted earlier. For rows inserted with writetext operations. The sequence of calls for replicating text modified by writetext or ct_send_data is illustrated below: distribute rs_datarow_for_writetext rs_ datarow_for_writetext distribute rs_writetext append first rs_writetext distribute rs_writetext append rs_writetext … distribute rs_writetext append last rs_writetext rs_datarow_for_writetext rs_ datarow_for_writetext rs_init_textptr rs_init_textptr rs_get_textptr rs_get_textptr (textpointer) textpointer) rs_writetext rs_writetext … rs_writetext rs_writetext Figure 88 – Sequence of calls for replicating text modified by writetext or ct_send_data(). because the textreq function within the ASE engine is able to determine if the text is to be replicated – even when a non-logged text operation is performed. the normal rs_writetext functions are sent to the Replication Server. Thankfully. in the case of a custom function class. As we discussed before. you will have to create a series of function strings for each text/image column in the table. However. it is also used to provide the column values to the rs_init_textptr and rs_get_textptr function strings so the appropriate row for the text can be identified at the replicate and have the textptr initialized. After that. they might resemble the below: create function string blurbs. rs_init_textptr is crucial to avoid allocating text chains when no text data was present at the primary. The DSI simply does the same thing. The Replication Agent in reading this record. we will be discussing these functions in a little bit more detail. you will not only need to create these four function strings. Earlier. However. If we were to create function strings for this table. the sequence is little different. these are generated for you.0. • In regards to the first bullet. if using the default function classes (rs_sqlserver_function_class or rs_default_function_class). what if you are using your own function class?? If using your own function class. we discussed the fact that it is used to determine the subscription destinations for the text data. The role of rs_datarow_for_writetext is actually two fold. The textstatus modifier available for text/image columns in normal rs_insert.copy for sqlserver2_function_class output language ‘ ‘ Note the name of the column in the function string name definition. retrieves the RID from the TIPSA and then creates an rs_datarow_for_writetext function. the text function strings for each text column is identified by the column name after the function name. If you have 2 text columns. In the following paragraphs. This brings the list of function strings to 4 for handling replicated text. you may want to have this function perform something – for example insert auditing or trace information into an auditing database. you will need two definitions for rs_get_textptr. rs_delete as well as rs_datarow_for_writetext. RS will use ct_send_data() API calls to send the text to the replicate using the same log specification that was used at the primary.1 Typically the next function sent is the rs_init_textptr. it sets the column to a null value. This can be problematic. there is a little known (thankfully) option to sp_changeattribute . remember.dealloc_first_txtpg . we need a valid text pointer before we start using writetext operations. The CT-Lib API essentially has it built-in as it is only with the subsequent ct_get_data() or ct_send_data() calls that the actual text is manipulated.x). @last_chunk = ?rs_last_text_chunk!sys?. we simply use a normal update command to insert some temporary text into the column knowing that the real text will begin at an offset of 0 and therefore will write over top of it. Anytime you get an invalid textpointer error or zero rows error for the textpointer.rs_writetext.copy for gw_function_class output rpc 'execute update_blurbs_copy @copy_chunk = ?copy!new?. check the queue for the proper text functions as well as check the RSSD for fully defined function string class. While this is the simplest form of the rs_writetext functions. The first is the more normal writetext equivalent as in: create function string blurbs. If this should happen. it seemed that ~19 bytes of text was the guidelines for System 10. it is a good idea to check the RS commands being sent (using trace “on”.copy for sqlserver2_function_class output language 'update blurbs set copy = “Temporary text to be replaced” where au_id = ?au_id!new?' This. @writetext_log = ?rs_writetext_log!sys?' This also could be used to replicate text from a source database to a target in which the text has been split into multiple varchar chunks. Note that in the examples in the book. Since Replication Server uses CT-Lib API calls to manipulate text.”DSI_BUF_DUMP”) and validating the text row should exist and that the table attribute for dealloc_first_txtpg is not set. Finally. the next function Replication Server sends is the rs_get_textptr function. Consequently.Final v2. at first appears to be a little strange.which asynchronously deallocates text pages with null values. the text itself is sent using multiple calls to rs_writetext. Of special note. you may want to use an update where textcol=”<some arbitrary string>”. Note that in this case. As a result.copy for rs_sqlserver2_function_class output writetext use primary log In this example. text replication may fail as the text pointer may get deallocated before the RS retrieves it .rs_textptr_init. it was no guarantee that setting the text column to null would do so (in fact.”DSI”.rs_get_textptr. After initializing the textptr. The textptr() function solely exists so that those using the SQL writetext and readtext commands can pass it a valid textptr. in earlier versions of SQL Server. two system variables are used to flag whether this is the last text chunk and 280 . Those familiar with CT-Lib programming know that when a normal select statement without the textptr function is used. The rs_writetext function can perform the text insert in three different ways. but it also could point to database inconsistencies where the parent row is actually missing. The error could be transient. @au_id = ?au_id!new?. which can be used to replicate text through an Open Server: create function string blurbs. it is probably the most often used as it allows straightforward text/image replication between two systems that provide ct_send_data() for text manipulation (and therefore one of the biggest problems in replicating through gateways).rs_writetext. create function string blurbs. it is often the lack of a valid textptr – or more than one – that frequently will cause a Replication Server DSI thread to suspend. An alternative is the RPC mechanism. which might look like the below: create function string blurbs. In addition. This is deliberate. rather than initializing the textpointer using and update textcol=null. Although setting a text column to null is supposed to allocate a text chain. to ensure that the text chain is indeed allocated when needed. the textptr() function is then unnecessary. it is the pointer itself that is bound using ct_bind() and ct_fetch() calls. But since we haven’t sent any text yet….or may get deallocated between the time RS allocates it and the first text characters are sent to the ASE.0.copy for sqlserver2_function_class output language 'select copy from blurbs where au_id = ?au_id!new?' Those who have worked with SQL text functions may be surprised at the lack of a textptr() function call in the output mask as in “select textptr(copy) from …”.kind of a catch-22 situation. Consequently. However. rs_update for my_custom_function_class with overwrite output language ‘ if ?copy!text_status? < 2 -. @au_id = ?au_id!new?. that text is replicated in chunks. these status values allow you to customize behavior at the replicate – for example. create function string blurbs. For example. In the previous paragraph. @last_chunk = ?rs_last_text_chunk!sys?.1 whether it was logged at the primary. The main reason for this. While it is true that the text does get replicated. remember. text does not support the notion of a before and after image. if left at “always_replicate”. the Open Server could simply close the file – or if a dataserver. 72 characters) and to handle white space properly. The new text modifier instead refers to the current chunk of text contents without referring to whether it is the old or new values. the answer is sort of. if using custom function strings. we were referring to the old and new as it applies to the before and after images captured from the transaction log.rs_writetext.Final v2. is that while the text rows may be logged. Before you jump and say “wait a minute. @writetext_log = ?rs_writetext_log!sys?' For non-RPC/stored procedure mechanisms. didn’t you just say…”. While other columns support the usual old and new modifiers for function strings as in ?au_lname!new?. Text pointer is initialized. The former could be used if the target is buffering the data to ensure uniform record lengths (i. The values for text_status are: Hex 0x0000 0x0002 0x0004 0x0008 0x0010 Dec 0 2 4 8 16 Meaning Text field contains NULL value. Consider the following: create function string blurbs_rd. the columns text status. is the role of the text variable modifiers. Additionally. the “new” variable modifier was used to designate the text chunk string vs. consequently a single cohesive after image is not available. An example of this can be found near the end of the previous section when discussing the RPC mechanism for replicating text to Open Servers (which could then write it to a file).0. avoiding initialing a text chain when no text exists at the primary. Real text data will follow.do nothing since no text was modified 281 . the before image is not logged when text is updated. No text data will follow because the text data is not replicated. these modifiers are not necessary. the functionality would be extremely limited as the support for text datatypes is extremely reduced. The text data is not replicated but it contains NULL values. the after image isn’t available from the log either. When the last chunk is received. create function string blurbs. and the text pointer has not been initialized. The final method for rs_writetext is in fact to prevent replication via no output. In this scenario. The whole purpose of “new” in this sense was to provide an interface into the text chunks as they are provided through the successive rs_writetext commands. the “new” chunks are really the “old” values which are still the same.copy for rs_sqlserver2_function_class output none Which disables text replication no matter what the setting of sp_setrepcol. Even if it were. if a primary transaction updates a column in the table other than the text column and minimal column replication is not on. then the text column will be replicated. Note that the Replication Server handles splitting the chunks of text into 255 byte or less chunks avoiding datatype issues. In that example (repeated below). text columns also support the text_status variable modifier. which specifies whether the text column actually contains text or not. However. During normal text replication.e. it could update the master record with the number of varchar chunks. if the primary application opts not to log the text being updated.copy for gw_function_class output rpc 'execute update_blurbs_copy @copy_chunk = ?copy!new?. unlike normal updates to tables. Text Function Modifiers The second aspect of text replication that takes some thought. so that in a sense an “after image” does exist. However text columns do support two modifiers: new and text_status.rs_writetext. it is almost guaranteed that it will have to be re-read from disk. it begins to forward the text chunks to the Replication Server. but also during the transaction sorting. While this is the fastest way to handle text. if the original function was a writetext or ct_send_data. The main impact within the Replication Server however. As mentioned earlier. the Replication Agent needs to read the entire text chain. text replication is required. Replication Server Processing Within the Replication Server itself. then the Replication Agent has to read it from disk – and more than likely physical reads. As it reads the text chain. it might be better to simply place tables containing text or image data in a separate database and replicate both. The reason for this degradation is somewhat apparent from the above discussions.?au_id!new?) else if ?copy!text_status? = 8 -. replicating text can have performance implications. First. ‘ The above function string – or one similar – could be used as part of an auditing system that would only allocate a text chain when necessary – and also signal when the primary text chain may have been eliminated via being set to null. key_val. While the primary transaction may have only updated several bytes by specifying a single offset in the writetext function. Replicate the text and endure the performance degradation. the throughput for text replication is much. Consider the following points: • • • Text transactions can’t be batched The DSI has to get the textptr before the rest of the text can be processed. However. Use custom function strings to construct a list of changed rows and then asynchronously to replication. text_col) values (?rs_origin_xactn_id!sys?. it first has to read the row’s RID from the FTP TIPSA. not only will the stable queue I/O be higher due to the large number of rs_writetext records required. Performance Implications As mentioned earlier.5GB/hr was sustainable for non-text data. is at the DSI thread. then performance could be severely degraded. only 600MB/hr was sustainable for text data (or 4x worse). Net Impact Replicating text will always be considerably slower than regular data. Consequently. In highly concurrent or high volume environments. it will more than likely fill the SQT cache – and also be the most likely victim of a cache flush meaning it will have to be read from disk. Consider the fact that in ASE versions prior to ASE 12. processing a single rs_writetext is slower than an rs_insert or other similar normal DML function. “(text was deleted or set to null at the primary)”). 2.1 else if ?copy!text_status? = 2 or ?copy!text_status? = 4 insert into text_change_tracking (xactn_id. all other Rep Agent activity in the transaction log is effectively paused. However. it is not fast. read the row from the base table and construct the rs_datarow_for_writetext function as well.0. While reading the text chain. If not that much text is crucial to the application. 282 . Consequently. This requires more network interaction than most other types of commands. application developers have really only three choices: 1. this could result in the Replication Agent getting significantly behind. if a lot of text is expected. At this juncture. for high availability architectures involving a Warm Standby.0. the text is irrelevant and therefore simply can be excluded from replication. For most workflow automation systems. the database engine would have to scan the text pages to find the byte offset. Then as it begins to scan the text chain. In fact. key_val) values (?rs_origin_xactn_id!sys?. Replication Agent Processing It goes without saying that if the text or image data isn’t logged. during a customer benchmark in which greater than 2. then replicating text may not have that profound of an impact on the rest of the system.text is not replicated else if ?copy!text_status? = 16 insert into text_change_tracking (xactn_id. 3.Final v2. much lower than for non-text data. Each rs_writetext function is sent via successive calls to ct_send_data(). have an extraction engine move the text/image data Don’t replicate text/image at all Which one is best is determined by the business requirements. ?au_id!new?. However. As lower levels submit their budget requests. whatever is forwarded to the main business system. Parallel DSI’s. bill payment. A variation of this is a sort of change nomination process in which the change nomination is made to the headquarters and due to automated rules. While it is always possible to have the first system simply execute a stored procedure that is empty of code as a crude form of messaging. to make this site work for us and to reduce the number of customer service calls handled by operators. we will be taking a close look at Asynchronous Request Functions and the performance implications of using them. However. Sometimes.??? In some systems. the way most commercial bank web sites work. it might not be the obvious answer as it can get overlooked. you can’t – you have to provide a direct interface to the main business systems. however. with Replication Server. to protect our main business systems from the ever-present hackers and to ensure adequate performance for internal processes. Sounds pretty normal right??? The problem with this is. we discuss several scenarios of real-life situations in which asynchronous request functions make sense. etc.0.Final v2. Purpose During normal replication. As part of our customer service (and to stay competitive). they don’t understand the impact that they could have on replication performance. However. we would like the customer to be able to change their basic account information (name. Since we are discussing it. in which the request for a name change. very good idea that is rarely implemented). or asynchronous request functions. you simply implement each of the customer’s actions as “request functions”. mailing address) as well as perform some basic operations (online bill pay. In addition. One example in which this applies is a budget programming system. There are many real-life scenarios in which a business unit needs to submit a request to another system and have the results replicated back. You could easily picture this as being something similar to: Web Database Business Systems Account Transactions App Server Account Requests Figure 89 – Typical Web/Internal Systems Architecture In fact. it is impossible for a replicated data item to be re-replicated back to the sender or sent on to other sites (without the old LTM “–A” mode or the current send_maint_xacts_to_replicate configuration for Replication Agent). but that could lead to the “endless loop” replication problem. some form of corporate controlled data exists which can only be updated at the corporate site.1 Asynchronous Request Functions Just exactly why were Asynchronous Request Functions invented for anyway??? It is an even toss up as to which replication topic is least understood – text replication. in some cases this might be necessary. In this section. the obvious solution is asynchronous request functions. the change is made. we have separated the web-supported database from the database used by internal applications (a very. Corporate Change Request In many large systems. It is also possible to configure the replication agent to not filter out the maintenance user. the corporate budget is reduced and the budgeted items replicated back to 283 . this architecture is extremely viable and reduce the risk to mission critical systems by isolating the main business systems from the load and security risks of web users. how do you handle the name changes. The reason is simple – the procedure would be executed at the target by the maintenance user – whose transactions are filtered out. processed and then the results replicated back. we have created a web site for our customers to view online billing/account statements or whatever. Even for those who understand what they do. the problem with this is that the results are not replicated back to the sender. Key Concept #34: Asynchronous Request Functions were intended for a replicate system to be able to asynchronously request the primary perform some changes and then re-replicate those changes back to the replicate Web – Internal Requests Let’s assume we are working for a large commercial institution such as a bank or a telephone utility company. transfer funds). In the next couple of sections. picture a bad phone bill (or something similar) in which you both call to change the address. you were probably routed to different call centers with (gasp) different data centers. The fledgling Sybase DBA answer is this can’t be done. rules such as whether or not the amount exceeds certain dollar thresholds based on the type of procurement etc.1 subscribing sites. 284 . However. the request function is replicated on to other more senior organizations due to approval authority rules.e. Once the replicated record is received back from headquarters. Consider the fact that you and your spouse are both at work…only you happen to be traveling out of the area. it simply overwrites the existing record. Not. keep in mind.Final v2. In addition. that the goal is to have all of the databases consist – which of the two sets of data is the most accurate portrayal of the customer information is somewhat irrelevant. work phone number). Now. Having that in mind. More than likely. Corporate Forwarded Budget Requests Total Expenditures Approved Requests Budgeted Amounts Regional Budget Requests & Expenditures Field Figure 90 – Typical Corporate Change Nomination/Request Architecture Update Anywhere Whoa!!! This isn’t supposed to be able to be done with Sybase Replication Server. This scenario is a bit different than most as the local database would not be strictly executing a request function. For years we have been taught the sanctity of data ownership and woe to the fool who dared to violate those sacred rules as they would be forever cursed with inconsistent databases. could be in place. a request from a field office for a substantial funding item may have to forwarded through intermediates – in affect. due to the hierarchical nature of most companies. a “local change” would be enacted – i. At the headquarters system. look at the following architecture. account names or something – but provide slightly different information (i.0.e. The problem is that by being in two different locations and using the same toll-free number. a record saved in the database with a “proposed” status. 285 . 4. Implementation & Internals Now that we have established some of the reasons why a business might want to do Asynchronous Request Functions. another reason administrators don’t implement request functions is the lack of understanding who to set it up. Since commit order is guaranteed via Replication Server. then it will get response A and request #1 will get response B. we will explore this and how the information gets to the replication server. a good idea might be to review the steps necessary to create an asynchronous request function. etc. 3. has a Rep Agent. set a status column to “pending”). the steps are: 1. it is the commit sequence of the request functions at the “arbitrator”. This could be an “empty” procedure – or could have logic to perform “local” changes (i. the next thing to consider is how they are implemented. the databases will all have the same answer. Implementing Asynchronous Request Functions In general.e. make sure source database is established as a primary database for replication (i. Make sure the login names and passwords are in synch between the servers for users who have permission to execute the procedure locally (including those who can perform DML operations if proc is embedded in a trigger). specifying the primary database as the target (or recipient) desired and not the source database actually containing the procedure. The reason? We are exploiting the commit sequence assurance of Replication Server.Final v2.e. then every site will have the response (A) from request 2 applied ahead of the response (B) from request 1. In this case. Replicate Database & Rep Agent Perhaps before discussing what happens internally.) Create the procedure to function as the asynchronous request function.1 San Francisco Chicago New York Arbitrator Los Angeles Dallas Washington DC Request #1 Request #2 Response “A” Response “B” Figure 91 – Update Anywhere Request Function Architecture No matter what order request 1 or 2 occur in. If not already established. Frequently.0. 5. If request #2 commits first. 2. Mark the procedure for replication in the normal fashion (sp_setrepproc) Create a replication definition for the procedure. In this section. In order to send a request function to another system. 286 . Any one of the systems could send a request function to any of the others. One way to think of it is that an asynchronous request function replication definition functions as both a replication definition and subscription.funding ny_req_my_proc_name At PRS Procedure exists here create function replication definition ny_my_proc_name with primary at HQ. the typical process of replicating a procedure from a primary to a replicate involves creating a replication definition and subscription as normal similar to: HQ. the username and encrypted password are packaged into the LTL for the Replication Server. the “with primary at” clause specifies the recipient (HQ in this case) and not the source (NY) and that the replication definition was created at the primary PRS for the recipient. Note that in the above example. While described in general terms in the LTL table located in the Replication Agent section much earlier. it is within the Replication Server that all of the magic for request functions happens. A couple of points that many might not consider in implementing request functions: • A single replicated database can submit request functions to any number of other replicated databases. the full LTL syntax for “begin transaction” is: distribute begin transaction ‘tran name’ for ‘username’/ . the picture changes slightly to: HQ. • • Replication Agent Processing Essentially. Replication Agent processing for request functions is identical to the processing for an applied function. Replication Server Processing Since the source database processing is identical to applied functions. Ensure that the common logins have permission to execute the procedure at the recipient database. NY is sending the request function (dashed line) to HQ and the return is replicated via the solid line.funding my_proc_name At PRS Procedure exists here create function replication definition my_proc_name with primary at HQ. a route must exist between the two replicated systems.funding deliver as ‘hq_my_proc_name’ (…param list…) searchable parameters (…param list…) hq_my_proc_name Procedure exists here create subscription my_proc_name_sub for my_proc_name with replicate at NY. when a request function procedure is executed. an implicit transaction is begun.funding At RRS Figure 92 – Applied (Normal) Procedure Replication Definition Process This illustrates a normal replicated procedure from HQ to NY. Think of a shared primary configuration of 3 or more systems. a single request function can only be sent to a single recipient site. A bit of explanation might be in order for the last three. This restriction is due to the fact a single procedure needs to have a unique replication definition and that definition can only specify a single “with primary at” clause.1 6.encrypted_password Consequently.0.Final v2. The reason for this is as you probably guessed – the fact that the Replication Server executes the request function at the destination as the user who executed it at the primary (more on this in the next section).funding NY. As with any stored procedure execution.funding deliver as ‘ny_req_my_proc_name’ (…param list…) searchable parameters (…param list…) At RRS (no subscription) ny_my_proc_name Procedure exists here Figure 93 – Asynchronous Request Function Replication Definition Process In this illustration. For request functions. As a result. Regarding step #4.funding NY. there is nothing unique about Replication Agent processing for request functions. This happens in two specific areas – the inbound processing and the DSI processing. While a single site can send request functions to any number of sites. Following the SQM. normal subscription resolution specifies which sites receive the modifications due to the request function. changes made by the request function are then eligible for replication. The “deliver as” procedure itself (or a sub-procedure) could be a request function in which case the request is “forwarded up the chain” while the original request function serves as “notification” to the immediate supervisory site that the subordinate is making a request. determining if the procedure is a request function or not is easily achieved simply by checking to see if the primary database for the replication definition is the same as the current source connection (i. DSI Processing Within the outbound queue processing of a request function. If more than one request function has been executed in a row by the same user. The DSI-E executes the request function. A couple of points for consideration: • The destination of the modifications be replicated out of the recipient is not limited to the site that originally made the request function call. Key Concept #35: An Asynchronous Request Function will be executed at the recipient by the same user/password combination as the procedure was executed by at the originating site. the Replication Agent filters log records based on the maintenance user name returned from the LTL “get maintenance user” command. If the procedure listed in the “deliver as” clause of the request function replication definition is itself marked for replication. When a request function is processed by a DSI. • 287 . then any individual DML statements on tables marked for replication and/or sub-procedures marked for replication will be replicated as normal. this involves matching replicated rows with replication definitions. normalizing the columns and checking for subscriptions. not much happens as far as row evaluation until the DIST thread. all are executed individually. The latter is applicable when back-to-back request functions are executed by different users at the primary. If you remember from our earlier discussion.1 Inbound Processing As discussed earlier. If not. Consequently. and that due to the unique naming constraint for replication definitions. this process also involves determining if the procedure is an applied or request function.e. However. marking it as a request function. the DIST/SRE fails to find a subscription and simply needs to read the primary at clause to determine the “primary” database that is intended to receive the request function. then the procedure invoked by the request function will be replicated as an applied function. In addition. connection for which the SQM belongs to). then any modification performed by the request function execution is eligible for replication back out of the recipient database. Because it is not executed by the maintenance user. Recipient Database Processing The second difference in request function processing takes place at the replicate database.Final v2. Normally. the DSI resumes “normal” processing of transactions as the maintenance user until the next request function is encountered.0. they do have downside – it degrades replication performance. for stored procedure replication definitions. the following occurs: • • • • The DSI-S stops batching commands and submits all commands up to the request function. Consider the following: • Replication command batching/transaction grouping is effectively terminated when a request function is encountered (largely due to the reconnection issue). Since at this point normal replication processing is in effect. Performance Implications By now. Once the request function(s) have been delivered. the only difference is in the DSI processing. you have begun to realize some of the power – and possibilities – of request functions. Remember: the name for a replication definition for a procedure is the same as the procedure name. The DSI-E disconnects from the replicate dataserver and reconnects as the username and password from the request function transaction record. The DIST/SRE then writes the request function to the outbound queue. Since the DSI applies the request function by logging in as the same user at the primary. then the procedure is a request procedure. within the inbound processing of the replication server. The DSI-E disconnects from the replicate and reconnects as either the maintenance user or different user. If not. there will only be one replication definition with the same name as the procedure. • Normally the latter is not much of an issue. at the recipient. transactions that follow the request function appear to execute immediately. This should not deter Replication System Administrators from using request functions. If the number of rows affected are very large. the request functions at the originator are often empty. Obviously. while at the recipient there is a sequence of code. the two disconnect/reconnects could consume a considerable portion of time when a large number of request functions are involved. execute the procedure. however. establish the database context. In summary. they will be delayed until the request function completes execution. In the typical implementation. However.Final v2. request functions will impede replication performance by “interrupting” the efficient delivery of transactions. Consequently. as they provide a very neat solution to common business problems. 288 .1 • Replication Server must first disconnect/reconnect as the request function user.0. and then disconnect/reconnect as the maintenance user. then this procedure’s execution could be significantly longer than expected. but some customers have attempted to use request functions as a means of implementing “replication on demand” in which a replicate periodically executes a request function that at the primary flips a “replicate_now” bit (or something similar). Ignoring the procedure execution times. the degree to which performance is degraded will depend on the number and frequency of the request functions. at the originator. dropped LTL. However.S.000. Multiple DSI – A custom implementation in which multiple physical connections are created to the same database. Transaction commit order is not guaranteed and must be controlled by design..0 with only 5 DSI’s.0.000 write operations processed by a single RS in a single day against a database engine with known performance problem!!! So 10.000 transactions per 24 hour period to three destinations – each transaction a stored procedure with typical embedded selects and averaging 10 write operations (40. One customer processing credit card transactions reported achieving 10. you must ensure that your transactions are “commit consistent” or employ your own synchronization mechanism to enforce proper serialization when necessary. Multiple DSI’s are not fully supported by Sybase Technical Support.Final v2.000. Prior to version 11. 289 . Parallel DSI’s were not available in Replication Server. Government monitored test demonstrated a single Replication Server (version 10. For example. Key Concept #36: If using the Multiple DSI approach. Throughout the rest of this section. however. Commit Consistent – Transactions applied in any order will always yield the same results. a U. Parallel DSI’s guarantee that the transactions at the replicate will be applied in the same order. Concepts & Terminology Okay. etc. Accordingly. This does not mean the two methods are similar as there is one very key difference between the two. if you experience data loss or inconsistency then Sybase Technical Support will not be able to assist in troubleshooting. Serialized Transactions – Transactions that must be applied in the same order to guarantee the same database result and business integrity.000. Apply these in the opposite order may not yield the same database result as the withdrawal will probably be rejected due to a lack of sufficient funds.000 replicated procedures for a total of 120. if you’ve read this far.000.5) replicating 4. in effect implementing more than one DSI thread. exploit this to achieve higher throughput. despite number of threads or serialization method chosen. Transaction commit order is still guaranteed.000 a hour with RS 11. then the above warning didn’t deter you. That’s a total of 12. However. a bit of terminology and concepts need to be discussed to ensure we each understand what is trying to be stated. If you experience product bugs such as stack traces.x is could be believable. If you think this is unrealistic. It cannot be understated – Multiple DSI’s can be a lot of work – you have to do the thinking the Replication Server Engineering has done for you with Parallel DSI’s. in late 1995.000. Performance Benefits Needless to say. then Sybase Technical Support will be able to assist. the following definitions are used in association with the following terms: Parallel DSI – Internal implementation present in the Replication Server product that uses more than one DSI thread to apply replicated transactions.000 write operations total) against SQL Server 10. Before discussing Multiple DSI’s. Such exuberance however needs to be tempered with the cold reality that in order to achieve this performance. many customers were already hitting the limit of Replication Server capabilities due to the single DSI thread. Multiple DSI’s can achieve several orders of magnitude higher throughput than Parallel DSI’s.000 transactions per hour. For example transactions at different Point-Of-Sale (POS) checkout counters or transactions originating from different field locations viewed from the corporate rollup perspective. WARNING: Because the safeguards ensuring commit order are deliberately bypassed.1 Multiple DSI’s Multiple DSI or Parallel DSI – which is which or are they the same??? The answer to this question takes a bit of history.000. Multiple DSI’s do not – in fact. a deposit followed by a withdrawal. a number of design changes had to be made to facilitate the parallelism and extensive application testing to ensure commit consistency had to be done.0. several different methods of implementing multiple DSI’s to the same connection were developed and implemented so widely that it was even taught in Sybase’s “Advanced Application Design Using Replication Server” (MGT-700) course by late 1995 and early 1996. we simple need to alias it five additional times. As a consequence. remember the 4 hour procedure execution example) or just simply waiting for their “turn” to commit. The reason for this is the ease and speed of setup and the least impact on existing function definitions (i. you need to look at each of the bottlenecks that exist in Parallel DSI’s and see how Multiple DSI’s overcome them. the only way to accomplish this is to fake out the Replication Server and make it think it is actually connecting to multiple databases instead of one.e. 3. Due to the unique index on the rs_databases table in the RSSD. this is the source of the biggest performance boost as transactions in the outbound queue are not delayed due to long running transactions (i. all we need to do is “alias” the real dataserver in the Replication Server’s interfaces file.1 In order to best understand the performance benefits of Multiple DSI’s over Parallel DSI’s. Implementing multiple physical connections Ensuring recoverability and preventing loss Defining and implementing parallelism controls Implementing multiple physical connections The multiple DSI approach uses independent DSI connections for delivery. activity halts – even if the transactions that follow it have no dependence on the transaction that failed (i. Cross-Domain Replication – Parallel DSI’s are limited to replicating to destinations within the same Replication domain as the primary. the method discussed in this section will focus on that of using multiple maintenance users. Multiple DSI’s prevent large backlogs in the outbound queue reducing recovery time from transaction failures. Given that the first one counts as one. For example. Multiple DSI’s have no such restriction and in fact. CORP_FINANCES master tli query tli CORP_FINANCES_A master tli query tli CORP_FINANCES_B master tli query tli CORP_FINANCES_C master tli query tli CORP_FINANCES_D master tli query tli CORP_FINANCES_E master tli query tli /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 290 . Fortunately. extend easily to support large-scale cross-domain replication architectures (different topic outside scope of this paper). we decide we need a total of 6 Multiple DSI connections. Independent of Failures – If a transaction fails with Parallel DSI. corporate rollups). Implementing Multiple DSI’s is a sequence of steps: 1. 2. including altering the system function strings.Final v2. While the details will be discussed in greater detail later. the performance benefits from Multiple DSI’s stem from the following: No Commit Order Enforcement – by itself. this is easy to do. lets assume we have a interfaces file similar to the following (Solaris): CORP_FINANCES master tli query tli /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000 Based on our initial design specifications. Not Limited to a Single Replication Server – Multiple DSI’s lends itself extremely well to involving multiple Replication Servers in the process – achieving an MP configuration currently not available within the product itself.0. you don’t end up creating a new function class).e.e. Since Replication Server doesn’t check the name of the server it connects to. Implementation While the Sybase Education course MGT-700 taught at least three methods for implementing Multiple DSI’s. However. CORP_FINANCES_A.my_db DS2 my_db DSI DSI DSI DSI-Exec DSI-Exec DSI-Exec Stable Device SQM SQM SQM Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1) SRE TD MD Distributor SQT Rep Agent User SQM Figure 95 – Multiple DSI with Independent Outbound Queues & DSI threads In the above drawings.finance_db. CORP_FINANCES_B. DSI-Exec DSI-Exec DSI-Exec DSI SQT Replicate DB Stable Device Primary DB SRE TD MD SQM Distributor SQT Rep Agent User Outbound (0) Inbound (1) Outbound (0) Inbound (1) SQM RepAgent Figure 94 – Normal Parallel DSI with Single Outbound Queue & DSI threads DS2_a. before we do this. 291 . However.my_db DS2_b.finance_db.1 Once this is complete.0. etc. only a single replication server was demonstrated. however. the Multiple DSI’s can simply be created by creating normal replication connections to CORP_FINANCES.finance_db. The difference between this and Parallel DSI’s is illustrated in the following diagrams. while the second demonstrates Multiple DSI’s . there is some addition work we will need to do to ensure recoverability (discussed in next section).Final v2. as we mentioned Replication Server now thinks it is replicating to n different replicate databases instead of one. it creates separate outbound queues and DSI threads to process each connection. in Multiple DSI’s each of the connections could be from a different replication server. Consider the following – the first being the more normal multiple replication server implementation using routing to a single replication server. To get a clearer picture of what this accomplishes. Because of this.my_db DS2_c.one from each Replication Server. In addition. A slight twist of the latter ends up with a picture that demonstrates the ability of Multiple DSI’s to provide a multiprocessor (MP) implementation. if a large number of large transactions or text replication is involved. all of the sites stop replicating until the transaction is fixed and the DSI is resumed. 292 . long transactions or other issues could degrade performance. since they are independent. consider a possible Multiple DSI implementation: Chicago San Francisco New York London Tokyo Figure 97 – Multiple Replication Server Implementation Using Multiple DSI’s In this case. a failure of one does not cause the others to backlog.0. In contrast. only a single RSI thread is available between the two Replication Servers involved in the routing.Final v2. this has an inherent fault in that if any one of the transactions from any of the source sites fail. as we have already discussed.1 Chicago San Francisco New York London Tokyo Figure 96 – Multiple Replication Server Implementation without Multiple DSI’s While the RRS could use Parallel DSI’s. While this is normally sufficient. it may also be a bottleneck. Additionally. each RS could still use Parallel DSI’s to overcome performance issues within each and in addition. 0.c committed after a.my_db DS2_b. Unfortunately.my_db tran oqid 31 … tran oqid 31 … tran oqid 35 … tran oqid 35 … tran oqid 39 … tran oqid 39 … tran oqid 43 … tran oqid 43 … . you would have no certainty that tran OQID 42 should be next. call altered definitions of rs_update_lastcommit to provide distinguishable identity. However. is still performed by the single replication server servicing the inbound queue.. b. However. Consider the following picture: rs_lastcommit rs_lastcommit tran oqid 41 … tran oqid 41 … . they do present a problem – recoverability. The problem is simply this: with a single rs_lastcommit table and commit order guaranteed. Simply because the last record in the rs_lastcommit table refers to transaction id 101 does not mean the transaction 100 was applied successfully – or that 102 has not been already applied.a.. d suspended first 3 ..my_db DS2_c. .. Use a separate function class for each DSI. a DSI connection does not identify itself. or call a variant of the procedure such as rs_update_lastcommit_A.. rs_thread tables as well as associated procedures (rs_update_lastcommit). extensive function string utilization or other requirements demonstrating a bottleneck in the outbound processing. 293 ..my_db tran oqid 33 … tran oqid 33 … tran oqid 37 … tran oqid 37 … tran oqid 41 … tran oqid 41 … tran oqid 45 … tran oqid 45 … .. .e.my_db DS2_a. the same is not true. For example.1 Trading System Investments Figure 98 – MP Replication Achieved via Multiple DSI’s Note that the above architecture really only helps the outbound processing performance. the MP approach may be viable. All subscription resolution.. Then create separate rs_lastcommit. replication definition normalization.. if using Multiple DSI’s. As a result. d rolled back due to deadlocks Figure 99 – Multiple DSI’s with Single rs_lastcommit Table Consider the three scenarios proposed above.. DS2_b. DS2_c.my_db tran oqid 32 … tran oqid 32 … tran oqid 36 … tran oqid 36 … tran oqid 40 … tran oqid 40 … tran oqid 44 … tran oqid 44 … . etc... . DS2_d.... Plausible Scenarios: 1 . b. . 2. etc. Ensuring Recoverability and Preventing Loss While the multiple independent connections do provide a lot more flexibility and performance. “A”).my_db tran oqid 34 … tran oqid 34 … tran oqid 38 … tran oqid 38 … tran oqid 42 … tran oqid 42 … tran oqid 46 … tran oqid 46 … .my_db DS2_d. add a parameter that is hardcoded to the DSI connection (i. & d (long xactn) xactn) 2 . it is critical that each Multiple DSI has it’s own independent set of rs_lastcommit. b... .a. Within the class. DS2_a. owned by each specific maintenance user. consequently there are only two choices available: 1..Final v2. Parallel DSI’s are assured at restarting from that point and not incurring any lost or duplicate transactions. systems with high queue writes. In each of the three... Exploit the ASE permission chain and use separate maintenance users for each DSI. a column would have to be added to rs_threads as it has no distinguishable value either – along with changes to the procedures which manipulate these tables (rs_update_lastcommit. pad4 binary(255). pad7. @secondary_qid binary(36).Final v2. pad7 binary(4). origin_qid binary(36). For example. ** We pad each row to be greater than a half page but less than one page ** to avoid lock contention. pad6. */ if exists (select name from sysobjects where name = 'rs_update_lastcommit' and type = 'P') begin drop procedure rs_update_lastcommit end go /* Create the procedure to update the table. */ if exists (select name from sysobjects where name = 'rs_lastcommit' and type = 'U') begin drop table rs_lastcommit end go /* ** Create the table. dest_commit_time datetime. pad8 binary(4) ) go create unique clustered index rs_lastcommit_idx on rs_lastcommit(origin) go /* Drop the procedure to update the table. @origin_time datetime as update rs_lastcommit set origin_qid = @origin_qid. secondary_qid binary(36). each are aliased as dbo within the database. However. secondary_qid = @secondary_qid. The third one is definitely an option and is perhaps the easiest to implement.) would be to add a login name column. if it exists. rs_get_lastcommit. */ create table rs_lastcommit ( origin int. rs_get_lastcommit and rs_update_lastcommit are as follows: /* Drop the table. the second takes a bit of explanation. pad8) 294 . pad1 binary(255). pad1. pad4. While the first one is obvious – and obviously a lot of work as maintaining function strings for individual objects could then become a burden. pad2. @origin_qid binary(36). origin_time. The problem is that with high volume replication.). etc. origin_time = @origin_time. etc. In addition to rs_lastcommit. The modifications to the rs_lastcommit and rs_threads tables (and their corresponding procedures such as rs_update_lastcommit. */ create procedure rs_update_lastcommit @origin int. rs_get_thread_seq.0. pad5 binary(4).1 3. pad2 binary(255). the original rs_lastcommit table. the single rs_lastcommit table could easily become a source of contention. dest_commit_time = getdate() where origin = @origin if (@@rowcount = 0) begin insert rs_lastcommit (origin. dest_commit_time. it does have the advantage of being able to handle identity columns and other maintenance user actions requiring “dbo” permissions. pad3 binary(255). While separate maintenance user logins are in fact used. pad6 binary(4). origin_qid. pad3. secondary_qid. origin_time datetime. pad5. Since this is system information available through suser_name() function. Multiple maintenance users with changes to the rs_lastcommit table to accommodate connection information and corresponding logic added to rs_update_lastcommit to set column value based on username. the procedure modifications would simply be adding the suser_name() function to the where clause. Final v2. 0x00. pad8 binary(4) ) go -. getdate(). ** We pad each row to be greater than a half page but less than one page ** to avoid lock contention. 0x00. 0x00. pad7 binary(4). If using the multiple login/altered rs_lastcommit approach. pad6 binary(4). Consequently.modify the table to add the maintenance user column. For rs_lastcommit.0. origin_qid binary(36). */ create procedure rs_update_lastcommit 295 . then you simply need to add a where clause to each of the above procedures and the primary key/index constraints. rs_get_lastcommit. */ -. pad1 binary(255). pad5 binary(4). origin) go /* Drop the procedure to update the table. dest_commit_time datetime. 0x00. The reason for this is that the oqid is unique to the source system – but if there are multiple sources as can occur in a corporate rollup scenario – there may be duplicate OQID’s. pad2 binary(255). pad3 binary(255). normally retrieves all of the rows in the rs_lastcommit table.. origin_time datetime.modify the unique index to include the maintenance user create unique clustered index rs_lastcommit_idx on rs_lastcommit(maint_user. create table rs_lastcommit ( maint_user varchar(30).1 values (@origin. if it exists. @origin_qid. pad4 binary(255). */ if exists (select name from sysobjects where name = 'rs_get_lastcommit' and type = 'P') begin drop procedure rs_get_lastcommit end go /* Create the procedure to get the last commit for all origins. @secondary_qid. */ if exists (select name from sysobjects where name = 'rs_lastcommit' and type = 'U') begin drop table rs_lastcommit end go /* ** Create the table. 0x00. origin int. this becomes (modifications highlighted): /* Drop the table. the oqid and database origin id (from RSSD. During recovery. */ create procedure rs_get_lastcommit as select origin. origin_qid. 0x00) end go /* Drop the procedure to get the last commit. @origin_time. secondary_qid binary(36).rs_databases) is stored together. as each transaction is played back. the oqid and origin are used to determine if the row is a duplicate. 0x00. 0x00. */ if exists (select name from sysobjects where name = 'rs_update_lastcommit' and type = 'P') begin drop procedure rs_update_lastcommit end go /* Create the procedure to update the table. secondary_qid from rs_lastcommit go Note that the last procedure. Final v2. pad3. origin_time. update rs_lastcommit set origin_qid = @origin_qid.authors.1 or less on ASE 12. @origin_qid. So if “fred” is a user in the database and there is two tables: 1) fred. If one is not found. authors will be resolved to fred. 2KB).. by using separate maintenance users and individually owned rs_lastcommit. On the other hand. It is a little known fact (but still documented). Fortunately. 0x00. secondary_qid. other than the reduction in transaction log activity. pad7. that when you execute a SQL statement in which the object’s ownership is not qualified.0.5 you may need to modify these tables anyhow. It is a useful technique to remember. Why this is necessary at all is discussed under section describing the Multiple DSI/Multiple User implementation. binary(36). if Mary issues “select * from pubs2. this will gain little in the way of performance .authors”. we could alter the table definition to accommodate max_rows_per_page or datarow locking and eliminate the row padding (thereby reducing the amount of data logged in the transaction log for rs_lastcommit updates). By not changing the procedure parameters and due to the fact that all operations occur through the procedures. tables. origin_qid. we do not need to make any changes to the function strings (reducing maintenance considerably). Note that at the same time.authors exists. 0x00. dest_commit_time. So if implementing RS 12. pad5.add the maint_user to the (previously nonexistent) where clause select origin. which invalidates the normal rs_lastcommit padding.1 @origin @origin_qid @secondary_qid @origin_time as int. 0x00. 0x00. all retrieval and write operations against the rs_lastcommit table are performed through stored procedure call (similar to an API of sorts).e. 16KB vs.authors and fred issues “select * from pubs2.5 will support larger page sizes (i. you can exploit the way ASE does object resolution and permission checking.authors. While useful for handling identity and simple to implement. origin. binary(36). datetime -. pad2. pad6. However. we have the following: 296 . */ create procedure rs_get_lastcommit as -. @secondary_qid. etc. as ASE 12. though.authors. */ if exists (select name from sysobjects where name = 'rs_get_lastcommit' and type = 'P') begin drop procedure rs_get_lastcommit end go /* Create the procedure to get the last commit for all origins. 0x00. origin_time = @origin_time. It is important to avoid changing the procedure parameters. pad1. 0x00. dest_commit_time = getdate() where origin = @origin and maint_user=suser_name() if (@@rowcount = 0) begin -. @origin_time. ASE will first look for an object of that name owned by the user (as defined in sysusers). @origin. secondary_qid from rs_lastcommit where maint_user = suser_name() go Similar changes will need to be done to the rs_threads table and associated procedure calls as well. origin_qid. pad8) values (suser_name(). pad4.add the maintenance user login to insert statement insert rs_lastcommit (maint_user. then it searches for one owned by the database owner – dbo. Consequently. 0x00) end go /* Drop the procedure to get the last commit. the third alternative above may provide slightly greater performance by eliminating any contention on the rs_lastcommit table. authors will be resolved to dbo. secondary_qid = @secondary_qid. By using separate maintenance users.. and 2) dbo.authors”. 0x00.add maint_user qualification to the where clause. since no mary. getdate(). . DS2_c.rs_lastcommit MaintUser5 Figure 100 – Multiple Maintenance Users with Individual rs_lastcommits This then addresses the problems in the scenario we discussed earlier and changes the situation to the following: DS2_a. DS2_d.database.. Plausible Scenarios: 1 .my_db tran oqid 34 … tran oqid 34 … tran oqid 38 … tran oqid 38 … tran oqid 42 … tran oqid 42 … tran oqid 46 … tran oqid 46 … . DS2_c..... However.. . DS2_b. this leads to a potential recoverability issue with RS system tables that must be handled to prevent data loss or duplicate transactions. .0.c committed after a.my_db DS2_c. DS2_d. Key Concept #37: The Multiple DSI approach uses independent DSI connections set up via aliasing the target dataserver..rs_lastcommit MaintUser1 MaintUser2. .rs_lastcommit MaintUser3 MaintUser4.my_db tran oqid 32 … tran oqid 32 … tran oqid 36 … tran oqid 36 … tran oqid 40 … tran oqid 40 … tran oqid 44 … tran oqid 44 … .a.rs_lastcommit DS2_b. . . . d rolled back due to deadlocks Figure 101 – Multiple DSI’s with Multiple rs_lastcommit tables Now. DS2_a.my_db DS2_b..rs_lastcommit MaintUser4 MaintUser5. no matter what the problem.. b. .rs_lastcommit tran oqid 34 … tran oqid 34 … ..a..rs_lastcommit tran oqid 41 … tran oqid 41 … ... ..my_db DS2_a....my_db tran oqid 33 … tran oqid 33 … tran oqid 37 … tran oqid 37 … tran oqid 41 … tran oqid 41 … tran oqid 45 … tran oqid 45 … .. Detailed Instructions for Creating Connections Now that we now what we need to do to implement the multiple DSI’s and how to ensure recoverability.....my_db DS2_d.rs_lastcommit DS2_d. the next stage is to determine exactly how to achieve it..1 MaintUser1. & d (long xactn) xactn) 2 ...rs_lastcommit DS2_a. b. b..rs_lastcommit MaintUser2 MaintUser3.rs_lastcommit tran oqid 39 … tran oqid 39 … ..my_db tran oqid 31 … tran oqid 31 … tran oqid 35 … tran oqid 35 … tran oqid 39 … tran oqid 39 … tran oqid 43 … tran oqid 43 … .. Basically.. each of the DSI’s recovers to the point where it left off.Final v2. DS2_b. d suspended first 3 ..rs_lastcommit DS2_c. it comes down to a modified rs_init approach or performing the 297 .rs_lastcommit tran oqid 44 … tran oqid 44 … .. 6. Do not give them sa_role. the manual method is fairly easy.logical_db [use dump marker]] 3. (Same as above). 6. While you could grant permissions to individual maintenance users.logical_db | as standby for logical_ds. one may have to be aliased to “dbo” (drop the user and add an alias). If replicate is also a primary. Grant all permissions on tables/procedures to replication_role. Create connections from Replication Server to the replicate database. dsi_suspended}] [as active for logical_ds. 5. Repeat steps 1-2 until all maintenance users created. Not that all maintenance users have probably been created as Replication Server users. 2. For all other maintenance users. Grant all permissions on tables/procedures to replication_role. The steps are: 1. Configure the Replication Agent as desired. Alter the copy to include the first maintenance user as owner of all the objects. pick one of the maintenance users to be the “maintenance user” and specify the log transfer option create connection to data_server. Use isql to load the script into the replicate database. 3. add the maintenance user to Replication Server (create user) grant the specified maintenance user connect source permission in the Replication Server. 4. but does require a bit more knowledge about Replication Server. If the database will also be a primary database and data is being replicated back out.database set error class [to] rs_sqlserver_error_class set function string class [to] rs_sqlserver_function_classset username [to] maint_user_name [set password [to] maint_user_password ] [set database_param [to] 'value'] [set security_param [to] 'value' ] [with {log transfer on. 298 . If identity values are used.1 steps manually (as may be required for heterogeneous or OpenServer replication support). by granting permissions to the role. Rename the rs_install_primary script to a name such as rs_install_primary_mdsi. but can be cleaned up if desired. 2. If following the first implementation (modifying rs_lastcommit). This will prevent problems for future replication installations not involving multiple DSI’s. It is very similar to the above in results. Run rs_init for replicate database.0. you reduce the work necessary to add additional DSI connections later. 5. 1. Modified rs_init Method The modified rs_init method is the easiest and ensures that all steps are completed (none are accidentally forgotten). all may be aliased to dbo. Use sp_config_rep_agent to specify the desired maintenance user name and password for the Replication Agent. when in any database. you reduce the work necessary to add additional DSI connections later. 4. Add the maintenance user logins (sp_addlogin). Specify the first maintenance user. Rename the original back to rs_install_primary. If you do. Make a copy of $SYBASE/$SYBASE_RS/scripts/rs_install_primary. Each of the below requires the developer to first create the aliases in the interfaces file. by granting permissions to the role. Manual Multiple DSI Creation Despite what it sounds. Add the maintenance users to the replicated database. Repeat for each maintenance user. the maintenance user will map to “dbo” user vs. 7. 8. While you could grant permissions to individual maintenance users. one may have to be aliased to “dbo”. Create as many as you expect to have Multiple DSI’s plus a few extra. This is not a problem. Alter the rs_install_primary to include the first maintenance user as owner of all the objects. If using the modified rs_lastcommit approach. you can simply repeat step 2 until done. If identity values are used. Make a copy of $SYBASE/$SYBASE_RS/scripts/rs_install_primary (save it as rs_install_primary_orig). Grant maintenance user logins replication_role.Final v2. alter the connection and set replication off (if desired). but less manual steps. the maintenance user desired – consequently incurring the problem with rs_lastcommit. Load script using rs_init as normal. you must implement your own synchronization point to enforce serialization. Make a copy of rs_install_primary. etc. is to make sure that the where clause operations for any one connection are mutually exclusive from every other connection. b. This will prevent problems for future replication installations not involving multiple DSI’s. 2. Manually adjust function strings if inheritance does not provide appropriate support. this is not as difficult to achieve as you would think. 7. Rename the original back to rs_install_primary. In many cases. In this situation. rs_lastcommit and their associated procedures. Specify the appropriate function string class for each. and in particular the subscription where clause. 5. Modify the system functions for rs_get_thread_seq. The main mechanism for implementing parallelism is through the use of subscriptions. The steps are basically: 1. Defining and Implementing Parallelism Controls The biggest challenge to Multiple DSI’s is to design and implement the parallelism controls in such a way that database consistency is not compromised. This includes adding column to tables such as rs_threads without anything. Add column for DSI to each table as well as parameter to each procedure. As a result. it is the transactions from this source database that must be processed in parallel using the Multiple DSI’s. 3. a single primary source database provides the bulk of the transactions to the replicate. each of the Multiple DSI’s subscribes to different transactions or different data through one of the following mechanisms: 299 .0. The following rules MUST be followed to ensure database consistency: 1. Serial transactions must use the same DSI connection. b. Add column for maintenance user suid() or suser_name() to all tables and procedure logic. you will have to do the following (note this is a variance to either of the above. 4. If not 1 & 2. edit the appropriate file and make the following changes: a. This can be done via a variety of mechanisms. two transactions executed at the primary might be subscribed to by different connections and therefore have a different order of execution at the replicate than they had at the primary. Create a function string class for the first DSI (inherit from default). Create multiple connections from Replication Server to replicate database for remaining DSI’s using the create connection command. Depending on manual or rs_init method. you opt not to have multiple rs_lastcommit tables and instead wish to use a single table. Modify the original as follows: a. to specify the DSI. but is usually determined by two aspects: 1) the number of source systems involved. Rename the rs_install_primary script to a name such as rs_install_primary_mdsi. 3. This includes tables such as rs_threads. 2. Make a copy of rs_install_primary and save it as rs_install_primary_orig. Adjust all unique indexes to include suid() or suser_name() column. so replace the above instructions as appropriate): 1. 6. 2. Single rs_lastcommit with Single Maintenance User This method employs the use of function string modifications and really is only necessary if the developers really want job security due to maintaining function strings. rs_update_lastcommit. however. Monitor replication definition changes during lifecycle.1 Single rs_lastcommit with Multiple Maintenance Users If for maintenance reasons or other.Final v2. and 2) the business transaction model. Parallel Subscription Mechanism. Each aliased database connection (Multiple DSI) subscribes to a different data – either at the object level or through the where clause. Repeat for each DSI. Adjust all unique indexes to include DSI column. Single Primary Source In some cases. Load script according to applicable manual or rs_init instructions above. Procedure logic should select suid() or suser_name() for use as column values. Parallel transactions must be commit consistent. The key. This will create the first connection. Alter the first connection to use the first DSI’s function string class. As a result. Handling Serialized Transactions In single source systems. @@spid%10) – remembering that the result of mod(n) could be zero through n-1 (i. it can help with large transactions (due to large transaction threads) and medium volume situations through tuning the serialization method (none vs. this has one very distinct advantage over normal replication in that an erroneous transaction from one does not stop replication from all the others by suspending the DSI connection. if a typical customer transfers funds from a savings to a checking account. If you tried to divide the range of users evenly by spid. the column could store the mod() of the spid (i. global variables are no longer allowed as input parameter defaults to stored procedures. if an accurate picture of fund balances is necessary. However. The first two and last are fairly easy to implement and typically do not require modification to existing tables. Since in each case an independent Rep Agent.Transactions from one source system are guaranteed commit consistent from all others. etc. they will be applied in parallel at the replicate. frequently users are coming through a middleware tier (such as a web or app server) and are using a common login. An example of a discrete list might be similar to a bank in which one DSI subscribes to checking accounts. a hospital’s outpatient system may have a separate appointment scheduling/check-in desk. While this may not affect some business rules.e. in today’s architectures. you would end up with some DSI’s not doing any work for a considerable period (4 hours) of the workday. etc.e. In many cases. An example of this might be a consolidated database in which multiple stations in a business flow all access the same database. this includes situations such as retail POS terminals. However. interest calculations) to execute independent of other batch processes without either “blocking” the other through the rs_threads issue. Guaranteed commit consistency . mod(2) yields 0 & 1 as remainders). An example of the former may be that a DSI may subscribe to A-E or account numbers 10000-20000. typically via a range or discrete list. the other credit card transactions. As mentioned earlier. inbound queue processing and OQID’s are used for the individual components of a 2PC transaction. it is frequent that a small number of transactions still need to be serialized no matter what the parallelism strategy you choose. Typically implemented in situations involving a lot of procedure-based replication. the user/process partition might. Data Partitioning – In this scenario. then such a column could readily be used as well. This is most useful when a single database is used to process several different types of transactions. the replicate system may be inconsistent for a period of time.Since each source database has it’s own dedicated DSI. When multiple primary source systems are present. On the other hand. User/Process Partitioning – In this scenario. For example. pharmacy.0. from a replication standpoint. different DSI’s subscribe to different transactions. Multiple Primary Sources Multiple primary source system situations are extremely common to distributed businesses needing a corporate rollup model. a normal call center may start with only a few users at 7:00am. lab tests and results. if a bank opts for using the account number. different DSI’s subscribe to a different subset of tables. this allows long batch processes (i. Parallel DSI support – While this doesn’t appear to add benefit if the multiple DSI’s are from a single source. probably 80-90% of the transactions are fine. etc. banking applications. it would be impossible for even a single Replication Server to reconstruct the transaction into a single transaction for application at the replicate. in the case of multiple sources. different DSI’s subscribe to data modified by different users. If the database design incorporates an audit function to record the last user to modify a record and user logins are enforced.1 Data Grouping – In this scenario. establishing parallel transactions are fairly easy due to the following: No code/table modifications . the spid itself could be hard to develop a range on as load imbalance and range division may be difficult to achieve. wait_for_commit). Transaction Partitioning – In this scenario. etc. it resembles a 1:1 straightforward replication. if the transaction is split due to the account numbers. triage treatment. Probably one of the more frequently implemented. For example. For example. Each of the regional offices would have it’s own dedicated DSI thread to apply transactions to the corporate database.e. This is most useful in situations where individual user transactions need to be serialized.Final v2. in the remaining 10-20% are transactions such as account transfers that need to be serialized. If each “group” of tables that support these functions are subscribed to by different DSI’s.9. This is true even in cases of two-phased commit distributed transactions affecting several of the sources. The transactions affect a certain small number of tables unique to that data. Note that as of ASE 11. different DSI’s subscribe to different sets of data from the same tables. but are independent of each other’s. As a result. For example. a column may have to be added to the main transaction tables to hold the process id (spid) or similar value. However. build to 700 concurrent users by 09:00am and then degrade slowly to a trickle from 4:00pm to 06:00pm. this could cause a problem similar to the 300 . order_queue. the Multiple DSI’s that subscribe to those accounts do not receive the change. Unlike rs_threads where the sequence is predictable. If you remember. sending a SQL statement. the handling of serialized transactions is pretty simple – simply call a replicated procedure with the parameters. Another example. At that stage. The above is a true serialized transaction example. The last transaction should destroy the latch by deleting the row.is when the transaction involves a worktable in one database and then a transaction in another database (pending/approved workflow). item_inventory. because it is a replicated procedure. read by two different Rep Agents. If such is the case. the primary database can simply assign an arbitrary transaction number (up to 2 billion before rollover) and store it in a column added similar to the user/spid mod() column described earlier.1 typical isolation level 3/phantom read problems in normal databases. Once determined. So. For the most part. the normal transactional integrity of the transaction is inescapably lost. Instead. rs_threads imposes a modified “dead man’s latch” to control commit order. However. However. Latch Set – As each successive transaction begins execution. a store procedure at the primary call may generate a work table in one database using a select/into and then call a sub-procedure to further process and insert the rows. order_items. The answer is we would need a latch table and procedures similar to the following at the replicate: -. it is not. As long as the bill invoice number is part of the subscription and the itemization. even when user/process id is used for the parallelism strategy. The core logic would be: Latch Create – Basically some way to ensure that the latch was clear to begin with. a hospital bill containing billable items for Anesthesia and X-ray. not null. In addition. the individual row modifications are not replicated – consequently. the best approach would be to simply invoke the parallelism based on the spid of the person entering the order. Consequently. but what if 3 or more are involved? Even more complicated. there may not be a single or easily distinguishable set of attributes that can be easily subscribed to for ensuring transaction serialization within the same transaction. what if several had a specific sequence for commit? For example. During processing. By using bitmask subscription. the technique is a bit reminiscent of rs_threads. Serialization Synchronization Point There may be times when it is impossible to use a single procedure call to replicate a transaction that requires serialization and the normal parallel DSI serialization is counter to the transactions requirements. This normally occurs when a logical unit of work is split into multiple physical transactions – possibly even executed by several different users. Latch Block – Once the previous transactions have begun. lets consider the classic order entry system in which the following tables need to be updated in order: order_main.0. the transaction needs to set and lock the latch. This is fairly simple for two connections. Multiple DSI connections will wreak havoc on transactional integrity and serialization – simply because there is no way to guarantee that the transaction from once connection will always arrive after the other. of course. Of course. consequently a new latch should be created for each serialized transaction Latch Wait – In this case. A classic case – without even parallel DSI . The answer is “Yes”. we would expect 4 DSI’s to be involved – one for each of the tables. For example. the normal Replication Server commit order guarantee ensures that the transactions are serialized within respect one another. for some obscure reason. The question “Is there a way to ensure transactions are serialized?”. the transaction is guaranteed to arrive at the replicate as a complete bill – and within a single transaction.Final v2. after defining the parallelism strategy. 301 . in this case. The most common example is to have transactions executed by the same user serialized – or impacting the same account serialized. another DSI reserved for serialized transactions (it may be more than one DSI – depending on design) subscribes to the procedure replication and delivers the proc to the replicate. then by subscribing by invoice. this site can’t do that – and want to divide the parallelism along table lines. the second and successive transactions if occurring ahead of the first transaction need to sense that the first transaction has not taken place and wait. A similar mechanism could be constructed to the same thing through the use of stored procedures or function string coding. and delivered by two different DSI connections. then the rs_id column becomes very useful.latch table create table order_latch_table ( order_number latch_sequence int int not null. serializing the transactions simply means ensuring that all the ones related are forced to use the same DSI. each successive transaction needs to clear its lock on the latch. Latch Release – When completed. they can begin immediately. Similarly. While this may necessitate an application change to call the procedure vs. a careful review of business transactions needs to be conducted to determine which ones need to be serialized. Normally. However. the following transactions need to block on the latch so that as soon as the previous transactions commit. since both transactions originate from two different databases. the load could be evenly balanced across the available Multiple DSI’s. the benefits in performance at the primary are well worth it. otherwise.procedure to clear order latch create procedure destroy_order_latch @order_number int.procedure to wait block and set latch create procedure set_order_latch @order_number int.the only way we got to here is if the latch update worked -. we’d still be blocked on previous update -. the procedure will more than likely have no code in the procedure body as there is no need to perform serialization at the primary (transaction is already doing that). The way this works is very simple .0. that means we can exit this procedure and allow -.In any case. 0) return (0) end go -.block on latch so follow-on execution begins immediately -.1 constraint order_latch_PK primary key (order_number) ) lock datarows go -. @thread_num rs_id as begin delete order_latch_table where order_number = @order_number return (0) end go It is important to note that the procedure body above is for the replicate database. For example. At the primary. In addition.once previous commits update order_latch_table set latch_sequence = @thread_seq where order_number = @order_number -. consider the following pseudo-code example: Begin transaction Insert into tableA Update tableB Insert into tableC Insert into tableC Update table B 302 .make sure we are in a transaction so block holds if @@trancount = 0 begin rollback transaction raiserror 30000 “Procedure must be called from within a transaction return(1) end -.Final v2.the application to perform the serialized update return (0) end go -. it is possible to combine the “create” and “set” procedures into a single procedure that would first create the latch if it did not already exist.procedure to set/initialize order latch create procedure create_order_latch @order_number int. @thread_num rs_id as begin insert into order_latch_table values (@order_number. @thread_seq int.wait until time to set latch while @cntrow=0 begin waitfor delay “00:00:02” select @cntrow=count(*) from order_latch_table where order_number = @order_number and latch_sequence = @thread_seq –1 at isolation read uncommitted end -.but does require the knowledge of which threads will be applying the transactions. @thread_num rs_id as begin declare @cntrow int select @cntrow=0 -. 2 Update into tableB Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num.Final v2. latch_sequence int not null. The “deliver as” name would not be prefaced with the server extension. 1) return (0) end go -. 3 Commit transaction In which the indented calls are initiated by the triggers on the previous operation. 2 Update into tableB Exec SRV_destroy_order_latch @order_num. @seq_num. @seq_num. while B updates the balance and C is the history table). 3 Insert into tableC Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num. 2.1 Commit transaction Now. assuming tables A-C will use DSI connections 1-3 and need to be applied in particular order (i. 3 Insert into tableC Insert into tableC Exec SRV_set_order_latch @order_num. constraint order_latch_PK primary key (order_number) ) lock datarows go -. 3 Insert into tableC Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num. -. @seq_num. In addition.procedure to set/initialize order latch create procedure SRV_create_order_latch @order_number int. A inserts a new financial transaction. 1 Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num.e. Also. the procedure execution calls above could be placed in triggers. 1 Commit transaction Note that the SRV prefix on the procedures in the above is to allow the procedure replication definition to be unique vs. Note that the above also uses variables for passing the sequence. this makes sense as the first statement doesn’t have to wait for any order . 2 Update into tableB Exec SRV_set_order_latch @order_num. 3.latch table create table order_latch_table ( order_number int not null. 2 Update into tableB Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num. @thread_num rs_id as begin insert into order_latch_table values (@order_number. 1 Insert into tableA Exec SRV_set_order_latch @order_num. 303 . @seq_num. @seq_num. 1. reducing the modifications to application logic . changing the above to: Begin transaction Insert into tableA Exec SRV_create_order_latch @order_num.0.it should proceed immediately. the transaction at the primary could be changed to: Begin transaction Exec SRV_create_order_latch @order_num. As a result. the local version of the latch procedures would have to have some logic added to track the sequence number for the current order number and each “set latch” would have to simply add one to the number.although this would require the trigger to set the latch for the next statement. note that the first “set latch” is sent using the second DSI. If you think about it. This is simply due to the fact that the trigger is generic and can’t tell what number of operations preceded it. other connections.procedure to wait block and set latch create procedure SRV_set_order_latch @order_number int. However. thread 1) is not known. Parallel DSI’s may also have to be implemented for that DSI connection as well.field sites for example.consequently re-replicating data distributed from Multiple DSI’s. you should also note that the destroy procedure never gets called . this could cause a problem. the number of transactions per group was not controllable and attempts to use the byte size were rather cumbersome. if identities are used at multiple sites .for example the more classic user/process strategy. then simply aliasing one of the DSI connections to “dbo” and ensuring that all transactions for that table use that DSI connection is a simple strategy. Multiple DSI’s & Shared Primary Again. thread 2 vs.0. implementing multiple DSI’s has other design challenges. the real solution is to simply define the identity at the replicate as a “numeric” vs.and in fact. If the parallelism strategy chosen is one based on the table/table subset strategy. As a result.e. they could experience considerable contention between the different connections. the re-replication of data modifications may be desirable. 30% of the transactions deadlocked at the replicate. in large implementations. the normal Replication Agent processing for filtering transactions based on maintenance user name will fail . if not . Of course. Since the DSI connections use aliased user names. it is extremely simple to disable this by configuring the connection parameter “dsi_replication” to “off”. In the final implementation. Multiple DSI’s & Contention Because Multiple DSI’s mimic the Parallel DSI serialization method “none”. However. “identity”. Normally.Final v2. If not a Warm-Standby. Design/Implementation Issues In addition to requiring manual implementation for synchronization points. in a 1995 case study using 5 Multiple DSI connections for a combined 200 tps rate. Namely: • • • Set dsi_max_xacts_in_group to a low number (3 or 5) Use datapage or datarow locking on the replicate tables Change clustered indexes or partition the table to avoid last page contention Identity Columns & Multiple DSI As partially discussed before. they one that was rolled back (in this case the order (i. For example.the retry from deadlocking is not the “kindler-gentler” approach of applying the offending transactions in serial and printing a warning.1 @thread_seq @thread_num as begin int. Instead. define the context of identity!! It doesn’t have any . unlike Parallel DSI’s . However. in those days. Or. transaction grouping was simply disabled and the additional I/O cost of rs_lastcommit endured. as mentioned. the replicate may be an intermediate in the hierarchical tree.procedure to clear order latch create procedure SRV_destroy_order_latch @order_number int.with the exception of Warm Standby . However. you need to consider the problem associated with Multiple DSI’s if the replicate is also a primary database.does not have any valid context in any distributed system. @thread_num rs_id as begin delete order_latch_table where order_number = @order_number return (0) end go However. This should not pose a problem as the identity . A modification to the replicate versions of rs_lastcommit procedure could perform the clean up at the end of each batch of transactions. it would have to be combined with the site identifier (source server name from rs_source_ds) to ensure problems with “duplicate” rows do not happen. at a corporate rollup. For instance. it is even more critical to tune the connections similar to the Parallel DSI/ dsi_serialization_method=none techniques discussed earlier. so the wrong victim may be rolled back and the transaction attempted again and again until the DSI suspends due to exceeding the retries. as we mentioned before.it would be impossible from a trigger to know when the transaction is ended. rs_id update order_latch_table set latch_sequence = latch_sequence+1 where order_number = @order_number end go -. it could be viewed as a slight twist on the asynchronous 304 . Think about it. yet very low volume compared to passenger check-in and ticketing activities. And so on. During peak processing. Again. if a replicated procedure requires 4 hours to run. other transactions can continue to be applied through the other DSI connections. the replicated insert triggers inserts into the “pick” queue and the status is replicated back to the order entry system.Multiple DSI Solution for Separate Business Processes Flight departures is an extremely time sensitive piece of information. Multiple DSI’s can deftly avoid this problem. no such mechanism exists for Multiple DSI’s. For example. with 20+ lanes of checkout counters. While in Parallel DSI’s. large volumes of transactions that are independent of each other end up delaying one-another simply due to commit order. Consider the following: Flight Departures Airport Airfreight Shipments Passenger Ticketing Aircraft Servicing Costs Airline Headquarters Figure 103 . For example. a flight departure could have to wait for several hundred passenger related data records to commit at the replicate prior to being received. Consider the following illustration: Batch Interest Payments OLTP System Closing Trade Position Customer Trades Mutual Fund Trades DataWarehouse Figure 102 .Final v2. This is particularly useful in handling overnight batch jobs. the outbound queue will be filling with transactions. Normal daily activity could use a single DSI connection (it still could use parallel DSI’s on that connection though!). while one DSI connection is busy executing the long transaction.1 request functions earlier described. we illustrated how a long running transaction – whether it be a replicated procedure or several thousand individual statements within a single transaction – can cause severe delays in applying transactions that immediately followed them at the primary.Multiple DSI Solution for Batch Processing The approach is especially useful for those sites which normally Replication Server is able to maintain the transaction volume even during peak processing . then during the 4 hours that procedure is executing. normal table modifications could function as asynchronous requests. a delay of 30 305 . because commit consistency is a prerequisite. Multiple DSI’s still have applicability in most of today’s business environments. Consider the average Wal-Mart on a Friday night. this could lead to an unrecoverable state if the transaction volume is high enough that the remaining time in the day is not enough for the Replication Server to catch up. In this section we will take a look at ways that Multiple DSI’s can be exploited to get around normal performance bottlenecks as well as entertaining business solutions.0. Business Cases Despite their early implementation as a mechanism to implement parallelism prior to Parallel DSI’s. Long Transaction Delay In several of the previous discussions. Similarly. in many businesses. Consequently. Multiple DSI’s allow this problem to be overcome by allowing such techniques as dedicating a single DSI connection for each checkout counter. Only in this case. order entry database could insert a row into a “message queue” table for shipping. the rs_threads table is used to ensure commit order. It the transactions are being replicated. By now. these could use separate DSI connections to avoid being delayed due to a high volume of activity for another business process.but gets behind rapidly due to close of business processing and overnight batch jobs. As was mentioned in one case. transactions from the express lane would have to wait for the others to execute at the replicate and commit in order. At the shipping database. there are several different business processes involved in the same database. Commit Order Delay Very similarly. while the nightly purge or store close out procedure would use a separate DSI connection. you may be getting the very correct idea that Multiple DSI’s can contribute much more to your replication architecture than just speed. During peak travel times. even though the transactions are completely independent and commit consistent. Again. Corporate Rollups On of the most logical places for Multiple DSI implementation is corporate rollup. a queuing engine) could be directed down the same connection . However. And when it does. it takes on a different aspect as different connections can use different serialization methods. Limited Parallelism. Single Replication Server for delivery. Which means they all begin to back up .1 minutes would not be tolerable as this is the required reporting interval for flight “following” (tracking) that may be required from a business sense (i. No clearer picture of commit consistency can be found.minimizing the contention between the threads.000 is 1.Multiple DSI Approach to Managing Contention One of the advantages to this approach is that where warranted. they all do.0. of course).000. all of the transactions for one user (i. While this has proven conclusively to be sufficient for extremely high volume at even half of that. With multiple queuing engines involved. Parallel DSI only supports 20 threads. any single account probably does not get much activity. there would be no contention within its transactions simply due to only a single thread of execution. • • Multiple DSI’s can overcome this by involving multiple Replication Servers. while others use “none”. if even a small percentage of them experience timing related contention. at the replicate. delay the next connecting flight due to this one leaving 45 minutes late) . with extremely large implementations (such as nation-wide/global retailers). a system becomes essentially single threaded with a large transaction due to commit order requirements. Parallel DSI’s can still be used. For example. it also can happen in retail banking from a different perspective. in the case of aggregates. Mixed transaction modes.which is still a large number of transactions to retry when an alternative exists. Consider the following: 306 .e. it still can be two few.Final v2. However. By using Multiple DSI’s. the contention could be considerable. At a maximum. establishing Parallel DSI profiles is next to impossible as the different transaction mixes are constant. For example. Granted. Another example of this is also present in high volume OLTP situations such as investment banking in which a few small accounts (investment funds) incur a large number of transactions during trading and compete with small transactions from a large user base investing in those funds. etc. it could translate to a large contention issue during replication. with parallel DSI enable. 1% of 1. Contention Control Another reason for Multiple DSI’s is to allow better control of the parallelism and consequently reduce the contention by managing transactions explicitly. one connection in which considerable contention might exist would use “wait_for_commit” serialization.e. in normal Parallel DSI. While transactions may be routed from several different sources. concurrent and much less likely to experience contention. limiting connection issues to only that site and allowing large transaction concurrency (within the limits of contention at replicate. Consequently. it places the full load for function string generation and SQL execution on a single process. Given several sites executing large transactions and the end result is that corporate rollups have extreme difficulty completing large transactions in time for normal daily processing. considerable contention may occur at the replicate as transactions are indiscriminately split among the different threads. At the primary. In “follow-the-sun” type operations limit the benefits of “single_transaction_per_source” as the number of sources active concurrently performing POS activity may be fairly low while others are performing batch operations. Acct_num mod 0 Branch Bank Acct_num mod 1 Acct_num mod 2 (etc) Cross_Acct Transfer Headquarters Figure 104 . The problem is that Parallel DSI’s are not well equipped to handle corporate rollups. In fact. however. it is dispersed between different transactions over (generally) several hours. considerable contention may result. As a result. However.000 . While this is nothing different than other Multiple DSI situations. in this case. Basically. as stated before. In the example below. every transaction that affected a particular account would use the same connection and as a result would be serialized vs. As a result the aggregate of transactions in the backup may well exceed possible delivery rates. Consider the following • • • If one DSI suspends.not just the one with the problem. a typical online daemon process (such as a workflow engine) will log in using a specific user id. Large Transactions issues. given the magnitude of the accounts. extremely large-scale implementations can be developed.or simply timely notification back at headquarters that a delayed flight has finally taken off. However.each domain would have a separate connection. Cross Domain Replication Although a topic better addressed by itself. and the individual maintenance user eliminates the administrative headache of keeping the accounts synchronized. Multiple DSI’s natively allow the first point but by-pass the last two quite easily. request functions have the following characteristics: • • • Designed to allow changes to be re-replicated back to the originator or other destinations. Asynchronous Requests Addition to parallel performance. Consider the following: 307 . another performance benefit for Multiple DSI’s could be as a substitute for asynchronous request functions.and is a considerable headache for system developers as well as those on the business end of corporate mergers who need to consider such costs as part of the overall merger costs.0. The key to this is that a database could participate in multiple domains simply be being “aliased” in the other domain the same way as Multiple DSI approach . this may be extremely impractical as it disables replication for one of the domains during this process . it still may require some form of initialization or materialization at the new intermediate site). Can incur significant performance degradation in any quantity due to reconnection and transaction grouping rules.Large Corporate Rollup Implementation with Multiple DSI’s In the above example.because in a sense it is simply a twist on Multiple DSI’s . This also allows a field office to easily “disconnect” from one reporting chain and “connect” to the other simply by changing the route to the corporate rollup as well as the regional rollup and changing the aliased destination to the new reporting chain (note: while this may not require dropping subscriptions. The replicated request functions could simply be implemented as normal procedure replication with the subscription being an independent connection to the same database. once a replication system is installed and the replication domain established. perhaps one of the more useful applications in Multiple DSI’s is as a mechanism to support cross-domain replication. transaction grouping for the primary connection is not impeded. As stated earlier. While not occurring on a regular basis (hopefully). merging it with other domains is a difficult task of re-implementing replication for one to the domains.1 Regional Rollup Field Offices Corporate Rollup Figure 105 . Normally. Require synchronization of accounts and passwords. each source maintains it’s own independent connection to the corporate rollup as well as the intermediate (regional) rollup. this reduces the IT workload significantly when re-organizations occur.Final v2. In this way. And so forth. it is an order for Mrs. Whether in a messaging system or accomplished else wise (replication). replication of aggregates could cause data inconsistencies as the same change may be replicated twice. Number of Access Points . While “message tables” could be constructed to handle simpler cases. it is a single debit for $120. any form of workflow automation involves some new data distribution concepts foreign to and in direct conflict with academic teachings.db1 DS3. it is crucial to establish that cross-domain replication should not be used as a substitute for a real message/event broker system where the need for one clearly is established.00 charged to a specific credit card account. However. For example. workflow has the following characteristics: Transaction Division . While some of this may be new to those who’ve never had to deal with it. if Sales and Shipping were in two different domains. particularly. replicating the order directly . consider the following: Transaction Transformation . As this topic is much better addressed on its own.db2 DS3a. Instead “queue” or “message” tables may have to be implemented in which the “new order received” message is enqueued in a more desirable format for replication to the other domain. Sales and HR. it is a package 2x8x16 weighing 21 ounces to 111 Main Street.0.particularly with the amount of data transformation that may need to take place . To Shipping. To credit authorization. For example. This is especially true in hierarchical implementations.Final v2.spanning multiple records . due to backorders or product origination.1 DS1 DS2 DS1. However.Multiple DSI Approach to Cross-Domain Replication Once the concept of Multiple DSI’s is understood.db1 DS4. However. Hence advent and forte of Sybase Real Time Data Services and Unwired Orchestrator 308 . Smith containing 10 items.db2 DS3 DS4 Figure 106 .not supportable by function strings alone. Since cross-domain replication is a very plausible means of beginning to implement workflow. some of these need to be understood. If integrating the two.Typically the two domains will be involved in different business processes. as the above points illustrate. it was a blue shirt for $39. cross-domain replication becomes extremely easy. however. it is not without additional issues that need to understood and handled appropriately.may be impractical. Ima Customer.Replicating between domains may require adding additional tables simply to form the intersection between the two. cross domain replication may involve an order of magnitude more difficult data transformation rules .000 sale in one translates to a $500 commission to a particular employee in the other. etc.If the domains intersect at multiple points.While an order may be viewed as a single logical unit of work by the Sales organization. it increases I/O in both systems and may require modifications to existing application procedure logic. not a lot of detail will be provided.95 to Mr.db1 DS1a.To the Sales system. Anytown. Transaction Consolidation . Those familiar with Replication Server’s function string capabilities know that a lot of different requirements can be meant with them. Messaging Support .To Sales.db1 DS2. USA. Data Metamorphism . the Shipping department may have several different transactions on record for the same order. the integration may involve considerable function string or stored procedure coding to accommodate the fact that a $5. if flexibility is key – or. Desire is to ensure workflow Message type. Often. RPC Interfaces While the above would seem to suggest that EAI represents a “better” data distribution mechanism.used Replication Server as a means to integrate older applications with messaging systems. Several years ago. Primary transaction is asynchronous and may be extensively delayed.Final v2. which did just that . Guaranteed Serialization to ensure database consistency Row/column value EAI Messaging Enterprise/Internet B2B integration at the message/logical unit of work Complete message – essentially intact logical transaction Optional – usually not serialized. if the target 309 . Primary transaction unaltered (direct to database) LTL. Both are equally wrong. If you want a simpler implementation with NRT latency and high volume replication to an internal system. proprietary Implementation Complexity Low to Medium with singular corporate administration & support Application Transparency Transparent with isolated issues. content.assuming they are mutually exclusive or that messaging is some higher form of replication that has replaced it. addressees.1 Integration with EAI One if by Land. message transmission Complete transparency (requires integration server) Unit of Delivery Serialization Subscription Granularity Event triggers DML operation/Proc execution Schema Transparency Row level with limited denormalization – similar data structures High Volume/ Low-NRT latency Speed/Throughput Medium Throughput/Hours-Minutes latency. However. However. Two if by Sea..0. SQL. system developers confuse replication and messaging . SEEB has been replaced with RepConnector (a component in Real Time Data Services). let’s take a closer look at the characteristics of each solution. Replication vs. etc. Today.e. consequently is the 2nd generation product for replication/messaging integration. Time expiration. build an adapter for it). Sybase produced the “Sybase Enterprise Event Broker”. XML. For good reason – remove the guaranteed commit order processing and provide transaction level transformations/subscriptions and Sybase’s RS becomes a messaging system. Medium to Complex with coordinated specifications/disjoint administration & support Requires rewrite to form messages. Characteristic Focus Replication Server Enterprise/Corporate data sharing at the data element level Transaction composed of individual row modifications. The assumption for this section is that the reader is familiar with basic EAI implementations and architectures. in order to straighten this out. Messaging Messaging is billed as “application-to-application” integration while replication is often viewed as “database-todatabase integration”. In fact. Replication Server is probably the better solution. Sybase’s RS is a natural extension to messaging architectures to the extent that any corporation with an EAI strategy that already owns RS should take a long and serious look at how to integrate RS into their messaging infrastructure (i. EDI. The confusion then usually arises as different people will proselytize one solution over another – completely ignorant of the fact that each are entirely different solutions and are target to different needs. the real answer is it depends on your requirements. e. etc) Actions (Retry. Time limit. Scenario Standby System Internal system to packaged application such as PeopleSoft ? RS MSG Rationale Transaction serialization Schema transparency. retry.1 system is not strictly under internal control (i. interface specification – possibly use both if internal system – use RS to signal EAI solution Schema transparency. XML) Message Format Distribution Flexible Event Detection Row/Column value subscriptions Message Filters Addressee Groups Exception Processing Now then. interface specification Two packaged applications 310 . a packaged application or a partner system).Final v2. The following table illustrates how EAI extends basic messaging to include business level drivers.0. EAI is the only choice. log) • • • Time limit Non-repudiation (return receipt) Delivery Failure Relative priority Time constraints Time expiration Subsequent Message Sender/user Authenticity Privacy Protocol Translation Custom Protocol Definition Message Structures Failure Events (Non-Events) Threshold Events State Change Events User Requested Events Conditions on Events Hierarchical Channels Broadcast Corrupted/Incomplete Rule (Expiration. Event) Message Prioritization Perishable Messages Message Security Interface Standards (EDI. Replication Server Guaranteed Delivery EAI Messaging Guaranteed Delivery • • • N/A • • N/A • • Transmission Encryption/System Authentication via SSL ANSI SQL • • • • SQL Transactions CML (insert/update/deletes) Procedure Executions • • • • • • Individual DB connections • • • Definable actions (stop. In general. EAI extends basic messaging with business level functionality. let’s consider the classic architectures and when which of these solutions might be a better fit. Log. ) it is hampered by the need to similar data structures or extensive stored procedure interfaces to map the data at each target location. fpML messages) etc. the other aspect to external party interaction is that it often requires a “challenge/response” message that is required before workflow can continue. The below list of transaction operations are not couched in the terms of 311 . the store needs to debit the credit card and receive and acknowledgement prior to the original message continuing along the path to HR and Shipping. Consider the basic premise of a customer ordering a new PC. transaction serialization from individual nodes Schema transparency.e. As noted above. HR really only cares about the dollar figure and the transaction date for payroll purposes.0. However. To see how complex workflow situations can get. the point is a single transaction – which may be represented as a single record in the Order Processing database (and a single SKU) – has different elements of interest to different systems.It’s a HP Vectra PC costing $$$ for Mr.1 Scenario Corporate Roll-ups/Fan-Out RS MSG Rationale Little if any translation required (ease of implementation). data distribution. PeopleSoft Financials). In addition to the external party complexities that Replication Server really can’t address. that would require – in a sense – modifying the application (although this is highly arguable as adding a few stored procedures that are strictly used as an RS API is no different than message processing). Order Processing Database . protocol differences Shared Primary/Load Balancing Internal to External (customer/partner) Enterprise Workflow ? Possibly use RepConnect to integrate RS with EAI – rationale is business viewpoint differences drive large schema differences plus use of packaged applications (i. Different Databases/Visualization Within different business units in the workflow. While RS supports some basic workflow concepts (request functions.Final v2. control restrictions. a single business transaction in a workflow environment may be represented by different transactions at different stages of the workflow. The real difference between the two and the need for EAI is apparent in a workflow environment. Interactions with external parties has it’s own set of special issues. transaction serialization from individual nodes Little if any translation required (ease of implementation). Obviously. suppliers (hint: buy. Different Companies Additionally. • • • Still want guaranteed transaction delivery (but the transaction may be changed) Mutually untrusted system access Complicated by different protocols.com and amazon.$$$ in sales at 10% commission for Jane Employee Shipping Database . For example. Those familiar with replication know it would be a simple task to use function strings and procedure calls to perform this integration from a Replication Server perspective. Different Transactions Additionally. you could conceive more – Financials. HR Database . etc. some stages of the workflow may become synchronous (i. while shipping cares nothing about the customer nor financial aspects of the transaction – in fact the single record becomes three in it’s systems.e.It’s 3 boxes weighing 70lbs to Mulberry St. etc. structures. the “data” is visualized quite differently. the workflow often requires interaction with external parties – such as credit card clearing houses. However. (EDI 820 messages. Jones along with a fancy new printer. lets take the simple online or catalog retail example.com neither one REALLY have that “Pentagon” size inventory). credit card debit) before the workflow can continue. one transaction from the order entry system spawns a transaction to the financial system as well as order fulfillment. 3. In other words. this is the order number/item number combination. Integrating Replication & Messaging Having seen that the two are distinctly different solutions. N other messages/transactions will result in various workflow systems. In the financial system. construct an email message and pass it to the email system for delivery. etc. the revenue is treated as “booked” but not credited yet. we expect an email from any online retailer worthy of the name when our order is shipped. Transaction state . the shipping department’s response also updates the state of the financial system – causing the credit card to actually be debited as well as changing the state of the revenue to “recognized”. such as the daily high for a stock price. The second implementation is a favorite of many integration techniques – including heterogeneous Replication Agents where log scanning is not supported.0. In the order fulfillment department. RS could use an RPC to add a job to an OpenServer or EA Server based queuing mechanism vs. a Warm Standby within a workflow environment) – is that legacy applications may be include in the EAI strategy without the cost of re-writing existing 3 tier applications – and the response time impact to front-end systems of adding messaging on to the transaction time. the life span of a message within a messaging system can be appreciable – unlike database replication in which the message has extremely short duration (less recovery configuration settings). or shadow tables. a transaction identifier is needed to associate the appropriate responses – for retail. Marketing. hotel reservation request into single trip ticket for travelers) over an extended period of time – consequently. the polling mechanism simply selects the rows that have been modified since the last poll period. An isolation level 3 read is required – which could significantly impact contention on the data as the shared/read locks are held pending the read completion.) from causing a row to be read twice. having the systems constantly polling from a database queue. In this case.One order Booked vs. This becomes a simple task for RS. Similarly. For example. This technique has a multitude of problems: 1. Isolation level 3 is required to avoid row movement (deferred update/primary key change. Transaction decomposition/division .One order multiple shipments (due to backorder or multiple/independent suppliers). The single largest benefit of integrating replication and messaging systems when both are needed (i. Shipping…. if the purchase depletes the stock of an item below a threshold that spawns and automatic re-ordering of the product from the supplier. rental car request. This implementation has a number of considerations (not necessarily problems.1 any one EAI product – but are useful when considering the metamorphis a single business transaction can undergo in a workflow system Transaction spawning . Timestamp tracking involves adding a datetime field to every row in the database. Additionally.One order Accounting. Additionally. Transaction multiplication .Shipping Request Stock Order – for example. RepConnect and EAServer as a single column update of the status in the database via subscription on the shipment status field could invoke a component in EA Server to extract the rest of the order details. does it make sense to use both solutions simultaneously in an integrated system.Final v2. Performance Benefits of Integration The chief performance benefits of integrating the two solutions comes from the elimination of using a cpu/process intensive polling mechanism that is commonly used to integrate existing database systems into a new messaging architecture.e. 2. At a simplistic level. Additionally. Recognized Revenue. This field is then modified with each DML operation. but could have system impact): 312 . In this sense the order is not complete until each item is complete. The answer is a resounding “YES”. the next question that arises is whether they are complementary. In a sense this is multiplication in that for each business transaction. Deleted rows are missed entirely (they are there anymore – so no way to detect a modification via the date). Multiple updates to the same row between polling cycles are lost. Any polling mechanism that attempts to detect database changes outside of scanning the transaction log involves one of several techniques: timestamp tracking. This could mean the loss of important business data. existing systems can now have extended functionality added without a major re-write. The important aspect to keep in mind is that through each of these systems. workflow messaging may require challenge/response messaging (as discussed earlier) as well as message merging (merge airline reservation request. once the order has been shipped – in a sense they issue a response message to the order entry system stating the order is complete. today. This last consideration may not be that much of a concern on a lightly or medium loaded system. Lack of before/after images – if all that is recorded is the after image. The key areas of improved performance would be: • • Reduced latency for event detection – Replication Agents work in Near-Real Time whereas a polling agent would have a polling cycle – possibly taking several minutes to detect a change. However. Insert into shadow table(s) c. particularly if the system is already involved in replication (i. 3. Additionally. inserts into different tables) would have to be tracked ordinally to ensure RI was maintained as well as serialization within the transaction. Any site that has existing applications that does not wish to undertake a massive recoding effort.0. A single insert becomes: a. as the distribution mechanism reads or removes records from the shadow tables. Extensive I/O for distribution. integrating replication with messaging may improve performance & throughput over using both individually – and suffering the impacts that a database adapter could inflict. Distribution mechanism deletes rows from transaction tracking table 2. Reduced contention.1 1. Shadow tables may still be necessary for heterogeneous systems.Final v2. Additionally. As a consequence – ignoring the cost/development benefits of an integrated solution – integrating Replication Server with a messaging system could achieve greater overall performance & throughput than simply forcing a messaging solution. As a result. there would be issues with deletes – additionally critical information for updates would be lost. Warm Standby). 313 . it was included to illustrate the classic point that sometimes better performance and throughput is a system-wide consideration and a shift in architecture may achieve more for overall system performance than merely tweaking RS configuration parameters. then again. Messaging Conclusion This section may have appeared out of context with the rest of this paper. Consequently a transaction tracking table is necessary to tie individual row modifications together in the concept of a transaction. this activity could bring it to it’s knees. Distribution mechanism deletes rows from shadow table(s) g.e. Insert into real table(s) b. each operation (i. However.and associated CPU load – of timestamp scanning or maintaining shadow tables are eliminated for ASE systems. the shadow table would have to track before/after values for each column. • The conclusion is fairly straight-forward. Insert into transaction tracking table d. Reduced I/O load on primary system – by scanning directly from the transaction log.e. Distribution mechanism reads transaction tracking table e. Key Concept #38: A corollary to “You can’t tune a bad design” is “A limited architecture may be limiting your business”. if the system is nearing capacity. it could result in contention with source transactions that are attempting to insert rows. the I/O load . Lack of transactional integrity – each table is treated independently of the parent transaction. Distribution mechanism reads shadow table(s) f. Unpublished rights reserved under U.A. www.sybase.S. Sybase and the Sybase logo are trademarks of Sybase. copyright laws. Inc. All other trademarks are property of their respective owners. ® indicates registration in the United States. All rights reserved. CA 94568. Specifications are subject to change without notice.1 Sybase Incorporated Worldwide Headquarters One Sybase Drive Dublin.S. USA Tel: 1-800-8-Sybase. Printed in the U. Inc.com Copyright © 2000 Sybase.Final v2.0. . Documents Similar To Sybase RepServer Performance Tuning Wp 022708Skip carouselcarousel previouscarousel nextSybase Administration DBASybase ASR15-Admin Guide-Jeffrey Ross Garbus -Ashish GuptaSybase Replication Server - Warm Standby SetupSybase Update StatisticsInternal Working of SybaseSybase DBA User Guide for BeginnersRep Server Monitoring Best PracticesSAP ASE_ Performance TuningSybase Important ConceptsAdmin Sybase QueriesSybase SQL1 Sybase ArchitectureSybaseSybase ASE Database Performance Troubleshootingsybase-syntel12Managing DBMS Workloads v1.0 WPSybase ASE PNT ToolsSybase Basic & Advanced T-SQL ProgrammingSybaseIQsybaseQ &ASybase Installation and Setup Guide 11gEjemplo - SAP Sybase Replication Server Setupsybase-ase-howtoASE15UpgradeChecklist_for_12.x_V3.0.pdfTuning Data Cache in ASE15Sybase upgradation documentASE15_Q&A_LogDBA Cockpit a Automatic Table Maintenance for Sybase ASEPerformance and TuningASE15 Optimizer Best Practices v1 051209 WpFooter MenuBack To TopAboutAbout ScribdPressOur blogJoin our team!Contact UsJoin todayInvite FriendsGiftsSupportHelp / FAQAccessibilityPurchase helpAdChoicesPublishersLegalTermsPrivacyCopyrightSocial MediaCopyright © 2018 Scribd Inc. .Browse Books.Site Directory.Site Language: English中文EspañolالعربيةPortuguês日本語DeutschFrançaisTurkceРусский языкTiếng việtJęzyk polskiBahasa indonesiaYou're Reading a Free PreviewDownloadClose DialogAre you sure?This action might not be possible to undo. Are you sure you want to continue?CANCELOK