Exome sequence analysisand interpretation Handbook for Clinicians 1st Edition ________ Vinod Scaria Sridhar Sivasubbu Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 2 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Like us on Facebook https://www.facebook.com/clinicalexome 1st Edition (2015) Version 1.01 Scaria V and Sivasubbu S Exome sequence analysis and interpretation The entire surplus from the sale of this book in will go to support advancing research in genomics. This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License. Cover Image: Artist’s impression of Nucleotides in a DNA strand. Oil on canvas by Pradha (2015) 3 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 4 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Acknowledgements A number of individuals have contributed to this book in personal as well as professional capacities. This include graduate students from our groups, especially Mr. Shamsudheen Karuthedath Vellarikkal, Mr. Rijith Jayarajan, Mr. Ankit Verma, Ms. Saakshi Jalali, Ms. Heena Dhiman and Mr. Kandarp Joshi who have helped in collating content, and figures which enrich the manuscript. Authors also thank and acknowledge critical comments, editorial help and support from our colleagues, Dr. Vamsi Krishna, Dr. Adita Joshi, Dr. Srinivasan Ramachandran, Dr. Jameel Ahmad Khan and Dr. Abhay Sharma. Authors thank the Genomics for Understanding Rare Diseases- India Alliance network (GUaRDiAN) and collaborators for critical insights, which significantly enriched the outlook and content of this book. Authors thank an innumerable number of patients and families who have interacted with us through the network, without which our insights and knowledge would have been limited. The authors acknowledge the financial support from the Council of Scientific and Industrial Research, (CSIR), India through grant BSC0212 (Wellness Genomics Project). The funding agencies had no role in the preparation of the content or the decision in publishing this book. Authors declare no competing financial interests. 5 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 6 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Dedication Dedicated to the innumerable patients and families who enriched our knowledge and insight through their close interactions, shared their distress like a family member, contributed samples to research selflessly, without which we would not have been what we are, and we would not be doing what we do, and would not be writing what we wrote. 7 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 8 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Contents Contents .................................................................................................9 Foreword ............................................................................................... 11 Case of the Bhai .................................................................................... 13 The human genome project and how it changed everything .............. 19 Genome variations and how they makes us different? ........................ 29 A brief introduction to next generation sequencing ........................... 37 When you could sequence your own genomes .................................... 43 So what if we could sequence just the protein coding genome? .........49 When should you do exome sequencing? ............................................ 55 When should you probably not do exome sequencing? ...................... 61 First things first: putting insights before data ..................................... 65 Educating the patient and getting an informed consent..................... 71 Points to note when you outsource exome sequencing ...................... 81 Understanding the steps in analysis of exome sequence data ............ 85 How good is the exome sequencing data? ........................................... 91 Prioritizing, annotating and interpreting variants .............................. 95 Don't forget the validation ................................................................ 103 Ethical considerations in whole exome sequencing ........................... 107 The last word ...................................................................................... 113 Index.................................................................................................... 115 9 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 10 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Foreword I would easily pick ‘Next Generation Sequencing’ as one of the techniques that had an immediate and immense application in research and healthcare. Within a span of five years, almost every scientist and physician cannot afford ignorance of ‘exome sequencing’. With newspapers and internet screaming ‘genome’ everywhere, this handbook by Dr. Scaria and Dr. Sivasubbu is timely. The introductory chapter on ‘Bhai’ is a story of ‘exome sequencing’ that is lucidly told even to general public. It is really important for everybody to know, let alone clinicians what sequencing is and how human genome project has improved our understanding of role of genetic variants in health and disease. The authors then introduce readers to exome, clinical importance of sequencing it and the situations where this is helpful in patient care. At the same time they warn the physicians not to get carried away. In the next chapter they explain the basics of medical evaluation and how they remain evergreen even in the current era. It is important the patient is not taken for a ride by the new diagnostic companies which did not exist the previous year. Both clinician and the patient must be aware of what they are doing with the new test and what they can expect in the form of results. Probably both need to be involved thoroughly in the consenting process. For a researcher, the authors explain how outsourcing is not easy despite having several service providers and detail in simple terms how the large data can be analyzed. Chapters on quality control and interpretation of variants serve the readers to understand the intricacies of this technique. Independent validation of the results is vital to apply this technique in clinical 11 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation practice, especially prenatal diagnosis. To conclude the authors elegantly touch upon the ethical issues that cry for attention. The ‘Did you know’ text boxes spread throughout the book are simply highlights of genetic milestones or common terms that should be ‘general knowledge’. With excellent medical and scientific background and pioneering this technique in our country both scientifically and socially, Dr. Scaria and Dr. Sivasubbu have done incredible job of cracking the hard nut of ‘exome sequencing’ and the book is a ‘must read’ for all clinicians and students of genetics. Girisha KM Professor and Head Department of Medical Genetics Kasturba Medical College, Manipal Manipal University 12 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 1 Case of the Bhai The day after the Indian genome sequence was announced1, we received a phone call from an individual who introduced himself as Bhai2. A phone call from Bhai comes with a lot of connotations, with the popular imagination being that of the underworld calling for extortion. Fortunately for us Bhai was neither part of the underworld nor was he interested in extorting money, for he must have been aware that we were not millionaires to extort money from. Bhai nevertheless had a bigger problem at hand. He said that he had a skin problem and wanted us to talk to his doctor. On close discussion, it was evident that he suffered from an inherited genetic disease, which had affected multiple members of his family. Days later, it was understood from his physician that his family suffered from a rare genetic skin disease called Epidermolysis Bullosa (EB). EB encompasses distinct disease subtypes with a variable severity ranging from localized lesions to a more extensive or generalized form. The disease is caused by defects in a number of 1 The sequencing of the first genome of an Indian was announced on 8 December 2009. Source: http://www.pib.nic.in/newsite/erelease.aspx?relid=55470 th Also published in: Patowary, Ashok, et al. "Systematic analysis and functional annotation of variations in the genome of an Indian individual." Human mutation 33.7 (2012): 1133-1140. 2 Bhai in Hindi and Gujarati means brother. It is a popular surname attached to most Gujarati names. In colloquial terms, this would also sometimes be attributed to an underworld don. 13 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation genes, mostly involved in maintaining the integrity of the skin layers. Mutations in one of the genes would result in fragility of the skin, thereby causing eruptions on the skin, resembling those that occur after burns. These eruptions or 'bullae' would sometimes break open, get infected and result in scarring and sometimes extensive pigmentation. Bhai wanted the genetic lesion to be identified. This was a complicated task to begin with. We had two options, first was to systematically characterize the mutation by sequencing every single exon one by one using the conventional sequencing approaches, which might have cost us a lot in terms of time, money and effort; or use a genome scale approach without prior hypothesis to sequence multiple genes in one go, and possibly try to mine the Did you know? variation from the haystack. A paradigm Epidermolysis bullosa is a shift in the approach rare genetic disease of the was in the anvil. We skin presenting with blisters on the skin. The disease runs had worked in families and has an extensively on setting incidence of approximately 1 up sequencing on a in 50,000 individuals. new technology that allowed us to sequence whole genomes or parts of genome, which consisted of protein coding genes specifically3. We also had laid our hands on systematically analyzing the genome data for variants. 3 One technology to sequence part of the genome, which encodes for proteins, is called exome sequencing. The concept of this methodology forms the basis of this book and is detailed in the later chapters. 14 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation There was another technical issue. A close study of the pedigree revealed that almost half of the family members spanning almost three What is a pedigree chart? generations were affected with the Pedigree chart is a graphical document that details the disease suggesting an ancestry of an individual. A autosomal dominant pedigree chart is a very inheritance. That important tool to study the would essentially mean inheritance of diseases in a the variant under family over generations question would be potentially heterozygous4. Now potential heterozygous variations could be difficult to identify. On one hand, you would require enough coverage5 to accurately call a heterozygous variation. On the other hand, differentiating a potential causative variation from a number of other changes is a tedious and challenging task. There were also well-established workarounds for these problems. One approach was to sequence two affected individuals and see what set of variations overlap between the datasets and probably prioritize variations that could potentially change amino acids or 4 The human genome is diploid. That means we have two copies of each chromosome, and therefore two nucleotides correspond to each position in the genome. If both the nucleotides are not the same, that means only one copy has a variation. Such a variation is called heterozygous. 5 Coverage here denotes the number of times the sequence of the genome has been covered or repeated. 15 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation functional regions. The other approach would be to sequence one affected individual and use computational approaches to prioritize variations that change amino acid sequences and then potentially check whether the variations are present in other affected individuals, which included both affected and unaffected individuals. We decided to pursue the first path. We sequenced the protein-coding Did you know? region of the genes (exomes) in two Keratin 5 (KRT5) is a affected individuals cytoskeletal protein from two generations. important for the integrity of Systematic overlap of skin. Mutations in KRT5 gene the single nucleotide can cause Epidermolysis changes and filtering Bullosa Simplex. for potential alterations that could have caused the disease, identified a variation in KRT5 gene6. Fortunately, the gene KRT5 was previously associated with the disease. The variation was further investigated in a number of affected and unaffected individuals using conventional Sanger sequencing7 of the region around the variation. Interestingly, the same variation was present in all affected individuals but absent in all unaffected individuals tested, supporting 6 Vellarikkal, Shamsudheen K., et al. "Exome sequencing reveals a novel mutation, p. L325H, in the KRT5 gene associated with autosomal dominant Epidermolysis Bullosa Simplex Koebner type in a large family from western India." Human Genome Variation 1 (2014). 7 Sanger sequencing is a molecular technology for sequencing nucleic acids, discovered by and named after Fred Sanger. The conceptual methodology is detailed in the next chapter, later in this book. 16 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation our observation and conclusion that the variant is causative of the disease in the family. 17 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 18 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 2 The human genome project and how it changed everything The quest to understand the sequence of DNA was pioneered by Frederick Sanger, who also received the Nobel Prize in 1980 for the technique to determine the same. This technique popularly known as Sanger chemistry is practiced Did you know? till date and is based on the concept that Frederick Sanger (1918-2013), modified nucleotide received two Nobel prizes in bases could irreversibly Chemistry, one for the terminate a DNA discovery of the amino acid synthesis reaction, sequence of Insulin in 1958, wherever they get and the second one for the incorporated. The sequencing technology in principle is simple. One 1980, which eventually was could clonally amplify named after him. short stretches of DNA The Sanger Center, now the and use the single Sanger Institute at Hinxton, strands as templates which took a lead role in the for DNA synthesis. International human genome Apart from pure project, was founded in his nucleotides, the memory. The Institute is now synthesis mixture one of world’s largest could be spiked with genome centers. abnormal nucleotides, which are modified and labeled. These abnormal modified nucleotides called di-deoxy nucleotides could terminate a synthesis reaction 19 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation wherever they get incorporated by virtue of complementary sequence in the template strand. This chain termination would produce truncated products of different sizes. Each of the products would be different by one nucleotide and could be separated using gel electrophoresis. Earlier, radioactively labeled bases were used that enabled their detection using radiography, but later non-radioactive modifications were developed that allowed bases to be labeled either with specific fluorophores or light emitting molecules. The overview of the technology is summarized in Figure 1. This methodology was perfected in 1970s and it was not until a decade later that the technology matured and was automated, fuelling the quest to sequence genome. The Sanger sequencing technology saw a number of improvements. The major improvement was the automation and miniaturization of the technique. This saw the birth of automated capillary sequencers. In capillary sequencers, electrophoresis happened inside capillaries and the electrophoresis bands were detected using lasers. The automation significantly increased the throughput of Sanger technology enabling sequencing of larger genomes and is popularly dubbed the first generation sequencing. 20 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 1. Conceptual overview of the Sanger sequencing technology. The technology relies incorporation of labeled di-deoxy nucleotides and chain termination. 21 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 2. One of the earliest sequencers that used the Sanger sequencing methodology. The readout was obtained from the vertical gel electrophoresis. Courtesy: The genomics museum at CSIR- Institute of Genomics and Integrative Biology, Delhi, India Figure 3. Automated capillary sequencer. Courtesy: The genomics museum at CSIR- Institute of Genomics and Integrative Biology, Delhi, India 22 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The planning for Did you know? the ambitious Human genome project was Apart from the NIH led started as early as in Human genome initiative, a the year 1984, but the parallel effort to sequence initiation of the project the human genome was initiated by a private with appropriate company, Celera Genomics, funding started in the led by Craig Venter. This was year 1990. The United initiated in 1998 and was States Department of estimated to cost Energy (DOE) and the approximately 3 million US National Institutes of Dollars, far cheaper than the Health (NIH) jointly NIH led effort. The draft funded the project. The assembly was released and project was started to published in the year 2001. complete in 15 years with a total outlay of approximately 3 billion US Dollars. Apart from the United States of America, the project also encompassed an International consortium, which included researchers from other countries including the United Kingdom, France, Australia, Japan and China. The sequencing of the human genome involved quite a cumbersome procedure. Initially, the genome of 3.3 billion bases was broken down into small fragments, each of approximately about 150,000 bases and cloned into bacterial vectors. These were further maintained and replicated by the bacterial mechanism for DNA replication. Each of these vectors were then sequenced and assembled independently, before putting the pieces together to assemble the chromosomes. This 23 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation methodology then came to be known as the hierarchical shotgun approach. Meanwhile, Did you know? almost halfway through the publicly The draft human genome funded human was jointly announced by Bill genome project, a Clinton, President of the company Celera United States of America and Genomics was formed Tony Blair, the British Prime Minister on 26th of June in the year 1983. The 2000. The complete assembly company used a of the genome was later radically different announced on April 14th, approach that involved 2003 sequencing both ends of the short DNA fragments in a pair-end way, which was previously successfully used to sequence small bacterial genomes. The company promised to complete the genome sequence, at a much smaller cost of approximately 3 million US dollars and compete with the International consortium. 24 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The first chromosome to be sequenced was chromosome 22, one of the smallest chromosomes in the human genome. The chromosome sequence was published in the year 1999. Did you know? Another noteworthy event that happened in this timescale was the Bermuda declaration of 1996, also known as the Bermuda principles for early access to DNA information. The declaration set rules and regulations for the early public release of data generated by the International Human Genome Project in public domain. This was a significant shift from the well-practiced principle of releasing the data only after publication in a peer-reviewed journal. This declaration formed the basis of pre-publication release of genomic data, which is widely practiced even today. In March 2000, the draft human genome was announced by the then US President Bill Clinton jointly with the British Prime Minister Tony Blair. The papers corresponding to the publicly funded genome and the Celera assembly were published in the journals Nature and Science respectively. Further improvements of the drafts were announced in the year 2003. The Human Genome Project was unique in many ways. In one way, it was a mega-project that involved a large number of researchers, not only from the United States of America, who led the project, but also from other countries across the globe, majorly from Britain, 25 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation France, China and Japan. The major aim of the project was to provide researchers with a working template for the human genome and provide them with tools and resources to start understanding the basis of genetic diseases in humans. The computational tools and methods developed as part of the human genome project also significantly helped in the completion of the genomes of other organisms, including many model organisms like mouse, rat, zebrafish, worm and fly, which have been extensively used to understand human diseases. 26 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 27 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 28 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 3 Genome variations and how they make us different? The completion of human genome sequencing led to two parallel large endeavors to understand the human genome. One effort spearheaded the functional characterization of the genome in terms of identifying transcribing8 and regulatory elements9, whereas second initiative focused on understanding the genomic variability. The human genome is quite large, over three billion alphabets, comprising of four nucleotides: Adenine (A), Thymine (T), Guanine (G) and Cytosine (C) placed on a string. Though the genome is quite similar between individuals, every one of us has changes and this variability in the human genome sequence is what largely makes us different. The number of variations between individuals is quite large, approximately 3,000,000 or 3 million. Given the large size of the human genome, this is approximately one variation in almost a thousand bases. Many of these variations do not have any impact in the functionality of the organism. Some of 8 Protein coding genes are transcribed to messenger RNA and further translated to proteins. 9 Regulatory elements include regions in the genome that regulate the expression of genes. Regulatory elements include promoters of genes, enhancers among others. 29 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation them are quite variable in the human population10. We should also note that the half of the genome is inherited from each parent11. Variations in the genome inherited collectively make us look, behave and sometimes act like our parents. Therefore, many of the variations could be surrogates of features that we inherit. Geneticists call these features traits12. As you would also have guessed, there are innumerable human traits. Many times, cataloging these human traits is a complex and tough task. Understanding the genomic variations and its association with Did you know? human traits is by itself quite complicated. On The Celera project included one hand, we need to DNA from 5 donors selected know the extent of from a pool of 21 individuals. genomic variability, The founder, Craig Venter whereas on the other was also part of the pool. hand, we would need to know which variation or sets of variations are associated with a particular trait. Sequencing a large number of individuals to understand the genomic variability would be a herculean task due to the costs involved and complexity 10 Variations that are quite variable in the population, i.e., have a frequency more than 1% are popularly called as polymorphisms. Single nucleotide variations that are polymorphic are therefore otherwise called Single nucleotide polymorphisms or SNPs. 11 The human genome is diploid and one copy of each chromosome is inherited from each parent. 12 Trait is defined as a quality or feature, especially of an individual. This could be for example, hair color, color of the eye, height etc. 30 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation of executing such a large project. But without a grasp of the genomic variability and an understanding of how the genomic variability could affect human traits, the fruits of genomics cannot be tested. Now there were shortcuts available. The first shortcut was that one could create a crude map of genomic variations by putting together information from multiple sources. The first source that scientists had laid their hands upon was the sequence data itself. The Craig Venter led genome assembly; popularly called the Celera assembly was one large resource. Apart from that, scientists had also put together sequences of smaller regions, sometimes genes and parts of genes in the public domain, and this created the next resource. So there was something to start with. The genome is not randomly inherited from a parent to the child. Genes are inherited as blocks of the genome, one from each parent. Hence, the variations too are inherited in blocks. So if someone could study common variations inherited in blocks, one could identify the blocks that are associated with a trait. Thus, we would be able to map the trait to the genomic region encompassing the block. So if one had a family in which a particular trait is inherited, say lack of the pigment melanin in skin, hair and eyes (leading to a condition called Albinism), one could theoretically study the blocks of the genome inherited from each parent to child and observe whether the people who had Albinism all inherited the same block of the genome. This is a somewhat complex approach, which geneticists call linkage mapping. Since children inherit a large number of traits from their parents differentiating each from one 31 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation another becomes a humongous task. However the task becomes easier if one is lucky enough to identify large families with numerous affected individuals spanning multiple generations. Now as we mentioned above, you could just study common variations and blocks of genomes that harbor them. These are called as tag variations13. Now you could just study a small number of common variations to understand Did you know? associations with common diseases. Polymerase chain reaction is Well, before single a molecular technique nucleotide changes developed in 1983 by Kary Mullis to amplify a piece of were employed, DNA. This technique bagged scientists used him the Nobel Prize for something simpler to Chemistry in 1993. tag genomic blocks. These were based on typing repeats in the genome. The locations of many of these repeats were common in the population and one could use simple techniques such as polymerase chain reaction (PCR) to type these repeats and their lengths. 13 Tag Single nucleotide polymorphisms (SNPs) are representative variations which mark a stretch of the genome. 32 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The technological advancements in optics and miniaturization of components including microprocessors and microelectronics that followed the genome era also saw many of these being applied to study genomics. The earliest advent was the Did you know? creation of microarrays, which in The DNA samples in the the last decade HapMap project involved revolutionized individuals from Yoruba tribe in Ibadan, Nigeria, Chinese genomics. Scientists from Beijing, Japanese from learned that one could Tokyo and people with immobilize small European ancestry fragments of DNA onto maintained at the Centre glass slides14. Now d’Etude du Polymorphisme these small fragments Humain (CEPH) in France. of DNA could be used to identify single nucleotide variations, by the mere fact that a complementary nucleotide if present could hybridize effectively. This became a quick and popular assay for typing variations in the genome. Further advancements in miniaturization saw higher densities of packing such fragments of DNA onto slides, and thereby enabling a larger number of variations that could be typed. The ready availability of microarrays to study variations provides huge impetus towards the understanding of genomic variations and associations with human traits. These studies extensively used 14 This is popularly known as microarrays. 33 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation genome wide approaches to mark blocks in the genome and are popularly known today as genome wide association studies (GWAS). The later years saw the discovery of a large number of variations and their associations with human traits and diseases. This approach still seems to yield quite fruitful dividends in gathering genomic variations and their associations with traits and diseases. A number of global initiatives to map genomic blocks and their associations have provided us with a map of regions in the human genome associated with distinct human traits and diseases in various populations15. These efforts were notably the first popular approaches to collect genomic variations associated with human diseases. Now coming back to the case of Bhai. While the genome wide association studies were moderately effective in mapping genomic blocks associated with common diseases and traits, these approaches were futile in the case of rare genetic diseases. This was primarily because the genome wide association studies relied on common variants and common traits, whereas rare genetic diseases are caused by rare variants. In the earlier sections, we had mentioned an approach using 15 Welter, Danielle, et al. "The NHGRI GWAS Catalog, a curated resource of SNP-trait associations." Nucleic acids research 42.D1 (2014): D1001-D1006. A visual representation of this map is available at URL: http://www.genome.gov/gwastudies/ 34 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation repeats, called microsatellites16. Microsatellite based studies were the mainstay in mapping genes associated with such rare diseases, and often was cumbersome, time taking, costly, and the success was heavily dependent on identifying large families. A typical microsatellite study in a standard molecular biology laboratory would take months for data generation and analysis, which precluded its widespread application in clinical settings for want of expertise and infrastructure. 16 Microsatellites are also called Simple Sequence Repeats (SSRs) or Short Tandem Repeats (STRs). They encompass small stretches of 2-5 nucleotides which occur in tandem. 35 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 36 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 4 A brief introduction to next generation sequencing After the announcement of the genome sequencing, a silent revolution was taking shape at the technology front. A bunch of researchers were working hard to enable quick and cheap sequencing of nucleotides. The traditional Sanger sequencing lacked the speed and cost effectiveness to be able to sequence genomes. A number of research labs around the globe were approaching the problem in a variety of ways. The field also saw the convergence of technologies from multiple areas including nanotechnology, microelectronics and computing. These efforts led to the emergence of a spectrum of approaches, each different in their principle with their own set of limitations and advantages, but similar in their goal of providing cheap, fast and high throughput sequencing of nucleotides. These technologies came to be popularly known as the next generation sequencing (NGS), differentiating it from the first generation sequencing technology, which comprised of automated Sanger chemistry. Briefly, Next generation sequencing refers to a gamut of sequencing technologies, which differentiate themselves from the conventional Sanger sequencing in terms of the technology employed, significantly higher throughput of sequence generation, quality of the sequencing and reduction in per-base sequencing costs. 37 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation One of the earliest NGS technologies used was called massively parallel signature sequencing or MPSS, developed by a company called Lynx Therapeutics as early as in the year 2000. The MPSS technology is not in commercial use anymore and is rather of historical importance. One of the first commercial offerings in the NGS space came from 454 life sciences. The commercial 454 sequencers were Did you know? launched in the year 2004. These systems The pyrosequencing used pyrosequencing methodology relied on the approach to sequence release of a pyrophosphate nucleotides. Short with nucleotide addition. This pyrophosphate is acted upon fragments of by ATP sulfurylase and nucleotides were produces ATP in the captured on beads and presence adenosine 5´ clonally amplified in an phosphosulfate. This ATP emulsion covering the reacts with Luciferin to beads. The beads were produce oxy-luciferin and further deposited onto generates light, which is microtitre plates. The captured by the camera. bases were reversibly added, which on each cycle would release a pyrophosphate that was detected by imaging the cell on the microtitre plate, thus enabling scalability to sequence millions of short stretches of nucleotides. The sequencing technology became quite popular due to the longer read lengths and high quality data. The 454 sequencing technology was eventually acquired and marketed by Roche Diagnostics. Other two technologies that came to the commercial space were the SOLiD technology marketed by Life Technologies and the reversible termination sequencing technology 38 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation developed by Solexa and later acquired and improved upon by Illumina in the year 2007. The SOLiD technology, which stands for sequencing by oligonucleotide ligation and detection, employed amplification of short stretches of DNA using emulsion PCR and ligation-based chemistry to sequence short stretches of DNA. The first commercial SOLiD sequencers were launched in the year 2007. Though historically methods like the massively parallel signature sequencing and colony sequencing were the forerunners of modern and more popular NGS approaches, many of these technologies are now in vogue and primarily of historical interest or have very specialized applications. Nevertheless, associated tools and methods including miniaturization, massive parallelization and methods for assembling short sequences still form the conceptual mainstay in the field. These methodologies are detailed in the later section of this book. One of the popular and field tested technologies practiced till date was that developed by Solexa. As legend goes, a couple of British scientists met at a bar in Cambridge over a pint of beer to chalk out a better chemistry to sequence nucleotides in high throughput. The informal summit at Panton Arms, dubbed by many as the Beer Summit was where the most popular next generation technology was chalked out. Shankar Balasubramanian and David Klenerman put together their chemistry and the laser detection expertise to develop the reversible terminator based sequencing technology. The startup Solexa provided flesh to their concepts, and the Genome Analyser, a commercial bench top next generation sequencer was born. The basic technology could be summarized as follows. Short 39 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation pieces of DNA could be captured on solid glass surface using small adapters, and these stretches could be amplified on the slide to produce clonal bunches of DNA stretches. These clonal bunches of single stranded bases could then be further used as templates for DNA synthesis, cycle by cycle. In each cycle, a nucleotide attached with a fluorophore is added. This addition is recorded by imaging the slide. The fluorophore would be then removed, and the cycle goes on for the entire stretch of the DNA template. The series of images, which were recorded, would further be analyzed using computers to reconstruct the sequence of the stretch of DNA. The computer would systematically go through the images, cycle by cycle and reconstruct the order of nucleotides from the fluorophore that shined up at that particular cycle. Figure 1. Overview of the Illumina NGS sequencing methodology 40 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation A number of other technologies and conceptual methodologies also emerged in the later years. Of note are technologies developed by Helicos biosciences17, Pacific Biosciences and Ion Torrent. The Helicos sequencer was released in the year 2009 but did not become quite popular. The company later filed for bankruptcy, putting the technology to oblivion. The Ion torrent used a conceptually different technology, based on estimation of pH on silicon wafers. The sequencer was released in the year 2011 and the technology and product was later acquired by Life technologies. Pacific Biosciences also released a commercial sequencer in the year 2011, based on single molecule sequencing chemistry without amplification. The technology has many advantages compared to others, in that the single molecule chemistry obviates the PCR bias incurred in other sequencing methodologies, and in addition, provided very long reads, sometimes extending to kilobases. Such long reads have enormous applications like detection of structural variations. Nevertheless, the technology has not found widespread applications in regular clinical settings, but is quite popular among the research community, especially laboratories working on genomes that are difficult to assemble. A number of newer technologies are presently in the anvil, and not yet available in the commercial space, including Nanopore sequencing based on protein nanopores for detection of nucleotide bases. 17 Helicos bioscience was co-founded in the year 2003 and imaged individual DNA molecules. It also featured a chemistry, which prevented incorporation of multiple nucleotides in each cycle, dubbed ‘Virtual terminator’. 41 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 2. The Illumina Hiseq 2500 Next Generation Sequencer Courtesy: CSIR Institute of Genomics and Integrative Biology, Delhi. Figure 3. The Ion Torrent Proton Sequencer based on semiconductor chips. Courtesy: CSIR Institute of Genomics and Integrative Biology, Delhi. 42 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 5 When you could sequence your own genomes Next generation sequencing was like a tsunami. Though the early adopters of the technology saw its huge potential, many of the traditionalists were quite slow to realize the potential and the future. Sanger sequencing was entrenched in many clinical laboratories and was widely acclaimed for its reliability, quality and ease of use, with automation being a standard. During the early years, commercial next generation sequencing platforms were fraught with frequent machine downtimes, smaller read lengths, which practically limited its applications and usually had lower quality of reads compared to the traditional Sanger sequencing. Nevertheless, these technologies provided Did you know? the much-needed Gordon E Moore, one of the throughput to enable co-founders of Intel whole genome predicted that the density of sequencing in a transistors in an integrated foreseeable trajectory. circuit would double every two years. This was commonly known as Moore’s law. The revolution in technological advancements and the resultant scale and throughput was phenomenal, so much that at one point, the speed at which the sequencing technology improved in terms of throughput and cost -reduction was comparable to the Moore’s law in the case of 43 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation microprocessors. The phenomenal increase in the throughput and cost is depicted in Figure 1. Figure 1. The dwindling cost of whole genome sequencing over the years. The X-axis denotes the timeline, and the Y-axis denotes the costs in US$ on a logarithmic scale. Data from http://www.genome.gov/sequencingcosts/ Retrieved Feb 04, 2015 What came next was the race to sequence human genomes. The first of course were the stalwarts themselves - Watson and Venter, who sequenced and made available their personal genomes. What came out of the sequencing was an astounding number of novel variations, which were hitherto not reported before. The years that followed saw large genome centers drastically shift to next generation sequencers and rapidly adapt themselves to the avalanche of data. There were a few new players also, notably the Beijing Genomics Institute, which at a point in time was the largest genome facility with over a hundred next generation sequencers. 44 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The rapid technological advancements during this period led to a major paradigm shift enabling genome sequencing amenable to small research labs. For the first time the power of genomics was being shared and tasted by not so endowed laboratories, which did not have the wherewithal to own and operate a large inventory of sequencers and compute, leave alone trained technicians and analysts. Countries like India, which were not in the forefront of technology during the initial human genome sequencing initiative, were quick to adopt next generation sequencing. What followed was a flurry of human genome sequencing announcements from across the world. The Chinese announced the Han Chinese genome sequenced by the Beijing Genomics Institute, while the Japanese announced the Japanese genome and the Koreans announced the Korean genomes. India was not far behind. The team from the CSIR funded Institute of Genomics and Integrative Biology (CSIRIGIB), Delhi announced the first Indian genome. The flurry of genome announcements continued…..the African Genomes, Sri Lankan, Malaysian, Russian so on and so forth. Those were exciting times!! We would pour through online announcements of genomes sequenced, which were getting announced almost every month, and see if we could put them up together to derive scientific insights. Being associated with the Indian genome sequencing activity was a humbling experience. While it taught us much of the nuts and bolts of genome sequencing and analysis, it also provided immense insights into how the genome sequencing could be 45 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation applied in the clinical practice. The costs of genome sequencing were also dwindling drastically and a thousand dollar genome and its promises were widely discussed. While individual whole human genome sequencing would reveal approximately three million variations, computational pipelines and datasets available for analysis can functionally annotate only a small portion of these variations. This has been primarily because the functional annotation of variations is dependent on computational methods that can predict whether the variation can change the protein sequence, structure and thereby their functionality. This would essentially mean that the bulk of functional annotations could be done for only variations that fall in protein coding regions of the genome. This is detailed in the next chapter. Having said this, it should also be emphasized that methodologies to functionally annotate and prioritize variations in regions of the genome not coding for proteins also exist, though have not been quite popular. Some of the early methodologies for prioritizing variations in non-protein coding genes have come out of our own laboratories. In addition, a number of newer methodologies to annotate functional variations in regulatory regions of the genome also exist and have been widely used in literature. 46 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 2. The CIRCOS representation of the first Indian genome announced in 2009 and the title page of the publication. (Patowary et al. "Systematic analysis and functional annotation of variations in the genome of an Indian individual." Human mutation 33.7 (2012): 1133-1140.) 47 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 48 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 6 So what if we could sequence just the protein coding genome? The previous chapter discussed the limitations in analysis of whole genome data. So the natural question is that if the present methods of functional annotation are largely limited to just the protein coding regions of the genome, then why not just sequence this part? Such an approach has the potential to significantly reduce the cost of sequencing, ease of handling data and analysis and possibly implement it in clinical practice to aid diagnosis. This is popularly called as exome sequencing. An exome is defined as the protein-coding region of a genome. In the human genome, the exome is estimated to be approximately 1% of the genome or roughly about 30 million bases. Since the proteins form the major workhorse in the cell that modulate the biological functions and outcome, sequencing just the protein coding region of the genome offers a cost effective quick solution to screen for genetic mutations. A number of approaches have been in the anvil to extract and sequence just the protein-coding regions in the genome. Three major approaches are popularly employed to extract specific regions of the genome (also known as targets) for sequencing. One approach would be to amplify specific regions under question using standard polymerase chain reaction. Usually, the reactions are multiplexed and involve pools of primers that amplify selected regions of 49 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation the genome under question. The products following the PCR reaction could be pooled together and sequenced. This approach is widely used to amplify smaller regions of the genome, but has limitations scaling accurately to larger sizes of targets, for example whole exomes, due to the fact that identifying optimum sets of PCR primers with comparable efficiencies and high specificity is challenging given the complexity of the human genome. Figure 1. Conceptual outline of the gene structure with exons, introns and the un-translated regions. The blue regions denote the protein-coding regions, and the yellow regions denote the untranslated regions. The transcript is spliced to form the messenger RNA and then translated to functional protein. Another popular approach has been the specific capture of DNA corresponding to the specific regions under question. This technique efficiently used the principle of specific base pair complementarities to 50 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation isolate specific regions in the genome. The capture reaction involves pools of single stranded nucleotides attached to solid surfaces, either beads or on glass surface. These pieces of nucleotides are designed to have complementarities with the regions or targets that require to be captured. Briefly, the genome is fragmented using ultrasound or specific enzymes known as restriction enzymes that can nick the DNA at specific intervals. This produces DNA fragments of approximately comparable sizes. The strands are then denatured and only fragments with complementarities to the stretches are isolated from the pool, thus enriching only regions that fall in protein coding regions as compared to the whole genome. The targets are then processed for whole exome sequencing following standard protocols. An overview of the two popular approaches to enrich for protein coding regions in the genome is summarized in Figure 2. Though the approach seems to be simple and logical, exome sequencing also has its share of limitations. The first limitation is that it by design precludes genomic variations falling outside of protein coding regions, many of which are functional. The best examples are promoter variations, which change expression of specific genes and regulatory variations in the untranslated regions that are known to modulate expression of genes and stability of transcripts. 51 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 2. Conceptual outline of the two popular methodologies for capturing specific regions in the genome. The first methodology involves capture probes immobilized on solid surfaces, while the second approach involves probes immobilized on beads. Figure 3. Conceptual overview of the major steps in primary analysis pipeline, which involves sequence quality check, alignment of high quality reads to the reference genome. 52 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The second major caveat of exome sequencing is that specific types of variations cannot be accurately typed. The best example could be chromosomal abnormalities, especially when there is no net change in the copy numbers. The best examples of such variations being translocations and inversions. Since the capture methodology enriches specific stretches of the genome, without keeping the context of the genomic region it came from, it would be impossible to decipher such events, unless the breakpoint occurs within the protein coding region, as in the case of the well-studied PMLRARa translocation in leukemia. Though new computational tools enable the characterization of copy numbers from exome sequencing data, it should be emphasized that exome sequencing is still not the most accurate methodology to look for chromosomal abnormalities, which include a copy number change. These limitations aside, sequencing just the protein coding part of the genome has its advantages. The first being the cost, which is significantly lower than whole genome sequencing. The second being the relatively small amount of data, which makes it easier to handle and less complex to analyze without reliance on huge computer infrastructure required to analyze human genomes. The third advantage being the ready availability of methods and tools to systematically analyze data including online resources, which makes analysis and interpretation a bit easier for clinicians. 53 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 54 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 7 When should you do exome sequencing? So the obvious next question would be when should I do exome sequencing? Let us go back to the case of Bhai. The molecular diagnosis and confirmation of the disease would require sequencing of approximately 20 amplicons using Sanger sequencing approach in a traditional diagnostic setup. Standardizing the PCR amplicons and performing the sequencing is a tedious, time-consuming and sometimes expensive proposition, which makes the accurate molecular characterization of many diseases a challenge. Advantages of whole exome The second is a sequencing in clinical scenario where there settings are a number of differential diagnoses. 1. Fast- 1-4 weeks turn around There are many time examples for such 2. Holistic as it covers cases in regular clinical majority of known disease settings. In such causing gene loci situations, the 3. Cheaper in specific accurate molecular situations characterization and diagnosis of the disease would require sequencing of multiple loci and genes, which on several settings, as in the previous situation, might become tedious, time-consuming and expensive. 55 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Exome sequencing is an alternative new approach in such scenarios for a number of reasons. Exome sequencing is quite fast with commercial turnarounds in the range of weeks, if not months. The approach is holistic, in the sense that it covers a majority of genes involved in Mendelian diseases. In addition, in many cases, which involve a number of genes or exons for confirmatory diagnosis, it might be cheaper than traditional approaches. The third scenario is where there is no diagnosis and the presentation is quite rare, or there are multiple affected family members or a situation, which involves consanguinity. After exclusion of chromosomal abnormalities and structural variations, exome sequencing might be an interesting approach to follow in such situations. The fourth and probably the commonest case where exome sequencing is warranted is when a definitive clinical diagnosis has been made, but specific variant or variants that are associated with the diseases are reported unaltered. This would hint towards the involvement of a novel variant or new gene loci, which would benefit significantly from a holistic approach like exome sequencing. The fifth situation where exome sequencing would be extremely beneficial is in cases where a specific molecular diagnosis is expensive and possibly not available in the specific local situation or country or in cases where the timelines for diagnosis would not be met by a conventional approach. Exome sequencing in 56 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation such cases would be useful on the economic front as well as on grounds of speed and efficiency. The sixth scenario is in the case of undiagnosed diseases with a clear or suggestive genetic The GUaRDIAN Consortium cause. A number of international studies GUaRDIAN stands for have suggested that Genomics for Understanding whole exome Rare Diseases-India Alliance sequencing would be a Network. It is a consortium useful proposition to and network of clinicians, clinical geneticists and arrive at a definitive genomics researchers diagnosis in cases of formed with the aim to use undiagnosed diseases. the power of genomics to Specific programs and understand the molecular studies have basis of rare genetic undertaken extensively diseases. exome sequencing to identify undiagnosed or More information on the rare diseases. These consortium and how it could have provided insights help you is available online at and diagnosis to a URL: http://guardian.meragenome.com significant number of cases in a cohort. There are a number of research settings where exome sequencing would benefit significantly. These are especially the cases of genetic diseases, which present with atypical presentations or additional features of otherwise clinically diagnosed conditions, where the possibility of finding novel variants and novel loci exits. 57 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The other research application of exome sequencing in clinical settings is in understanding the genetic basis of rare genetic diseases. A number of recent studies have shown that exome sequencing and whole genome sequencing could be appropriate genomics tools towards understanding the molecular dissection and discovery of novel mechanisms and gene loci involved in rare genetic diseases. Figure 1. The quadrant where the optimum use of whole exome and genome sequencing is recommended. In addition, as rightly described in Figure 1, exome sequencing has rightfully found its place in the discovery of rare mutations with large effect sizes and genetic loci associated with common diseases. Exome sequencing has recently also been extensively used to discover rare variants associated with common diseases. This has been largely possible by sequencing individuals at ends 58 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation of the spectrum. A number of recent reports have shown that this approach is powerful and could provide a new opportunity to understand genetic variants with large effect sizes. 59 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 60 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 8 When should you probably not do exome sequencing? It should be noted that exome sequencing is not a magic bullet that can enable diagnosis of all genetic diseases; nevertheless it should be considered as a new technological advancement, which can provide valuable insights that can aid the diagnosis of majority of the genetic diseases. Exome sequencing is not without caveats. These limitations should be clearly understood so that the expectations from whole exome sequencing remain realistic. The major caveat being that the approach can only identify variations in protein coding regions of genes. A number of genetic diseases are known to be caused due to mutations in non-protein coding regions, including non-coding RNAs. Most of the newer exome sequencing panels also include untranslated regions, promoters and in some cases non-coding RNA genes. It should also be noted that many diseases are caused by variations in the introns and splice junctions. These might not be captured in a typical exome capture panel. So a clear distinction and informed decision is warranted before selecting exome sequencing as a method of diagnosis for such diseases. Contrary to expectations, not all genes are captured in typical exome sequencing. A number of exons, which encompass repeats or regions that have lot 61 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation of Gs and Cs or in regions that are repeat-rich cannot be accurately captured and resolved by exome sequencing. Exome sequencing is not useful in diseases associated with chromosomal abnormalities and structural variations in the chromosomes (with very few exceptions). A large number of syndromes involve large chromosomal abnormalities including copy number and structural abnormalities. The capture methodology precludes the identification of such chromosomal abnormalities, especially ones that are not associated with a net change in the chromosome number. The exceptions in such cases are rare, especially ones involving the breakpoint within the protein-coding gene. Though standard pipelines for exome analysis are built to analyze single nucleotide variations and insertion deletion events, newer and specialized pipelines are presently available to detect copy number changes in chromosomes and breakpoints. It should be noted that such analysis is still in the research domain and have not been extensively applied in clinical settings. A number of diseases are caused by repeat expansions. The best-studied examples include Huntington's disease and some Spinocerebellar ataxias. Exome sequencing approach is not quite effective in diagnosing such diseases. This limitation primarily arises from the fact that most next generation sequencers are not able to accurately resolve repeats, especially simple repeats. A number of diseases are caused by mutations in the mitochondrial genes that show a unique feature called heteroplasmy, which means that mitochondria 62 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation with multiple variations are present in the same cell. Standard exome capture and analysis methodologies significantly ignore the mitochondria, though some capture methodologies also systematically capture mitochondrial variations. In addition, pipelines for analysis of mitochondrial variations are also available. If you suspect a mitochondrial disease, and a maternal pattern of inheritance, it would be worthwhile to start with mitochondrial sequencing. A word of mention is also essential that not all mitochondrial abnormalities are caused by mitochondrial variations. A number of nuclear genes are imported into the mitochondria and mutations in these genes could also possibly manifest as mitochondrial abnormalities, nevertheless with a Mendelian pattern of inheritance. A handful of rare diseases are caused by uniparental disomy. Usually the two copies of the genome are inherited, one from each parent. In some situations, both the copies of alleles are inherited from the same parent. Typical exome sequencing would not be able to identify whether the mutation came from one parent or both. 63 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 64 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 9 First things first: putting insights before data "Chance favors a prepared mind" -Max Perutz The diagnosis of a disease is only as good as the clinical work up you have done on the patient. Before prescribing for exome sequencing, you should have your options set and know what exactly your expectations are. Exome sequencing is not a panacea for all limitations for genetic diagnosis. A complete family history and pedigree. Before you decide on exome sequencing, collect the following information Let’s come back Complete detailed family again to the case of history and pedigree Bhai. In the initial Complete list of clinical conversations with the phenotypes and results of primary physician and clinical investigations Bhai himself, the only A complete list of information that could differential diagnoses be gleaned was that only members in his immediate family and close relatives were affected. On multiple encounters and a close study of his distant family tree over multiple visits and trips revealed that the disease was running in a much larger family, scattered over cities. Multiple coordinated attempts put together the comprehensive family tree and it was revealed that the disease has been running in the family 65 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation for generations, and involved more than a dozen affected members. The family still remains the largest reported family affected with Epidermolysis Bullosa in India. There is nothing better than a detailed family history and a pedigree that can help clinch a clue and assist a great extent in arriving at the right diagnosis. The index case or parents might not be quite forthcoming on the family history, or in many cases might not be aware of the family history of the disease. It would be worthwhile to spend some time closely with the patient or other members of the family and collect detailed information of all the relatives around them, their health status including diseases, medications, deaths and clause of deaths, miscarriages, abortions, stillbirths and deaths in early neonatal and childhood. Consanguinity18 is another key question. In many cases the family might not be quite forthcoming on the consanguinity as is it sometimes a norm in many communities. In many cases all the relevant information cannot be gathered in a single sitting as the patients or parents might not be quite aware or might not recollect facts. So it would be useful to possibly gather the details over multiple interactions. If the patient or parents are not educated, it would also sometimes be necessary to ask pointed, but not suggestive questions regarding the diseases, deaths and causes thereof in the family. A good detailed pedigree can permit hypothesizing the mode of inheritance of the disease, 18 Consanguinity means shared kinship or blood relation. 66 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation which would be extremely useful to prioritize variations in the exome data. For example, the family tree of Bhai helped us clinch a diagnosis of autosomal dominant Epidermolysis Bullosa. The genetic variant thus is expected in the heterozygous state in the exome, which necessarily meant we could have easily prioritized by sequencing two affected members in the family. A detailed chapter on prioritizing variants is available in the later part of this book. Similarly, a consanguineous marriage would suggest the possibility of a recessive19 disease and also suggests for mapping of regions of homozygosity (This is described in the later chapters as a methodology to prioritize variations after exome sequencing). The concurrence of disease in multiple individuals in an outbred family suggests a possibility of an autosomal dominant presentation, while a disease passed on through the maternal lineage through generations would suggest a mitochondrial mode of inheritance. A complete list of clinical phenotypes and clinical investigations Apart from the detailed pedigree, a thorough clinical examination and enumeration of the clinical findings is an important aspect that should not be overlooked. In cases of clinical presentations like facial dysmorphology20 or skin abnormalities, a detailed description of the findings is necessary. It would also be 19 Both copies of the gene would require to be mutated to manifest an autosomal recessive disease. 20 Dysmorphology is the study of birth defects, especially involving the morphology of the body. 67 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation worthwhile to have clinical photographs of the features to avoid ambiguity and to enable other clinicians or clinical geneticists arrive at an independent conclusion. In case of patients with diseases manifesting Did you know? with abnormalities in levels of metabolites in The Online Mendelian the blood, a detailed Inheritance in Man (OMIM) investigation towards database is a comprehensive this end is also an online database of human genes and disease essential clinical phenotypes. activity to the worked upon. A complete list of differential diagnoses The clinical findings and investigation reports together with the detailed pedigree forms the basic set of clues enabling one to arrive at a set of differential diagnoses. It would be worthwhile to enlist a detailed set of differential diagnoses before one prescribes exome sequencing in clinical settings. This would enable the The work on collecting Mendelian diseases and traits was originally initiated by Dr. Victor A. McKusick in 1960s and was available initially as a book. The electronic version of the compendium was made available online in the present form from 1995 through the National Center for Biotechnology Information. The present OMIM is curated and maintained by McKusickNathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, USA. 68 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation prioritization of genes to be closely examined. Apart from the list of differential diagnoses, a list of genes that are involved in the disease also becomes handy while analyzing the exome data. A list of potential genes involved could be garnered from the Online Mendelian Disease in Man (OMIM) database. Furthermore, a number of locus specific variation databases enlisting variants in these genes and their pathogenic effects could be garnered from appropriate resources. 69 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 70 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 10 Educating the patient and getting an informed consent Before prescribing exome sequencing, it is imperative to explain the entire method, its benefits and pitfalls. It is also imperative to inform the patient about potential risks of uncovering unanticipated facts, which could be gathered from the exome sequencing, including risks of late onset diseases, cancers and sometimes paternity. It would be therefore essential to take both parents under confidence before the exome sequencing is prescribed. A detailed information sheet that explains a non-exhaustive set of circumstances and or scenarios is appended at the end of the book. The following major points need to be specifically discussed with the patient before exome sequencing. Samples collected: The patient need to be informed how the samples would be collected (saliva, blood) and what amount of samples would be collected. 2) The analysis performed on the samples also requires to be explained to the patient. If any additional genetic/epigenetic/biochemical tests are required to be performed on the sample, this needs to be mentioned and how such a test would help in reaching the diagnosis. 3) Use of data and release: The patient requires to be informed whether the data would also be used for research and whether it would be released in a public 1) 71 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation database anytime. The benefits and risks of public release also need to be discussed. 4) Risks and discomforts: The Did you know ? risks and discomfort due to Manuel Corpas, a researcher the methodology made available the genomes of sample of himself and his family in a collection, or freely available and re-usable having the exome format on the internet, with the hope that people could sequence available download the data, analyze it should also be and obtain new insights on explained in detail. the genome. This was A few scenarios are popularly called the explained below. ‘Corpasome’. Such an approach could potentially make the genome analysis and derivative information up-to-date and comprehensive at any point in time, with enormous benefits in understanding the disease predispositions and or prognosis. a. The availability of the sequence could put one in precarious situations including identification of an individual, inference of paternity, The paper describing the inference of dataset was published with specific features the following citation: of the genealogy and possible Source Code for Biology and Medicine 2013, 8:13 doi:10.1186/1751-0473-8-13 prediction of http://www.scfbm.org/content/8/1/13 risks to self and children, and in some times to other close relatives in the family. 72 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation b. The information on the exome could be potentially leaked from multiple sources electronic or otherwise, which might have implications on the person and the family. c. If a previous genetic screen has been performed for research or diagnosis, the exome sequencing would make you identifiable in such a situation. Anonymity and privacy: The patient should be educated about the benefits and risks of being anonymous, and potential advantages of being nonanonymous. Specific case scenarios of data being publicly released as in the case of the ‘Cospasome’ could be discussed. If the patient requires being anonymous, the methodologies and measures whereby the anonymity would be maintained in a specific clinical setting needs to be detailed to the patient. The patient should also be educated that privacy and anonymity are not inter-dependent entities, and modern technologies could maintain anonymity and privacy, while benefiting from public release of the data. A recent paper from our laboratory details this concept21. 6) Masking results: The patient could be asked for a potential list of conditions or types of conditions, which need not be screened on the genetic data generated, and which might cause discomfort. Nevertheless, the patient also requires to be informed whether any of the diseases, which would benefit from reporting and is part of the ACMG 5) 21 "Personal genomes, participatory genomics and the anonymityprivacy conundrum." Journal of Genetics (in press) available at URL: http://link.springer.com/article/10.1007/s12041-014-0451-3 73 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation guidelines (Detailed in the later chapter) need to be also excluded from the analysis or reporting. Detailed consent provided to the patient and other participants as part of the GUaRDIAN consortium is enclosed below and would serve as a ready reference guide. 74 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation RESEARCH CONSENT FORM Reference Code: Son/daughter/wife of…………………………………………….aged………….. Residing at …………………………………………………… ………………………………………………………………… Hereby consent to freely participate in the genetic study aimed at understanding the human genome. I have been informed about the implications of my personal genome data being made publicly available through public databases as well as scientific communications I have been advised to discuss my participation in this study with my family members I have been provided written information that may be circulated to them, if necessary. I have been further informed that personal and medical data collected during this study will be associated with my publicly available genome and may be used for scientific analysis My participation in this study is entirely voluntary and I am free to withdraw from this study as and when I feel so inclined. 1.I choose to disclose / not to disclose my Identity (select one option) 2. I choose to be / not to be Informed of the results of the analysis that may impact my health (Applicable only to those who have chosen to disclose their identity – select one option). 3. I choose to exclude the information attached on the "Exclusion Form" from analysis / public disclosure (Applicable only to those who have chosen to disclose their identity). (Signature/ Thumb impression of volunteer) (Date) Certified that the above consent has been signed in my presence. The purpose for which the sample will be used has been explained to the above volunteer. The individual is free to withdraw from the study as and when he/she feels so inclined. (Signature of the investigator) (Date) 75 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Exclusion Form I choose to exclude the following information from the questionnaire with respect to analysis or public disclosure (please indicate the rave/ant question numbers from the attached questionnaire) 1. Analysis. 2. Public disclosure: INFORMATION FOR THE VOLUNTEERS 1.Purpose of study The principle scientific goal of this study is to explore avenues to study genetic variability between Individuals and to correlate the variability to the phenotypes. The data generated (i.e., human DNA sequence, medical information and physical traits) may be used for scientific and clinical research such as development of computational tools and interfaces for scientist, clinicians and individuals in addition to developing general public awareness on potential benefits and risks of having whole genome level information available to the public. 2. Enrolment procedures A. Collection of baseline trait data: You are required to provide baseline trait data about yourself, including: data of birth, medications, allergies, vaccines, personal and family medical history, race/ethnicity/ancestry and vital signs (e.g. height, weight, blood pressure etc) in the attached questionnaire. B. Monozygotic twin: If you have any identical twin(s), such sibling(s) will need to provide consent for your participation in this research. 3. Tissue (Blood/Saliva) collection A. Blood sample will be collected from the upper arm by Venipuncture. Twenty-five ml of blood sample will drawn by an authorized medical or an authorized technician under the supervision of an authorized medical doctor, in the presence of the principal investigator. Fresh blood sample will be collected in designated containers (which will be provided by CSIR/IGlB). Serum would be isolated from the collected blood sample for biochemical analysis B. Saliva sample will be collected by voluntary spitting. Two to 76 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation four ml of saliva will be collected in designated containers (which will be provided by CSIR/IGIB). 4.Genomic analysis Analysis of DNA RNA including but not limited to whole genome sequencing and other biochemical analysis will be performed on tissue samples collected from the individual. The nature and extent of analysis will be determined by CSIR/IGlB at its sole discretion. 5.Public release of research data Upon completion of genomics analysis, your DNA sequence data will be made available through the CSIR/IGlB website and other scientific communications (including but not limited to publication in scientific joumals). This information is for research purpose only and may not be used by you for any medical or clinical purpose unless the relevant research data (DNA sequence) is first confirmed and discussed in consultation with a health care professional. By signing this consent form, you hereby agree and authorize CSIR/IGlB to proceed with the full public release of your DNA/RNA sequence data and other information (data of birth, medications allergies, vaccines, personal and family medical history, race/ ethnicity /ancestry and vital signs) voluntarily made available by you, without any legal restriction and without your further consent through CSIR/IGIB website and database or other formats of standard scientific communications (including but limited to publication in scientific journals), and you hereby acknowledge the risk associated with the public release of such data and information. Your identity will be held confidential if you choose, even though the identity stripped information would be publicly available. 6.Risks and discomforts A.Venipuncture: This procedure is associated with minimal discomfort and is free of significant adverse effects. B.Data analysis: You are strongly advised to discuss this study and the potential risks. as outlined below with your Parents, Siblings and Descendants, hereinafter family members, as well as your health care provider(s). You are also advised to directly discuss any additional concerns with the Principal Investigator. The following non-comprehensive list of hypothetical scenarios that could pose risk for you and your family members: i) The data provided by you (such as traits and vital signs or DNA sequence data) may be used to identify you, resulting in higher than normal levels of contacts from the press and other members of the public. This could result in a loss of privacy and personal 77 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation time. ii) Anyone with sufficient knowledge and resources could take your DNA sequence data and or your personal trait information and utilize the data, with or without modification, to (1) infer paternity or other features of your genealogy, (2) reveal the possibility of a disease or risk for a disease. Such information could lead to social and financial consequences including but not limited to employment and insurance. iii) Your family members could also be subject to discrimination for employment, insurance or financial service on the basis of the public disclosure of your genetic and trait information. iv) If you have previously made or plan to make available genetic information In a confidential setting, the data provided by you as part of this study may reveal your identity. v) Any conclusions derived from the publicly available information may be speculative with rasped to you and even less predictive with respect to your family members. The complete set of risks posed to you and your family members due to the public release of the DNA sequence and trait data is not known at this time. We encourage you to discuss this aspect with your family members. 7. Benefits (i). At present there are no proven benefits to you for your participation in this study. (ii). This study may benefit the medical and research community in particular, and humanity in general and may help in establishing genetic causes and predisposition for common diseases. (iii). You may experience satisfaction from participating in research that may benefit medical science. 8. Intellectual property rights and benefit sharing You will not be financially compensated for your participation in this study. Neither you nor your heirs shall claim from CSIR/iGl8 any financial benefits or rights, for any information, data, discoveries, whether or not of a commercial nature, made using the information generated in this study. However as per international (HUGO, UNESCO) and National Guidelines (National Bioethical Committee, Ethical Guidelines for Biomedical Research on Human Participants) it is necessary for national/international entities deriving economic benefit out of the knowledge resulting by the use of the human genetic material, to dedicate a percentage (e.g. 1%-3%) of their 78 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation annual profit for the benefit of the community/ public health. 9. Confidentiality The results of this study may be published in a medical book, journal, website or webpage or used for teaching purpose. Your name and other identi6ers win be disclosed only if you have consented to disclosure of your identity, You may not be notified by CSIR/IGl8 prior to such use. 10.Withdrawal of participation Participation in this study is voluntary. You may withdraw your participation and/or your data from this study at any time, as described in the consent form. However once the DNA sequence and associated information is in public domain it is likely to get disseminated widely and rapidly. Therefore it may not be possible to retract the data in response to a withdrawal request. 79 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 80 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 11 Points to note when you outsource exome sequencing A large number of commercial enterprises now provide whole exome sequencing as a service. As stated before, there are large number of competing capture methodologies and sequencing technologies, which make the decision on the appropriate technology a bit cumbersome and sometimes extremely challenging. Nevertheless, the challenges aside, there are a few questions that need to be kept in mind before outsourcing exome sequencing in clinical settings. This section is designed to provide a basic guideline on specific points that are to be considered, and not as a guide to select a particular methodology of technology. The capture methodology and capture efficiencies As mentioned before, it is a good point to keep note of the target genes and exons captured as there are a number of capture methodologies with varying amount of bases captured in the genome and with varying efficiencies of capture. This is important in the context of patients with known genetic diseases, where you are keen to look for a known variant or variants to confirm the diagnosis. It is important to make sure the genes and specific exons are covered efficiently in the specific capture methodology under question. The capture efficiency of the target region is also important to be noted after the sequencing is being done. Details of how to go about this are mentioned in the later chapter on data analysis. 81 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Sequencing technology, quality of reads and data throughput A number of sequencing technologies are available in the commercial space. Therefore, it is important to keep a note on the sequencing technology employed before you finalize on the methodology. A rule of the thumb is to go with a methodology that would provide ample number of high quality reads at an affordable cost. More on how to evaluate this after the sequencing is performed is detailed in the later chapter. Depth coverage of the target regions In a regular clinical setting, for diagnosis of rare genetic diseases, it would be worthwhile to have at least 100x coverage of the exome. This is due to the fact that the capture efficiencies are variable across the genome, and an average coverage of 100x would essentially have in practical situations, almost all target regions adequately covered to enable variant calling. It is also imperative to look for what percentage of the target region has good coverage to enable accurate variant calling. Availability of raw data and alignments While outsourcing exome sequencing, one should also insist that the raw data with qualities (preferably in FASTQ formats) and alignments should be available. This is an important consideration due to a number of reasons. The first and the prime reason being that the field is still naive, and so are the methodologies for analysis. Apart from the information on the particular variant in question, the exome also contains a number of variants, many of which could also give insights and 82 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation additional clinical implications. Secondly, in many cases, it is necessary to go back to the data and reanalyze at a later point in time to arrive at an appropriate diagnosis in light of disease progression and new clinical findings. Variant calls, formats and interoperability A number of service organizations offer the variants in custom formats, usually in tab-delimited formats or even excel sheets. It would be necessary to keep a note that all variant calls be available in standard interoperable formats. The commonly employed standard format for variant calls has been the VCF format. The VCF22 format includes all necessary information to reanalyze the variants for prioritization, especially the read coverage around the variant, the variant quality and samples that have the particular variants, in case of trios. Additionally, VCF formats are interoperable and are accepted by most online resources and software that aid the analysis of exome datasets. Details of the analysis pipeline with parameters The results of an exome sequencing analysis could drastically vary depending on the analysis pipeline employed and especially the parameters used for sequence alignment and variant calling. To ensure that the data is reliable and reproducible, it is imperative that the report has accurate description of the analysis pipeline as well as the parameters used in alignment, and variant calling. 22 VCF stands for Variant Call Format. This format came into existence after the 1000 Genomes project and is widely used in the community. A number of bioinformatics tools and resources for analyzing variant data take variant data input as VCF files. 83 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Datasets used for annotation, versions and updating. As much the analysis tools and parameters affect the variant calls, the datasets used and their versions also have a large impact on the conclusions derived. Many of the datasets of genomes, genes and variants are regularly updated and have non-trivial changes between the versions released. It is thus important to keep a note of the versions of the databases used so the results could be appropriately interpreted and the analysis be appropriately reproduced. 84 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 12 Understanding the steps in analysis of exome sequence data The major steps in analysis of the exome sequence data could be summarized as follows. The first step involves quality check of the data. The second step involves alignment of the sequence reads to the reference genome. The third step would be the analysis of the alignment to call variants and the fourth step would be to annotate and analyze the variants. The steps involved in the entire process are summarized in Figure 1. Figure 1. Summary of steps involved in the analysis of the exome. The nucleotide data generated by the sequencer is usually available in a file format known as FASTQ (which stands for FASTA with Qualities). As you would have imagined the file contains sequences with their 85 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation base qualities. The base quality part is important to note here, because it tells how good is the sequence read, and only a good quality read would provide you a good quality variant for further analysis. The FASTQ files are quite large, and in most cases cannot be opened on your word processor or text editor. Nevertheless, it would be worthwhile understanding what the file contains and what it would mean. The FASTQ file would essentially have 4 lines corresponding to each read, and there could be millions of such reads in the file, arranged one after another. Briefly, the first line starts with an ‘@’ followed by the information on the read. This usually has information of the sequencer, the run name, date, and this might not be of use to you in a regular case. The second line contains a string of ATGCs, which is essentially the nucleotide sequence of the read. The third line starts with a ‘+’ and in some cases repeat the information as in the first line, while sometimes it is empty, to avoid redundancy. The fourth line, in many cases contains characters that read like gibberish and this is the representation of the quality of each base in the read. So essentially the number of characters would be exactly same in the second and the fourth lines, as there is a quality representation for each read. The gibberish is nothing but the ASCII character23equivalent to the quality score. 23 ASCII stands for American Standard Code for Information Interchange and it comprises of numerical representations corresponding to a character. 86 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCCTTGGCAGGCCAAGGCCGATGGATCA + ;;3;;;;;;;;;;;;7;;;;;;;88;;;;;;;;;;;9;7;;.7;393333 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA GTTGCTTCTGGCGTGGGTGGGGGG + ;;;;;;;;;;;7;;;;;-;;;3;83;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGGCCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_443_348 ;;;;;;;;;;;9;7;;.7;393333;;;;;;;;;;;7;;;;;-;;;3;83 Figure 2. The FASTQ file format with sequences of the reads and qualities of bases in the sequence read. The quality of reads across the read length is usually expressed as a Phred score. The Phred score is nothing but ten times the negative logarithm of the probability that the base was incorrect. So if the base had a one in hundred chance for an error, which means a 0.01 probability of error, this would mean that the Phred score would be 20 (as follows 10x-log(-2)). So a Phred score of 30 would mean the base error probability would be one in thousand and a score of 20 would mean a probability of one in hundred and so on. There are a number of ways you could evaluate the quality of data. One approach is to plot the distribution of qualities at every base, and this plot serves a ready reference to see whether the sequencing was good or not. The quality of sequences could be quite variable because of issues in the library preparation or sequencing. If the reads on first place have issues with quality or with sequencers of the adapters used for sequencing, it is usually trimmed to exclude low quality reads and this step is otherwise known as trimming. How to verify the quality is detailed in the later chapter. The next step would be to align the good quality reads to the reference human genome. The selection of the genome version on build is very important as there 87 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation are non-trivial differences in the positions of nucleotides and annotations of genes between the builds. A number of computational algorithms have been used extensively in literature to align the reads. The purpose of alignment is to find the cognate position of the read in the genome, and this would offer a way to compare whether the nucleotide is same or different across the read. As you would have rightly imagined, each genomic position corresponding to protein coding exons would be covered by a number of reads. This is otherwise denoted as coverage, or how many times the nucleotide is covered by reads. Once you have aligned the reads to the chromosome, you would find some positions that are different in the reads compared to the reference genome template. This information could be analyzed using computers to derive which positions have a variant. As you have rightly guessed, a higher coverage would provide you with a better accuracy of the variants called. So if you imagine a homozygous variant, all the reads or rather majority of the reads would have the particular variant, while in the case of heterozygous variations approximately half the reads would have the particular change with respect to the reference template. This entire process is called variant calling. As mentioned before, a number of computational algorithms have been extensively used to accurately call variations in the genome. 88 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 3. The alignment of reads to the reference genome. The positions where the bases in the reads are different from those of the reference genome is highlighted. The variations in the genome are usually available in a standard format known as the VCF. VCF stands for Variant call format. A number of analysis software are able to appropriately recognize the variant formats and provide annotations to the variants in terms of information that would help clinch the diagnosis. The structure of the VCF file is summarized in Figure4. 89 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 4. The VCF file format representation of variations in the exome. 90 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 13 How good is the exome sequencing data? There are three major parameters, which decide whether the exome sequencing data is good or not. The first one is of course, the quality of the sequencing reads, the second one is the coverage depth across the target regions and the third is the alignment percentage across the genome. The first parameter is possibly the easiest to check. That’s the quality of reads. A number of tools, both online and offline are available to check the quality of bases. It should be noted that the distribution of quality of bases is as important as the mean quality of the bases. The scheme below shows the base quality plot for a good set of sequencing reads. The scheme also shows the advantage of looking at the distribution of the qualities compared to the mean quality at each base position. Figure 1. Quality plots for good quality reads. 91 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 2. Quality plots for bad quality reads. Note the low quality of reads towards the end. The second important parameter to check would be the coverage depth across the target region. On an average, for identification of rare disease variants in clinical settings, it is recommended to have at least 100x coverage worth of high quality data. The calculation would be dependent on the read length and total length of exome capture (in case of whole exome it is approximately 50 mb). For a 100 base read, this would mean 5 million reads, and so on. The third important parameter is the alignment percentage. It denotes the percentage of the total reads which aligned to the reference genome. On an average, in a well-set experiment, more than 95 percent of the reads generated after capture should align to the human genome. At times, the percentage alignment could also possibly cross 99 percent, with good quality data. A low percentage alignment would mean a number of possible 92 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation things that could have gone wrong. One of the major possibilities is contamination of the reagents. Other possibilities could include inefficient capture or sequencing. Adapter contamination is one of the first things to look in case the reads show an abnormal percentage alignment. An adapter contamination could also be identified in the FASTQC report, which would show over represented sequences. Over representation of particular sequences, especially repeat sequences would mean an improper capture or library preparation. Apart from the total alignment percentage, the coverage of the target site is also an important consideration. For accurate variant calling, it is advised to have a good coverage across majority of the target sites. Skewed target coverage would mean an inefficient capture procedure. 93 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 94 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 14 Prioritizing, annotating and interpreting variants As described in the previous chapter, the real analysis and interpretation starts after you lay your hands upon the compendium of variations called from the exome. Ideally we expect all variants to be in the standard VCF format, which makes it compatible and interoperable with most tools and resources available online for exome analysis. But before we go right into the thick of exome analysis, it would be imperative to conceptually understand how to prioritize variations. There are largely six approaches to prioritize variations from exome or whole genome sequencing data and these are summarized in Figure 1. The highlighted region denotes the exome sequenced and the panel below suggests the approach to filter or prioritize variations. Such prioritization strategies could be employed at any step and the selection of the approach is dependent on the specific case. If there are multiple affected family members as in the case of the Bhai, a linkage-based strategy is useful. One could potentially sequence multiple affected family members of the same family, and if possible, unaffected members too. A segregation-based strategy could be used to include all variations overlapping in the affected individuals and excluded in the unaffected individuals. Such an approach would be extremely useful in autosomal dominant diseases with multiple affected family members. 95 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 1. Summary of popular approaches or strategies to prioritize variants from exome sequencing. The second scenario as you would see involves a consanguineous marriage and you would expect an autosomal recessive pattern of inheritance of the disease causing mutation. Here a homozygosity based strategy, taking into consideration all homozygous variants and prioritizing them through standard pipelines would be the best approach to follow. The third scenario involves a non-consanguineous marriage with a probable autosomal recessive pattern of inheritance, where filtering the exome by exclusion for heterozygous variations could be the approach to follow. In some cases where the affected child is not available for testing, as in the case of abortions, sequencing both the parents for heterozygous variations associated with Mendelian diseases would be the alternative to follow. 96 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The further analysis of the exome data involves majorly three steps. The first step involves understanding variations, which cause a change in the amino acid sequence of proteins and predicted to be deleterious. The second step involves annotating the genes with respect to the disease candidates and the third step involves prioritizing variations using different strategies as in the specific case. The first step is to obviously find variations, which could change the amino acid sequence of the protein and are predicted to be deleterious. As you would also have imagined, not all variations in the exome are important or could have a functional consequence. The variations that can cause a change in the amino acid sequence are called non-synonymous variations, while the variations that do not change the amino acid sequence are called synonymous variations. Not all nonsynonymous variations are important. Only a small proportion of the non-synonymous variations in the exome change the amino acid sequence of a protein to produce a functional effect. These are variations that cause an amino acid change in regions of the protein that are extremely important for the function or the structure of the protein. These variations are generally called deleterious variations. Now whether a variation could potentially be deleterious or not, is largely derived from computational predictions based on what amino acid change is caused by the specific variation under question. Two computational tools are popularly used to prioritize deleterious variations. These includes SIFT and PolyPhen2. The algorithms use similar, but distinct approaches to annotate variations as deleterious or not. 97 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation SIFT stands for Sorting Intolerant from Tolerant and uses evolutionary conservation of the amino acid at the particular position in the protein to predict whether the variation is deleterious or not. This is under the basic assumption that if the amino acid is quite conserved at a particular position in the protein, evolutionarily, a change to a less frequent amino acid at that position could be functionally deleterious and thereby evolutionarily discarded. PolyPhen2 is yet another algorithm to prioritize variations. The algorithm is a bit complicated, and apart from the conservation of position, also uses the structural context of the amino acid and additionally uses artificial intelligence methodologies to predict whether the change is deleterious in nature or not. Both approaches individually might not be quite effective in prioritizing the variations. So one approach that has been popularly employed by researchers is to use a consensus of both approaches to prioritize variations that are deleterious in nature. You should also however note that while a consensus approach might be highly specific, such a stringent approach might exclude some variations that are functionally relevant and the decision to use the tools in consensus or alone has to be decided on a case-to-case basis. The online applications that integrate these predictions are discussed later in this chapter. The second step is to annotate the variations and genes associated with the disease phenotypes under question. As mentioned in the earlier chapter, the complete clinical details come in handy here. A number of tools discussed later in this chapter can take in 98 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation additional annotation of the patient phenotypes to prioritize variations. There are two web-based resources that have been extensively used to clinically annotate exomes. This includes Exomiser maintained by the Sanger Institute and PhenIX, both of which have been extensively used by clinicians worldwide to prioritize variations and possibly arrive at a diagnosis. Exomiser has a web-based interface, where you could upload the VCF variant file corresponding to the exome. The web interface also provides an option to upload exome variants from multiple samples in a family with associated pedigree information in a specified format. Briefly, you could upload the VCF file and optionally the pedigree annotation if you are having multiple individuals sequenced from a family. The resource also features additional options where you could input either the diagnosis of the patient or a set of phenotypes in case the diagnosis is not sure. There are additional parameters, which you could specify, and are optional. This includes 1. Minimum variant call quality: You could specify a Phred score, say 30. 2. Maximum minor allele frequency (%): This option allows you to exclude common variations by allele frequency. Could put a minimum allele frequency of 1%. 3. Remove off-target, intronic, synonymous variants, dbSNP variants and non-pathogenic variants options would allow you to exclude these variations from the report. 99 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 4. Inheritance model: You could select the specifics if you are sure about the inheritance of the disease and this option is used to prioritize the variations. Otherwise could select none to display all variants in the report. Figure 2. Screenshot of Exomiser with the different options. Another similar resource that allows you to prioritize variations is PhenIX maintained by the Charite in Berlin. 100 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Figure 3. Screenshot of PhenIX with the different options. PhenIX has an interface quite similar to that of Exomiser and has an option where the user can input the phenotypes or traits using the autofill option, upload the VCF file and specify the inheritance model and the maximum allele frequency. Both tools prioritize the variations by pathogenicity or deleterious effect of the variation(s) and by similarity of the genes harboring these variations to the genes associated with phenotypes provided by the user. 101 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Apart from the deleteriousness of the variation, another parameter that would help clinch a diagnosis would be the allele frequency of the variation in populations. It is expected that most deleterious variations in the population would be quite rare in occurrence, so an allele frequency of less than 1 per 100 would be a quite legitimate frequency to choose to prioritize variations. In many cases, it could also be expected that the variation is novel and might not have appropriate allele frequency information data. 102 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 15 Don't forget the validation The validation of the findings from whole-exome sequencing is as important as the exome sequencing itself. Most researchers and clinicians are not aware of the fact that exome sequencing and analysis is also fraught with its limitations. It is therefore necessary to independently validate the variation before confirming the diagnosis. There are two scenarios where validation is to be considered. In the first scenario, the variant is known and implicated in the disease previously. Here the validation is quite simple, in the sense, the finding needs to be verified independently in the sample or samples. Traditional Sanger sequencing approach is what is commonly used in the field, especially for single nucleotide variations. Polymerase chain reaction primers could be designed around the variant under question and the region could be amplified and sequenced to confirm the diagnosis. The second scenario is where you have identified a new variant in a known gene. The first line of evidence that would clinch on the variant would be segregation of the variant in the affected members and a predicted deleterious effect. Wholesome participation of members of the family in such cases is required, and consent is required to be obtained (detailed in the ethical considerations section of this book). In some cases, participation of other family members would be impossible to obtain, due to privacy and anonymity 103 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation concerns. Another approach to validate a generic variant would also be to see the segregation in a trio. In some cases, especially in sporadic cases, and in specific social circumstances, it might not be possible to approach other family members or in some cases not even parents but nevertheless the pathogenicity requires to be proven unequivocally. In such circumstances, a number of advanced methods have been adopted in literature. These include validation of the finding using specific assays at the protein level or at a cellular level using advanced gene cloning, expression and sometimes genetic engineering approaches. These technologies are specialized applications, mostly in the research domain and clearly out of the purview of this book. The third scenario is where you stumble upon a new gene and variant that causes a disease. While segregation and or homozygosity mapping in cases of consanguinity and filtering based on allele frequencies could clinch a conclusive diagnosis, many cases also leave a margin of error or doubt in the diagnosis and implications of the genes involved. Functional validation of such new genes is presently a realm of research laboratories as no clear cut and wholesome methodologies exist to systematically validate the functional effects. Apart from the popular cell culture systems, a number of research laboratories employ model organisms to functionally validate the gene and model the disease process. Model systems are useful to validate the physiological processes, especially in cases of developmental defects or structural abnormalities, which would be difficult to validate in cell culture systems. Nevertheless, cell culture systems are useful to validate specific processes including metabolic pathways 104 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation and genes involved in specific processes at a cellular level. The popular model organisms used to validate disease genes include vertebrate and non-vertebrate systems such as mouse, rat, zebrafish, fly and worm. Our group employs zebrafish, which is a popular vertebrate model organism for functionally validating the novel genes. 105 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 106 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Chapter 16 Ethical considerations in whole exome sequencing There are a number of ethical considerations that have to be accounted for while performing and analyzing the exome sequencing in clinical settings. This is primarily because exome sequencing is unique in many ways, compared to traditional diagnostic approaches. For example, in comparison to most traditional diagnostic approaches, the fine line between diagnostics and research is quite blurred in the case of exome sequencing. This is primarily because unlike other diagnostic approaches, methodologies for exome testing and validation are still not quite established. In addition, since most of the clinicians would use exome sequencing for understanding rare diseases, the diagnostic accuracy in many cases cannot be established due to the paucity of numbers and unique nature of each patient. It should also be kept in mind that The basic tenets of investigations in genetics has to be based on the strong principles of beneficence, reciprocity, justice and professional responsibility. Three major areas are covered in the following section of this chapter. This includes educating and informing the patient, informed consent and handling incidental findings, and anonymity and privacy of the patient and family members. Information and education 107 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Educating the patient on the technology, analysis process and interpretation is an important component. The patients need to be educated about the possible pitfalls, fallacies and limitations of exome sequencing. In addition, the patient would also require to be informed about incidental findings which could have clinical, social and emotional implications and one should be equipped to make an informed decision on the same. In addition, the patient is also required to be informed that a genetic testing of this sort could reveal information not just about the patient or family, but also information, which might be critical and relevant to other relatives in the family and possibly the next generation. The pros and cons of such information being available and implications of the same also need to be addressed. Incidental findings and reporting Exome sequencing is unique compared to the traditional research or diagnostic tests where the data generation is comparative to the questions asked, or rather, the chances of finding something incidental while performing a test is meager. The first set of diagnostics that started changing the paradigm was radiology, where whole body scans started churning out information than that was accurately required to answer the clinical questions. The more the data generated, in a generic form, the more incidental findings start to appear. Exome sequencing is unique in this respect that the sequencing allows a comprehensive scan of all variants in protein coding regions. This would include apart from the variant or variants that help in the diagnosis, other variants, many of which would have 108 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation clinical implications or relevance. Many of the resulting findings may or may not have direct implications in the condition at hand, but might have long-term implications. One example could be variants that are associated with drug metabolism or adverse drug reactions. In some situations, the information might have implications in early diagnosis or prognosis, as in the case of inherited cancers. In many cases the distinction between the incidental finding and the study or target mutation under question also does not exist. The traditional approach to such incidental findings in clinical settings has been one of 'didn't look, didn't find, don't report' where the onus was on the doctor to decide what needs to be looked in the results and report what he or she felt was good or relevant for the patient. This paradigm might not always be the right approach to follow because the incidental findings by themselves could be of immense value to the patient, and possibly to another doctor treating the patient, as in the case of pharamacogenetic variants, which might help in modulating the dosage of specific drugs under question. In addition, the case of exome sequencing is unique compared to computed tomography (CT) scans in another way. While computed tomography scans could reveal in addition to the intended evidence, additional incidental findings, the relevance of the findings rarely change with time. In the case of whole genome or exome sequencing, since the field by itself is naive, and researchers are discovering new variants and attributions in terms of their clinical relevance, almost every day. Reanalyzing the exome sequencing data at a 109 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation later point of time could possibly reveal new findings of clinical relevance. This unique situation would pose another interesting paradigm, where reporting of the exome is going to be a dynamic process, not an end point or static process in contrast to many traditional clinical diagnostic approaches. The American College of Medical Genetics (ACMG) formed a working group to deliberate on guidelines for reporting incidental findings in exome and genome, which was published recently. The working group recommended the reporting of incidental findings for a set of specified disorders, variants and class of variants by evidence. This reporting is done irrespective of the primary indication for exome sequencing. American College of Medical Genetics and Genomics Recommendations for Reporting Incidental Findings in Clinical Exome and Genome Sequencing A comprehensive description of the methodology, recommendations, list of genes, variants and phenotypes is available in the document entitled ‘ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing’ accessible at URL: https://www.acmg.net/docs/ACMG_Releases_HighlyAnticipated_Recommendations_on_Incidental_Findings_in_Clinic al_Exome_and_Genome_Sequencing.pdf Apart from the incidental findings, the patient or family members may decide to mask reporting on 110 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation specific regions or loci variations that might have nontrivial implications. The consent should include a section where the patient or family members could explicitly state this. Anonymity and privacy Utmost care on anonymity and privacy is another important component of ethical conduct to the patient and family. It should be emphasized that anonymity and privacy are not two sides of the same coin, but are separate entities. A detailed discussion with the patient and family members is essential on this aspect. In many cases, the impact of the genetic testing is just not limited to the index case or family, but might have implications in the genetic predisposition and disease manifestation in the other family members too. Similarly, the identification of a mutation might not be relevant to the specific individual or family, but could be of relevance in terms of screening and carrier detection in other members of the family. As in the case of Bhai, the identification of a novel mutation in KRT5 gene would have implications in genetic screening and in some cases prenatal screening with implications for the other members of the family. In some cases the validation of the genetic variant would require participation of other members of the family, including people who might not be affected with the disease. With the advent of Internet support groups and patient groups, in many cases the patient of the family members do not like to be anonymous, since it might benefit the larger community and society. In some cases, the patient and family would like to remain anonymous given the social stigma associated with the disease and 111 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation social implications with respect to other members of the family. It is therefore the educated decision of the patient or family that needs to be given utmost importance. Questions in this direction need to be nonsuggestive, and should take into consideration the social, emotional attachments and long term implications. 112 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation The last word Exome Sequencing is only a means, not an end. It seemingly has a limited lifetime, largely popular and widely adopted due to the cost advantage and ease of analysis and interpretation. With dwindling costs and improved throughput of sequencing, it is imperative, not just plausible, that whole genome sequencing would be the mainstay in diagnosis of genetic diseases. 113 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 114 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Index computational · 16, 26, 46, 53, 88, 97, 121 4 computed tomography scans · 109 454 · 38 computer · 40, 53 coverage · 15, 82, 83, 88, 91, 92, 93 A CSIR-IGIB · 45, 119, 121 Albinism · 31 alignment · 52, 83, 85, 88, 89, 91, D 92, 93 Anonymity · 73, 111 deleterious · 97, 98, 101, 102, 103 anonymous · 73, 111 diagnosis · 12, 49, 55, 56, 57, 61, autosomal · 15, 67 65, 66, 67, 71, 73, 81, 82, 83, 89, 99, 102, 103, 104, 108, 113 diagnostic · 11, 55, 107, 108, 110, B 123 disease · 11, 13, 15, 16, 34, 55, Beijing · 44, 45 62, 63, 65, 66, 67, 69, 83, 92, Bhai · 9, 11, 13, 14, 34, 55, 65, 67, 96, 97, 98, 100, 103, 104, 111 95, 111 DNA · 19, 23, 24, 33, 39, 40, 41, Bill Clinton · 25 50 C E capillary · 20, 22 Epidermolysis Bullosa · 13, 66, 67 capture · 50, 52, 53, 61, 62, 63, exome · 14, 16, 49, 51, 53, 55, 56, 81, 82, 92, 93 57, 58, 61, 62, 63, 65, 67, 68, Celera · 24, 25, 31 71, 72, 73, 81, 82, 83, 91, 103, chromosome · 15, 25, 30, 62, 88 107, 108, 109 115 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Exome · 1, 3, 16, 56, 58, 61, 62, K 65, 108, 110, 113 Exomiser · 99, 100, 101 Koreans · 45 expression · 29, 51, 104 KRT5 · 16, 111 F L FASTQ · 82, 85, 87 leukemia · 53 fluorophores · 20 M G Malaysian · 45, 121 genomic variations · 31, 33, 34, Mendelian · 56, 63, 69, 96 51, 121 microelectronics · 33, 37 GWAS · 34 microprocessor · 33 microsatellite · 35 molecular · 16, 35, 55, 56, 58, 119 H molecular biology · 35 mutation · 14, 63, 109, 111 Helicos · 41 heterozygous · 15, 67, 88, 96 homozygosity · 67, 96, 104 N Human Genome · 16, 25 Nanopore · 41 next generation sequencing · 9, I 37, 43, 45, 119 imaging · 38, 40 non-synonymous · 97 inherit · 30, 31 nucleotide · 16, 19, 30, 32, 33, 40, 41, 62, 85, 88, 103 inheritance · 15, 63, 66 nucleotides · 15, 19, 21, 29, 35, Inheritance · 100 37, 38, 39, 40, 41, 51, 88 inversions · 53 Ion Torrent · 41, 42 116 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 108, 109, 110, 113, 119, 121, O 123 Shankar Balasubramanian · 39 outsourcing · 11, 81, 82 shotgun · 24 SIFT · 97, 98 P silicon · 41 Solexa · 39 Pacific Biosciences · 41 SOLiD · 38 PCR · 32, 39, 41, 50, 55 Sri Lankan · 45 pedigree · 15, 65, 66, 67, 68 Phred · 87, 99 T polymerase · 32, 49 PolyPhen2 · 97, 98 Tony Blair · 25 privacy · 73, 103, 107, 111 trait · 30, 31, 34 pyrophosphate · 38 translocations · 53 trimming · 87 R U regulatory · 29, 46, 51 restriction · 51 United States · 23, 25 Russian · 45 V S variation · 14, 15, 16, 29, 30, 46, Sanger · 16, 19, 20, 21, 22, 37, 43, 69, 97, 98, 101, 102, 103 55, 99, 103 VCF · 83, 89, 90, 95, 99, 101 sequencing · 1, 9, 11, 12, 14, 16, Venter · 31, 44 20, 21, 22, 23, 24, 29, 37, 38, 39, 40, 41, 43, 44, 45, 46, 49, 51, 53, 55, 56, 57, 58, 61, 62, W 63, 65, 67, 68, 71, 73, 81, 82, 83, 87, 91, 93, 95, 96, 103, 107, Watson · 44 117 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 118 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation About the authors Sridhar Sivasubbu Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB) Web: http://sridhar.rnabiology.org Email:
[email protected] Sridhar Sivasubbu’s laboratory is interested in exploiting the advantages of zebrafish to dissect molecular mechanisms of gene function, regulation and genome organization in vertebrates. Research activities in his lab include deciphering non-coding RNA mediated regulation of blood and blood vessel development and development of zebrafish models for application in personalized and precision medicine in humans. His group is actively involved in mapping the genome and transcriptome of the wild zebrafish. His group was also responsible for the whole genome sequencing of human samples from India and other Southeast Asian countries. Sridhar did his PhD from M.S University, Tirunelveli, India and postdoctoral research at the Center for Cellular and Molecular Biology, India and the University of Minnesota, USA. He is a faculty at the CSIR-Institute of Genomics & Integrative Biology (CSIR-IGIB) since 2006. Sridhar also served as the CEO of The Center for Genomic Application, a Public-Private partnership company established by CSIR-IGIB for enabling research in the field of Genomics and Proteomics, where he spearheaded the application of next generation sequencing technology for commercial projects. 119 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 120 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation About the authors Vinod Scaria Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB) Web: http://vinodscaria.rnabiology.org Email:
[email protected] Vinod Scaria is a clinician turned computational biologist. His laboratory is interested in understanding the function, organization and regulation of vertebrate genome, and how genomic variations could potentially impact them. He is also involved in creating novel methods and resources for analysis and annotation of genomes and understanding the functional impact of genomic variations. He has been part of collaborative genomics projects aimed at understanding the Asian Genome diversity. He has also been part of the whole genome sequencing and analysis projects including the Indian, Sri-Lankan and Malaysian genome projects and is also a member of the HUGO Pan-Asian Population Genomics Initiative task-force. He has adopted novel and creative strategies, such as the use of social media, and the participation of a large number of undergraduate students in collaborative projects to accelerate genome annotation and co-creation resources for genome annotation. Vinod did his undergraduate medical education from Calicut Medical College, University of Calicut and PhD in Computational biology from University of Pune. Vinod has over 80 peer publications in international peer-reviewed journals and two bookchapters to his credit. He is also in the editorial board of PLoS ONE, PeerJ, Journal of Translational Medicine and Journal of Orthopaedics (Elsevier). He is also recipient of the CSIR Young Scientist Award for Biological Sciences in 2012. He was a member in the senate of the Academy of Scientific and Innovative Research (AcSIR) 121 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation 122 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation Reaching the authors This book was written keeping in mind how genomic technologies could translate to patient-care. The authors would be happy to extend their expertise and resources to help the diagnosis of patients with rare genetic diseases. Interested clinicians and patient groups may kindly contact us for further discussion. You could reach us at: Email:
[email protected] OR
[email protected] Register yourself to the Clinical Exome Group We have set up a unique Reader’s club to keep you updated about the new versions of this book and recent developments in the field. It would also be a unique opportunity to share your issues and find answers to your issues with exome sequencing and analysis and also discuss interesting cases with experts in the field. To register, follow this link: http://goo.gl/o9aAfC You could also leave your comments on our Facebook page: https://www.facebook.com/clinicalexome 123 Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation What readers have to say...... "The book is very well written, concise and provides an excellent collection of data capturing the transition of one era into another. Due emphasis was given towards the limitations of NGS along with its widely acknowledged benefits. It helps one to understand the basics of whole exome sequencing from a realistic viewpoint. Each chapter is well constructed and systematically elucidates situations where WES would be useful. Moreover, it provides an impetus for the clinicians to understand their contributions towards accurate phenotyping for better understanding of the genetic variations in a diagnostic set-up" Yenamandra Vamsi Krishna, Department of Dermatology, All India Institute of Medical Sciences, Delhi Let us know what you have to say about this book on our Facebook page: https://www.facebook.com/clinicalexome 124 Scaria V and Sivasubbu S (2015) Exome Sequence Analysis and Interpretation This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Cover Image: Artist’s impression of Nucleotides in a DNA strand. Oil on canvas by Pradha (2015)