EMBOSS.pdf - Free Download PDF Ebook

 Fall  08  Biochemistry 711 – Book  3 –    EMBOSS Software for sequence analysis  Professor Ann Palmenberg, Institute for Molecular Virology & Department of Biochemistry [email protected] Dr. Jean-Yves Sgro Biotechnology Center & Institute for Molecular Virology [email protected] University of Wisconsin‐Madison      version 10/2008    Biochemistry 711 - 2008 This labbook is Copyright © 1997-2008 A.C. Palmenberg & J.-Y. Sgro, University of Wisconsin-Madison. All Rights Reserved (October 2008)                          [ @                           k? \   Biochem 711 – 2008 i Foreword and Acknowledgements The original laboratory exercises resulted from a long-term commitment to promote and foster genetic computing on the Madison campus by the Genetics Computing Group Inc., (GCG) and its standing collaborative teaching efforts with Ann Palmenberg. John Devereux and Maggie Smith provided, through GCG, the original UNIX-based hardware and software licenses necessary to create the first such curriculum for UW students. We are thankful for their largess in providing the funding for purchase and yearly upgrades the original UW UNIX-based teaching computer. The GCG exercises of this lab book were inspired by the original educational tutorials developed by Barbara Butler to teach this complex family of software programs. She has generously shared her materials and her knowledge for the benefit of UW students and staff. GCG has now been replaced by an open source software and the exercises adapted to this new package: EMBOSS, the European Molecular Biology Open Software Suite. We want to express special thanks to Ms. Marchel Hill, a course instructor, who has helped translate the GCG exercises to an EMBOSS equivalent and has unselfishly volunteered many hundreds of hours of her time and also her teaching skills towards tutoring UW students, both inside and outside of the scheduled classes. Ann and Jean-Yves would also like to acknowledge Joshua Harder at the Digital Media Center (DMC) for the maintenance of the desktop computing classroom and John Koger for installing EMBOSS both on Macintosh and Windows partitions. The goal of these exercises, is to provide an introduction to sequence analysis that will help students acquire the expertise beneficial to his or her research program. Two key lessons are (1) that computers are nothing to be afraid of, and (2) they will only do what they are told. In this modern age of genomics, “what can I DO with my sequence, now that I have it?” and ”how can I put my sequence into biological perspective?” are very important questions for the learned biologist. If by taking this lab course you simply increase your confidence when using a computer, it will be time well spent!  Foreword and Acknowledgements ‐ i Biochem 711 – 2008 ii The BLOSUM62 matrix BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins. BLOSUM are used to score alignments between evolutionarily divergent protein sequences. BLOSUM is based on local alignments. BLOSUM was first introduced in a paper by Henikoff and Henikoff [1]. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices. [1] Henikoff, S., Henikoff, JG. (1992). "Amino Acid Substitution Matrices from Protein Blocks". Proc Natl Acad Sci 89 (22): 10915–10919. doi:10.1073/pnas.89.22.10915. PMID 1438297 Source: http://en.wikipedia.org/wiki/BLOSUM Introduction to EMBOSS ‐ ii Biochem 711 – 2008 1 Introduction to EMBOSS Table of Contents Introduction: The EMBOSS Package ....................................................... 2 1. 2. 3. 4. History ......................................................................................................... Overview....................................................................................................... License......................................................................................................... The EMBOSS software organization .............................................................. 4.1. Applications ............................................................................................ 4.2. Platforms & Interface ................................................................................ 4.3. Accessing the line-command..................................................................... 5. Download and installation............................................................................. 5.1. Windows.................................................................................................. 5.2. Macintosh ............................................................................................... 6. Manual, documentation and help .................................................................. 7. Tutorial ........................................................................................................ 2 2 2 3 3 3 4 4 5 5 6 6 EMBOSS Graphical Output ...................................................................... 7 EMBOSS Commands Organized by Functional Group ............................... 8 GCG to EMBOSS Commands Equivalence .............................................. 14 Introduction to EMBOSS ‐ 1 EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.I. 16(6) pp.embnet. For developers who have their own licensing conditions already in effect […] the EMBASSY collection can include packages that use the EMBOSS core libraries and interfaces but under their own licensing conditions. originated in Madison1. A comprehensive set of sequence analysis programs for the VAX. Because of changes in the source rcode distribution rules of GCG and other factors the former EGCG developers created a totally new generation of academic sequence analysis software: the present EMBOSS project. 1984 Jan 11. No one individual or institute 'owns' the code.A. was a pioneering software for sequence analysis that became commercial in 1992. and Bleasby. Smithies O. Longden. Haeberli P. EMBOSS breaks the historical trend towards commercial software packages3. Citation: EMBOSS: The European Molecular Biology Open Software Suite (2000) Rice. A. Nucleic Acids Res. Trends in Genetics 16. Overview EMBOSS is "The European Molecular Biology Open Software Suite". I. They will be bound by the Library GPL […]. 2. 2 EMBnet (http://www.org/) is the only organisation world-wide bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology.12(1 Pt 1):387-95. History The Genetics Computer Group (GCG or Wisconsin package). EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology community […].P.net/licence/ 1 Devereux J.Biochem 711 – 2008 2 Introduction: The EMBOSS Package 1. (6) pp276-277 3. and Bleasby. Longden. EGCG developed by a group within EMBnet2 from 1988 provided extensions to the GCG package.276-277 Introduction to EMBOSS ‐ 2 . For more information see http://emboss. but not necessarily by the full GPL. License EMBOSS is licensed for use by everyone under the GNU General Public Licence (GPL) and GNU Library General Public Licence (LGPL) licences. "EMBOSS: The European Molecular Biology Open Software Suite" Trends in Genetics June 2000.sourceforge. 3 Rice. P. Jemboss assumes a client-server set-up but in some cases can be available as a stand-alone application. The EMBOSS applications are organized into 45 logical groups according to their function. For example the group ALIGNMENT GLOBAL contains 4 applications: Table .sourceforge. including in Microsoft Windows cmd DOS interface.html). (http://emboss. Platforms & Interface EMBOSS exists for multiple computer platforms. However. Applications EMBOSS is a set of a few hundred programs (applications) that handle specific functions.2. Introduction to EMBOSS ‐ 3 .net/apps/groups.Local sequence alignment Program name matcher seqmatchall supermatcher water wordmatch Description Finds the best local alignments between two sequences All-against-all comparison of a set of sequences Match large sequences against one or more other sequences Smith-Waterman local alignment Finds all exact matches of a given size between 2 sequences 4. The groups cover the EMBOSS and EMBASSY (see above) sets of applications. The EMBOSS software organization 4. These commands can be called from multiple graphical interface (GUI) variations that can be added over EMBOSS (some GUIsand not available for all platforms.Biochem 711 – 2008 3 4. All platforms can support the basic line-command version of EMBOSS.1.Global sequence alignment Program name est2genome needle stretcher esim4 Description Align EST and genomic DNA sequences Needleman-Wunsch global alignment Finds the best global alignment between two sequences Align an mRNA to a genomic DNA sequence while the group ALIGNMENT LOCAL contains 5 applications: Table .) The most common GUI is the Java-based Jemboss that is part of the EMBOSS development. The line-command applications are the core engine of EMBOSS. 2. Essentially EMBOSS can be viewed as a layer over the operating system (OS). Macintosh On a Macintosh it is available on the Terminal or X11 terminal found within Applications > Utilities 4. 5.1. Similarly the GUI can be viewed as another layer between EMBOSS and the user: 4 User GUI EMBOSS applications OS Therefore the GUI is useful but not essential to running EMBOSS. There also exists various web interfaces options.sourceforge.3. 4. A list of all available GUI is at http://emboss.net/interfaces/ 4.3. Download and installation Introduction to EMBOSS ‐ 4 . Note: you may need Administrator privilege to install.3. Accessing the line-command The line-command is the most basic way to interact with the operating system. Windows On a Windows system it is available within the DOS command window started by the menu cascade: Start > Run and enter cmd within the resulting window: This will open a new DOS command-line text window.Biochem 711 – 2008 Some GUIs are specific to an operating system. such as EMBOSSrunner for MacOSX. an FTP site: ftp://emboss.sourceforge.org/ The simplest method to using fink is via the fink GUI called FinkCommander (part of the download package.open-bio.finkproject. an use the top left button ( binary” to install in your system: ) “install from Introduction to EMBOSS ‐ 5 . Macintosh Macintosh users install EMBOSS form fink http://www. However.org/pub/EMBOSS/windows/ The Windows version is called mEMBOSS and developers insist that any emails sent their way specify this fact and not EMBOSSWin or any other name.org/pub/EMBOSS/ Biologists should only consider the “stable release” and not bother with any developer release. Windows Windows users will be pleased to find a Windows-only version of EMBOSS that installs together with Jemboss (the Java GUI interface) configured as a standalone application: ftp://emboss. 5.1.net/download/ is the official download information page.) Seach for “emboss” on the top right. 5.Biochem 711 – 2008 http://emboss. this will point to the actual download site. 5 It is somewhat assumed that the end-user will actually configure and compile the software from the source code. which should be practical on a Linux system.open-bio.2. The Fine Manual (tfm) is the online documentation for applications called by the command line tfm followed by the application name.html -e- Introduction to EMBOSS ‐ 6 . Tutorial A short online tutorial is available on the EMBOSS home page or by going directly to: http://emboss. Manual. documentation and help The documentation page http://emboss.sourceforge. (Note: in line-command $ and % are typical prompts waiting for user’s input.net/docs/ has limited information but provides other links.Biochem 711 – 2008 6 6.) For example: $ wossname global Finds programs by keywords in their short description SEARCH FOR 'GLOBAL' est2genome Align EST sequences to genomic DNA sequence needle Needleman-Wunsch global alignment of two sequences stretcher Needleman-Wunsch rapid global alignment of two sequences Therefore to obtain information on the application needle for global alignment the command would be: $ tfm needle Help can also simply be requested by adding –help after the name of the application.sourceforge. An online search might reveal manuals at various institutions.net/docs/emboss_tutorial/emboss_tutorial. the user can ask to be prompted for optional parameters by adding –opt after the name of the application: $ needle -opt 7. To find relevant applications the command wossname is very useful: it will echo back a list of applications based on a single search word. for example: $ needle -help Finally. For example X11 graphics if connecting by line command or PNG is using Jemboss. The graphical format can be altered by the –graph qualifier The allowed values are better explained in the following table: Example: $ dotmatcher calm_drome.fasta calm_drome.fasta Draw a threshold dotplot of two sequences Created dotmatcher.embnet.png -graph png Introduction to EMBOSS ‐ 7 .1.ch.html) EMBOSS applications that create a graphical output (interactive or redirected to a file) will send the graphics to the default current set-up.org/EMBOSS/introduction.Biochem 711 – 2008 7 EMBOSS Graphical Output (From http://www. with colouring and boxing Output sequence with translated ranges Display sequence with restriction sites.interface to ClustalW program Information on a multiple sequence alignment Plot quality of conservation of a sequence alignment Displays aligned sequences.Biochem 711 – 2008 8 EMBOSS Commands Organized by Functional Group Group Acd acdc acdpretty acdtable acdtrace acdvalid Alignment consensus Cons megamerger merger Alignment differences diffseq Alignment dot plots dotmatcher dotpath dottup polydot Alignment global est2genome needle stretcher esim4 Alignment local matcher seqmatchall supermatcher water wordmatch Alignment multiple emma infoalign plotcon prettyplot showalign tranalign mse Display abiview cirdna lindna pepnet pepwheel prettyplot prettyseq remap seealso showalign showdb showfeat showseq sixpack textsearch Description Acd file utilities ACD compiler ACD pretty printing utility Creates an HTML table from an ACD file ACD compiler on-screen trace ACD file validation Merging sequences to make a consensus Creates a consensus from multiple alignments Merge two large overlapping nucleic acid sequences Merge two overlapping nucleic acid sequences Finding differences between sequences Find differences between nearly identical sequences Dot plot sequence comparisons Displays a thresholded dotplot of two sequences Non-overlapping wordmatch dotplot of two sequences Displays a wordmatch dotplot of two sequences Displays all-against-all dotplots of a set of sequences Global sequence alignment Align EST and genomic DNA sequences Needleman-Wunsch global alignment Finds the best global alignment between two sequences Align an mRNA to a genomic DNA sequence Local sequence alignment Finds the best local alignments between two sequences All-against-all comparison of a set of sequences Match large sequences against one or more other sequences Smith-Waterman local alignment Finds all exact matches of a given size between 2 sequences Multiple sequence alignment Multiple alignment program . translation etc Display a DNA sequence with 6-frame translation and ORFs Search sequence documentation. use SRS and Entrez! Introduction to EMBOSS ‐ 8 . Slow. with colouring and boxing Displays a multiple sequence alignment Align nucleic coding regions given the aligned proteins Multiple Sequence Editor Publication-quality display Reads ABI file and display the trace Draws circular maps of DNA constructs Draws linear maps of DNA constructs Displays proteins as a helical net Shows protein sequences as helices Displays aligned sequences. translation etc Finds programs sharing group names Displays a multiple sequence alignment Displays information on the currently available databases Show features of a sequence Display a sequence with features. skipping first few Split a sequence into (overlapping) smaller sequences Trim poly-A tails off EST sequences Trim ambiguous bits off the ends of sequences Reads sequence fragments and builds one sequence Strips out DNA between a pair of vector sequences Reads a sequence range. appends the full USA to a list file Enzyme kinetics calculations Find Km and Vmax for an enzyme reaction Manipulation and display of sequence annotation Extract CDS. use SRS and Entrez! Displays a program's help documentation manual Search all databases for an entry Finds programs by keywords in their one-line documentation Menu interface(s) Simple menu of EMBOSS applications Nucleic acid secondary structure Finds DNA inverted repeats 9 Introduction to EMBOSS ‐ 9 .Biochem 711 – 2008 Edit biosed codcopy cutseq degapseq descseq entret extractfeat extractseq listor maskfeat maskseq newseq noreturn notseq nthseq pasteseq revseq seqret seqretsplit skipseq splitter trimest trimseq union vectorstrip yank Enzyme kinetics findkm Feature tables coderet extractfeat maskfeat showfeat twofeat HMM ealistat ehmmalign ehmmbuild ehmmcalibrate ehmmconvert ehmmemit ehmmfetch ehmmindex ehmmpfam ehmmsearch Information infoalign infoseq seealso showdb textsearch tfm whichdb wossname Menus emnu Nucleic 2d structure einverted Sequence editing Replace or delete sequence sections Reads and writes a codon usage table Removes a specified section from a sequence Removes gap characters from sequences Alter the name or description of a sequence Reads and writes (returns) flatfile entries Extract features from a sequence Extract regions from a sequence Write a list file of the logical OR of two sets of sequences Mask off features of a sequence Mask off regions of a sequence Type in a short new sequence Removes carriage return from ASCII files Exclude a set of sequences and write out the remaining ones Writes one sequence from a multiple set of sequences Insert one sequence into another Reverse and complement a sequence Reads and writes (returns) sequences Reads and writes (returns) sequences in individual files Reads and writes (returns) sequences. mRNA and translations from feature tables Extract features from a sequence Mask off features of a sequence Show features of a sequence Finds neighbouring pairs of features in sequences Hidden markov model analysis Statistics for multiple alignment files Align sequences with an HMM Build HMM Calibrate a hidden Markov model Convert between HMM formats Extract HMM sequences Extract HMM from a database Index an HMM database Align single sequence with an HMM Search sequence database with an HMM Information and general help for users Information on a multiple sequence alignment Displays some simple information about sequences Finds programs sharing group names Displays information on the currently available databases Search sequence documentation. Slow. references. suppliers etc Display sequence with restriction sites. translation etc Find restriction enzymes producing specific overhang Finds restriction enzyme cleavage sites 10 Introduction to EMBOSS ‐ 10 .Biochem 711 – 2008 Nucleic codon usage cai chips codcmp cusp syco Nucleic composition banana btwisted chaos compseq dan freak isochore sirna wordcount Nucleic CpG islands cpgplot cpgreport geecee newcpgreport newcpgseek Nucleic gene finding getorf marscan plotorf showorf sixpack syco tcode wobble Nucleic motifs dreg fuzznuc fuzztran marscan Nucleic mutation msbar shuffleseq Nucleic primers eprimer3 primersearch stssearch Nucleic profiles profit prophecy prophet Nucleic repeats einverted equicktandem etandem palindrome Nucleic restriction recoder redata remap restover restrict Codon usage analysis CAI codon adaptation index Codon usage statistics Codon usage table comparison Create a codon usage table Synonymous codon usage Gribskov statistic plot Composition of nucleotide sequences Bending and curvature plot in B-DNA Calculates the twisting in a B-DNA sequence Create a chaos game representation plot for a sequence Count composition of dimer/trimer/etc words in a sequence Calculates DNA RNA/DNA melting temperature Residue/base frequency table or plot Plots isochores in large DNA sequences Finds siRNA duplexes in mRNA Counts words of a specified size in a DNA sequence CpG island detection and analysis Plot CpG rich areas Reports all CpG rich regions Calculates fractional GC content of nucleic acid sequences Report CpG rich areas Reports CpG rich regions Predictions of genes and other genomic features Finds and extracts open reading frames (ORFs) Finds MAR/SAR sites in nucleic sequences Plot potential open reading frames Pretty output of DNA translations Display a DNA sequence with 6-frame translation and ORFs Synonymous codon usage Gribskov statistic plot Fickett TESTCODE statistic to identify protein-coding DNA Wobble base plot Nucleic acid motif searches Regular expression search of a nucleotide sequence Nucleic acid pattern search Protein pattern search after translation Finds MAR/SAR sites in nucleic sequences Nucleic acid sequence mutation Mutate sequence beyond all recognition Shuffles a set of sequences maintaining composition Primer prediction Picks PCR primers and hybridization oligos Searches DNA sequences for matches with primer pairs Search a DNA database for matches with a set of STS primers Nucleic acid profile generation and searching Scan a sequence or database with a matrix or profile Creates matrices/profiles from multiple alignments Gapped alignment for profiles Nucleic acid repeat detection Finds DNA inverted repeats Finds tandem repeats Looks for tandem repeats in a nucleotide sequence Looks for inverted repeats in a nucleotide sequence Restriction enzyme sites in nucleotide sequences Remove restriction sites but maintain same translation Search REBASE for enzyme name. translation etc Pretty output of DNA translations Display a sequence with features. branch-and-bound Phylogenetic distance matrix methods Creates a distance matrix from multiple alignments Fitch-Margoliash and Least-Squares Distance Methods Fitch-Margoliash method with contemporary tips 11 Introduction to EMBOSS ‐ 11 . mRNA and translations from feature tables Plot potential open reading frames Output sequence with translated ranges Display sequence with restriction sites. promoters and terminator prediction Scans DNA sequences for transcription factors Translation of nucleotide sequence to protein sequence Back translate a protein sequence to ambiguous codons Back translate a protein sequence Extract CDS. translation etc Silent mutation restriction enzyme scan RNA folding methods and analysis RNA alignment folding RNA alignment folding with partition RNA cofolding RNA cofolding with concentrations RNA cofolding with partitioning RNA distances RNA duplex calculation RNA eval RNA eval with cofold Calculate secondary structures of RNAs Secondary structures of RNAs with partition RNA melting RNA sequences matching a structure Calculate locally stable secondary structures of RNAs Plot vrnafold output Calculate RNA suboptimals Transcription factors. branch-and-bound Largest clique program Dollo and polymorphism parsimony algorithm Penny algorithm Dollo or polymorphism Multistate to binary recoding program Mixed parsimony algorithm Interactive mixed method parsimony Discrete character parsimony Penny algorithm. translation etc Display a DNA sequence with 6-frame translation and ORFs Translate nucleic acid sequences Phylogenetic consensus methods Majority-rule and strict consensus tree Majority-rule and strict consensus tree Distances between trees Distances between two sets of trees Phylogenetic continuous character methods Continuous character Maximum Likelihood method Continuous character Contrasts Continuous character Contrasts Phylogenetic discrete character methods Largest clique program Dollo and polymorphism parsimony algorithm Penny algorithm Dollo or polymorphism Multistate to binary recoding program Mixed parsimony algorithm Penny algorithm.Biochem 711 – 2008 showseq silent Nucleic RNA folding vrnaalifold vrnaalifoldpf vrnacofold vrnacofoldconc vrnacofoldpf vrnadistance vrnaduplex vrnaeval vrnaevalpair vrnafold vrnafoldpf vrnaheat vrnainverse vrnalfold vrnaplot vrnasubopt Nucleic transcription tfscan Nucleic translation backtranambig backtranseq coderet plotorf prettyseq remap showorf showseq sixpack transeq Phylogeny consensus econsense fconsense ftreedist ftreedistpair Phylogeny continuous characters econtml econtrast fcontrast Phylogeny discrete characters eclique edollop edolpenny efactor emix epenny fclique fdollop fdolpenny ffactor fmix fmove fpars fpenny Phylogeny distance matrix distmat efitch ekitsch Display a sequence with features. Biochem 711 – 2008 eneighbor ffitch fkitsch fneighbor Phylogeny gene frequencies egendist fcontml fgendist Phylogeny molecular sequence ednacomp ednadist ednainvar ednaml ednamlk ednapars ednapenny eprotdist eprotpars erestml eseqboot fdiscboot fdnacomp fdnadist fdnainvar fdnaml fdnamlk fdnamove fdnapars fdnapenny fdolmove ffreqboot fproml fpromlk fprotdist fprotpars frestboot frestdist frestml fseqboot fseqbootall Phylogeny tree drawing fdrawgram fdrawtree fretree Protein 2d structure garnier helixturnhelix hmoment pepcoil pepnet pepwheel tmap topo Protein 3d structure psiphi domainreso domainalign domainrep seqalign seqfraggle seqsearch seqsort seqwords Phylogenies from distance matrix by N-J or UPGMA method Fitch-Margoliash and Least-Squares Distance Methods Fitch-Margoliash method with contemporary tips Phylogenies from distance matrix by N-J or UPGMA method Phylogenetic gene frequency methods Genetic Distance Matrix program Gene frequency and continuous character Maximum Likelihood Compute genetic distances from gene frequencies Phylogenetic tree drawing methods DNA compatibility algorithm Nucleic acid sequence Distance Matrix program Nucleic acid sequence Invariants method Phylogenies from nucleic acid Maximum Likelihood Phylogenies from nucleic acid Maximum Likelihood with clock DNA parsimony algorithm Penny algorithm for DNA Protein distance algorithm Protein parsimony algorithm Restriction site Maximum Likelihood method Bootstrapped sequences algorithm Bootstrapped discrete sites algorithm DNA compatibility algorithm Nucleic acid sequence Distance Matrix program Nucleic acid sequence Invariants method Estimates nucleotide phylogeny by maximum likelihood Estimates nucleotide phylogeny by maximum likelihood Interactive DNA parsimony DNA parsimony algorithm Penny algorithm for DNA Interactive Dollo or Polymorphism Parsimony Bootstrapped genetic frequencies algorithm Protein phylogeny by maximum likelihood Protein phylogeny by maximum likelihood Protein distance algorithm Protein pasimony algorithm Bootstrapped restriction sites algorithm Distance matrix from restriction sites or fragments Restriction site maximum Likelihood method Bootstrapped sequences algorithm Bootstrapped sequences algorithm Phylogenetic molecular sequence Methods Plots a cladogram.or phenogram-like rooted tree diagram Plots an unrooted tree diagram Interactive tree rearrangement Protein secondary structure Predicts protein secondary structure Report nucleic acid binding motifs Hydrophobic moment calculation Predicts coiled coil regions Displays proteins as a helical net Shows protein sequences as helices Displays membrane spanning regions Draws an image of a transmembrane protein Protein tertiary structure Phi and psi torsion angles from protein coordinates Remove low resolution domains from a DCF file Generate alignments (DAF file) for nodes in a DCF file Reorder DCF file to identify representative structures Extend alignments (DAF file) with sequences (DHF file) Removes fragment sequences from DHF files Generate PSI-BLAST hits (DHF file) from a DAF file Remove ambiguous classified sequences from DHF files Generates DHF files from keyword search of UniProt 12 Introduction to EMBOSS ‐ 12 . not for general use.Biochem 711 – 2008 libgen matgen3d rocon rocplot siggen siggenlig sigscan sigscanlig contacts interface Protein composition Backtranambig backtranseq charge checktrans compseq emowse freak iep mwcontam mwfilter octanol pepinfo pepstats pepwindow pepwindowall Protein motifs ntigenic digest epestfind fuzzpro fuzztran helixturnhelix oddcomp patmatdb patmatmotifs pepcoil preg pscan sigcleave meme Protein mutation msbar shuffleseq Protein profiles profit prophecy prophet Test crystalball Utils database creation aaindexextract cutgextract printsextract prosextract rebaseextract tfextract cathparse domainnr domainseqs domainsse scopparse Generate discriminating elements from alignments Generate a 3D-1D scoring matrix from CCF files Generates a hits file from comparing two DHF files Performs ROC analysis on hits files Generates a sparse protein signature from an alignment Generate ligand-binding signatures from a CON file Generate hits (DHF file) from a signature search Search ligand-signature library & write hits (LHF file) Generate intra-chain CON files from CCF files Generate inter-chain CON files from CCF files Composition of protein sequences Back translate a protein sequence to ambiguous codons Back translate a protein sequence Protein charge plot Reports STOP codons and ORF statistics of a protein Count composition of dimer/trimer/etc words in a sequence Protein identification by mass spectrometry Residue/base frequency table or plot Calculates the isoelectric point of a protein Shows molwts that match across a set of files Filter noisy molwts from mass spec output Displays protein hydropathy Plots simple amino acid properties in parallel Protein statistics Displays protein hydropathy Displays protein hydropathy of a set of sequences Protein motif searches Finds antigenic sites in proteins Protein proteolytic enzyme or reagent cleavage digest Finds PEST motifs as potential proteolytic cleavage sites Protein pattern search Protein pattern search after translation Report nucleic acid binding motifs Find protein sequence regions with a biased composition Search a protein sequence with a motif Search a PROSITE motif database with a protein sequence Predicts coiled coil regions Regular expression search of a protein sequence Scans proteins using PRINTS Reports protein signal cleavage sites Motif detection Protein sequence mutation Mutate sequence beyond all recognition Shuffles a set of sequences maintaining composition Protein profile generation and searching Scan a sequence or database with a matrix or profile Creates matrices/profiles from multiple alignments Gapped alignment for profiles Testing tools. Answers every drug discovery question about a sequence Database installation Extract data from AAINDEX Extract data from CUTG Extract data from PRINTS Build the PROSITE motif database for use by patmatmotifs Extract data from REBASE Extract data from TRANSFAC Generates DCF file from raw CATH files Removes redundant domains from a DCF file Adds sequence records to a DCF file Add secondary structure records to a DCF file Generate DCF file from raw SCOP files 13 Introduction to EMBOSS ‐ 13 . compseq -. Sequence composition Removes extra whitespace in text files. backtranambig backtranslates to ambiguous codons.fr/faq/outils/gcg-vs-emboss Former GCG users will find this extremly useful.counts composition of dimer/trimer in sequence.calculates codon usage stats cusp -. matcher uses Pearson's lalign algorithm. Creates a scoring matrix Creates a consensus sequence or matrices/profiles from multiple alignments Codon usage table comparison CodonPreference CoilScan Compare + DotPlot Composition compresstext comptable consensus correspond Introduction to EMBOSS ‐ 14 .inra. chips -. dotpath does a non-overlapping wordmatch dotplot.jouy.nih.Biochem 711 – 2008 ssematch allversusall seqnr domainer hetparse pdbparse pdbplus pdbtosp sites Search a DCF file for secondary structure matches Sequence similarity data from all-versus-all comparison Removes redundancy from DHF files Generates domain CCF files from protein CCF files Converts heterogen group dictionary to EMBL-like format Parses PDB files and writes protein CCF files Add accessibility & secondary structure to a CCF file Convert swissprot:PDB codes file to EMBL-like format Generate residue-ligand CON files from CCF files 14 Utils database indexing dbiblast dbifasta dbiflat dbigcg dbxfasta dbxflat dbxgcg Utils misc embossdata embossversion Database indexing Index a BLAST database Database indexing for fasta file databases Index a flat file database Index a GCG formatted database Database b+tree indexing for fasta file databases Database b+tree indexing for flat file databases Database b+tree indexing for GCG formatted databases Utility tools Finds or fetches data files read by EMBOSS programs Writes the current EMBOSS version number GCG to EMBOSS Commands Equivalence Edited from http://helix. Backtranslate protein -> nucleotide sequence.creates a codon usage table. water uses Smith-Waterman. Bestfit uses the Smith-Waterman algorithm to find the best local alignment between 2 sequences. Recognize protein coding sequences Predicts coiled-coil regions 2-sequence comparison. GCG program Assemble BackTranslate BestFit Blast Psiblast Breakup Chopup CodonFrequency EMBOSS program merger union backtranseq backtranambig water matcher dbiBlast splitter chips compseq cusp syco wobble pepcoil dottup + dotmatcher dotpath compseq pepstats prophecy codcmp Description/Comments Construct new sequences from pieces of existing sequences. merger only accepts 2 sequences while assemble and union accept several.gov/Applications/ And / or http://migale. NCBI homology search between query and database Splits a sequence into (overlapping) smaller sequences Helps to convert a non-GCG sequence format Not needed in EMBOSS because it reads most sequence formats without conversion CodonFrequency --tabulates codon usage. Can be done via Unix shell script. html 2-sequence comparison ExtractPeptide takes the output of Map and can write one or more of the reading-frame translations. Makes a Blast database. Generates plots from other GCG programs. available as a standalone. The equivalent EMBOSS programs usually generate plots (e. The Phylip package can do this.genetics. http://evolution. whereas dbiflat. Graphical representation of similarity of 2 sequences. If one of your sequence is genomic and you are trying to align an est sequence to it. plotorf does this graphically Homology searches including frameshifts between protein and nucleotide sequences Converts from various formats to GCG sequence format.g. stretcher uses the Myers-Miller algorithm which is more memory-efficient. you may want to consider the 'est2genome' program. Can use Phylip or Clustal instead. http://biowiki. and large sequences. but seqret can convert between formats if desired. Pearson's homology-search program.org/HmmerPackage Introduction to EMBOSS ‐ 15 . Finds best local alignment including frame shifts between a protein and nucleotide sequence.html Estimates pairwise substitutions per site between 2 or more coding sequences. Replaces tabs with spaces in sequence files. Can be performed by Unix shell command.washington. Mostly replaced by Blast detab distances diverge dotplot extractpeptide FastA FastX Tfasta TfastX fetch figure findpatterns fingerprint fitconsensus framealign frames framesearch fromembl fromfasta fromgenbank fromig frompir fromstaden fromtrace Gap dottup dotmatcher transeq - seqret seqretsplit fuzznuc fuzzpro plotorf showorf - Pull one or more sequences out of the databases. http://evolution. dbigcg will take most formats between them. Parts of GCG's gel assembly suite. transeq translates one or more of the frames or specific regions directly from an input nucleotide sequence. For sequences larger than 10kb. GCG's Dataset requires sequences in GCG format. Use NCBI's 'formatdb' instead. Plots peptide sequence as helical wheel to help recognize amphiphilic regions. I would suggest you to use 'stretcher' program in EMBOSS which is also a global alignment program.genetics.Biochem 711 – 2008 corrupt dataset msbar dbiflat dbiblast dbigcg - 15 Randomly mutate sequence Creates searchable sequence database. Show open reading frames. medium. Use after Consensus to find the best fits. searches for patterns in a sequence or database Finds the products of T1 ribonuclease digestion.edu/phylip. Sean Eddy's HMMER package. Calculates pairwise evolutionary distances between aligned sequences.edu/phylip. water->matcher>supermatcher are local alignment programs for small. dbiblast. On the other hand. plotorf). needle stretcher Gapshow GCGtoBlast GelAssemble GelDisassemble GelEnter GelMerge GelStart GelView GetSeq GrowTree HelicalWheel HmmerAlign HmmerBuild HmmerCalibrate HmmerEmit HmmerFetch plotcon megamerger merger union Needleman-Wunsch algorithm to compare 2 sequences. seqret/seqretsplit can save output in various sequence formats. seqret pepwheel - Type in a new sequence Creates phylogenetic tree. The Phylip package can do this.washington. respectively. Unnecessary in EMBOSS because it can accept most sequence formats. http://www. If you need this data. patmatmotifs can accept file containing multiple sequences or patterns.edu/ Predicts nucleotide secondary structure. provides some info about sequence specifications. Garnier does not include Jameson-Wolf antigenic indexing. GCG & EMBOSS may display different isoschizomers of the same enzyme. Calculates isoelectric pt of protein.cgi?form=msdigest Secondary structure prediction. and HPLC retention. Info on Zuker’s site: http://mfold.ucsf. CAMP Phosphorylation Site).edu/ Makes a contour plot of the helical hydrophobic moment of a peptide sequence hmoment prints the text output of the calculation. but the results are equivalent. whichdb in emboss can search for accession numbers. (http://www. GCG peptidesort sorts fragments from an enzyme/reagent cleavage of one or more proteins according to position. Can use Unix pcprint command instead. Finds common Prosite motifs in a sequence. and not Prosite 'Matrices' (e.. GCG's version is an old version of Zuker's MFOLD.edu/cgi-bin/msform. Use web version: http://www. Use '-full' tag to display abstract information when using EMBOSS patmatmotifs. Compares 2 sets of sequences using Wilbur-Lipman algorithm. octanol hmoment patmatmotifs Meme + Motifsearch Names NetBlast Netfetch NoOverlap OldDistances onecase Overlap Paupdisplay + Paupsearch Pepdata Pepplot Peptidemap Peptidesort prophecy + profit infoseq diffseq getorf sixpack pepinfo digest digest pepstats Finds HTH motifs in protein sequences.nih. Note that both these programs will only find Prosite 'Patterns' (e. mol. remote access to NCBI's Blast. try the UCSF MS-Digest program which has an option for HPLC Indices. Helix-turn-Helix).ncbi.sdsc. Search a sequence or database with a matrix or profile.nlm.ebi. antigenic predicts potentially antigenic regions of a protein sequence. Multiple sequence alignment.g. and garnier does protein 2ndary structure prediction.bioinfo. pepwindowall produces a set of superimposed Kyte & Doolittle hydropathy plots from an aligned set of protein sequences. Pepplot plots protein 2ndary structure and hydrophobicity. pepwindow displays Kyte-Doolittle protein hydropathy.g. EMBOSS digest only processes one reagent cleavage at a time. Enzyme/reagent cleavage map of a protein. Makes a table of the pairwise similarities within a group of sequenes. The EMBOSS remap program may not display a few of the available isoschizomers.rpi.gov/BLAST/ Finds differences between 2 sequences. EMBOSS pepstats can be used to determine the composition of the fragments afterwards. converts sequence into lower or upper case. The EMBOSS programs do not provide the elution times from HPLC. wt.Biochem 711 – 2008 HmmerIndex HmmerPfam HmmerSearch HTHScan IsoElectric Lineup ListFile Lookup 16 helixturnhelix iep - Map Mapplot Mapsort MeltTemp MEME MFold Moment Motifs restrict remap restover dan pepnet. PAUP Phylogenetic Analysis.nih. http://meme.gov/Entrez/ finds restriction enzyme cleavage sites. sixpack displays the DNA sequence with 6-frame translations and orfs. for printing. Translates in all 6 reading frames.ncbi.ac. Edits multiple sequence alignments – SEE SeqEd below. There exist a standalone Meme/Mast software. using the method of Kolaskar and Tongaonkar. NoOverlap can work with a group of sequences. emma is an interface to ClustalW. pepinfo plots hydrophobicity. but GCG's lookup is much more sophisticated. Computes melting temperature of oligos Finds conserved motifs in a group of unaligned sequences.uk/Tools/InterProScan/). Versatile program for finding sequences in a database. Use Interproscan to find all known domains and functional sites. Can also use the standalone Clustal (command clustalw for linge-command or clustalx for GUI) or web ClustalW online: Peptidestructure Plotstructure garnier antigenic pepwindow pepwindowall Pileup emma Introduction to EMBOSS ‐ 16 . http://prospector.nlm. Can be performed by Unix shell command. Use NCBI Entrez instead. Try the Jemboss alignment editor for editing multiple sequence alignments: http://emboss. union.uk/Tools/clustalw2/ Plot DNA constructs. Redefines keyboard keys. hence other formats need to be converted with 'reformat'.gov/Entrez/ Introduction to EMBOSS ‐ 17 . degapseq. listor. Helix) to your desktop. Scans a sequence or database with a matrix or profile.ebi. Finds inverted repeats. Selects oligonucleotide primers. SecureFX for Windows. Predicts signal peptides in protein sequences. Reduce the number of symbols in a sequence. Sequence editor. Windows. See MFOLD. Evaluates individual primers to determine their compatibility for use as PCR primer pairs.ac. newseq. maskfeat. 17 PlasmidMap PlotFold PlotSimilarity Pretty prettybox Prime Profilegap Profilemake PrimePair Profilescan Profilesearch Profilesegments Publish Reformat cirdna lindna plotcon cons prettyplot showalign eprimer3 prophecy prophet distmat primersearch patmatdb profit seqret showseq seqret Plots MFold output. Rarely used.ncsu. Use the nedit editor instead. Plotting program. vectorstrip. yank Replaces characters in a text file. trimseq. Masks off low-complexity regions from a sequence. Degapseq is specific for replacing gap characters. notseq.g. but 'seqret' can be used to convert between formats if desired. splitter. GCG requires input sequences to be in GCG format. or line-command sftp on Mac/Unix. extractfeat. Creates matrices/profiles from multiple alignments. Alignments for results of Profilesearch Makes publication-quality displays of sequences. revseq. The equivalent group of Emboss programs will also look for inverted or palindromic repeats. Use NCBI's Entrez instead: http://www. Reverse/complement a sequence. http://pbil. trimest. Finds tandem repeats in sequences. Profilescan uses Gribskov method. mainly used for GCG's gel assembly programs.univ-lyon1.mbio. maskseq. Finds text phrases in sequence or database. Graphical representation of the similarity along a set of aligned sequences. skipseq.html ) and Seaview (Mac. available as a standalone program on Helix. seqretsplit. pasteseq. Sends a sequence from a remote computer (e. descseq.sourceforge.Biochem 711 – 2008 http://www. awk or tr. entret. and displays them prettily. Shuffles a sequence. noreturn. Extract regions from a sequence.net/Jemboss/ Other alternatives are BioEdit (Windows only. cutseq. Calculates consensus sequence from a multiple sequence alignment. Can be performed with Unix shell utilities like sed.edu/BioEdit/bioedit. Or use a text editor (not word processor!). Use FTP instead. Emboss programs accept most sequence formats. Repeat Replace Reverse Sample Seg Seqed equicktandem etande einverted palindrome biosed degapseq revseq extractseq maskseq biosed. EMBOSS has several tools for specific editing tasks. http://www.ncbi. extractseq. Unix. Searches sequences or db for protein motifs. so conversion is rarely required.nih. seqret. Moves text by column. nthseq.fr/software/seaview) SeqLab Setkeys Shiftover Shuffle Simplify Spew SPScan Ssearch StatPlot StemLoop Stringsearch shuffleseq sigcleave palindrome etandem textsearch X-windows interface to GCG. Gapped alignment for profiles and sequences.nlm. Part of Pearson's Fasta package. Regular expression search of a sequence. Residue/base frequency table or plot. Findpatterns is an approximate equivalent. mRNA and translations from feature tables Plots and reports CpG-rich regions. Finds PEST motifs as potential proteolytic cleavage sites Align EST and genomic DNA sequences. Remove restriction sites but maintain the same translation all-against-all comparison of a set of sequences. Plots 3rd-position variability as an indicator of potential coding regions. Translates nucleotide -> Protein sequences predicts transmembrane helices. Reports STOP codons and ORF statistics of a protein Extract CDS. Shows features of a sequence Silent mutation restriction enzyme scan Finds siRNA duplexes in mRNA Searches a DNA database for matches with a set of STS primers Introduction to EMBOSS ‐ 18 . Homology search using Wilbur/Lipman algorithm. Reads ABI file and displays trace Finds antigenic sites in proteins Bending and curvature plot in B-DNA Calculates the twisting in a B-DNA sequence CAI codon adaptation index. Segments displays the result. Pulls one sequence out of a multiple set. therefore format conversion is rarely required. Can be performed by Unix utilities like 'tr'. Protein identification by Mass spectrometry. Shows molwts that match across a set of files Filter noisy molwts from mass spec output remove carriage return from a ASCII files. Reformat will pull a sequence out of an MSF or RSF file. Find Km and Vmax for an enzyme reaction by a Hanes/Woolf plot Protein pattern search after translation Calculates the fractional GC content of nucleic acid sequences Plots isochores in large DNA sequences Writes a list file of the logical OR of two sets of sequences Create random nucleotide and protein sequences Finds MAR/SAR sites in nucleic sequences Mask off features of a sequence. to measure synonymous codon usage bias. cutseq is command-line. Emboss accepts most sequence formats. Shows info about currently available databases. Create a chaos game representation plot for a sequence Protein charge plot. Extract features from a sequence. transeq freak abiview antigenic banana btwisted cai chaos charge checktrans coderet cpgplot cpgreport newcpgreport newcpgseek cutseq degapseq dreg emma emowse epestfind est2genome extractfeat findkm fuzztran geecee isochore listor makenucseq makeprotseq marscan maskfeat mwcontam mwfilter noreturn nthseq oddcomp polydot printsextract pscan rebaseextract redata recoder seqmatchall showdb showfeat silent sirna stssearch seqed seqed Findpatterns Reformat - Removes a specified section from a sequence. Finds protein sequence regions with a biased composition Displays all-against-all dotplots of a set of sequences Extract data from PRINTS Scans proteins using PRINTS Search and extract from REBASE. Masks tandem repeats for future Blast search.Biochem 711 – 2008 Terminator Testcode ToFastA ToIG ToPIR ToStaden Translate Transmem Window + Statplot Wordsearch Segments Xnu wobble seqret 18 searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov. seqed is interactive. seqret can be used to convert between formats if desired. Alter name/description of sequence. interface to ClustalW program. Scans DNA sequences for transcription factors Displays membrane spanning regions Align nucleic coding regions given the aligned proteins Trim bits off ends of sequences. shows documentation for a program. inds neighbouring pairs of features in sequences Strips out DNA between a pair of vector sequences Counts words of a specified size in a DNA sequence Finds all exact matches of a given size between 2 sequences 19  Introduction to EMBOSS ‐ 19 . Can be done interactively with GCG's seqed.Biochem 711 – 2008 gcghelp supermatcher tfextract tfm tfscan tmap tranalign trimest trimseq twofeat vectorstrip wordcount wordmatch Finds a match of a large sequence against one or more sequences Extract data from TRANSFAC database. Biochem 711 – 2008 Class notes 20 Introduction to EMBOSS ‐ 20 . ............................................................................................ Changing directory and present working directory: cd......................... 2...................... Multiple sequence formats: seqret..... 30 32 33 35 35 36 36 37 38 38 39 39 40 41 42 L09 Exercise D: Pairwise comparisons with dotplots ................................. 28 2.............................................................................................3... 2..................................................................................................................... Relative path and current directory: ..................................................... Nucleotide sequence comparison......................................................... 23 1.........1. Word size............. Begin an Xterm session ............ List files: @ symbol .............................................. 3.................................Biochem 711 – 2008 21 L09: Pairwise alignment with EMBOSS Table of Contents L09 Exercise A: Xterm and Unix line commands ... 2...................... 28 L09 Exercise C: Sequence format and changing format: seqret.... Redirect of standard text output: >............... 2................. 2..................... Inverted repeats ............................... % .... 2........................5.................. The prompt: $............................................. Documentation and Help: tfm...................................................................................... 2........................................................11..............7....................... Full path location of a file....... Local alignment: water .................... tfm.....................10................ 2............................................ Fasta format. tail ..............8.. 2...................... 2...........2.. 2.......................................... 2................. 36 L09 Exercise E: Pairwise comparisons with optimal alignments ......................9... 6. Creating a new directory: mkdir... 2...................... Comparison tables: BLOSUM62................................................. 5.. Dotmatcher ........... dottup ............................... 28 1............................................................................1........... Seqret reads and writes (reformats) sequences ....... 4................ seqretsplit ................................................................................................................................. 29 1...............................6....... help and manual pages: man ................................ Window size...................................... head.......... 2................................ 2....... 43 Pairwise comparison with EMBOSS ‐ 21 ........ Documentation.......... Find relevant programs: wossname ..... nano........... Text file content: cat..................1............. Line commands................ pwd......................................................................................... 1....................... 43 1.. Simple text editing: pico........ 3............ -option .............. Summary tables........ 4.......... -help..................... -option ......................... ............. 3... 2....................1......................................................................................................................... 23 23 23 24 24 24 25 25 26 26 26 27 27 L09 Exercise B: Help and relevant EMBOSS applications: wossname.........................................3............. Directory listing: ls..................... Changing the format: format codes....4..........................~ .............................................................................. 2.................... Working directory ................ Threshold ............... more. Defaults run ......2.................... 2........................... .... 3......................... 49 Pairwise comparison with EMBOSS ‐ 22 ............... Defaults run .... Comparison tables: PAM250 ................................................................................................................... 22 45 45 45 46 47 48 L09 Exercise F: End of laboratory ................... Alternative alignments ....................... Change the gaps ............................................................................. Global alignment: needle ..1................................. 2................................................................ 4...................... Global and local alignment comparison................. 2..2......Biochem 711 – 2008 2.............................................. 5................................................. and with the most recent Mac OS Terminal will transfer the graphical output to the X11 system. Under Public Data click on Class Resources. Therefore you can also use Terminal if you prefer. 1. 2. All of the commands can be transcribed to any of the multiple GUI interfaces that exist for EMBOSS. Copy/Paste is easier from Terminal.edu/acp web site. However. then under Files for ACP Labs use the Files for our Labs pull-down menu to select Seq Files for Lab. Terminal is found in the directory Applications/Utilties 2. In this manner. % Pairwise comparison with EMBOSS ‐ 23 . Note: X11 is mandatory for any EMBOSS application that has graphical output.1.Biochem 711 – 2008 23 The EMBOSS package will be used here as a line-command tool. including the java-based Jemboss interface. The prompt: $. Begin an Xterm session ✔ TASK Click on the X11 logo within the Dock (bottom of screen) This will launch X11 (Xwindows) and open an xterm VT100 terminal emulation At the % or $ prompt type (DO NOT TYPE the prompt!) $ cd Desktop/LabFiles LabFiles is on the DMC dektops Alternatively find X11 within the Applications > Utilities directory. They serve to navigate along the directory tree on the hard drive. the various options for each particular GUI do not become an encumbrance in the learning process and the users can concentrate on the algorithms and the effect of changing parameters from default. L09 Exercise A: Xterm and Unix line commands The directory LabFiles on the desktop contains the files necessary for these exercises.wisc. Line commands ✔ READ A few sets of line-commands are useful to know. However launch X11 as well. available on all installations and therefore common to all platforms. If you are practicing at home you can download the files from the http://virology. 3. all directories that need to be traversed from the root directory to reach to the file to be accessed need to be listed and separated by a forward slash without space. C: and the slashes are backward slashes. the root is the hard drive letter e.Biochem 711 – 2008 24 The line command prompt means that the computer is ready for input.g. assuming that the software we are using is now “looking” within the current directory where the file resides.g.~ We already used the cd command above to change directory. not covered here. Full path location of a file Under Unix the top directory is called “root” and is symbolized by a forward slash. Changing directory and present working directory: cd. Note: On a Windows system it is exactly the same..4. Combined with the path. The relative path is a method to access the file without going through all the hierarchy of the directories from root and relative to the current location. For example: C:\Documents and Settings\Administrator\Desktop\myfile. To access a file e.) For example: /Users/dmc/Desktop/myfile.txt . The prompt could also be > and depending on the computer setting reflect the name of the computer and even the current directory name.. Relative path and current directory: . spaces are allowed. while the parent directory immediately above the current directory is represented by a double dot ./Desktop/myfile. Therefore. Pairwise comparison with EMBOSS ‐ 24 ..2.txt 2. the following relative paths are correct depending on the location of the file and the location where the software is “looking”: myfile. pwd. 2.txt. except that it is usually caseinsensitive.txt is the full path to the file myfile. which is the first /. myfile. Typically the prompt is either $ or % for non-administrative users. The simplest relative path is simply the name of the file alone..txt .txt 2./myfile. The special symbol for the current directory is a dot: .txt since it starts with root. (spaces can be allowed but require special care. . one can access any directory within the accessible hard drives. most remarkable are . 25 A special case is a very useful shorthand that always takes you back “home” (to your home directory as computers may have multiple users. Note: on Windows. For example.pep blue./ and .vec. Extremely useful if one gets “lost” even with help of pwd! 2. Directory listing: ls To obtain a list of the files present in the current working directory we use the command ls./ . Since we used –F the directories are shown with / and since we used –a we can see the hidden files. Compatible modifiers can be combined: $ ls -lFa total 168 drwx-----..5. 2./ .17 dmc drwxr-x--7 dmc -rw-r----1 dmc -rw-r--r-1 dmc -rw-r--r-1 dmc staff staff staff staff staff staff staff staff 748 2414 6148 578 238 1014 5743 163 Oct Oct Sep Apr Nov Sep Sep Oct 23 23 11 11 5 14 11 23 19:16 19:17 15:42 2002 2003 1999 15:08 19:14 . Creating a new directory: mkdir Pairwise comparison with EMBOSS ‐ 25 .Biochem 711 – 2008 To know in which directory into which we are currently looking we can use the command pwd that will echo the present working directory. a group to which the user belongs to and the rest of the world.) The tilde symbol (~) replaces all that would be required as a full path from root to the home directory. It can then be used as well for going down the directory path.22 dmc drwxr--r--+ 71 dmc -rw-------@ 1 dmc drwxr-x--.seq calm_drome.DS_Store EVOL/ FOLD/ ant. (On Windows the command is DIR) The ls command can be modified with –l (letter L) for a long list.6. the date of last change and the file name. the command DIR /b (forward slash!) shows the directory content as one column. the commands cd cd ~ ~/Desktop would return to the home directory and to the desktop respectively from ANY other location. then the file size. –1 (number one) for a one-column list./ the present and parent directories.. The owner of the file (dmc) and group (staff) are shown. –F to show file type (files marked with / are directories) and –a to show hidden files.fasta The first column shows if the file is a directory (d) followed by 3 sets of file permission levels (read write execute for user/group/other): the user. Therefore it is easiest to use TextEdit (make sure to change the format to plain format with the menu cascade Format > Make Plain Text).Biochem 711 – 2008 26 Since the EMBOSS software is on the local computer it may be easier to create new directories with the mouse menu File > New Folder. 2. more.9. Simple text editing: pico. Some commands will scroll the complete file all at once (cat).7. head. However it is possible to edit a small text file within the terminal with the full-screen text program pico.8. Text file content: cat. Navigation is simple with the up/down/ right/left arrows of the keyboard. and the next page when hitting the space bar. Note: recently pico has be replaced by nano: “ANOther editor. The command cd can then be used to go down the directory path into the new directory.txt Note: in Windows the command would be (forward slash b as /b and \b have different meanings. Note: in Windows the command is type The commands head and tail display the first 10 lines at the top or the last 10 lines at the bottom of a file. Commands are summarized at bottom of the screen. paste that line: control-u. It is possible to redirect the standard text output to a file by adding > and a file name after a command that would create a text output such as cat or ls. However. nano All exercises are done with files that are local. tail The content of a text file (binary files are special cases) can easily be appraised by having the content of the file scrolled onto the terminal. Type control-X to exit and write the file. while others will pause (more) with the next line shown when hitting the return key. Redirect of standard text output: > The standard input is the keyboard and the standard output is the terminal screen. the command mkdir will create a new directory within the current directory. an enhanced free Pico clone. we can obtain a one-column list of file names within the current directory with the dash-one –1 option of ls and redirecting the standard text output into a file: ls -1 > mylist. 2. Cut one line: control-k. The number of desired lines to view can be specified.txt Pairwise comparison with EMBOSS ‐ 26 .” 2. For example.) DIR /b > mylist. Biochem 711 – 2008 2. Example: man pico 2.txt See next line: press <return> See next page: press <space bar> Return to prompt (quit): q Example: more myfile.11.10.txt > top10. head tail pico man > Shows absolute path Example: mkdir Test Modifiers can be added: long list (letter L): –l 1 column (# one): –1 mark file types : –F show hidden files: –a example: ls –laF cat myfile. Simple text editor displays doc with more Redirects standard screen text output into a text file. you will appear as a “Unix Guru” to most people! And indeed you will be able to interact with ease with any Unix/Linux system! Learn the Windows notes embedded above for an even stronger effect. Displays top 10 lines of file by default or specify # of lines Same as head for end of file... 27 ✔ READ Symbol $% > / .txt head myfile. help and manual pages: man The command man displays the documentation of commands within a more screen display. If you learn this table. Documentation./Desktop/LabFiles cat more Types complete file to screen Types file one screen-page at a time.txt Cut one line: Control-K Save and exit: Control-X man cat Examples: ls > mylist.txt head myfile.txt Pairwise comparison with EMBOSS ‐ 27 .txt tail -5 myfile. % of file viewed displayed at bottom left. Summary tables Here are summarized the commands and symbols reviewed here. .txt head -2 myfile. Name Prompt Root Current directory Parent directory “Home” directory Name Change directory Present working directory Create a new directory List files Can specify another directory Function / examples Shows ready for input See cd and pwd below ~ Command cd pwd mkdir ls Function / examples cd Desktop cd . Documentation and Help: tfm.Biochem 711 – 2008 28 L09 Exercise B: Help and relevant EMBOSS applications: wossname. 1. tfm. A list organized by logical group is provided in the EMBOSS introduction. -option The command tfm (the fine manual) contains all the details about a specified application. The dotplot is an intuitive graphical representation of the regions of similarity Pairwise comparison with EMBOSS ‐ 28 . Find relevant programs: wossname In a following exercise we will use dotplotting as a means to compare 2 sequences. tfm. wossname will let us know what applications could be used for the exercise: ✔ TASK $ wossname dotplot Finds programs by keywords in their short description SEARCH FOR 'DOTPLOT' dotmatcher Draw a threshold dotplot of two sequences dotpath Draw a non-overlapping wordmatch dotplot of two sequences dottup Displays a wordmatch dotplot of two sequences polydot Draw dotplots for all-against-all comparison of a sequence set We now have a list of relevant EMBOSS applications that we can use. or even a sequence against itself. dotmatcher EMBOSS contains a very large number of applications (programs). -option EMBOSS programs used in this exercise: wossname. More succinct information can be obtained as well by the following methods: ✔ TASK Type the bold commands after the % or $ prompt and observe the output: $ tfm dotmatcher dotmatcher Function Draw a threshold dotplot of two sequences Description dotmatcher generates a dotplot from two input sequences. -help. Online it is possible to identify relevant applications to what we want to do with the wossname application. 2. Sequence files are plain text files containing only printable characters from the keyboard. or reference (input USA) [. -help. mkdir. more. Most sequence formats include at least one form of ID name.net/docs/themes/SequenceFormats. The “format” part pertains to the conventions of arrangement of the text within the file.an ID name and an Accession number. each with their own story and history to review here. or press q to quit.Biochem 711 – 2008 tfm uses more to display text: press the space bar to see the next page. annotations and features. pwd ✔ READ There are too many file types.. L09 Exercise C: Sequence format and changing format: seqret EMBOSS programs used in this exercise: seqret EMBOSS qualifiers used: -option.html On that web page they rightfully state: “Before reading the rest of this document.sourceforge. as well as the order and organization of specific characters that serve as flags to tell what parts of the file contain the actual sequence data. RTF. Pairwise comparison with EMBOSS ‐ 29 . italics or underlined it is NOT a plain text file! Formats were designed to hold the sequence data and other information about the sequence. or reference (input USA) [-bsequence] sequence Sequence filename and optional format.] 29 The qualifier –option (or –opt) will be used within a following exercise. If anything is in bold. A good summary is presented at http://emboss. Most sequence databases have two identifiers for each sequence . PostScript® are NOT sequence file formats either. usually placed somewhere at the top of the sequence format.” Programspecific file types such as PDF. head. headers. please note: Microsoft WORD format is not a sequence format. HTML. $ dotmatcher –help Standard (Mandatory) qualifiers (* if not always prompted): [-asequence] sequence Sequence filename and optional format.. -osf UNIX commands used: cat. cd. Note that all the information is within one line after the > symbol.gov/sites/ Click on Protein: sequence database Within the search box type: calm_drome 1 http://emboss.ncbi. Later w will use the EMBOSS seqret program for this purpose. The ID name is CALM_DROME.sourceforge. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough.nlm. it would be a very long word for many programs and should be rewritten. If two sequences are merged into one. 1 1. Fasta format The simplest file format is the fasta file format.Biochem 711 – 2008 30 The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence.nih. Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. Note that > means something for Unix and something else for the fasta format. For example the fasta format version of a file of a calmodulin protein with ID name calm_drome on Entrez is: >gi|49037468|sp|P62152.net/docs/themes/SequenceFormats. The first character is the greater-than sign (>) followed by a name with no blank space either before or within the name. Names are not guaranteed to remain the same between different versions of a database (although in practice they usually do). used by default for output by EMBOSS. Since the name of the file is that of the first word without space touching the > sign.html Pairwise comparison with EMBOSS ‐ 30 . Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK The accession number is P62152 version 2. After the name can be some comments but only on that same 1st line. ✔ TASK Open a browser to point to NCBI: http://www. then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers.2|CALM_DROME RecName: Full=Calmodulin. the mouse menu Edit > Paste is NOT available...Biochem 711 – 2008 31 When the entry is shown switch from Summary to FASTA With the mouse select (highlight) and Edit > Copy the text of the file: title with > and sequence.txt with help of cat and redirect (>): cat > testfile. ✔ TASK Switch to the Terminal or X11 xterm. On Terminal. The method to paste is to click the middle mouse button. ✔ CAUTION: on X11 Xterm.txt <return> Paste the contents of the clipboard immediately after that. We will create a test directory within the LabFiles directory (review line commands in previous section if necessary): Type the bold commands after the % or $ prompt on the terminal: cd ~/Desktop/LabFiles mkdir TEST cd TEST pwd <return> <return> <return> <return> Then we will now create a new text file from the clipboard contents called testfile. We will paste the content of the clipboard shortly. use the mouse menu or the paste shortcut ⌘v At this point your screen should look like this: Pairwise comparison with EMBOSS ‐ 31 . or reference (input USA) [-outseq] seqoutall [<sequence>. and we can also verify its contents with cat.. The ls command will now list the file within our directory. q or the space bar to return to the prompt.Biochem 711 – 2008 cat > testfile.fasta]: <return> $ $ head -1 calm_drome.] There are many more options explained within the tfm manual (tfm seqret) In our case we already have a fasta-formated file.fasta >CALM_DROME P62152. <return> <control> D The file is now written on the local hard drive and contains the pasted text. For example: Ls more testfile. Seqret reads and writes (reformats) sequences By default the EMBOSS application seqret reformats sequence files to the fasta format.. Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK Now do the following: Press return and then press together control and D to close the file.txt <rtn> <rtn> Press either return.2|CALM_DROME RecName: Full=Calmodulin.txt Reads and writes (returns) sequences output sequence(s) [calm_drome. Short=CaM Pairwise comparison with EMBOSS ‐ 32 . $ seqret testfile.2 RecName: Full=Calmodulin. but the name within is very long and seqret can rewrite the file to update the name in a more useful format. ✔ TASK $ Type the bold commands after the % or $ prompt and press return: seqret –help Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format.txt 32 >gi|49037468|sp|P62152. 2.<format>] Sequence set(s) filename and optional format (output USA) [. more or head. Only the name after > has changed. swiss.gcg -osf gcg $ cat test.x heading up to first ". em fasta.sourceforge.gcg Pairwise comparison with EMBOSS ‐ 33 . Sequence data after first double dot “. (NOT for output of protein sequences) IntelliGenetics format. GENBANK entry format.net/docs/themes/SequenceFormats. including the feature table.Biochem 711 – 2008 The command head -1 shows only the first line. Note: the file is still a fasta-formated file. Mega format GCG's MSF multiple sequence format... EMBL entry format FASTA format with optional accession number. ClustalW ALN (multiple alignment) format.x and 10. Complete list and description available online at http://emboss. Changing the format: format codes It is possible to specify the format for the output file if we know the format code: (reduced list. DNA Strider format SWISSPROT entry format. GCG 9.x format with !NA and !AA sequence type identified on the first line. pir nexus. By simply pressing return we accepted the default of fasta format and the suggested file name.” GCG 8. paup pearson phylip strider swissprot. ddbj ig mega msf nbrf. ✔ TASK Type the bold commands after the % or $ prompt and press return: $ seqret testfile. NBRF (PIR) format. Nexus/PAUP format FASTA with no further processing of the "ID" eg: >name description PHYLIP interleaved multiple alignment format. as used in the PIR database sequence files. or at least a minimal subset of the fields. aln embl. sw Format short description ABI trace file format. Specifying the output format is done either with either the qualifier –osf (output sequence format) or with the double colon nomenclature of the requested format followed by the desired file name: formatcode::filename In addition it is possible to specify the input and output file names on the same line as seqret rather than pressing return." Remainder is sequence data. gb. 2. which we can observe has been rewritten with the sequence ID as its name.html 33 Format codes (use one) abi clustal.txt test. ncbi gcg gcg8 genbank.1. boolean: a switch. Short=CaM CALM_DROME Length: 149 Type: P Check: 5504 .Biochem 711 – 2008 !!AA_SEQUENCE 1.0 RecName: Full=Calmodulin.txt gcg::test. string: text) Input sequence command-line qualifiers that change the behaviour of the sequence input. 34 1 MADQLTEEQI AEFKEAFSLF DKDGDGTITT KELGTVMRSL GQNPTEAELQ 51 DMINEVDADG NGTIDFPEFL TMMARKMKDT DSEEEIREAF RVFDKDGNGF 101 ISAAELRHVM TNLGEKLTDE EVDEMIREAD IDGDGQVNYE EFVTMMTSK The command can also be written with the exactly equivalent alternative: $ seqret testfile..gcg In that case we specify the format code and the desired file name output between 2 colons. Qualifier -sbegin -send -sreverse -sask -snucleotide -sprotein -slower -supper -sformat -sopenfile -sdbname -sid -ufo -fformat -fopenfile Type integer integer boolean boolean boolean boolean boolean boolean string string string string string string string description first base used last base used. Qualifier -osformat -osextension -osname -osdirectory -osdbname -ossingle -oufo -offormat -ofname -ofdirectory Type string string string boolean string boolean string string string features string description output sequence file format file name extension base file name output sequence file directory database name to add create a separate output file for each entry feature file to create features format file name features output directory Pairwise comparison with EMBOSS ‐ 34 . The complete line-qualifiers are shown in the following tables: (integer = a numeric value. default=seq length reverse (if DNA) ask for begin/end/reverse sequence is nucleotide sequence is protein make lower case make upper case input sequence format input filename database name entryname UFO features features format features file name Output sequence command-line qualifiers that change the behaviour of the sequence output. g.txt One easy way to create a list is to use the ls command with the dash-one (ls -1) and redirect the output into a file.ig @another_list. List files: @ symbol ✔ INFO When the number of files becomes large. seqret will return a multiple fasta-formated sequence file by default if it is supplied with multiple files as input either as a list or as a wild card command: seqret *. Example: ls -1 > mylist.Biochem 711 – 2008 35 3. Here is an example of a valid list: File1. msf and aln formats.txt To tell the EMBOSS application that we are supplying a list rather than an actual sequence file.gcg File2. Pairwise comparison with EMBOSS ‐ 35 . msf or aln in the same manner as it was done for a single file: either with the –osf option or the double colon :: method.fasta The output format can be altered by specifying the output format code e. Some minor editing may be needed to remove names that do not belong to the list. Other formats mesh the files together which become interlaced. 4.fasta File3. it may be easiest to enter the sequence file names into a list and supply the list to the EMBOSS application. with each file name on one line. A list file contains a single column. Multiple sequence files can be split back into single files with the EMBOSS application seqretsplit . seqretsplit ✔ INFO Multiple sequences can fit together one after the other in any order into a single fasta-formated file. Multiple sequence formats: seqret. the list file name is preceded by the @ symbol. Lists can be embedded within another list if preceded by @. as is the case for some alignment formats. The multiple file formats that are useful to us are the fasta. Using a longer word (tuple) size displays less random noise. Identity and similarity is defined by the chosen comparison table (substitution matrix. but is less sensitive 1. Dot-plotting is the best method for comparing two sequences visually when it is suspected that there could be more than one segment of similarity between them. for example to create a png file.) dotmatcher compares two protein or nucleic acid sequences at all positions between the first sequence and all positions of the second sequence and displays the points of similarity between them shown as a graphical 2-dimentional dotplot. –windowsize. The “word” method is faster but not as sensitive and requires that the sequences actually contain short perfect matches for any similarity to be found. The –graph option allows to change the graphical format output. dottup looks for places where “words” (tuples) of a specified length have an exact match in both sequences and draws a diagonal line over the position of these words. cat ✔ READ In this exercise we will explore two EMBOSS programs for pair-wise sequence comparison dotmatcher and dottup. Working directory ✔ TASK pwd Make sure you are in the LabFiles directory with: If you just completed the previous exercise you need to go up one level with: cd . Dotmatcher The dotplot created by dotmatcher is a graphical output. Pairwise comparison with EMBOSS ‐ 36 .Biochem 711 – 2008 36 L09 Exercise D: Pairwise comparisons with dotplots EMBOSS programs used: dottup. –wordsize. dotmatcher. ls. –fetch. embossdata EMBOSS qualifiers used: –option. -file. If you are unsure : cd ~/Desktop/LabFiles 2. See “EMBOSS Graphical Output” tables within the introduction section for more details. Since we are using the line-command on an X11 system the default graphical output is the X11 interactive display. pwd. –sask. runs extremely quickly.. –threshold UNIX commands used in this exercise: cd. The process of displaying the graphical output needs to be terminated to return to the line-command prompt ($ or %): Do either of the following: a) Close the graphical window (click the red “x” button at top left: b) on the keyboard press the “control” and “C “ key (together) Note: On line-command Windows the default graphical output is called “win3” ) and the graphical window is closed by clicking the red “x” square on the top right: Pairwise comparison with EMBOSS ‐ 37 . If the –option qualifier is omitted you would only be prompted for three things: input sequence. second sequence and graph type.Biochem 711 – 2008 2.pep” with itself. Defaults run 37 ✔ TASK Use domatcher to compare the protein sequence in “dcalm.pep <rtn> Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: <rtn> Threshold [23]: <rtn> Graph type [x11]: <rtn> (to display this graphic) While the interactive X11 graphic window is being displayed it is not possible to type any more commands and the prompt is not visible. This time use all of the default parameters. % dotmatcher -option Draw a threshold dotplot of two sequences Input sequence: dcalm. Typing <rtn> would only create useless blank lines.1.pep <rtn> Second sequence: dcalm. We will be able to see what the command line choices are with the –option qualifier. 3. <control C> together to return the prompt Note: in the output of this example there is at least one long region of similarity in addition to the diagonal that bisects the figure. The long bisecting diagonal represents the identity that is found when a sequence is compared to itself. Pairwise comparison with EMBOSS ‐ 38 . $ dotmatcher dcalm. This time put all the commands on the first line. Window size 38 ✔ TASK Repeat the dotmatcher command adding the name of the files to the command line (short cut) along with –option and when prompted change the windowsize to 20 and threshold to 44. Threshold ✔ TASK Rerun dotmatcher using a different window and threshold. $ dotmatcher dcalm.pep -windowsize 10 -threshold 44 Draw a threshold dotplot of two sequences Graph type [x11]: <rtn> notice the change in the size of the diagonals when changing only the threshold (stringency).pep dcalm.2.pep -option Draw a threshold dotplot of two sequences Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: 20<rtn> Threshold [23]: 44<rtn> Graph type [x11]: <rtn> <control C> together to return the prompt 2.Biochem 711 – 2008 2.pep dcalm. dottup dottup displays a wordmatch dotplot of two sequences. You should now find a few dots. $ dottup dcalm. Word size Run dottup with the –word qualifier to identify perfect matches (in this case. Type tfm dottup at the prompt for all the details. The default word size is 10. Repeat exercise 4 as above. $ dottup dcalm.pep -wordsize 8 Displays a wordmatch dotplot of two sequences Graph type [x11]: <rtn> (to display the graphic) <control C> together to return the prompt Ah HA! Gotcha! You probably didn’t find any dots on this plot.pep -wordsize 4 <control C> together to return the prompt Pairwise comparison with EMBOSS ‐ 39 . 3. did you? This means that there are no repeats of 8 identical (or very similar) amino acids in a row in the peptide sequence “dcalm. but this time specify –wordsize 4. for example: dotmatcher –help 3. repeat sequences) that are at least 8 residues long (wordsize of 8).pep dcalm.1.Biochem 711 – 2008 39 Reminder: If you want to know more about any of the programs in EMBOSS add the –help qualifer after the program name.pep”.pep dcalm. pep. dottup or dottmatcher just add the all the match Pairwise comparison with EMBOSS ‐ 40 . but are always positive (AA=4. where we were compared the protein to itself with different windows and threshols (stringency). For protein comparison dottup uses a table called “Eblosum62. $ embossdata -fetch $ ls $ cat EBLOSUM62 # # # # # # A R N D C Q E G H I L K M F P S T W Y V B Z X * -file EBLOSUM62 Matrix made by matblas from blosum62. according to their (average) known frequencies in proteins.dat Cluster Percentage: >= 62 Entropy = 0. and the observed likelihood of this particular pair.0/blocks.Biochem 711 – 2008 40 4. Some similarity matches are also positive (EQ=2.iij * column uses minimum score BLOSUM Clustered Scoring Matrix in 1/2 Bit Units Blocks Database = /data/blocks_5. GG=6). FY=3. CC=9.5209 A R N D C Q E G H I L K M F P S T 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 Note how the values for identical aa matches this table have scores that vary. we found several long diagonals that suggested possible internal repeats within dcalm. Comparison tables: BLOSUM62 ✔ INFO Special Note: but how can this be? In the first 3 exercises with dotmatcher. When using this table. substituting for each other during natural evolution. the last 2 exercises with dottup show that there are very few regions with short perfect matches! The key here is to look at the scoring table that tells dottup whether there IS a match or a similarity between residues in the window. and “yes” that match should (or should not) be counted towards the stringency score. We can retrieve the comparison table used by the 2 dot plot programs and look at its content: ✔ TASK Type the bold commands after the % or $ prompt and press return.” This table assigns a relative significance score to every possible pairing of amino acids (aa). LI=2). Yet. or at least regions of similarity.6979. Expected = -0. 0. This means.seq –option <rtn> raw a threshold dotplot of two sequences Matrix file [EDNAFULL]: Window size over which to test threshhold [10]: Threshold [23]: Graph type [x11]: (use all other defaults) Repeat with altered threshold: ✔ TASK $ dotmatcher h1prom.seq h4prom. Some applications can use an identity matrix that assigns all identical aa matches a value of 1. and all mismatches a value of 0. There are MANY other tables you can use. Note: the default scoring table for DNA sequences (“EDNAFULL”) scores all identities and ambiguities with values of 5. a single CC match (9). We will compare the H1 and H4 histone promoter sequences to see if they share any regions of sequence similarity. Nucleotide sequence comparison Use dotmatcher to find regions of similarity between two different nucleotide sequences. first use defaults. then repeat changing the threshold levels. but all mismatches = -4. supplemented by an EQ match (2). if the rest of the aa’s averaged zeros).e. to meet or exceed the specified stringency (threshold). would meet that criteria (i. that even a very few exact matches within the window may give a high enough score to register as a dot. 5. when using this default table.seq h4prom. This type of comparison (between 2 different sequences) is perhaps the most common use of dotplots. and give a dot on the dotplot. dottup doesn’t have to find any exact matches between the sequences. with –windowsize=15 –threshold=11 settings.Biochem 711 – 2008 41 values within the requested window size and evaluates whether that sum meets or exceeds the stringency. Also note. if it can find enough similarities with high enough scores. ✔ TASK $ dotmatcher h1prom. containing different values for aa comparisons. For example.seq –option <rtn> Draw a threshold dotplot of two sequences Matrix file [EDNAFULL]: <rtn> Pairwise comparison with EMBOSS ‐ 41 . Analyze the first 300 bases of the dau. ✔ TASK $ dotmatcher dau.seq dau. Within the previous seqret and file format exercises the line-command qualifier -sask is shown within the tables of qualifiers and mean “ask for begin/ end/ reverse.seq -sask1 -sask2 -option Draw a threshold dotplot of two sequences Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: <rtn> Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: Y Matrix file [EDNAFULL]: <rtn> Window size over which to test threshhold [10]: 50 Threshold [23]: 50 Graph type [x11]: <rtn> (to display the graphic) Note how the top and bottom halves of this plot are symmetrical.” Here we will use this feature with -sask1 and -sask2 to specify that we want to answer optional questions about sequence input files 1 and 2. Note: for clarity return <rtn> is only shown for lines that keep default values and is implied for other lines. Pairwise comparison with EMBOSS ‐ 42 . This method can be a very valuable tool for identifying inverted repeats.seq sequence against its reverse-complement (“Reverse strand: Y”). Inverted repeats Use dotmatcher to find regions of inverted repeats within a single nucleotide sequence. or when used in conjunction with RNA structural prediction programs.Biochem 711 – 2008 Window size over which to test threshhold [10]: 12 <rtn> Threshold [23]: 30 <rtn> Graph type [x11]: <rtn> (to display the graphic) 42 6. –gapopen. After the path matrix is complete. –alt UNIX commands used in this exercise: more ✔ INFO needle2 and water3 are pair-wise alignment programs based on published algorithms that find the optimal mathematical “fit” between two sequences through the judicious insertion of gaps (spacers designated with “. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment. Both programs read a scoring matrix (comparison table) that contains values for every possible symbol match. it is the best end-to-end alignment for the two sequences. doi:10.1016/0022-2836(70)90057-4. needle. As in the previous exercise we will use the “-saskn” command to have the program prompt for start and end positions. The gap open penalty and gap extension penalties are set by the user. "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Note that either program will find an alignment for any pair of sequences you compare.pair”.Biochem 711 – 2008 43 L09 Exercise E: Pairwise comparisons with optimal alignments EMBOSS commands used: water.pep that the dottup and dotplot programs show to be similar. Call the output file “dcalm. and matcher EMBOSS qualifiers used: –option. Local alignment: water Use water to align two regions of dcalm. With “n” being the sequence number 2 Needleman SB. "Identification of Common Molecular Subsequences". with a score at every position for the best possible alignment to that point. For needle. The second is between amino acids 80 and 100. use needle. Journal of Molecular Biology 48 (3): 443-53. When you are trying to find only the best segment of similarity between two sequences (local).” symbols to show where one sequence might have an insertion or a deletion relative to the other). PMID 5420325 3 Smith TF. the highest value on the surface (water) or at the edge of the comparison (needle) represents the end of the best region of similarity between the sequences. less the gap creation penalty times the number of gaps in that alignment. –sask. Journal of Molecular Biology 147: 195–197. Waterman MS (1981). When you want an alignment that covers the whole length of both sequences (global). These values are used to construct a path matrix that represents the entire surface of comparison. doi:10. For water.1016/0022-2836(81)90087-5 Pairwise comparison with EMBOSS ‐ 43 . –gapextend. use water. less the gap extension penalty times the total length of all gaps in that alignment. The best path from this highest value backwards to the point where the values revert to zero (water) or back to the origin of the matrix (needle) is the alignment shown by in the output file. (1970). –fetch. The first region is between amino acids 10 and 30. Wunsch CD. this alignment is the best segment of similarity between the two sequences. 1. even if there is no significant similarity between them! YOU must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity. Pairwise comparison with EMBOSS ‐ 44 .0%) # Score: 60.water]: dcalm. in this case the default: Eblosum62.5 # # Length: 17 # Identity: 11/17 (64.7%) # Similarity: 14/17 (82. ✔ TASK $ water dcalm.:|||. the % similarity and the % identity that were calculated for this alignment.pair The output first restates the commands given and then provides the result: #======================================= # # Aligned_sequences: 2 # 1: CALM_DROME # 2: CALM_DROME # Matrix: EBLOSUM62 # Gap_penalty: 10.5]: <rtn> Output alignment [calm_drome.pair $ more dcalm.0]: <rtn> Gap extension penalty [0.4%) # Gaps: 0/17 ( 0.0 # # #======================================= CALM_DROME CALM_DROME 11 EFKEAFSLFDKDGDGTI |.pep dcalm.Biochem 711 – 2008 44 in the order listed in the line command.| 84 EIREAFRVFDKDGNGFI 27 100 #--------------------------------------#--------------------------------------- Note: both water and needle output will summarize the input parameters chosen for the alignment and show the quality score (score of the optimal matrix path). according to the scoring table that was selected.pep -sask1 -sask2 –option Smith-Waterman local alignment of sequences Begin at position [start]: 10 End at position [end]: 30 Begin at position [start]: 80 End at position [end]: 100 Matrix file [EBLOSUM62]: <rtn> Gap opening penalty [10.:|||||:|.0 # Extend_penalty: 0. For clarity return <rtn> is only shown for lines that keep default values and is implied for other lines. t.. e.|. this time) Gap extension penalty [0.0]: <rtn> (accept all defaults.|:|:. t. # 2: t.pair (for the output file name) $ more emcgap1..|||:..pep Needleman-Wunsch global alignment of two sequences Gap opening penalty [10.|. Pairwise comparison with EMBOSS ‐ 45 .||.Biochem 711 – 2008 45 2.0 # # #======================================= e.::|.0 # Extend_penalty: 0..2%) # Similarity: 170/258 (65.: |.1.|. this time) Output alignment [e.. e. 2..||.|. Defaults run ✔ TASK $ needle emc.] etc.pep”)..| 1 -------MACKHGYP-DVCPICTAVDATPGFEYLLMADGEWYPTDLLCVD 49 GEDDVF---------------DPELDMEVVFELQGNSTSSDKNNSSSEGN . :.:|| . Global alignment: needle Use needle to analyze the viral protein leader sequences from encephalomyocarditis virus (“emc.|:|. 1 MATTMEQETCAHSLTFEECPKCSALQYRNGF-YLLKYDEEWYPEELL-TD .5 # # Length: 258 # Identity: 145/258 (56..0%) # Score: 721. Notice how the optimal path changes depending upon how “easy” it is for the program to insert gaps.pair”. # Matrix: EBLOSUM62 # Gap_penalty: 10.|.||.9%) # Gaps: 36/258 (14.|| |||..pair #======================================= # # Aligned_sequences: 2 # 1: e.pep”) and from Theiler’s virus (“tme.:|||| |.| 93 EGVIINNFYSNQYQNSIDLSASGGNAGDAPQTNGQLSNILGGAANAFATM 48 42 83 92 132 142 2.||||:||||:||... Call the output file “emcgap1.. [..||||..|| 43 LDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSGN 84 EGVIINNFYSNQYQNSIDLSANAAGS-DPPRTYGQFSNLFSGAVNAFSNM |||||||||||||||||||||:.5]: <rtn> (accept all defaults.pep tme.2.||:.needle]: emcgap1.|. Change the gaps Now align the same sequences. but with lower gap creation penalties and gap extension penalties. t. %Identity: 56.1%. 1 MA-------CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 48 DGEDDVF---D----PE-LD-----M--EVVFELQGNSTSSDKNNSSSEG |.pep tme. |... t.9.|. Length: 261. 1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T |.||||. Realign the viral leader sequences using this table. e.2) and come at the expense of adding more gaps. e. Comparison tables: PAM250 Use the –fetch qualifier to copy an alternative symbol comparison table to your local directory.|:|:.pair (output file name) $ more emcgap3. How does using this alternative table change the gap analysis? ✔ TASK $ embossdata -fetch -file EPAM250 (note: case sensitive) $ needle emc.|.|.: :| : ::|. t. Use the same gap weight and length weight penalties as in exercise 3.|.. etc 47 41 82 91 Pairwise comparison with EMBOSS ‐ 46 .|. Length: 258.needle]: emcgap3.8.5) differ from the previous values (Score: 721.||||:||||:||. t.| 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG .|. emcgap2. the original PAM250 matrix.|: :. etc 47 41 82 91 Note also how the increased scores of this alignment (Score: 765.:|||| ::: :| : ::|... :.:|||| | .pair (accept all other defaults) (for the output file name) $ more e. %Similarity: 67.:|| .|| |||. t.. 1 ----MA---CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 48 DGEDDVF-------DPE-LD-----M--EVVFELQGNSTSSDKNNSSSEG |.|.: ||. %Similarity: 65.Biochem 711 – 2008 46 ✔ TASK $ needle emc.|| |||.|:|::. Gaps: 16.:|| . %Identity: 57. Gaps: 14.| 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG . 3.pair 1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T || |.pair #======================================= e..: ||.||||:||||:||.pep tme.pep –gapopen=3 –gapextend=1 <rtn> emcgap2.pep –gapopen=3 –gapextend=1 –datafile=Epam250 Needleman-Wunsch global alignment of two sequences Output alignment [e.||||. But it is always wise to rerun alignment programs with a variety of different input values.needle (no header) h1prom.||| 36 --------------------------------CCGGCGG--GACTTCCCG 293 ----CTCCTTTGGAGCTTCAAAGT-------GCCAAATTCTGTACCATTG ||.seq h1prom.4%) # Score: 103. can we recognize a good alignment when we see one? What values or tables should we use? These programs and others will typically offer default values that have been chosen for their general ability to give “relevant” results.water h1prom.seq h4prom.|.needle Global: h1prom.seq h1prom.water ######################################## #======================================= # # Aligned_sequences: 2 # 1: h1prom.seq (3 times) h1prom. How do the alignments from these two programs differ? ✔ TASK $ water $ needle h1prom.seq <rtn> h4prom..Biochem 711 – 2008 47 Note: the last 3 exercises should emphasize that the specific output for any optimal alignment is very sensitive to the values in the scoring table.||| ||||| ||| |. the shuffleseq program can help you evaluate the significance of your alignment.||| |||||.seq h4prom.seq # [-bsequence] h4prom.seq h4prom.seq h4prom.||.seq <rtn> h4prom.||...water ######################################## # Program: water # Rundate: Fri 24 Oct 2008 17:40:42 # Commandline: water # [-asequence] h1prom.| |||| ..seq 315 53 h1prom. Global and local alignment comparison Compare the two histone promoter sequences using the water and needle programs.seq h4prom.seq # Matrix: EDNAFULL # Gap_penalty: 10.||.||| 11 TTTGGCCCTTTA----GATTTCCCCTCCACCGGCGGGACTTC---CCGCC Pairwise comparison with EMBOSS ‐ 47 .|||.seq h1prom.seq # 2: h4prom.| |.0 # Extend_penalty: 0. and prefer a mathematical answer. using a simple statistical method by randomizing (shuffling) the sequence.seq h4prom. Simply create one or many randomized sequences with shuffleseq and compare them with the biological sequence keeping track of the resulting scores.| |.seq h4prom..seq h1prom.||| ||||.seq h1prom. For many proteins or nucleic acid comparisons.seq 271 TGTGAACCTGGAGGCTGTTTT--CCTCCTTTGGAG---CTTCAAAGTGCC |.|| |. and then LOOK carefully at the different results.seq # Align_format: srspair # Report_file: h1prom. How then.|||| ..5 # # #======================================= h1prom.seq h4prom.| ||||. the regions of “good” similarity will tend to be part of the optimal path over a wide range of penalties or table scores. 4.0%) # Gaps: 80/180 (44..||| 93 TATAAAG------------------------------------------50 0 100 0 150 0 200 8 245 35 292 51 331 92 381 99 Local: h1prom.|| 9 -------------GGTTTGGCCCTTTAGA-----TTTCCCC---TCCA-246 AGATTCTTGAAAACACAAACAAGTATGTGAACCTGGAGGCTGTTTTC--|.| 52 CCGACTTCTTT-CAGGTTCTCAGTTCGGTCCGCCAA---CTG-----TCG 332 TTTTAAGCATTTAATCAAATTTTGAGGACTAACAAACACAATTTGGGAGT |.seq h1prom.seq (9 times) Accepting all defaults includes accepting the default names for the output files that can then be viewed with: $ more $ more h1prom.. the gap weight (-gapopen) and the gap length (-gapextend) weight. Mostly however.0%) # Similarity: 72/180 (40. we still need to use good biological judgment and evaluate each output for whether or not it makes any sense! If you don’t trust your judgment.|| ||| 1 -----------------------------------CCTAT---TTC---201 GAAAAGCTGAAGGGATTT-----TTTAAAATATCTTTCATCAATTGCACA |.seq h4prom.5 # # Length: 180 # Identity: 72/180 (40.seq 1 GTCCTGTGCCTGTGTTACTTGCTACAGTTAGAAACAAACTTCATGCCCAA 0 -------------------------------------------------51 ACCAAGGAACCCAGTGTCTTTTCTCTTGCAAAAATCAAAGCATGAACTCA 0 -------------------------------------------------101 TGGGCAAATTTTTAAAAATAACTTTCACTGGATACTTAGTAGAAATTTAT 0 -------------------------------------------------151 CGCGACACGCTACTAACTAACATGATGCCCTCAGCCCAATGGATTCTTAT ||.||.|||. seq h4prom.||| |||.CAT 5 MisMatch = -9 Length Weight = 0 HighRoad: 1 GACCAT 6 ||| || 1 GAC.|| 71 -----CAGTTCGGTCCGCCAACTGTCGTATAAAGGCGCTGCCTCAGGTCA 403 GAGGGCGGTGGATTGGACGCTCCACCAATC |||| ||||.pair ✔ READ Note: just when you thought you had the variables in the optimal alignment programs figured out.||||| .|| ||.seq h1prom.|| 54 GACTTC------------------------------TTTCAGGTTCT--365 AAACACAATTTGGGAGTCCAAC--GCG------AGCGCGGC----GGCCA ||.pair h4prom..seq <rtn> h1prom4..seq h4prom.Biochem 711 – 2008 h1prom.seq h4prom. This sets the number of alternative matches output. the specific location of gap insertion is arbitrary in many cases.||. you now have to face the concept of “highroad” and “low road”! Actually. For: LowRoad: Match = 10 Gap weight = 10 1 GACCAT 6 || ||| 1 GA. By default only the highest scoring alignment is shown. Alternative alignments Repeat the alignment of the histone promoters using the matcher program and the –alt (alternative) command line qualifiers.seq h1prom.seq h1prom.seq –alt=4 (one time) (for output file name) $ matcher h1prom.|| .|| 100 ---------GCGCTGCCTCAGGTCAGAGGCCACAAAGCGTTG-------421 GCTCCACCAATCACAGGGCAGCGCCGGCTTATATAAGCCCGGGCCCGAGC 132 -------------------------------------------------471 ATAGCAGCAACGCAAAACCTGCTCTTTAGATTTCGAGCTTATTCTCTTCT 132 -------------------------------------------------521 AGCAGTTTCTTGCCACCATG 132 -------------------540 132 420 132 470 132 520 132 48 #--------------------------------------#--------------------------------------- 5.seq <rtn> h1prom10.||||.seq h4prom.||.seq h1prom.. How do the alignments differ? ✔ TASK $ matcher h1prom.seq h4prom.|| ||.seq h1prom.AT 5 Quality = 40 For: Match = 10 Gap weight = 30 MisMatch = 0 Length Weight = 0 Pairwise comparison with EMBOSS ‐ 48 . and equally optimal alignments can be generated by inserting the gaps differently.||. Here are two examples for the alignment of GACCAT with GACAT with these different parameters.seq h4prom.pair $ more h1prom10.seq #--------------------------------------#--------------------------------------h4prom.seq –alt=10 (one time) (for output file name) $ more h1prom4. matcher will insert the gaps differently if you select for the alternative parameter. A value of 2 gives you other reasonable alignments. When equally optimal alignments are possible.seq 316 AAATTCTGTACCATTGTTTTAAGCATTTAATCAAATTTTGAGG-ACTAAC .|||||| |||.| 116 GAGG-----------------CCACAAAGC 432 128 364 70 402 115 h1prom.seq 382 CCAACGCGAGCGCGGC----GGCCAGAGG-------GCGGTGGATTGGAC ||||.|.pair h4prom.||| . GACAT 5 Essentially the lowroad shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. Applications will try NOT to insert a gap whenever that is possible. 5 LowRoad: Quality = 30 1 GACCAT 6 ||| 1 . The highroad does exactly the opposite. may use the highroad alternative as a default.Biochem 711 – 2008 49 HighRoad: 1 GACCAT 6 ||| 1 GACAT. -e- Pairwise comparison with EMBOSS ‐ 49 . 3) Close all Macintosh windows. L09 Exercise F: End of laboratory 1) Tell the server you are done: type exit at the $ prompt 2) quit from X11: File > Quit. but when forced to choose. Biochem 711 – 2008 Class notes 50 Pairwise comparison with EMBOSS ‐ 50 .