Festival TTS Training MaterialTTS Group Indian Institute of Technology Madras Chennai - 600036 India June 5, 2012 1 Contents 1 Introduction 1.1 Nature of scripts of Indian languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Convergence and divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 2 What is Text to Speech Synthesis? 2.1 Components of a text-to-speech system . . 2.2 Normalization of non-standard words . . . . 2.3 Grapheme-to-phoneme conversion . . . . . . 2.4 Prosodic analysis . . . . . . . . . . . . . . . 2.5 Methods of speech generation . . . . . . . . 2.5.1 Parametric synthesis . . . . . . . . . 2.5.2 Concatenative synthesis . . . . . . . 2.6 Primary components of the TTS framework 2.7 Screen readers for the visually challenged . 5 5 5 5 6 6 7 7 8 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Overall Picture 10 4 Labeling Tool 12 4.1 How to Install LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Troubleshooting of LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5 Labeling Tool User Manual 5.1 How To Use Labeling Tool . . . . . . . . . . . . 5.2 How to do label correction using Labeling tool 5.3 Viewing the labelled file . . . . . . . . . . . . . 5.4 Control file . . . . . . . . . . . . . . . . . . . . 5.5 Performance results for 6 Indian Languages . . 5.6 Limitations of the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 25 29 29 30 30 6 Unit Selection Synthesis Using Festival 6.1 Cluster unit selection . . . . . . . . . . . . . . . 6.2 Choosing the right unit type . . . . . . . . . . 6.3 Collecting databases for unit selection . . . . . 6.4 Preliminaries . . . . . . . . . . . . . . . . . . . 6.5 Building utterance structures for unit selection 6.6 Making cepstrum parameter files . . . . . . . . 6.7 Building the clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 31 32 33 33 34 35 7 Building Festival Voice 42 8 Customizing festival for Indian Languages 44 8.1 Some of the parameters that were customized to deal with Indian languages in festival framework are : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 8.2 Modifications in source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 9 Trouble Shooting in festival 50 9.1 Troubleshooting (Issues related with festival) . . . . . . . . . . . . . . . . . . . . . . 50 9.2 Troubleshooting(Issues might occur while synthesizing) . . . . . . . . . . . . . . . . . 50 10 ORCA Screen Reader 51 2 11 NVDA Windows Screen Reader 53 11.1 Compiling Festival in Windows : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12 SAPI compatibility for festival voice 60 13 Sphere Converter Tool 13.1 Extraction of details from header of the input file 13.1.1 Calculate sample minimum and maximum 13.1.2 RAW Files . . . . . . . . . . . . . . . . . 13.1.3 MULAW Files . . . . . . . . . . . . . . . 13.1.4 Output in encoded format . . . . . . . . . 13.2 Configfile . . . . . . . . . . . . . . . . . . . . . . 14 Sphere Converter User Manual 14.1 How to Install the Sphere converter tool . 14.2 How to use the tool . . . . . . . . . . . . 14.3 Fields in Properties . . . . . . . . . . . . . 14.4 Screenshot . . . . . . . . . . . . . . . . . . 14.5 Example of data in the Config file (default 14.6 Limitations to the tool . . . . . . . . . . . 3 . . . . values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . properties) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 62 63 63 63 63 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 65 66 67 68 69 i. The typical forms of Akshara are V. The properties of Aksharas are as follows: 1. in particular.e. An Akshara is an orthographic representation of a speech sound in an Indian language 2. they share a common set of speech sounds. The basic units of the writing system are referred to as Aksharas. But languages such as Telugu. a syllable based unit selection synthesis system has been built for Indian languages. Phonotactics is the permissible combinations of phones that can co-occur in a language. including 15 vowels and 35 consonants.2 Convergence and divergence The official languages of India. This common phonetic base consists of around 50 phones.. CCV and CCCV. Kannada and Tamil have their own scripts. thus having a generalized form of C*V where C denotes consonant and V denotes vowel As Indian languages are Akshara based. except (English and Urdu) share a common phonetic base. CV. The property that makes these languages unique can be attributed to the phonotactics in each of these languages rather than the scripts and speech sounds. a syllable corresponds to a basic unit of production as opposed to that of the diphone or the phone. This implies that the distribution of syllables encountered in each language is different. akshara being a subset of a syllable. Aksharas are syllabic in nature 3. 1. Another dimension in which the Indian languages significantly differ is prosody which includes duration. 1. Further.1 Introduction This training is conducted for new members who joined the TTS consortium. intonation and prominence associated with each syllable in a word or a sentence. some of the languages such as Hindi.1 Nature of scripts of Indian languages The scripts in Indian languages have originated from the ancient Brahmi script. Earlier efforts were made by the consortium members. While all of these languages share a common phonetic base. 4 . The main aim of the TTS consortium is to develop text-to-speech (TTS) systems in all of these 22 official languages in order to build screen readers which are spoken interfaces for information access which will aid visually challenged people use a computer with ease and to make computing ubiquitous and inclusive. IIIT Hyderabad and IIT Madras do indicate that natural sounding synthesisers for Indian languages can be built using the syllable as a basic unit. Marathi and Nepali also share a common script known as Devanagari. Thus a text-to-speech approach using phones provides flexibility but cannot produce intelligible and natural speech. acronyms such as ABC. and MSN” etc. while a word level concatenation produces intelligible and natural speech but is not flexible.2 Normalization of non-standard words The text in real world consists of words whose pronunciation is typically not found in dictionaries or lexicons such as IBM”. approx. telephone numbers.. The conversion of text into spoken form is deceptively nontrivial. Ctrl-C. smileys :-) etc.e. +/-. abbreviations. time.3 Grapheme-to-phoneme conversion Given the sequence of words.. lb. emails etc.1 Components of a text-to-speech system A typical architecture of a Text-to-Speech (TTS) system is as shown in Figure below The components of a text-to-speech system could be broadly categorized as text processing and methods of speech generation. To handle unseen words. For languages such as Spanish. However. Punctuations 3-4. and/or. normalize the non-standard words. a set of simple rules may often suffice. It also cannot deal with generating appropriate intonation and duration for words in different context.. The text available in real world is anything but a sequence of words available in standard dictionary. Such words are referred to as non-standard words (NSW). Telugu. 2. Numbers whose pronunciation changes depending on whether they refer to currency. Dates. 4. news papers. predict the prosodic pauses and generate the appropriate phone sequences for each of the words. sentence and discourse level. the typical input to a text-to-speech system is text as available in electronic documents. 5 . Abbreviations. which cannot be synthesised by simple concatenation of phones. blogs. word. The goal of text processing module is to process the input text. natural speech consists of co-articulation i. and prosody at syllable. 2. homographs and symbols built using punctuation characters such as exclamation !. Text processing in the real world. Kannada. The text contains several non-standard words such as numbers. zip code etc. 3. A nave approach is to consider storing and concatenation of basic sounds (also referred to as phones) of a language to produce a speech waveform. where there is a good correspondence between what is written and what is spoken. 2. sub-word units such as diphones which capture essential coarticulation between adjacent phones are used as suitable units in a text-to-speech system. contractions. CMU”. But. effect of coupling two sound together. a grapheme-to-phoneme generator is built using machine learning techniques. US. The various categories of NSW are: 1. In order to balance between flexibility and intelligibility/naturalness. the next step is to generate a sequence of phones.2 What is Text to Speech Synthesis? Text to Speech Synthesis System converts text input to speech output. a standard pronunciation dictionary such as CMU-DICT is used. time. such a method may not synthesise millions of names and acronyms which are not in the dictionary. For languages such as English where the relationship between the orthography and pronunciation is complex. units and URLs. 2. Another method often employed is to store a huge dictionary of the most common words. To predict appropriate duration and intonation. the input text needs to be analyzed. 2. the sentences where are you going?.5 Methods of speech generation The methods of conversion of phone sequence to speech waveform could be categorized into parametric. example-based techniques and machine learning algorithms.2. The generated duration and intonation contour can be used to manipulate the context-insensitive diphones in diphone based synthesis or to select an appropriate unit in unit selection voices. where are you GOING? and where are YOU going?. concatenative and statistical parametric synthesis. 6 . For example.4 Prosodic analysis Prosodic analysis deals with modeling and generation of appropriate duration and intonation contours for the given text. have same text-content but can be uttered with different intonation and duration to convey different meanings. This can be performed by a variety of algorithms including simple rules. This is inherently difficult since prosody is absent in text. the division into segments is done using a specially modified speech recognizer set to a ”forced alignment” mode with some manual correction afterward.2. 2. There are three main sub-types of concatenative synthesis. The output from the best unit-selection systems is often indistinguishable from real human voices. researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems. However. These parameters are modified during synthesis time to incorporate co-articulation and prosody of a natural speech signal. At runtime. half-phones. the quality of synthesized speech using traditional parametric synthesis is found to be robotic. Also. maximum naturalness typically require unit-selection speech databases to be very large. in some systems ranging into the gigabytes of recorded data.5. only one example of each diphone is contained in the speech database. At run time. and neighboring phones. This has led to development of concatenative synthesis where the examples of speech units are stored and used during synthesis. The required modifications are specified in terms of rules which are derived manually from the observations of speech data. diphones. intonation. Diphone synthesis suffers from the sonic glitches of concatenative 7 . Examples of the early parametric synthesis systems are Klatts formant synthesis and MITTALK. words. 1. DSP often makes recorded speech sound less natural. Diphone synthesis . During database creation. unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e. position in the syllable.Unit selection synthesis uses large databases of recorded speech. linear prediction coefficients are extracted from the speech signal of each phone unit. especially in contexts for which the TTS system has been tuned. morphemes. phrases. minor words become unclear) even when a better choice exists in the database. using visual representations such as the waveform and spectrogram. co-articulation and excitation function. differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output.1 Parametric synthesis Parameters such as formants. Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. representing dozens of hours of speech. each recorded utterance is segmented into some or all of the following: individual phones. because it applies only a small amount of digital signal processing (DSP) to the recorded speech. and German about 2500. syllables. These rules include duration. 2. PSOLA or MBROLA. Spanish has about 800 diphones.g. the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree.Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch). duration. the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding. Typically. In diphone synthesis. concatenative synthesis produces the most natural-sounding synthesized speech. Recently. However. Unit selection synthesis .2 Concatenative synthesis Derivation of rules in parametric synthesis is a laborious task. Unit selection provides the greatest naturalness. Also. although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. and sentences.5. The number of diphones depends on the phonotactics of the language: for example. Generally. its use in commercial applications is declining. ”clear out” is realized as /klt/). which allows many languages to be provided with a small footprint. 3. 2. which provides a framework for building speech synthesis systems and offers full text to speech support through a number of APIs . The quality of unit selection synthesis is found to be more natural than diphone and parametric synthesis. The speech synthesized is intelligible. Screen Readers . 2. A large corpus based unit selection paradigm has been employed. We have used festival speech synthesis system developed at The Centre for Speech Technology Research. We chose ORCA for Linux based systems and NVDA for Windows based systems.The technology is very simple to implement. unit selection synthesis lacks the consistency i. But the main 30 drawback of this software is its high cost. and they closely match the prosody and intonation of the original recordings. Domain-specific synthesis . Speech Engine . Examples of diphone synthesizers are Festival diphone synthesis and MBROLA. Duration and intonation are derived either manually or automatically from the data and are incorporated during synthesis time.e. For example. It is used in applications where the variety of texts the system will output is limited to a particular domain. in non-rhotic dialects of English the ”r” in words like ”clear” /kl/ is usually only pronounced when the following word has a vowel as its first letter (e. due to increase in storage and computation capabilities. The level of naturalness of these systems can be very high because the variety of sentence types is limited. Different open source screen readers are freely available. ORCA is a flexible screen reader that provides access to the graphical desktop via user-customizable 8 . This paradigm is known to produce intelligible natural sounding speech output. but lacks naturalness. an effect called liaison. JAWS is the most popular screen reader used worldwide for Microsoft Windows based systems. The demand is for a high quality natural sounding TTS system. As such.Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. like transit schedule announcements or weather reports.g. in devices like talking clocks and calculators. University of Edinburgh. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. and has few of the advantages of either approach other than small size.[citation needed] although it continues to be used in research because there are a number of freely available software implementations. However. and provides quick responses. This alternation cannot be reproduced by a simple word-concatenation system. Because these systems are limited by the words and phrases in their databases. Multiple examples of a unit along with the relevant linguistic and phonetic context are stored and used in the unit selection synthesis. has led to development of unit selection synthesis. which would require additional complexity to be context-sensitive. they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed.The role of a screen reader is to identify and interpret what is being displayed on the screen and transfer it to the speech engine for synthesis. eSpeak uses ”formant synthesis” method.One of the most widely used speech engine is eSpeak..6 Primary components of the TTS framework 1. many final consonants become no longer silent if followed by a word that begins with a vowel. The speech units used in concatenative synthesis are typically at diphone level so that the natural co-articulation is retained. approximately 1300 USD. Likewise in French.synthesis and the robotic-sounding nature of formant synthesis. and has been in commercial use for a long time. whereas the average per capita income in India is 1045 USD. in terms of variations of the quality. The possibility of storing more than one example of a diphone unit. but has a larger foot print. visually impaired.The typing tools map the qwerty keyboard to Indian language characters. Widely used tools to input data in Indian languages are Smart Common Input Method (SCIM)and inbuilt InScript keyboard. Same has been used for our TTS systems. braille and magnification. NVDA is a free screen reader which enables vision impaired people to access computers running Windows. The perceived inability of people with disability. to enable them to use the computer using the screen reader JAWS with English as the language. bank statements. Natural sounding Text-to-speech synthesis systems in different Indian languages 2. and scholastic transcripts. AccessIndia is a mailing list which provides an opportunity for visually impaired computer users in India to exchange information as well as conduct discussions related to assistive technology and other accessibility issues . NVDA is popular among the members of the AccessIndia community. • They would prefer English spoken in Indian accent. Assistive technologies (AT) are necessary to enable physically challenged persons to become part of the mainstream of society.7 Screen readers for the visually challenged India is home to the worlds largest visually challenged (VC) population. Madras (IIT Madras) had been conducting a training programme for visually challenged people. 3. The aim of this project is to make a difference in the lives of VC persons. Education is THE means of developing the capabilities of people with disability. • Most students would have preferred a reader in their native language. NVDA has already been integrated with Festival speech Engine by Olga Yakovleva. 2. • The price for the individual purchase of JAWS was very high. illiterate or learning disabled. Against this backdrop. VC persons need to depend on others to access common information that others take for granted. To ensure that the TTSes are also integrated with open source screen readers. Spreadsheets. to enable them to develop their potential. 9 . it was felt imperative to build assistive technologies in the vernacular. such as newspapers.combinations of speech. escape poverty and provide a means of entry to fields previously denied to them. disability is equated to inability. Ministry of Information Technology to sponsor the development of 1. such as Word Processors. most of them felt that: • The English accent was difficult to understand. to use/access standard computer software. as well. Indian Institute of Technology. ORCA supports the Festival GNOME speech synthesizer and comes bundled with popular Linux distributions like Ubuntu and Fedora. for Linux and Windows systems respectively. Typing tool for Indian Languages . the VC persons have benefited from this programme. A screen reader is an assistive technology potentially useful to people who are visually challenged. become self sufficient. the perceived cost of special education and attitudes towards inclusive education are major constraints for effective delivery of education. Although. Before the start of this project. Low attention is paid to people with disabilities and social inclusion and acceptance is always a threat/challenge. In todays digital world. Email and the Internet. An initiative was taken by DIT. Using the wavefiles and their transcriptions the indian language unit selection voice was built 10 .From the crawled data sentences were picked to maximize syllable coverage. Labeling . Training .The wavefiles were then manually labeled using the semi-automatic labeling tool to get accurate syllable boundaries. Data Collection .Text crawled from a news site and a site for stories for children. Cleaning up of Data .3 Overall Picture 1. 2. 4. 3.The sentences that were picked were then recorded in a studio which was a completely noise-free environment. 5. Recording . Using the voice built. 11 .6. a MOS test was conducted with visually challenged end users as the evaluators. Testing . by comparing the labeling outputs of both EHMM process and VOP algorithm.. another panel which would show the segment boundaries as estimated by the EHMM process and a panel for VOP. Copy the html folder to /var/www folder. If www folder is not there in /var. The size of the segment labels generated can vary from monosyllables to polysyllables. has been obsoleted. This would help greatly in adjusting the labels provided by the group delay algorithm.. The tool works for 6 different Indian languages namely • Hindi • Tamil • Malayalam • Marathi • Telugu • Bengali The tool also displays the text (utf8) in segmented format along with the speech file.1 How to Install LabelingTool 1. as the Window Scale Factor (WSF) parameter is varied from small to large values. Install java compiler using the following command sudo apt−get install sun−java6−jdk The following error may come ==> Reading package lists. This may mean that the package is missing.. or is only available from another source 12 . So we have the labelingTool code in /var/www/html/labelingTool/ 2. • The group delay based algorithm (GD) • The Vowel Onset Point (VOP) detection algorithm. manual intervention during the labeling process can be eliminated. DONLabel Labeling tool provides an automatic way of performing labeling given an input waveform and the corresponding text in utf8 format. Done Building dependency tree Reading state information. It is also not trivial to label waveforms manually. which shows the segment boundaries estimated by Group Delay algorithm. if necessary. create a folder named www and extract the html folder into it. but is referred to by another package. Process of manual labeling is time consuming and daunting task. It would also improve the accuracy of the labels generated by the labeling tool. Our labeling process make use of: • Ergodic HMM (EHMM) labeling procedure provided by Festival. which shows how many vowel onset points are present between each segments provided by group delay algorithm. Done Package sun−java6−jdk is not available. 4. The Labeling tool displays a panel. By using VOP as an additional cue.4 Labeling Tool It is a widely accepted fact that the accuracy of labeling of speech files would have a great bearing on the quality of unit selection synthesis. The tool makes use of group delay based segmentation to provide the segment boundaries.. at the syllable level. but is referred to by another package. Done Building dependency tree Reading state information. It is recommended that you use openjdk-6 instead. you can install sunjava6 packages from the Canonical Partner Repository. It is recommended that you use openjdk-6 instead.debian. If you can not switch from the proprietary Sun JDK/JRE to OpenJDK. Done Package sun−java6−jre is not available. the sun-java6 packages have been dropped from the Multiverse section of the Ubuntu archive. the sun-java6 packages have been dropped from the Multiverse section of the Ubuntu archive.10.E: Package ’sun-java6-jdk’ has no installation candidate sudo apt−get install sun−java6−jre The following error may come ==> Reading package lists... If you can not switch from the proprietary Sun JDK/JRE to OpenJDK. You can configure your system to use this repository via command-line: sudo sudo sudo sudo sudo add-apt-repository ”deb http://archive. or is only available from another source E: Package ’sun-java6-jre’ has no installation candidate One solution is : sudo add−apt−repository ”deb http://archive. you can install sunjava6 packages from the Canonical Partner Repository.net/flexiondotorg/java/ubuntu/ lucid main” sudo apt−get update The other solution is : For Ubuntu 10.net/chromium−daily/ppa/ubuntu/ lucid main” sudo add−apt−repository ”deb http://ppa. has been obsoleted..com/ lucid partner” sudo add−apt−repository ”deb http://ftp.canonical.launchpad.com/ lucid partner” apt-get update apt-get install sun-java6-jre sun-java6-plugin apt-get install sun-java6-jdk update-alternatives –config java For Ubuntu 10. You can configure your system to use this repository via command-line: sudo add-apt-repository ”deb http://archive.canonical.04 LTS.com/ maverick partner” sudo apt-get update sudo apt-get install sun-java6-jre sun-java6-plugin sudo apt-get install sun-java6-jdk sudo update-alternatives –config java 13 .canonical. This may mean that the package is missing.org/debian squeeze main contrib non−free” sudo add−apt−repository ”deb http://ppa..launchpad. Sample default file is attached 5.If above does not work (for other version of ubuntu) then you can create local repository as follows: cd ∼/ wget https://github. Enable java script in the properties of the browser used Use Google chrome or Mozilla firefox 8. give full permissions to html folder sudo chmod −R 777 html/ 10.sh and then run: sudo apt-get install sun-java6-jdk sudo apt-get install sun-java6-jre Source : https://github. Install apache2 using the following command sudo apt−get install speech-tools 6.so sudo ln −s /etc/alternatives/mozilla−javaplugin. Install java plugin for browser sudo apt−get install sun−java6−plugin Create a symbolic link to the Java Plugin libnpjp2.policy 14 .so using the following commands sudo ln −s /usr/lib/jvm/java-6-sun/jre/plugin/i386/libnpjp2.sh chmod +x oab-java6. Install php using the following command sudo apt−get install php5 4.com/flexiondotorg/oab-java6/blob/a04949f242777eb040150e53f4dbcd4a3ccb7568/ README.so /etc/alternatives/mozilla−javaplugin.sh -O oab-java6.2./oab-java6.com/flexiondotorg/oab-java6/raw/0. Install apache2 using the following command sudo apt−get install apache2 update the paths in the following file /etc/apache2/sites−available/default Set all path of cgi−bin to ”/var/www/html/cgi-bin”.so 9.sh sudo . Add the following code to /etc/java−6−sun/security/java. Install tcsh using the following command sudo apt−get install tcsh 7.rst 3.1/oab-java6.so /usr/lib/mozilla/plugins/libnpjp2. In /var/www/html/labelingTool/jsrc/install file. make sure that correct path of javac is provided as per your installation. Open the browser and go to the following link http://localhost/main.security.6. Note: LabelingTool. When Labelingtool is working fine the following files will be generated in labelingTool/results folder boundary segments spec low vop wav sig gd spectrum low segments indicator tmp.26/bin/javac Version of java is 1. Replace the Pronunciation Rules. in the LIVE box.6.php NOTE : VOP algoirthm is not used in the current version of the labelingTool. it should display This web browser can indeed run Java applets wait for some time for the display to come. 12.grant { permission java.d/apache2 restart 14.26 here. Restart apache using the following command sudo /etc/init. Install the tool using the following command Go to /var/www/html/labelingTool/jsrc and run the below command sudo . 13. 11. Note: Recompile with −Xlint:deprecation for details. please ignore on the below sections 4.java uses or overrides a deprecated API.0.2 Troubleshooting of LabelingTool 1.html In that webpage.org/enabled.seg vopsegments 15 ./install It might give the following output which is not an error.pl ) 16. For example : /usr/lib/jvm/java−6−sun-1. there is some issue with the java applets. check if java applet is enabled in the browser by using the following link http://javatester. Check the path and give correct values. So anything related to vop.0. In case it had displayed This web browser can NOT run Java applets. Please browse for how to enable java in your version of browser and fix the issue. it might be different in your installation.AllPermission. 15. }.pl in the /var/www/html/labelingTool folder with your language specific code (The name should be same − Pronunciation Rules. If the above files are not getting created..wav .results/segments .base /home/text 0001. (a) cd /var/www/html/labelingTool/Vop−Lab (b) make −f MakeEse clean (c) make −f MakeEse (d) cd bin (e) cp Multiple Vopd ./Multiple Vopd fe-ctrl. try recompiling. if the vopUpdate button is clicked./results/wav .ese .. added or moved) and saved 2 more files gets created in the results folder././bin/ 5./results/vop The file ’wav’ in results folder is already the sphere format of your input wavefile. ind upd segments updated 3. 16 .. cd /var/www/html/labelingTool/Segmentation make −f MakeWordsWithSilenceRemoval clean make −f MakeWordsWithSilenceRemoval cp bin/WordsWithSilenceRemoval /var/www/html/labelingTool/bin/ • The command line usage of the Multiple Vopd program is as follows Multiple Vopd ctrlFile waveFile(in sphere format) segments file vopfile example : . another new file gets created in the results folder vop updated 4../WordsWithSilenceRemoval fe−words2..results/spec . Follow the below steps.php page is getting stuck. • The command line usage of the WordsWithSilenceRemoval program is as follows WordsWithSilenceRemoval ctrlFile waveFile sigFile spectraFile boundaryFile thresSilence(ms) thresVoiced(ms) example : .results/boun 100 100 Two files named spec and boun has to be generated in the results folder. when the boundaries are manually updated. we can try running through command line as follows Execute them from/var/www/html/labelingTool/bin folder. If a file named ’vop’ is not getting generated in labelingTool/results folder and the labelit. (deleted.2... you need to compile the vop module. When after manually updating and saving. if not created. a file ’vop’ has to be generated in the results folder. following command can be used in the labelingTool folder chmod −R 777 * chown −R root:root * ) 9. How to check if tcsh is installed. The java. enable character−encoding to utf8 for the browser Tools->options->content->fonts and colors(Advanced menu)->default character encoding(utf8) Restart browser. 8.policy file should be updated as specified in the installation steps. When the lab file is viewed in the browser. 17 .On running Multiple Vopd binary. If speech tools was installed along with festival and there is no link to it in /usr/bin. 6. please make a link to point to ch wave binary file in /usr/bin folder. If the file ’wav’ is not produced in results folder. Provide full permissions to the labelingTool folder and its sub folder so that the new files can be created and updated without any permission issues.. ’speech tools’ are not installed How to check if speech tools are installed : Once installing speech tools check if the following ch wave −info <wave file name> This command should give the information about that wave file. otherwise it may result in error ”Error writing Lab File” 10. (if required. type command ’tcsh’ and a new prompt will come. 7. if utf8 is not displaying. 5 Labeling Tool User Manual 5.1 How To Use Labeling Tool The front page of the tool can be taken using the URL http://localhost/main.php A screen shot of the front page is as shown below. The front page has the following fields • The speech file in wav format should be provided. It can be browsed using the browse button • The corresponding utf8 text has to be provided in the text file. It can be browsed using the browse button the text file that is uploaded should not have any special characters. • The ehmm lab file generated by festival while building voice can be provided as input. This is an optional field. • The gd lab file generated by labelingtool in a previous attempt to label the same file. This is an optional field. If the user had once labelled a file half way and saved the lab file, it can be provided as input here so as to label the rest of it or to correct the labels. • The threshold for voiced segment has to be provided in the text box. It varies for each wav file. The value is in milli seconds. (e.g. 100, 200, 50..) • The threshold for unvoiced segment has to be provided in the text box. It varies for each wav file. The value is in milli seconds. (e.g. 100, 200, 50..) If the speech file has very long silences a high value can be provided as threshold value. • WSF (window scale factor) can be selected from the drop down list. The default value given is 5. Depending on the output user will be required to change WSF values and find the most appropriate value that provides the best segmentation possible for the speech file. • The corresponding language can be selected using the radio button 18 • Submit the details to the tool using submit button. A screen shot of the filled up front page is given below. Loading Page On clicking ’submit’ button on the front page the following page will be displayed. Validation for data entered • If the loading of all files were successful and proper values were given for the thresholds in the front page the message Click View to see the results... will be displayed as shown above. • If the wave file was not provided in the front page the following error will come in the loading page Error uploading wav file. Wav file must be entered • If the text file was not provided in the front page the following error will come in the loading page Error uploading text file. Text file must be entered • If the threshold for voiced segments was not provided in the front page the following error will come in the loading page Threshold for voiced segments must be entered • If the threshold for unvoiced segments was not provided in the front page the following error will come in the loading page Threshold for unvoiced segments must be entered 19 • If numeric value is not entered for thresholds of unvoiced or voiced segments, in the front page the following error will come in the loading page Numeric value must be entered for thresholds • The wav file loaded will be copied to the /var/www/html/UploadDir folder as text.wav • The lab file (ehmm label file) will be copied to /var/www/html/labelingTool/lab folder as temp.lab If error occurred while moving to the lab folder the following error will be displayed Error moving lab file. • The gd lab file (group delay label file) will be copied to /var/www/html/labelingTool/results folder with the name gd lab. If error occurred while moving to the lab folder the following error will be displayed Error moving gdlab file. • The Labelit Page On clicking view button on the loading page the labelit page will be loaded. A screenshot of this page along with the markings for each panel is given below. Note: If error message Error reading file ’http://localhost/labelingTool/tmp/temp.wav appears, it means in place of wav file some other file(eg text file) was uploaded. • Panels on the Labelit Page It has 6 main Panels – EHMM Panel displays the lab files generated by festival using EHMM algorithm while building voices – Slider Panel using this panel we can slide, delete or add segments/labels 20 This means that. is considered to be a segment boundary. That means the segment boundary found by group delay algorithm is correct. Here green colour corresponds to one vowel onset point. select the WSF from the drop down list and click ’RESEGMENT’. Wherever the peaks appear. Below figure shows the same waveform with a lesser wsf (wsf =3). Yellow colour corresponds to more than one vowel onset points. To experiment with different wsf values. 21 . Lesser the wsf. – VOP Panel shows the number of vowel onset points found between each segments provided by Group delay. greater the number of boundaries and vice versa. This is because of the limitations in java) – Text Panel displays the segmented text (in utf8 format) with syllable as the basic units. as seen in wavesurfur. A different wsf will provide a different set of boundaries.– Wave Panel displays the speech waveform in segmented format (Note: The speech wave file is not appearing. A screen shot for the same text (as in the above figure) with a greater wsf selected is shown below The above figure shows the segmentation using wsf = 12. – GD Panel draws the group delay curve. That means the segment boundary found by group delay algorithm is wrong and that boundary needs to be deleted. It gives less number of boundaries. • Resegment The WSF selected for this example is ’5’. between 2 boundaries found by group delay algorithm there will be one or more boundaries. It gives more number of boundaries. Red colour corresponds to zero vowel onset point. This is the result of group delay algorithm. if required save button has to be clicked. Click the mouse on the waveform and a yellow line will appear to show the selection. 22 . with a heading ’Waveform’ The Menu Bar contains following buttons in that order from left to right – Save button The lab file can be saved using the save button.So the ideal wsf for the waveform has to be found out. On clicking this button. (Not missing any text segments nor having many segments without texts). the VOP algorithm is recalculated on the new set of segments on clicking this button. This button can be used to verify each segment. – Play from selection Play the waveform starting from the current selection to the end. adding or deleting) . addition or dragging). After making the changes. Easier way is to check the text segments are reaching approximately near the end of the waveform. • Menu Bar The menu Bar is just above the EHMM Panel. After making any changes to the segments (deletion. – Play the waveform The entire wave file will be played on pressing this button – Play the selection Select some portion of the waveform (say a segment) and play just that part using this button. the save button must be pressed before updating the VOP panel. from that selected point to end of the file will be played – Play to selection Plays the waveform from the beginning to the end of the current selection – Stop the playbackStops the playing of wave file – Zoom to fit Display the selected portion of the wave zoomed in – Zoom 1 Display the entire wave – Zoom in Zoom in on the wave – Zoom out Zoom out on the wave – Update VOP Panel After changing the segments (dragging. The selected point appears as a yellow line 3cm 23 .Some screen shots are given below to demonstrate the use of menu bar. Below figure shows how to select a portion (drag using mouse on wavepanel) of the waveform and play that part. The selected portion appears shaded in yellow as shown The below figure shows how to select a point (click using mouse on wavepanel) and play from selection to end of file. The Next figure shows how the portion of the wave file selected in the above figure is zoomed. 2cm 24 .Next figure shows how to select a portion of the wave and zoom to fit. • Deletion of a Segment All the segments appear as red lines in the labeling tool output. A segment can be deleted by right clicking on that particular segment on the slider panel.5. whether it is matching the text segmentation. with the help of VOP and EHMM Panels. User can decide whether to delete right or left of red segment after listening. The VOP has given a red colour (indication to delete one) for that segment.2 How to do label correction using Labeling tool Each segment given by the group delay can be listened to and decided whether the segment is correct or not. Ideally we delete the fourth one. the text segments get shifted and fits after silence segment as shown in the below figure. The figure below shows the original output of labelling tool for the Hindi wave file The third and fourth segments are very close to each other and one has to be deleted. On deletion (right click on slider panel on that segment head) of the fourth segment. 25 . But sometimes the algorithm computes it as a boundary. It can be wrong in some cases. Hence it need not be deleted. Threshold value in GD panel is the middle line in magenta colour. The figure below shows the corrected segments (after deletion) 26 .On listening each segments it is seen that the segment between and is wrong. The second last red column in VOP is incorrect and GD gives the correct segment. The last one is correct and we have to delete a segment. The yellow colour on VOP usually says to add a new segment. The VOP gives red colour for the segment and the corresponding peak in the group delay panel is below the threshold. Peaks below the threshold in group delay curve usually wont be a segment boundary. is always used as a reference for GD algorithm. It has to be deleted. There are 2 more red columns in VOP. but here the yellow colour is appearing in the Silence region and we ignore it. On clicking Save button a dialog box appears with the message Lab File Saved Click Next to Continue A silence segment gets deleted on clicking the right boundary of the silence segment. the save button have to be pressed. 27 . The updated output is shown in below figure. • Update VOP Panel After saving the changes made to the labels the VOP update button has to be clicked to recalculate the VOP algorithm on the new segments.On completion of correcting the labels. /labelingTool/labfiles directory in the gd lab file option in the main page. when Save button is pressed same labfile is updated but before updating backup copy of lab file is created. The VOP shows three yellow columns here of which the second yellow column is true. In the above figure it can be seen that the mouse is placed on the slider panel at the location to add the new segment. But backup copy is by default hidden. The figure below shows the corresponding corrected wave file and after VOP updation done. The GD plot shows a small peak in that segment and we can be sure that the segment has to be added at the peak only. system creates the backup copy of that file. The below figure shows a case in which a segment needs to be added. 28 . upload it from . After modification. But if we use resegmentation the already present labels will be gone and it will be regenerated based on the new wsf value present. • Modification of labfile If a half corrected lab file is already present (gd lab file present).• Adding A Segment A segment can be added by right clicking with mouse on the slider panel at the point where a segment needs to be added. Note: If system creates a lab file with same name that already exists in labfiles directory. to view it just press CTRL + h. Irrespective of the wsf value. Sliding can be used if required while correcting the labels. • Sliding a Segment A segment can be moved to left or right by clicking on the head of the segment boundary on the slider panel and dragging left or right. the earlier lab file will be loaded. • Logfiles Tool generates a seprate log file for each lab file(eg.4 Control file A control file is placed at the location /var/www/html/labelingTool/bin/fewords. These parameters can be adjusted by the user to get better segmentation results.base The parameters in the control file are given below./labelingTool/logfiles directory.3 Viewing the labelled file Once the corrections are made and the save button is clicked the lab file is generated in/var/www/html/labelingT directory and it can be viewed by clicking on the ’next’ link. text0001. • windowSize size of frame for energy computation • waveType type of the waveform: 0 Sphere PCM 1 Sphere Ulaw 2 plain sample one short integer per line 3 RAW sequence of bytes each sample 8bit 4 RAW16 two bytes/sample Big Endian 5 RAW16 two bytes/sample Little Endian 6 Microsoft RIFF standard wav format • winScaleFactor should be chosen based on syllable rate choose by trial and error • gamma reduces the dynamic range of energy • fftOrder and fftSize MUST be set to ZERO!! • frameAdvanceSamples frameshift for energy computation • medianOrder order of median smoothing for group delay function 1==> no smoothing 29 . The lab file will appear on the browser window as below 5. 5. The following message comes on clicking the ’next’.log) in . Please keep cleaning this directory after certain interval. Download the labfile: labfile Click on the link ’labfile’. 5 Performance results for 6 Indian Languages Testing was conducted for a set of test sentences for all 6 Indianlanguages and the percentage of correctness was calculated based on the following formulae. it is NOT used . Examples tested with ENERGY only • Sampling rate of the signal required for giving boundary information in seconds. The calculations were done after the segmentation was done using the tool with the best wsf and threshold values.24% 77.6 (N oof insertions+noof deletions) T otalnoof segments Percentage of Correctness 86. thresZero. thresSpectralFlatness are thresholds used for voiced unvoiced detection When a parameter is set to zero. [1− Language Hindi Malayalam Telugu Marathi Bengali Tamil 5.68% 85.83% 78.38% Limitations of the tool • Zooming is not enabled for VOP and EHMM panels • Wave form is not displayed properly as in wavesurfur 30 ] × 100 .• ThresEnergy.84% 77.40% 80. 5. phone plus word etc. Technically diphone selection is a simple case of this. positional and whatever) for each unit type. 2. The cluster method has actual restrictions on the unit size. and accents. Build distances tables. very natural sounding synthesis. such as phonetic context. The actual features used may easily be changed and experimented with as can the definition of the definition of acoustic distance between the units in a cluster. phone plus stress. The code here and the related files basically assume unit size is phone. Building coefficients for acoustic distances. word position. 5. diphone. First is size. 6. Build cluster trees using wagon with the features and acoustic distances dumped by the previous two stages 7. However techniques like this often produce very high quality. 4. The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows. prosodic. However because you may also include a percentage of the previous unit in the acoustic distance measure this unit size is more effectively phone plus previous phone. weighting the costs of relative candidates is a non-trivial problem. 6. A high level walkthrough of the scripts to run is given after these lower level details. demi-syllable. The second type itself which may be simple phone. Dump selection features (phone context. typically some form of cepstrum plus F0. LPC). such as phone.2 Choosing the right unit type Before you start you must make a decision about what unit type you are going to use. Building the voice description itself 6.g. However they also can produce some very bad synthesis too. precalculating the acoustic distance between each unit of the same phone type. prosodic features (F0 and duration) and higher level features such as stressing. it simply clusters the given acoustic units with the given feature. The theory is obvious but the design of such systems and finding the appropriate selection criteria. 1. when the database has unexpected holes and/or the selection costs fail. thus it is somewhat diphone like. By ”unit selection” we actually mean the selection of some unit of speech which may be anything from whole phrase down to diphone (or even smaller). in unit selection there is more than one example of the unit and some mechanism is used to select between them at run-time. Building the utterance Structure 3. but the basic synthesis code 31 . Collect the database of general speech. However typically what we mean is unlike diphone selection.1 Cluster unit selection The idea is to take a database of general speech and try to cluster each phone type into groups of acoustically similar units based on the (non-acoustic) information available at synthesis time. or some pitch synchronous analysis (e. Note there are two dimensions here.6 Unit Selection Synthesis Using Festival This chapter discusses some of the options for building waveform synthesizers using unit selection techniques in Festival. one of the advantages of unit selection is that a much more general database is desired. Note that the first part of a unit name is assumed to be the phone name in various parts of the code thus although you make think it would be neater to return previousphone phone that would mess up some other parts of the code. However.name i) ”” (item. The more distinctions you make less you depend on the clustering acoustic distance. The second dimension. In the limited domain case the word is attached to the phone. If you wanted to make a diphone unit selection voice this function could simply be (define (INST LANG NAME::clunit name i) (string append (item. Also synthesis using these techniques seems to retain aspects of the original database. as it is those variations that can make synthesis from unit selection sound more natural.3 Collecting databases for unit selection Unlike diphone database which are carefully constructed to ensure specific coverage.is currently assuming phone sized units. The parameter clunit name feat may be used define the unit type. If the database is broadcast news stories. 6. is very open and we expect that controlling this will be a good method to attain high quality general unit selection synthesis. The important thing to remember is that at synthesis time the same function is called to identify the unittype which is used to select the appropriate cluster tree to select from. There we distinguish each phone with the word it comes from. In the given setup scripts this feature function will be called lisp INST LANG NAME::clunit name. at least phone coverage if not complete diphone coverage. The mechanism to define the unit type is through a (typically) user defined feature function. but the more you depend on your labels (and the speech) being (absolutely) correct. prosodic variation is probably a good thing. Good phonetic coverage is also useful. although voices may be built from existing data not specifically gathered for synthesis there are still factors about the data that will help make better synthesis. However unlike diphones database. The decision of how to carve up that space depends largely on the intended use of the database. The simplest conceptual example is the one used in the limited domain synthesis. You can also consider some demi−syllable information or more to differentiate between different instances of the same phone. type. As we are going to be selecting units from different parts of the database the more similar the recordings are. the less likely bad joins will occur. Thus the voice simply defines the function INST LANG NAME::clunit name to return the unit type for the given segment. thus a d from the word limited is distinct from the d in the word domain. Such distinctions can hard partition up the space of phones into types that can be more manageable. the synthesis from it will typically sound like read news stories (or more importantly will sound best when it is reading 32 .feat i ”p.name”))) Thus the unittype would be the phone plus its previous phone. Like diphone databases the more cleanly and carefully the speech is recorded the better the synthesized voice will be. Thus you need to ensure that if you use say diphones that the your database really does not have all diphones in it. It is highly recommended that you follow this format otherwise scripts. They should have the extension . There are many ways to organize databases and many of such choices are arbitrary. marking closures in stops) mapping that information back to the phone labels 33 . pm/ − Pitchmark files as generated from the lar files or from the signal directly.5 Building utterance structures for unit selection In order to make access well defined you need to construct Festival utterance structures for each of the utterances in your database. Ideally these should all be carefully hand labeled but in most cases that’s impractical. words. and examples will fail. This is usually the master label files.wav and the fileid consistent with all other files through the database (labels. There are ways to automatically obtain most of these labels but you should be aware of the inherit errors in the labeling system you use (including labeling systems that involve human labelers). one utterances per file with a standard name convention. For the unit selection algorithm described below the segmental labels should be using the same phoneset as used in the actual synthesis voice. Again the notes about recording the database apply. syllables. and intonation events. Note that when a unit selection method is to be used that fundamentally uses segment boundaries its quality is going to be ultimately determined by the quality of the segmental labels in the databases. festival/relations/ − The processed labeled files for building Festival utterances. here is our ”arbitrary” layout. F0 Targets. This (in its basic form) requires labels for segments. festival/ − Festival specific label files. lar/ − The EGG files (larynograph files) if collected. Other directories will be created for various processing reasons. Syllable/ etc. Word/. held in directories whose name reflects the relation they represent: Segment/.news stories). phrases.g. Typically this first contains a copy of standard scripts that are then customized when necessary to the particular database wav/ − The waveform files. festival/utts/ − The utterances files as generated from the festival/relations/ label files. 6. utterances. these may contain more information that the labels used by festival which will be in festival/relations/Segment/. though it will sometimes be the case that the database is already recorded and beyond your control. These should be headered. 6. lab/ − The segmental labels. pitch marks etc). However a more detailed phonetic labeling may be more useful (e. The basic database directory should contain the following directories bin/ − Any database specific scripts for processing. in that case you will always have something legitimate to blame for poor quality synthesis.4 Preliminaries Throughout our discussion we will assume the following database layout. 6. Thus have a parametric spectral representation of each pitch period. Once we have a better handle on selection techniques themselves it will then be possible to start experimenting with noisy labeling. but the result should always be checked by hand.pm −window type hamming done The above builds coefficients at fixed frames. using such an aligner may be a useful first pass.mcep −pm pm/$fname. durations and an F0 target. Most autoaligners are built using speech recognition technology where actual phone boundaries are not the primary measure of success. This is is also still a research issue but in the example here we will use Mel cepstrum. Thus it. though it does require that pitchmarks are reasonably identified. We have also experimented with building parameters pitch synchronously and have found a slight improvement in the usefulness of the measure 34 . we have successfully used the aligner described in the diphone chapter above to label general utterance where we knew which phone string we were looking for. See the Section called Utterance building in the Chapter called A Practical Speech Synthesis System for a more general discussion of how to build utterance structures for a database. It has been suggested that aligning techniques and unit selection training techniques can be used to judge the accuracy of the labels and basically exclude any segments that appear to fall outside the typical range for the segment type. This is the desire for researchers in the field. Autoaligned databases typically aren’t accurate enough for use in unit selection. for i in $* do fname=‘basename $i .before actual use. General speech recognition systems primarily measure words correct (or more usefully semantically correct) and do not require phone boundaries to be accurate. is believed that unit selection algorithms should be able to deal with a certain amount of noise in the labeling. Here is an example script which will generate these parameters for a database. Having said this though. If the database is to be used for unit selection it is very important that the phone boundaries are accurate.6 Making cepstrum parameter files In order to cluster similar units in a database we build an acoustic representation of them. but at pitch marks. Interestingly we do not generate these at fixed intervals. This technique is inherently robust to at least a few tens of millisecond boundary labeling errors.wav‘ echo $fname MCEP $SIG2FV $SIG2FVPARAMS −otype est binary $i −o mcep/$fname. intonation events and phrasing allow a much richer set of features to be used for clusters. it is included in festvox/src/unitsel/make mcep. but we are some way from that and the easiest way at present to improve the quality of unit selection algorithms at present is to ensure that segmental labeling is as accurate as possible. However it should be added that this unit selection technique (and many others) support what is termed ”optimal coupling” where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. We have found this a better method. For the cluster method defined here it is best to construct more than simply segments. A whole syllabic structure plus word boundaries. we will go through each step and at that time explain which parameters affect the substep. 35 . To check if an installation already has support for clunits check the value of the variable *modules*. We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be tried. Note the secondary advantage of using LPC coefficients is that they are required any way for LPC resynthesis thus this allows less information about the database to be required at run time. We do not pretend that this part is particularly neat in the system but it does work. add ALSO INCLUDE += clunits to the end of your festival/config/config file. The function build clunits will build the distance tables.3. sort them into segment type and name them with individual names (as TYPE NUM. and recompile.based on this. assuming you have already generated pitch marks.scm when a new voice is setup. The file festvox/src/unitsel/build clunits.scm contains the basic parameters to build a cluster model for a databases that has utterance structures and acoustic parameters. Of course you need the clunits modules compiled into your version of Festival. This first stage is required for all other stages so that if you are not running build clunits you still need to run this stage first. This happens to be appropriate from LPC coefficients.3. The function build clunits runs through all the steps but in order to better explain what is going on. the version of clunits in 1. The first stage is to load in all the utterances in the database. There are many parameters are set for the particular database (and instance of cluster building) through the Lisp variable clunits params. When pitch synchronous parameters are build the clunits module will automatically put the local F0 value in coefficient 0 at load time. 6. An reasonable set of defaults is given in that file. and reasonable run−time parameters will be copied into festvox/INST LANG VOX clunits. dump the features and build the cluster trees.1 or later is required. This is done by the calls (format t ”Loading utterances and sorting types\n”) (set! utterances (acost:db utts load dt params)) (set! unittypes (acost:find same types utterances)) (acost:name units unittypes) Though the function build clunits init will do the same thing. Version 1. Also a more general duration/number of pitch periods match algorithm is worth defining. The script in festvox/src/general/make lpc can be used to generate the parameters. To compile in clunits.7 Building the clusters Cluster building is mostly automatic. This uses the following parameters name STRING A name for this database.0 is buggy and incomplete and will not work. This is done by the following two function calls (format t ”Loading coefficients\n”) (acost:utts load coeffs utterances) (format t ”Building distance tables\n”) (acost:build disttabs unittypes clunits params) The following parameters influence the behaviour.db dir FILENAME This pathname of the database.0 means all.. )) In the examples below the list of fileids is extracted from the given prompt file at call time. as in the current directory. utts ext FILENAME The file extention for the utterance files files The list of file ids in the database. ac left context FLOAT The amount of the previous unit to be included in the the distance. 0. Thus a mean mahalanobis euclidean distance is found between units rather than simply a euclidean distance. coeffs dir FILENAME The directory (from db dir) that contains the acoustic coefficients as generated by the script make mcep.0 means none.utt”) (files (”kdt 001” ”kdt 002” ”kdt 003” . This parameter may be used to make the acoustic distance sensitive to the previous acoustic 36 . typically .. Precalculating this saves a lot of time as the cluster will require this number many times. 1. The recommended value is t. coeffs ext FILENAME The file extention for the coefficient files get std per unit Takes the value t or nil. If t the parameters for the type of segment are normalized by finding the means and standard deviations for the class are used. For example for the KED example these parameters are (name ’ked timit) (db dir ”/usr/awb/data/timit/ked/”) (utts dir ”festival/utts/”) (utts ext ”. utts dir FILENAME The directory contain the utterances. The acoustic distance between each segment of the same type is calculated and saved in the distance table. The next stage is to load the acoustic parameters and build the distance tables. The first parameter is (in normal operations) F0 . An example is (coeffs dir ”mcep/”) (coeffs ext ”.5 0.5 0..8. These are standard festival feature names with respect to the Segment relation.5 0.5 0.5 0.5)) The next stage is to dump the features that will be used to index the clusters.5 0.mcep”) (dur pen weight 0.) The weights for each parameter in the coefficeint files used while finding the acoustic distance between segments. feats LIST The list of features to be dumped. F0 pen weight FLOAT The penalty factor for F0 mismatch between units. dur pen weight FLOAT The penalty factor for duration mismatch between units. but they are indexed by these features. Remember the clusters are defined with respect to the acoustic distance between each unit in the cluster. Finding the right parameters and weightings is one the key goals in unit selection synthesis so its not easy to give concrete recommendations. The remaining parameters are typically MFCCs (and possibly delta MFCCs).8) (ac weights (0. There must be the same number of weights as there are parameters in the coefficient files. but there may be better ones too though we suspect that real human listening tests are probably the best way to find better values. For our KED example these values are (feats dir ”festival/feats/”) (feats 37 . ac weights (FLOAT FLOAT .5 0.5 0. Its is common to give proportionally more weight to F0 that to each individual other parameter. The following aren’t bad.5 0.. These features are those which will be available at text-to-speech time when no acoustic information is available. The recommended value is 0.5 0. The name features may (and probably should) be over general allowing the decision tree building program wagon to decide which of theses feature actual does have an acoustic distinction in the units. The function to dump the features is (format t ”Dumping features for clustering\n”) (acost:dump features unittypes utterances clunits params) The parameters which affect this function are fests dir FILENAME The directory when the features will be saved (by segment type).1) (get stds per unit t) (ac left context 0.5 0.context. Thus they include things like phonetic and prosodic context rather than spectral information. name p.ph cplace p.p.name pp.seg onsetcoda R:SylStructure.ph cvox segment duration seg pitch p.ph vrnd p.ph ctype pp.ph vheight p.seg onsetcoda p.stress seg onsetcoda n.ph vc pp.seg pitch n.R:Syllable.ph cvox)) Now that we have the acoustic distances and the feature descriptions of each unit the next stage is to find a relationship between those features and the acoustic distances. This is a string and may also include any extra parameters you wish to give to wagon.ph cvox n.parent.ph vlng n.parent.parent.ph vfront p.seg pitch R:SylStructure.ph vrnd n.ph cplace n. wagon progname FILENAME The pathname for the wagon CART building program.ph vfront n.ph vlng pp.ph ctype n. though if you change the feature list (or the values those features can take you may need to change this file. The clusters are built by the following function (format t ”Building cluster trees\n”) (acost:find clusters (mapcar car unittypes) clunits params) The parameters that affect the tree building process are tree dir FILENAME the directory where the decision tree for each segment type will be saved wagon field desc LIST A filename of a wagon field descriptor file.(occurid p.name n.accented pos in syl syl initial syl final R:SylStructure.ph vrnd pp.syl break R:SylStructure.ph cplace pp. This we do using the CART tree builder wagon.parent. However in synthesis there will be desired units whose feature vector didn’t exist in the training set. An example is given in festival/clunits/all.ph vheight pp. It will find out questions about which features best minimize the acoustic distance between the units in that class. That is we are trying to classify all the units in the database.desc which should be sufficient for the default feature list.ph vlng p. 38 .ph ctype p.ph vc p. wagon has many options many of which are apposite to this task though it is interesting that this learning task is interestingly closed.syl break pp. This is a standard field description (field name plus field type) that is require for wagon.ph vc n. there is no test set as such.ph vfront pp.ph vheight n. The wagon cluster size the minimum size. This is down within the wagon training. Another usage of this is to cause only the center example units to be used.desc”) (wagon progname ”/usr/awb/projects/speech tools/bin/wagon”) (wagon cluster size 10) (prune reduce 0) The final stage in building a cluster model is collect the generated trees into a single file and dumping the unit catalogue. prune reduce INT This number of elements in each cluster to remove in pruning. In our KED example these have the values (trees dir ”festival/trees/”) (wagon field desc ”festival/clunits/all. Note that as the distance tables can be large there is an alternative function that does both the distance table and clustering in one. This is usefully when there are some large numbers of some particular unit type which cannot be differentiated. Format example silence segments without context of nothing other silence. deleting the distance table immediately after use. the list of unit names and their files and position in them. This removes the units in the cluster that are furthest from the center. This defines the maximum number of units that will be in a cluster at a tree leaf. When doing cascaded unit selection synthesizers its often not worth excluding large stages if there is say only one example of a particular demi−syllable. unittype prune threshold INT When making complex unit types this defines the minimal number of units of that type required before building a tree.e. To do this (acost:disttabs and clusters unittypes clunits params) Removing the calls to acost:build disttabs and acost:find clusters. thus you only need enough disk space for the largest number of phones in any type.wagon cluster size INT The minimum cluster size (the wagon −stop value). This is done by the lisp function (acost:collect trees (mapcar car unittypes) clunits params) (format t ”Saving unit catalogue\n”) (acost:save catalogue utterances clunits params) The only parameter that affect this is catalogue dir FILENAME 39 . i. cluster prune limit INT This is a post wagon build operation on the generated trees (and perhaps a more reliably method of pruning). We have used this in building diphones databases from general databases but making the selection features only include phonetic context features and then restrict the number of diphones we take by making this number 5 or so. For databases that have a tendency to contain non−optimal joins (probably any non−limited domain databases). limited domain). You could alternatively change the continuity weight to a number less that 1 which would also partially help. This helps reduce this. PCM waveforms is PSOLA is being used) sig ext FILENAME 40 . This option encourages far longer units. However such overflows are often a pointer to some other problem (poor distribution of phones in the db). This is experimental but has shown its worth and hence is recommended. in the same format as ac weights that are used in optimal coupling to find the best join point between two candidate units. These are join weights FLOATLIST This are a set of weights. extend selections INT If 1 then the selected cluster will be extended to include any unit from the cluster of the previous segments candidate units that has correct phone type (and isn’t already included in the current cluster). If the value is 2 this only checks the coupling distance at the given boundary (and doesn’t move it). This means that instead of selecting just units selection is effectively selecting the beginnings of multiple segment units. optimal coupling INT If 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find the best possible join point. log scores 1 If specified the joins scores are converted to logs. Be default this is (catalogue dir ”festival/clunits/”) There are a number of parameters that are specified with a cluster voice.the directory where the catalogue will be save (the name parameter is used to name the file). The problem is that the sum of very large number can lead to overflow. but does give better results. and is certainly faster. this may be useful to stop failed synthesis of longer sentences. particularly increasing the F0 value (column 0). continuity weight FLOAT The factor to multiply the join cost over the target cost. This is different from ac weights as it is likely different values are desired. These are related to the run time aspects of the cluster model. this is often adequate in good databases (e. This is computationally expensive (as well as having to load in lots of cepstrum files). This is probably not very relevant given the the target cost is merely the position from the cluster center. so this is probably just a hack. sig dir FILENAME Directory containing waveforms of the units (or residuals if Residual LPC is being used. pm coeffs dir FILENAME The directory (from db dir where the pitchmarks are pm coeffs ext FILENAME The file extension for the pitchmark files.g. Also where each phone is coming from is printed. and modified lpc which uses the standard UniSyn module to modify the selected units to match the targets. Currently it supports simple. clunits debug 1/2 With a value of 1 some debugging information is printed during synthesis. particularly how many candidate phones are available at each stage (and any extended ones). With a value of 2 more debugging information is given include the above plus joining costs (which are very readable by humans). The other two possible values for this feature are none which does nothing. and windowed. where the ends of the units are windowed using a hamming window then overlapped (no prosodic modification takes place though). 41 . a very naive joining mechanism.File extension for waveforms/residuals join method METHOD Specify the method used for joining the selected units. Generate Utterance Structure festival −b festvox/build clunits.7 Building Festival Voice In the context of Indian languages. Label Automatically $FESTVOXDIR/src/ehmm/bin/do ehmm help run following steps individually: setup. VC. $mkdir iiit tel syllable $cd iiit tel syllable 2. Generate Pitch markers bin/make pm wave wav/*. (d) Call your pronunciation directory module from festvox/iiit tel syllable lexicon.data”)’ 4. and CCVC. (c) Remove special symbols from tokenizer. Unlike most other foreign languages in which the basic unit of writing system is an alphabet. Record prompts .name field.pm (b) After modigying pitch markers convert lable format to pitchmarkers . Creat voice setup $FESTVOXDIR/src/unitsel/setup clunits iiit tel syllable $FESTVOXDIR/src/prosody/ Before going to run build prompts do following steps.scm ’(build prompts ”etc/txt. Indian language scripts use syllable as the basic linguistic unit. A syllable could be represented as C*VC*. bw.wav 9.pm Tuning pitch markars (a) Convert pitch marks into label format . 3.lab 8./bin/prompt them etc/time. containing at least one vowel and zero./bin/make pmlab pm pm/*./bin/make pm pmlab pm lab/*. one or more consonants. 1.scm ’(build utts ”etc/txt. Correct the pitch markers bin/make pm fix pm/*.desc file under p. align 6.wav 7. The syllabic writing in Indic scripts is based on the phonetics of linguistic sounds and the syllabic model is generic to all Indian languages A syllable is typically of the following form: V.done. feats. CV. CCCV. Create a directory and enter into the directory.done. Following steps explains how to build a syllable based synthesis using FestVox. CCV. diphone.data”)’ 42 . syllable units are found to be a much better choice than units like phone. Generate Mel Cepstral coefficients bin/make mcep wav/*. Generate Prompts: festival −b festvox/build clunits. (a) modify your phoneset according to syllables as phonemes in the phoneset file (b) modify phoneme lable files as syllable lables. phseq. where C is consonant and V is Vowel.scm (e) The last modification is change the default phonemeset to your language unique syllables in festival/clunits/all. and half-phone.data 5. scm ’(build clunits ”etc/txt./bin/do build do dur 12.desc and add all the syllables in p.wav”) If you want to see selected units.save.wave (utt. festival festvox/iiit tel syllable clunits.name field. rum following command (set! utt (SayText ”your text”)) (clunits::units selected utt ”filename”) (utt.data”)’ 11.done.save.10. open bin/make dur model and remove −stepwise . Cluster the units festival −b festvox/build clunits.synth (Utterance Text ”your text”)) ”test. Open festival/clunits/all. Test the voice.wave utt ”filename” ”wav”) 43 .scm ’(voice iiit tel syllable clunits)’ To synthesize sentence: If you are building voice on local machine: (SayText ”your text”) If you are running voice on remote machine: (utt. • Phrase Markers − It is very hard to make sense out of something that is said without a pause. Usually a unit is picked based on what the succeeding unit is. A high value of duration pen weight means a unit very similar in duration to the required unit is picked. There is a chance that a long silence is inserted at the end of a phrase or an extremely short silence is inserted at the end of a phrase which sounds very inappropriate. The silence units were therefore quantified into 2 types. these phrase markers were identified and a silence was inserted each time one of these was encountered. When the tree is built. the way a particular unit is spoken depends a lot on the preceeding and succeeding unit i. it takes more time to synthesize speech as the time required to search for the appropriate unit is more. which is denoted by cluster size. Else. It is therefore important to have pauses at the end of phrases to make what is spoken. i. Hindi has certain units called phrase markers which usually mark the end of a phrase. The silence at the end of a phrase will be of a short duration while the silence at the end of a sentence will be of a long duration. • Duration Penalty Weight (duration pen weight) − While synthesizing speech. intelligible.8 Customizing festival for Indian Languages 8. We therefore limit the size of the branch of the tree by specifying the maximum number of nodes. The F0 is calculated by calculating the F0 at the center of the unit which would be approximately where the vowel lies. the chances of a silence of a wrong duration in the wrong place is a common problem that is faced. the silence at the end of a sentence. but there are units called morpheme tags which are found at the end of words which can be used to predict silences. • Inserting commas − Just picking phrase markers was not sufficient to make the speech prosodically rich. Commas were inserted in the text wherever a pause might have been there and the 44 . • Fundamental pitch penalty weight (F0 pen weight) − While listening to synthesized speech an abrupt change in pitch between units is not very pleasing to the ear. If the number of nodes for each branch of a tree is very large. SSIL. the silence at the end of a phrase and LSIL. the context in which a particular unit is spoken. the duration of each unit being picked is also of importance as units of different durations being clustered together would make very unpleasant listening. the cluster size is limited by putting the clustered set of units through a larger set of questions to limit the number of units being clustered as one type. which plays a major role in the F0 contour of the unit. • Morpheme tags − There are no phrase markers in tamil. We therefore try to select units which have similar values of F0 to avoid fluctuations in the F0 contour of the synthesized speech. For the purpose of inserting silences at the end of phrases. The voice was built using these tags to predict phrase end silences while synthesizing speech. • ac left context − In speech. This ac left context specifies the importance given to picking a unit based on what the preceeding unit was.1 Some of the parameters that were customized to deal with Indian languages in festival framework are : • Cluster Size − It is one of the parameters to be adjusted while building a tree. The duration pen weight parameter specifies how much importance should be given to the duration of the unit when the synthesizer is trying to pick units for synthesis.e. not much importance is given to duration and importance is given to other features of the unit.e. The F0 pen weight parameter specifies how much importance is given to F0 while selecting a unit for synthesis. • Handling silences − Since there are a large number of silences in the database. data and also add the corresponding wav and lab files in the respective folder ( text 0998 ”LSIL” ) ( text 0999 ”SSIL” ) (text 0000 mono) 2.012 −def 0.scm file =>GoTo Line No:69 (i.e) ’(wagon cluster size 20) change the value 20 to 7 =>GoTo Line No:89 (i.01 −wave end −lx lf 140 −lx lo 111 −lx hf 80 −lx ho 51 −med o 0’ COMMENT THE ABOVE LINE AND ADD THE FOLLOWING LINE IN THE FILE PM ARGS=’−min 0. Do the following Modification in ”make pm wave” file PM ARGS=’−min 0.003 −max 0.8 to 0.done.2 Modifications in source code 1. Open /festvox/voicefoldername clunits. Handling SIL − For small system this issue is not need to be handled but system with large database multiple occurrence of SIL creates problem. • Geminates − In Indian languages it is very important to preserve the intra-word pause while speaking.7 −def 0. and care has been taken to preserve these intra−word pauses during synthesis.0057 −max 0. • Duration Modeling − Was done so as to include the duration of the unit to be used as a feature while building the tree and also as a feature to narrow down the the size of the number of units selected while picking units for synthesis.scm file =>GoTo Line No:136 ’(optimal coupling 1) change the value 1 to 2 5. Open /festvox/build clunits.e) ’(cluster prune limit 40) change the value 40 to 10 4.1 =>GoTo Line No:87 (i. as the word spoken without the intra-word pause would have a completely different meaning.tree was built using these commas so that the location of these commas could be predicted as pauses while synthesizing speech. Prosody modeling was done to make the synthesized speech more expressive so that it will more usable for the visually challenged persons.01 −wave end −lx lf 340 −lx lo 91 −lx hf 140 −lx ho 51 −med o 0’ 3. 8. • Prosody Modeling − This was achieved by phrase markers and by inserting commas in the text. Add the below 3 lines in the txt. These intra−word pauses are called geminates.e) ’(ac left context 0. To Solve the issue do the following step =>GoTo line No:161 the line starts with (define (VOICE FOLDER NAME::clunit name i) Replace the entire function with the following code 45 .8) change the value 0. Inside the bin folder. stress”) .”” . (item. ((string-equal name ”SIL”) . (string-append . It changes the basic classification of unit for clustering.feat i ”p..feat i ”p. The can be modified.name”) (item.”” . Comment out this if you want a more interesting unit name ((null nil) name) . By default we just use the phone name.”” 46 ..name”) )) .name”)) (string−equal ”h#” (item.”” .” (let ((name (item. but we may want to make this present phone plus previous phone (or something else).name”))) (string−equal ”pau” (item.feat i ”ignore”)) (and (string−equal ”pau” name) (or (string−equal ”pau” (item.feat i ”n..feat i ”R:SylStructure. (string-append .name”))))) ”ignore”) ((string−equal name ”SIL”) . name . (iiit tel lenina::nextvoicing i))) .feat i ”ph vc”)) . name .(t . (VOICE FOLDER NAME::nextvoicing i))) . Comment out the above if you want to use these rules . name) .p. Comment out this if you want a more interesting unit name .feat i ”p. (set! pau count (+ pau count 1)) (string−append name ”” (item. (string-append .(define (VOICE FOLDER NAME::clunit name i) ”(VOICE FOLDER NAME::clunit name i) Defines the unit name for unit selection for tam.parent.name i))) (cond ((and (not iitm tam aarthi::clunits loaded) (or (string−equal ”h#” name) (string−equal ”1” (item.((string−equal ”+” (item.feat i ”p.((null nil) .name . R:Token.silences ’(SSIL)) 10.. The create phoneset.feat i ”seg onsetcoda”) . After that the starting and ending consonants of the syllable are checked and and depending on the place of articulation of the consonants a particular value is assigned to that field.feat word ”p.”” . In VoiceFolderName Phoneset. Depending on the type of vowel and the type of beginning and end consonants we can now assign a value to the type of vowel field as well.prev word)))) .name field 8.pl The Phoneset.parent. place of articulation of c2 labial alveolar palatal labio−dental dental velar 47 .” (item.parent.scm file.prev word)))))) 7.feat word ”p. GoTo festival/clunits/ folder ===>Replace the all.. For every syllable the create phoneset.. . Generate phoneset units along with features to include in phoneset. (item.scm file contains a list of all units along with their phonetic features.. The fields for manner of articulation are kept as zero.R:Token.pl script checks first if the vowel present in the syllable is a short vowel or a long vowel. In the VoiceFolderName phoneset.pl script first takes every syllable and breaks it down into smaller units and dumps their phonetic features into the Phoneset. 9.. we have to change the phoneset definitions.silences ’(SIL)) Uncomment the following line during TESTING (PhoneSet. Depending on whether it is a short vowel or a long vowel a particular value is assigned to that field.scm file.scm file Uncomment the following line during TRAINING (PhoneSet. syllable type − v vc/vcc cv/ccv cvc/cvcc (syll type 1 2 3 4 0) .. end of a sentence ((string−equal ”...punc”)) (+ 1 (phrase snumber (item.end of a phrase (t (+ 0 (phrase number (item. vowel or consonant (vlng 1 0) . then go to line number 309 and add the following code (define (phrase number word) ”(phrase number word) phrase number of a given word in a sentence.desc file and copy the syllables and phones to both p. Replace the defPhoneSet function with the following code. (iiit tel lenina::nextvoicing i))) ))) 6. beginning or utterance ((string−equal ”. place of articulation of c1 (poa c1 1 2 3 4 5 6 7 0) .punc”)) 0) .scm file by running create phoneset languageName.name and n. (.” (item. full vowel (fv 1 0) ..” (cond ((null word) 0) . manner of articulation of c1 (moa c1 + − 0) . (format t ”%l n” word) (load (path-append VoiceFolderName::dir ”wordpronunciation”)) (list word a wordstruct))) ) During Training process uncomment il parser−train.(poa c2 1 2 3 4 5 6 7 0) . remove (text 0000 ”mono”) & (text 0000-2 ”phone”) from txt.pl (a) file containing unique clusters eg my $file=”./parser.sh”) ”w”)) . Creating pronunciation dictionary perl test. (b) Create pronunciation dictionary: my $oF = ”pronunciationdict artistName”.sh” ”w”))(format myfilepointer ”%s” ”mono”)(fclose myfilepointer)) ((string-equal word ”phone”)(set! myfilepointer (fopen ”unit size. the final step.. (print ”called”) (system ”chmod +x parser.sh”) . during testing uncomment il parser−test.. manner of articulation of c2 (moa c2 + − 0) ) 11.data (if exists) 12.sh”) (system ”. 48 .pl..done./unique clusters artistName”.e. (format myfilepointer ”perl %s %s %s” (path−append VoiceFolderName::dir ”bin/il parsertrain.pl 13..sh” ”w”))(format myfilepointer ”%s” ”phone”)(fclose myfilepointer)) (t (set! myfilepointer (fopen (path−append VoiceFolderName::dir ”parser.scm file(Calling parser in lexicon file) Goto line number 137 and add the following code in Hand written letter to sound rules section (define (iitm tam lts function word features) ”(iitm hin lts function WORD FEATURES) Return pronunciation of word not in lexicon.pl <inputfile in utf−8 format> Files name to be edited in il parser pronun dict.pl”) word VoiceFolderName::dir) (fclose myfilepointer) . Go to VoiceFolderName lexicon.pl”) word VoiceFolderName::dir) (format myfilepointer ”perl %s %s %s” (path-append VoiceFolderName::dir ”bin/il parser-test. When running clunits i.” (cond ((string−equal ”LSIL” word )(set! wordstruct ’( ((”LSIL”) 0) ))(list word nil wordstruct)) ((string−equal ”SSIL” word )(set! wordstruct ’( ((”SSIL”) 0) ))(list word nil wordstruct)) ((string−equal ”mono” word )(set! myfilepointer (fopen ”unit size. out and put it into festvox directory 14. we have include seperate scm file in the festvox folder called tokentowords. Eg : for i/p word Ram ===> R A M will be the output in wordpronunciation 15. Date and Time. we use the preprocessing3. if it is not in the Pronunciation dictionary. it will be sent to the parser and the parser will send the word to preprocessing3.(c) rename the created pronunciation dictionary to instituteName language lex. Abbreviations. Handling Numbers. When an english word is encountered. To handle English words.pl which will generate wordpronunciation by splitting the word into individual alphabets.scm 49 .out Add ” MNCL ”(without quote) at the first line of instituteName language lex.pl perl script. 5 /lib/libcurses.6 /lib/libstdc++. just create it) $cd $sudo gedit .orig ln −s /home/boss/festival/festival/src/main/festival /usr/bin/festival ln: creating symbolic link ‘/usr/bin/festival’ to ‘/home/boss/festival/festival/ 9.so.so • Error:− gcc: error trying to exec ’cc1plus’: execvp: No such file or directory Solution:− sudo apt−get install g++ • Error:− ln −s festival/bin/festival /usr/bin/festival ln: accessing ‘/usr/bin/festival’: Too many levels of symbolic links Solution:−sudo mv /usr/bin/festival /usr/bin/festival.9 Trouble Shooting in festival 9.festivalrc (if it is not there.1 Troubleshooting (Issues related with festival) Some errors with solutions during the installation/building process:• Error:− /usr/bin/ld: cannot find −lcurses Solution:− sudo ln −s /lib/libncurses.set ’Audio Method ’Audio Command) 50 .so.so • Error:− /usr/bin/ld: cannot find −lncurses Solution:−apt−get install libncurses5−dev • Error:− /usr/bin/ld: cannot find −lstdc++ Solution:− sudo ln −s /usr/lib/libstdc++.festivalrc add the following line in this file and save: (Parameter.2 Troubleshooting(Issues might occur while synthesizing) Error:: Linux: can’t open /dev/dsp Solution:-Go to your home directory and open the .set ’Audio Command ”aplay −q −c 1 −t raw −f s16 −r $SR $FILE”) (Parameter. festival speech dispatcher has to be installed . Click on the speech tab 7. Gnome speech driver is not supported in later versions of ubuntu . 9. Now ORCA should be able to read the language data. 2. The following fields should have the entries given here • Speech system should be GNOME Speech Services • Speech synthesiser should be Festival GNOME Speech Driver • Voice settings should be Default • Select the festival voice for your language from the drop down list for the Person entry 8. For hindi the last line in clunits. take the iitm hin anjana clunits. sudo apt-get install gnome-speech-swift and start festival and orca again. 3.scm file will be (provide ’iitm hin anjana clunits) ) (proclaim voice ’ iitm hin anjana clunits ’((language hindi) (gender female) (dialect hindi) (description ”This voice provides an Indian hindi language”) (builtwith festvox2. install gnome-speech-swift (if ubuntu. Start festival in server mode with the following command festival −i −−heap 2100000 −−server 4. Click on Apply button and then OK button. Use the following command to install the speech dispatcher.g. For hindi. (e. 10.04 . Start orca with the following command on another command prompt orca −n 5.scm file for the voice in festvox folder of the voice.1) (coding UTF8) )) Give the correct voice name and language in the above code. If ubuntu version is greater than 10.g. sudo apt-get install speech-dispatcher-festival 51 . Edit the clunits. If festival synthesizer does not load. Click on ORCA preferences button 6.10 ORCA Screen Reader Integrating festival voices with Orca 1.scm file in festival/lib/voices/hindi/iitm hin anjana clunits/festvox/ folder and insert the following code just above the last line in the file last line in the file would be (provide ’voice name) (e. Place the voices in festival/lib/voices/ folder as Orca will be loading all the voices in this folder. If an english word is not there in the database. If a timeout occurs for orca. it spells it out.py. and open the files named settings.py in command prompt.11. It can be tested using a gedit file containing your language data. Do the same for all files named settings. Move the cursor to different lines in the file for it to read line by line. 14. Cursor should be placed in front of the sentence to be read using keyboard arrow keys. 52 .py in any orca related folders. Start festival and orca again 12. Usually there are more than one. 13. Search for the phrase timeoutTime and change its value to 30. type locate settings. 8.mak (if error comes file not found) Note: copy the new module (il parser) to your festival/src/modules/ folder before compiling speech tools and festival. speech tools/stats/EST DProbDist.h must have the following changes #include <iostream> should be added at line 38 before using namespace std.mak to \speech tools\config\systems\ix86 unknown. speech tools/include/EST TrackMap. must be replaced by if (((tdir=getenv(”TMPDIR”)) == NULL) && ((tdir=getenv(”TEMP”)) == NULL) && ((tdir=getenv(”TMP”)) == NULL)) tdir = ”/tmp”.h” 11. Rename \speech tools\config\systems\ix86 CYGWIN1. Copy the Makefile provided to festival/src/modules/ folder Follow the steps mentioned in http://www. and l = (long long)c.5. on line 62 on line must be changed to long l. 6. speech tools/include/EST. 9.cc must have the following changes #include ”EST Math. speech tools/stats/wagon/wagon aux.net/doc build win festival.11 11.cc must have the following changes long long l.h must have the following changes #include <iostream> should be added at line 44 before using namespace std.h must have the following changes #include ¡iostream¿ should be added at line 54 after #include <cfloat> 7. 12. IMPORTANT: The following changes needs to be made only if errors are thrown for these files while compiling festival following the steps in the link given above. Install cygwin 4. The service pack for the visual studio 2008 must be installed 3. speech tools/include/EST Token.h must have the following changes #include <iostream> should be added at line 45 before using namespace std. 10.c must have the following changes if (((tdir=getenv(”TMPDIR”)) == NULL) k ((tdir=getenv(”TEMP”)) == NULL) k ((tdir=getenv(”TMP”)) == NULL)) tdir = ”/tmp”.php There are more changes we made apart from that mentioned in the web page. 5. speech tools/utils/EST cutils. which are mentioned below.h”should be added at line 47 after #include EST Wagon.1 NVDA Windows Screen Reader Compiling Festival in Windows : 1. speech tools/include/EST math. on line 66 must be changed to l = (long )c.0) standard edition must be successfully installed 2. 53 . speech tools/include/EST TKVL. Visual Studio 2008 (vc 9.h must have the following changes #include <iostream> should be added at line 43 before using namespace std.eguidedog. TCHI LAST }. festival/src/main/festival client. SYL STRESS. Change in EST FlatTargetCost. NBAD DUR. WORD.cc and in festival/src/modules/MultiSyn/EST FlatT all references to the variable WORD and PWORD must be replaced by WORD1 and PWORD1. NPOS. POS. POS. SIL. BAD DUR.cc” Instantiate KVL T(EST String. NSYL. SYL. SYL. NWORD. BAD F0 . SYLPOS. BAD OOL. SIL./base class/EST TSortable. EST String ST entry) #if defined(INSTANTIATE TEMPLATES) #include ”.h enum tcdata t { VOWEL. PBREAK. SYL STRESS. NSYL. PSYL. NWORD.. BAD F0 . PBAD DUR. WORD and PWORD are keywords in VC++. 16. PSYL. NPUNC. N VOWEL. NPUNC. N SIL.13. 15.cc must have the following changes #include <iostream> should be added at line 42 before using namespace std. PUNC. NNSYL. WORDPOS. NNWORD.. LC.cc must have the following changes The following code must be moved from the end of the file to line 52 after #include ”EST ServiceTable. N VOWEL. PUNC.cc In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const 54 . NNBAD DUR. SYL STRESS. WORD1. NBAD OOL.cc must have the following changes #include <iostream> should be added at line 42 before using namespace std. festival/src/main/festival main. RC. NNWORD. NSYL STRESS. PWORD. PWORD1. Must be changed to enum tcdata t { VOWEL. NBAD OOL. BAD DUR. RC. NNSYL. speech tools/utils/EST ServiceTable. So the variable names have to be changed. Change in EST FlatTargetCost.cc” #include ”. NNBAD DUR. PBAD DUR..cc” #include ”. PBREAK. EST String ST entry) #endif 14. SYLPOS. festival/src/modules/MultiSyn/EST FlatTargetCost. EST ServiceTable::Entry. LC. EST ServiceTable::Entry. N SIL./base class/EST TKVL. WORDPOS. BAD OOL.h” Declare KVL T(EST String. NBAD DUR./base class/EST TList. TCHI LAST }. NPOS. // inter else if( f->a no check(WORD1)!= f->a no check(PWORD1) ) 55 . else (*f)[WORD]=0. // final Must be replaced by // segs wordpos (*f)[WORDPOS]=0. // medial if( f->a no check(WORD1)!= f->a no check(NWORD) ) (*f)[WORDPOS]=1. Must be replaced by // Prev seg word feature if(seg->prev() && (word=tc get word(seg->prev()))) (*f)[PWORD1]=simple id(word->S(”id”)). // inter else if( f->a no check(WORD)!= f->a no check(PWORD) ) (*f)[WORDPOS]=2. must be replaced by // seg word feature if(word=tc get word(seg)) (*f)[WORD1]=simple id(word->S(”id”)). // medial if( f->a no check(WORD)!= f->a no check(NWORD) ) (*f)[WORDPOS]=1. In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const // Prev seg word feature if(seg->prev() && (word=tc get word(seg->prev()))) (*f)[PWORD]=simple id(word->S(”id”)). // initial else if( f->a no check(NWORD) != f->a no check(NNWORD) ) (*f)[WORDPOS]=3. else (*f)[WORD1]=0. else (*f)[PWORD1]=0.// seg word feature if(word=tc get word(seg)) (*f)[WORD]=simple id(word->S(”id”)). else (*f)[PWORD]=0. In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const // segs wordpos (*f)[WORDPOS]=0. // final In function float EST FlatTargetCost::position in phrase cost() const if ( !t->a no check(WORD) && !c->a no check(WORD) ) return 0. if(!t->a no check(WORD) k !c->a no check(WORD)) return 1. if ( !t->a no check(WORD1) k !c->a no check(WORD1) ) return 1. else if (t->a no check(WORD) && c->a no check(WORD)) must be replaced by if ( (t->a no check(WORD1) && !c->a no check(WORD1)) k (!t->a no check(WORD1) && c->a no check(WORD1)) ) score += 0.5.5. else if (t->a no check(WORD1) && c->a no check(WORD1)) In function float EST FlatTargetCost::partofspeech cost() const // Compare left phone half of diphone if(!t->a no check(WORD) && !c->a no check(WORD)) return 0. must be replaced by 56 . if ( !t->a no check(WORD) k !c->a no check(WORD) ) return 1. must be replaced by if ( !t->a no check(WORD1) && !c->a no check(WORD1) ) return 0.(*f)[WORDPOS]=2. In function float EST FlatTargetCost::punctuation cost() const if ( (t->a no check(WORD) && !c->a no check(WORD)) k (!t->a no check(WORD) && c->a no check(WORD)) ) score += 0. // initial else if( f->a no check(NWORD) != f->a no check(NNWORD) ) (*f)[WORDPOS]=3. return 0xff. static const unsigned char maxVal = 0xff. static const unsigned char defVal = 0xff. else if( b > a ) return cache[(b*(b-1)>>1)+a]. festival/src/modules/MultiSyn/EST JoinCostCache. Must be replaced by if( a == b ) return 0x0.h must have the following changes . else if( cost <= llimit ) qcost = minVal. all maxVal by 0xff and all defVal by 0xff Following are the changes comment out // #include <iostream> if( a == b ) return minVal. unsigned int qleveln = maxVal-minVal. festival/src/modules/MultiSyn/EST JoinCostCache.// Compare left phone half of diphone if(!t->a no check(WORD1) && !c->a no check(WORD1)) return 0. replace all minVal by 0x0. else if( b > a ) return cache[(b*(b-1)>>1)+a]. comment out the following portion of the code static const unsigned char minVal = 0x0. else return cache[(a*(a-1)>>1)+b]. 18. else return cache[(a*(a-1)>>1)+b]. if(!t->a no check(WORD1) k !c->a no check(WORD1)) return 1. if( cost >= ulimit ) qcost = maxVal. 17. 57 . return defVal. must be replaced by unsigned int qleveln = 0xff-0x0.cc. val(”name”).String() ) k ph is approximant( cand->next()->features().String() ) k ph is liquid( cand->next()->features().val(”name”).cc must have the following changes . else if( cost <= llimit ) qcost = 0x0.val(”name”).String() ).val(”name”).val(”name”).String() ) k ph is nasal( cand left->features(). 21. #include <iostream> should be added at line 48 before using namespace std. replace by if( ph is vowel( cand->next()->features(). separately in the following functions and remove the declarations from for loops 58 . 20.val(”name”).String() ) k ph is nasal( cand->next()->features(). 19.replace by if( cost >= ulimit ) qcost = 0xff.cc must have the following changes .val(”name”). if( ph is vowel( left phone ) k ph is approximant( left phone ) k ph is liquid( left phone ) k ph is nasal( left phone ) ) Replace by if( ph is vowel( cand left->features().String() ) ) if( ph is vowel( right phone ) k ph k ph k ph fv = is approximant( right phone ) is liquid( right phone ) is nasal( right phone ) ) fvector( cand right->f(”midcoef”) ). festival/src/modules/UniSyn/us mapping.String() ) ) fv = fvector( cand->next()>f(”midcoef”) ). comment out the following code const EST String &left phone( cand left->features(). const EST String &right phone( cand right->features().String() ) k ph is liquid( cand left->features().val(”name”).String() ) k ph is approximant( cand left->features().String() ). declare int i. festival/src/modules/Text/token.val(”name”).cc must have the following changes . festival/src/modules/MultiSyn/EST TargetCost.val(”name”). EST Track &pm. int wav srate ) (c) void make join interpolate mapping( const EST Track &source pm.(a) void make linear mapping(EST Track &pm. const EST Relation &units.EST Track &target pm. float default F0 . i<fz len. festival/src/modules/UniSyn/us unit. EST IVector *spaces. float s last time. float target end) remove the declaration of i in the for loop for( int i=0. EST IVector &map ) void make join interpolate mapping2( const EST Track &source pm. int num channels. declare int i. float stretch. int start pm.cc must have the following changes .cc must have the following changes . int end pm. festival/src/modules/UniSyn/us prosody. float t last time) declare int i. i++ ) In function void stretch F0 time(EST Track &F0 . EST IVector &map) (b) static void pitchmarksToSpaces( const EST Track &pm. EST IVector &map ) 22. separately and remove the declarations from for loops 23. EST Track &target pm. In function void F0 to pitchmarks(EST Track &fz. separately in the following functions and remove the declarations from for loops (a) static EST Track* us pitch period energy contour( const EST WaveVector &pp.const EST Track &pm ) (b) void us linear smooth amplitude( EST Utterance *utt ) 59 . const EST Relation &units. 12 SAPI compatibility for festival voice 1. Install festival and speech tools in windows Please check the chapter for compiling festival in windows. Replace the SampleTTSEngine folder { C:\ProgramFiles\MicrosoftSDKs\Windows\v6. Suppose we are installing festival in say D:\fest install\festival and speech tools in D:\fest install\speech tools This new module has to be kept in D :\fest install\festival\src\ modules\il parser 2. Two environmental variables have to be created FESTLIBDIR D:\festival\festival\lib This should point to where your voice is kept. Install Microsoft SDK from the link http://www.dll. The parser module has to be written in C . The name and the language code should be changed for respective languages 7. and the voice should be kept in this lib folder under voices\hindi folder voice path D:\festival\festival\lib\voices\hindi\iitm hin anjana clunits\ This will point to the voice folder 8. 4. D:\fest install\festival\lib\voices\hi 3. compile the SampleTTSEngine solution in release mode. It will generate SampleTtsEngine. These files are to be replaced at the place where we have festival and speech tools installed.aspx?id=11310 5.microsoft. Say suppose the voice is kept in the following folder.cpp has the details of our voice. include path of festival and speech tools must point to the correct path.cc -> D:\fest install\speech tools\speech class config. Install the voice .cc -> D:\fest install\festival\src\arch\festival festival main. lib folder should be there with all the scm files.ini file will be accessed by sapi code. In this SampleTTSEngine solution a file called register−vox. (voice iitm hin anjana clunits) This file has the command to set a voice Now you need to compile festival as per the steps given in chapter Compile festival in windows. Check the properties of SampleTTSEngine solution.cc -> D:\fest install\festival\src\main clunits. this module has to be kept in the src/modules folder.cc -> D:\fest install\festival\src\modules\clunits EST wave utils.1\Samples\winui\ } with the code provided by IITM 6. festival. The libraries. Some festival files has been changed . These libraries will be build when we compile festival and speech tools (point 1) 9. The new module (il parser) has to be plugged into festival.ini copy it to voice folder (hindi\iitm hin anjana clunits) config. 60 .com/download/en/details. Before compiling festival. Test with sample TTS application.10. 11. 61 . If it works in these applications now try in NVDA or JAWS. 12. Check if an entry is there in registry (HKEY LOCAL MACHINE -> software -> Microsoft -> speech -> voices -> Tokens -> ) An entry for our voice will be there. (Control Panel -> Speech -> Text to speech ) or with TTSAPP.exe that comes with the SDK. The header file of a wav file is shown in a table at the end of the document 62 . The sphere files can either be encoded in wavpack or shorten encoding or kept in the same format as of the input speech file.1 Extraction of details from header of the input file The input file can be wav file. either pcm or mu law encoded.13 Sphere Converter Tool The tool was developed to convert all the speech files in different format to a standard sphere format. raw or mulaw format. NIST 1A 1024 location id s13 TTS IITMadras database id s22 Sujatha 20 RadioJockey utterance id s9 Suj trial sample sig bits i 16 channel count i 1 sample n bytes i 2 sample rate i 16000 sample count i 46563 sample coding s3 pcm sample byte format s2 01 sample min i 16387 sample max i 23904 end head 13. The header is an object oriented. ASCII structure which is prepended to the waveform data. there will be a header which will have all the details of the speech file. wave or raw)is to be converted to a sphere file (either encoded in wavpack. SPHERE files contain a strictly defined header portion followed by the file body (waveform). 1024byte blocked. The speech files can be of wav . The header is composed of a fixed format portion followed by an object oriented variable portion. In the sphere format. First 4 fields are user defined fields taken from config file. The fixed portion is as follows: NIST 1A<newline> 1024<newline> The remaining object oriented variable portion is composed of <object> <type> <value> Below is a sample sphere header that this module is generating. The input file(either mu law. shorten or no encoding) with a sphere header. RAW files are headerless audio files.If the input file is a mulaw encoded file the AudioFormat field in the format chunk of the header will have value = 7and the FACT chunk will be present in the header. The output sphere files can be played in the utility wavesurfur.The objective of this module is to find the maximum sample value and minimum sample value among the sample data present in the input file.1. The sample count is calculated by counting the number of samples read while calculating the sample minimum and maximum values.sph extension and the sphere header can be verified by opening the file.2 RAW Files . Bytes per sample =( bits per sample ) /8 The sample count = (No.1. The sample rate. The total number of data bytes is obtained from the cksize (second field) in data chunk.g ghex2) to verify the header fields and size of file. If the sample data is in little endian format.1. of data bytes) / (bytes per sample) In the sphere package the byte format of data is stored in field SAMPLE BYTE FORMAT. 13. for the program to read the file successfully. Each sample is read from the data part of the file and calculated which is the maximum value and minimum value.3 MULAW Files .The necessary information from the header of the input file is extracted. If the fact chunk is present in the header the sample count is obtained from the header otherwise it is calculated as follows. The bits per sample is obtained from the field in format chunk.1 Calculate sample minimum and maximum values .1. The sphere files have .The data in the output sphere file can be Shorten compressed byte stream or Wavpack compressed byte stream or the data as is present in the input file. channel count and data encoding must be given by the user in the config file. 13. The file can be opened in a hex editor (e. 13. 13.4 Output in encoded format . 13.The user defined fields to be added to header can be kept in this file and it is to be placed at the location were the executables are placed. this field is given the value = 01 . 63 . sample size.2 Configfile . if the data is in bigendian the value is 10 and if the samples are single byte the value is 1. go to folder ’nist’ ( cd nist ) and install nist as follows sh src/scripts/install.[12] (b) : Sun Solaris (c) : Next OS (d) : Dec OSF/1 (with gcc) (e) : Dec OSF/1 (with cc) (f) : SGI IRIX (g) : HP Unix (with gcc) (h) : HP Unix (with cc) (i) : IBM AIX (j) : Custom Please Choose one: 10 What is/are the Compiler Command ? [cc] cc OK. by following #ifdef NARCH linux #include <errno.tar.sh (a) : Sun OS4. Is this OK? [yes] yes What is/are the Compiler Flags ? [−g] −g −c OK. The Architecture command is ’linux’. untar sphere 2.c ( nist/src/lib/sp) replace extern char *sys errlist[].Z (use tar −xvzf or zcat sphere 2.1. change the file exit. The Compiler Command command is ’cc’.Z 2.tar.1 Sphere Converter User Manual How to Install the Sphere converter tool 1.6a. #endif 4. 3.6. Is this OK? [yes] yes What is/are the Install Command ? [install −s −m 755] install −s −m 755 What is/are the Archive Sorting Command ? [ranlib] What is/are the Archive Update Command ? [ar ru] What is/are the Architecture ? [SUN] linux OK. A folder by name ’nist’ will be created.tar.Z — tar −xvf ) tar −xvzf sphere 2.h> #else extern char *sys errlist[]. Is this OK? [yes] yes 64 . The Compiler Flags command is ’−g −c’.14 14.6a. • If Bulk file conversion is selected: 65 . installing the Qt software is done..c encode sphere.bin A folder by name qtsdk−2010. 11.02/qt/bin/qmake (This created the Makefile.out which will be used by the front end of the tool 7. 10.. It can be browsed using the ’Open’ button. 8.) /home/./include/ −lsp −lutil −lm −L.c wavtosphere.02.. Now. Now./qt−sdk−linux−x86−opensource−2010. copy the following files from c files folder to nist/bin or to any user defined folder decode sphere. 14.02 will be created. 2...c encode sphere.c −I.sh 6.. Load the input file with extension either ’wav’ or ’raw’. • Select whether a single file or a bulk of files have to be converted.02..) 9.tar cd sphereconverter Execute the following commands: /home/.c −combine decode sphere./nist/bin (or to the user defined folder containing the c files) will run the tool.5. Copy the sphereconverter executable to the bin folder of nist (cd nist/bin) or to the user defined folder containing the c files.bin: . Install QT compile the Qt bin file qt−sdk−linux−x86−opensource−2010. clicking on this executable in the /home/...c convert to sphere. give appropriate path for the nist/ lib and nist/include in the above command This will create a file a.2 How to use the tool • Select the type of file to be converted in the radio button wav (pcm or mulaw) or raw./lib/ if the c files are not in nist/bin but in an ser defined folder.02/qt/bin/qmake project (This creates the .pro file.c configfile compare... • If Single file conversion is selected: 1. to compile the sphere converter codes using Qt: Extract sphereconverter.. compile using (in bin folder) cc convert to sphere. The files with the the extension selected will be listed when browsing...../qtsdk−2010. Specify the name of the output sphere file and where the output file has to be saved using the ’save as’ button./qtsdk2010. Copy this user manual to both sphereconverter andnist/bin folder (or to the user defined folder containing the c files) to access it at all times from the help button in the tool.) make (This makes the folder and creates sphere converter executable. Thus. Load the input folder containing wav files or raw files depending on the type of file selected. Preferably this field can hold the language used for the input file. This tool deals with only one byte per sample and 2 bytes per sample 66 . • utterence id : Its a Mandatory field. the other type files will be converted to corresponding sphere files with default properties. seperated by an underscore. 3. User can enter any value of string format. If the input folder contains files(wav or raw) other than the type of file selected. Maximum allowed length of the entered text is 100 characters. User can enter any value of string format. Specify the name of the output folder name where the sphere files have to be located. It can be wavpack encoded or shorten encoding or without any encoding. • Click on ’Edit properties’ to enter the details that would be stored in the sphere header. 14. • If properties are not edited the default properties stored in the configfile(present in nist/bin folder) will be used. • language : Its a Mandatory field. Preferably this field can hold the name of the location/institution at which the conversion is taking place. It can be browsed using the ’Open’ Button 2. In the sphere header the value of this field will be appended with the name of the file.1. For wav files this value will be retrieved from the wav header. If the user is satisfield with the header ’Ok’ button can be clicked. User can enter any value of string format. The user can enter the number of bytes in a sample in the file. else ’Cancel’ button can be clicked and user can go back and make changes in the properties. • On clicking the help button on the right top corner the user manual will be opened which can be referred for any issues while using the tool. Maximum allowed length of the entered text is 100 characters. • After successful convertion the sphere header created by the tool will be displayed. • sample n bytes : Its a Mandatory field for raw files only. Maximum allowed length of the entered text is 100 characters. • database id : Its a Mandatory field. Preferably this field can hold the details of project/database/speaker. • Select the type of encoding for the output sphere file. • If any field entered in the properties were wrong or if the file was not successfully converted the appropriate message will be displayed. If a non integer value is entered the tool will point out to the user to enter integer value. Maximum allowed length of the entered text is 50 characters.3 Fields in Properties • location id : Its a Mandatory field. User can enter any value of string format. Preferably this field can hold the name of the speaker. • If the file is successfully converted the message File was succesfully converted will be displayed. • Click ’Convert’ button to convert the a single file or ’Bulk Convert’ button to convert a set of files. If no encoding is selected for the output file. (meaning duration of file is less than or equal to zero) an error will be thrown informing the user that the value entered for this field is wrong • sample byte format : Its a Mandatory field for raw files only. this field can take only the value 1. For wav files this value will be retrieved from the wav header. Or cancel button can be clicked. • If you click ’Cancel’ button. The field name. • sample coding : The value for this field is calculated by the tool. User can enter the number of significant bits in a sample. If a value other than this is entered. mulaw or raw. • User entered fields can be deleted. The input file encodings can be pcm. User can check off the check box on the right of each fields entered by the user and delete button can be pressed. • channel count : This tool deals with only channel count = 1. It is the total number of samples in the file. 14. Stereo = 2 • sample count : The value for this field is calculated by the tool. seperated by a comma. It is minimum sample value ( amplitude of the sample with minimum value) present in the file • User can add more fields using the ’Add’ button. Else the user will be informed that the value entered for this field is wrong. It can be either 01 for little endian. 10 for big endian and 1 for single byte. For wav files this value will be retrieved from the wav header. Click Ok button after entering details of the new field. It is the number of interleaved channels. It is maximum sample value ( amplitude of the sample with maximum value) present in the file • sample min : The value for this field is calculated by the tool. • sample rate : Its a Mandatory field for raw files only. • Maximum size for the sphere header is 1024 bytes. Output file encodings are shorten or wavepack. which can be selected from the drop down list. If a non integer value is entered the tool will point out to the user to enter integer value.4 Screenshot Here is a screen shot of the sphere converter tool 67 . If a non integer value is entered the tool will point out to the user to enter integer value. the value of the field will contain only the encoding of the input file • sample max : The value for this field is calculated by the tool. If the value of sample sig bit > (sample n bytes * 8) an error will be thrown informing the user that the value entered for this field is wrong. For raw files. tool will inform the user the ’Cancelling will remove the editted properties and use the default wav properties. Mono = 1. If (sample count / sample rate ) < = 0. the tool will point out to the user to enter one of the three values. If the user enters more data. integer or real. The user can enter the byte ordering used inthe file. User can click ’yes’ or ’no’. data type and value has to be entered. The data type can be string. Do you want to continue?’. It is the encoding used in the input and output file. • Once the properties are edited click ’Ok’ button.• sample sig bits : Its a Mandatory field. if sample n bytes entered by the user is 1. The field names should not have spaces in between. tool will inform the user that the header has exceeded 1024 bytes and tells the user to edit/delete few properties. The user can enter the sampling rate(blocks per second) used in the file. 14.5 Example of data in the Config file (default properties) WAV: location id STRING IIT Madras. chennai database id STRING Sujatha 20 RadioJockey utterance id STRING Suj language STRING tamil sample sig bits INTEGER 16 $$ RAW: location id STRING IIT Madras. chennai database id STRING Sujatha 20 RadioJockey utterance id STRING Suj language STRING hindi sample n bytes INTEGER 2 sample sig bits INTEGER 16 channel count INTEGER 1 68 . if both the pcm and mulaw files are there in the input folder.6 Limitations to the tool • The tool allows only files of single channel.sample rate and sample byte format should be given to get the correct value for sample max and sample min. All files in the folder may not be of the same format. • For raw files the correct value for sample n bytes. • Maximum size for the sphere header is 1024 bytes. the tool will inform that All files were not sucesfully converted. It is advised to have only files of one type in the input folder and without any subfolders. they will use the same properties.sample rate INTEGER 16000 sample byte format STRING 01 $$ 14. • For Bulk conversion if the input folder has sub folders or different types(pcm. Although conversion will successfully happen for the type of file you have selected in the gui. • For bulk conversion. this error is thrown. 69 .mulaw or raw) of input files. • sample n bytes can take only the values 1 and 2. This is the size of the entire file in bytes minus 8 bytes for the two fields not included in this count: ChunkID and ChunkSize.e. BitsPerSample 2 8 bits = 8. Optional portion in FORMAT chunk cbSize 2 Size of the extension (0 or 22) This field is present only if the Subchunk1Size is 18 or 40 ValidBitsPerSample 2 Number of valid bits ChannelMask 4 Speaker position mask SubFormat 16 GUID. including the data format code FACT Chunk (All (compressed) nonPCM formats must have a Fact chunk) ckID 4 Chunk ID: ”fact” cksize 4 Chunk size: minimum 4 SampleLength 4 Number of samples (per channel) Data chunk ckID 4 Contains the letters ”data” cksize 4 This is the number of bytes in the data. 44100. Linear quantization) Values other than 1 indicate some form of compression. it is RIFF and big− endian is RIFX This is the size of the rest of the chunk following this number. Mu−law file has value 7 NumChannels 2 Number of interleaved channels Mono = 1. etc. 16000. sampled data n The actual sound data. For little endian files. AudioFormat 2 PCM = 1 (i. This is AvgBytesPerSec = = SampleRate * NumChannels * BitsPerSample/8 BlockAlign 2 Data block size (bytes) = = NumChannels * BitsPerSample/8 The number of bytes for one sample including all channels. 16 for PCM. 16 bits = 16. pad byte 0 or 1 Padding byte if n is odd 70 . Stereo = 2 SampleRate 4 Sampling rate (blocks per second) 8000. Value can be 16 or 18 or 40.Field Length RIFF/RIFX chunk ChunkID 4 ChunkSize 4 Wave ID FORMAT Chunk Subchunk1ID Subchunk1Size 4 Content Contains the letters ”RIFF” or RIFX in ASCII form. ByteRate 4 Data rate. Contains the letters ”WAVE 4 4 Contains the letters ”fmt ” This is the size of the rest of the Subchunk which follows this number. etc.