Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications



Comments



Description

1440IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. VOL. ASSP-34. NO. 6, DECEMBER 1986 Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications DAVID J. GOODMAN, MEMBER, IEEE, GORD,ON B. LOCKHART, ONDRIA STUDENT MEMBER, IEEE, AND WAI-CHOONG WONG J. WASEM, silent gaps replaced the missing speech packets, and it was determined that packet loss rates up to about 1 percent were tolerable. There are reports of other techniques for dealing with lost packets, such as repeating previous packets or, if the speech has been processed by a vocoder, synthesizing new speech from previously received analysis data [ 5 ] . Another approach is to construct speech packets at the transmitter in a manner that facilitates the recovery of lost packets. Jayant and Christensen have experimented with an interleaving technique that places oddnumbered samples and even-numbered samples in different packets [6]. When an isolated packet is missing at the receiver, the samples in a neighboring packet are used to estimate the missing samples. By contrast, the packet reconstruction techniques presented in this paper operate on conventional PCM packets containing consecutive speech samples. They apply only at the receiver and incur negligible processing delay. We I. INTRODUCTION assume that the receiver learns the positions of missing ACKET speech communication may play an impor- packets from time stamps and/or sequence numbers in the tant role in the evolution of combined voice and data headers of correctly received packets [ 5 ] . Not only does services. Although the advantages of communicating this header information guide the replacement process, but computer data in packets are well documented [ 11, the it also allows the receiving terminal to establish correct editors of a recent collection of papers report that “the timing. Timing uncertainties arise from variable transjury is still out” on the merits of packet speech [2]. In mission delay and from speech activity detection which contrast to packet data transmission where delays are al- suppresses all packet transmission during silent intervals. The simplest reconstruction technique is merely to set lowed to build up as traffic increases, speech communication requires prompt packet delivery. Beyond some time all samples to zero when packets are missing and to accept limit, delayed speech packets are useless at the receiving the distortion caused by the resulting gaps in received terminal and are discarded by the system. Packet loss, speech. This “zero substitution” may be acceptable for therefore, has a major effect on speech quality and the very small probabilities of packet loss, but for rates greater consequent constraints on packet dropping rates affect than about 1 percent there is much to be gained by atsystem costs. In formal listening tests conducted to assess tempting to l € ? C O n S t ~ C tthe waveform- of the missing the effects of missing packets on speech quality [ 3 ] , [4], packet. The term “waveformsubstitution” refers to reconof missing packets by substitution of past wavestruction Manuscript received November 30, 1985; revised March 22, 1986. D. J. Goodman is with the AT&TBell Laboratories, Crawford Hill Lab- form segments. Because it is likely that the contents of a oratory, Holmdel, NJ 07733. missing packet will resemble immediately preceding G. B. Lockhart is with the University of Leeds, Leeds LS2 9TS, Enspeech, this approach is attractive in that waveform syngland. 0. J . Wasem is with the Massachusetts Institute of Technology, Cam- thesis is not required. Instead, the substitute waveform is bridge, MA 02139. selected from speech already availableatthe receiver. W.-C. Wong is with the National University of Singapore, Singapore, Packet reconstruction thus amounts to selecting a speech 0511. segment and placing it in the missing packet time slot. IEEE Log Number 8610377. Abstract-Packet communication systems cannot, in general, guarantee accurate and prompt delivery of every packet. The effect of network congestion and transmission impairments on data packets i s extended delay; in voice communications these problems lead to lost packets. When some speech packets are not available, the simplest response of a receiving terminal is to substitute silence for the missing speech. Here, we explore techniques for replacing missing speech with waveform segments from correctly received packets in order to increase the maximum tolerable missing packet rate. After presenting a simple formula for predicting the probability of waveform substitution failure as a function of packet duration and packet loss rate, we introduce two techniques for selecting substitution waveforms. One method is based on pattern matching and the other technique explicitly estimates voicing and pitch. Both approaches achieve substantial improvements in speech quality relative to silence substitution. After waveform substitution, a significant component ofthe perceived distortion is due to discontinuities at packet boundaries. To reduce this distortion, we introduce a simple smoothing procedure. P 0096-3518/86/1200-1440$01.00 0 1986 IEEE Pf = 0. it is heard as a disturbing buzz or whistle. C = 32/Tp. waveform substitution can potentially increase the acceptable missing packet probability to 19 percent. Within a 1 percent failure-rate criterion.PACKETMERGING While this probability model provides a guide to the quality of waveform substitution. 3-5 illustrate this approach. and that it is perfect when neither condition occurs. It then uses as a replacement packet the L samples that follow the best match.0001 0 . and missing packet probability p . We introduce the variables 3 In 0. it uses as a template the M speech samples that came just before the missing packet. One of the packets is missing and the algorithm searches previous packets to find L samples that resemble the missing packet. 16. Substitution failure probability as a function of missing packet rate. Overview Figs. PERFORMANCE OF WAVEFORM SUBSTITUTION SCHEMES Waveform substitution will be effective provided the character of the speech signal does not change significantly during a missing packet. ms of received speech to each end of the replacement packet. The merge duration T. 3) T. the maximum missing packet rate is approximately 5.0. 1. we observed perceptible distortions due to small discontinuities at some of the boundaries between correct packets and substitution packets. 1 that relative to zero substitution. Observing thatPf = p for zerosubstitution and assuming that. 11. we move the search window (in Fig.(pt)C+l 1 . 5 shows the reconstructed waveform. In addition to the two failure mechanisms that we analyzed.WAVEFORM GOODMAN et al. and therefore incorrect waveform substitution. failure due to mechanism a) is likely to be more disturbing because it sustains for a long interval (greater than 32 ms)a sound selected from previous speech packets. 2 . we use a simple probability model to describe the effects of these two failure conditions on the performance of an idealized waveform substitution scheme. it is also important to A . 16. Speech waveforms display quasi-stationary intervals which for the most part fall into one of three distinct categories: high-energy voiced speech with strong pitch periodicity. PERCEPTUAL ASPECTS. we found it helpful to increase slightly the duration of substitution packets so that packet ends can be “merged” with the corresponding correct packets. or 5 percent when the packet duration is 8. for example. and 32 ms. respectively. To apply the merging technique of Fig. and t = exp (-0. This scheme.01 is a packet communication performance objective [3].1 MISSING PACKET RATE 1 Fig. Fig. we see in Fig.: SUBSTITUTION TECHNIQUES 1W 1441 The .WAVEFORMSUBSTITUTION BASED MATCHING the probability that there is no category transition during any particularpacket. for packet durations Tp = 8. or 32 ms. In the Appendix. or b) there is a transition from one category to another within the missing segment. in fact. and silence [7].0052Tp) = 1 . 2 .Our simplifying assumption is that waveform substitution fails when either condition a) or condition b) occurs. introduces distortiondueto discontinuities at packet edges. (1) the maximum number of packet intervals over which the speech signal can be considered stationary. To reduce the audibility of this distortion. 111.simplest waveform substitution scheme replaces a missing packet with the previous packet. can occur for either of the following two reasons: a) the missing segment is so long that the speech signal is nonstationary. ms to the left and append T. however. or 8 rns. this merging also makes zero substitution less objectonable even though it alters the first or last millisecond of some correct packets. This leads to a perceptible discontinuity when correct packets reappear. the probability of unsuccessful waveform substitution is Pf = 1 .p ) 1 . ON PATTERN IV. On the other hand. 10. To do so. 16 ms.0052Tp. which can tolerate a missing packet rate of 1 percent. (2) consider the perceptual effects of substitution failures. Fig. The raised-cosine smoothing is then applied to the . 3 we see a speech waveform divided into packets each containing L samples. 0 4 0. low-level unvoiced speech.(1 . In Fig.pt Fig. 4 indicates that the algorithm scans a search window of duration N samples to find the M samples that best match the template.In terms of these variables. and if the sustained sound is voiced. or 20 percent when the packet length is 32 ms. respectively. We then analyze the probabilities of these conditions as functions of packet duration Tp m s . 10 percent. = 1 ms produces acceptable results and. Our merging procedure consists of raised-cosine weighting and addition of overlapping packet segments as indicated in Fig. Significant changes in the character of thespeech. an impairment that is reduced by more intelligent waveform selection techniques which exploit the periodicities in voiced speech. Listening experience suggests that the transition failure b) is perceptually the less damaging of the two mechanisms because it merely truncates or extends the current speech category for a short interval. 1 displays this failure probability as a function of missing packet probability. - - Fig. SPEECH. which is attractive in some practical implementations. then the cross-correlation formula is m= ___ 1 TP ~ ~ P ~ PACKET DURATION = Tp MERGE ~ DURATION ~ =T. Speech waveform divided into five packets. A simplified version of C(n). Fig. the peakto-peak amplitude is not across the whole window. In the case of the search window. A third way is to divide by the peak-to-peak amplitude of the segment.the samples of the search window are y(i). 6. 4. (4) where M is the number of samples in the template and n identifies the position of the template asit slides along the search window. AND SIGNAL PROCESSING. One is to divide the samples of each segment by the square root of the energy of that segment. DECEMBER 1986 --- 4 0 WEIGHT PROFILE FOR ADJACENT PACKETS replacement packet and to the packets that precede and follow it. Another approach to pattern matching is based on waveform differences. ASSP-34. The result of the search is the value of n corresponding to the maximum C(n). Packet 4 is missing. The formula for the resulting difference measure is: . : SEARCH WINDOW where sgn (x)is 1 when x > 0 and . Patterrz Matching There are several methods of pattern matching which yield almost equivalent results. * * * . I I_-B. the speech segments are normalized first. Fig. We have considered three methods of normalization.1 when x < 0. As the template slides along the search window. the algorithm seeks the minimum sum of absolute differences. 3. 5. However. and the final M samples of packet 3 comprise a template to be used in a search for a substitution packet. is the sign correlation. This leads to the differeace measure + 1 SLIDING GATE Note that the first denominator is independent of n for a given missing packet. If the samples of the template are x ( i ) .N. VOL. 2 . The substitution packet contains the L samples immediately following the best match to the template. 2. NO. the second denominator changes each time the template advances one sample. Another way to normalize is by the sum of the absolute magnitudes of the samples Fig. Raised cosine weightprofiles for merging a substitution packet into received speech. and . but anly across the segment being compared to the template. n = 1.1442 IEEE TRANSACTIONS ON ACOUSTICS. because the template never changes. each with L samples. c x(m)Y(n + m> 1 M C(n>= M . One method is cross correlation o f samples in the template and samples of the search window. In order that the result be sensitive to waveform shapes rather than level changes. The template slides along a search window containing N samples to find M samples that best match the template. 3-5. The decay should be fast enough to accommodate produced with the correlation measures and that the nor. the missing packof the speech signal. then P is taken as the average of the two estimates. In all cases the final in (6). then a missing packet as the reconstructed packet. we foundit helpful to adjust the amplitude of the The negative peak detector operates in a similar way substitution packet to matchthat of the preceding packet. The final replacement packet is a dent” estimate P is made only if agreement is secured weighted sum of past and future selections. = max Ml).the cycle restarts. CT provides a crude voiced/unvoiced classification and We believe that our conclusions about the relative effects - . At the beginning of its cycle. The root-mean-square and 1) If the last significant peak to be detected by both mean-absolute-valuemeasuresarecomparable to each detectors occurred more than a constant T. This amplitude adjust. then the higher estimate is acquality. can be reconstructed by repetitions of the last P msof available speech in the missing packet time slot. VI. respectively. This stored time positions of the latest three positive and negtwo-sided scheme selects a “future” replacement packet ative significant peaks. analogous to the normalizations to one of the . speech. x(M)1. Two-sided Approach packet is taken as the reconstructed packet.GOODMAN et al. HLD should be greaterthan the number matching measures.peak detection in a succession of pitch periods with demalizations of (6) and (7) are superior to (8). * ~ (+ n MI1 as a significant peak and MAX is allowed to decay expoand similarly for xmin and ymin. If a missing packet packet amplitude. Weexperimentedwithan extension of the pattern2) If 1) above is not satisfied. Two parallel detectors are employed which con. 6.: WAVEFORM TECHNIQUES SUBSTITUTION 1443 each peakdetector attempts to isolate and identify a single “significant” peak in each pitch period. and peak-to-peak. i) two pitch estimates from the This scheme is clearly more complicated than the one. of a pattern-matching criterion. in practical networks. if both detectors provide apparently reliable but of the two-sided scheme are compensated by improved contradictory estimates. creasing amplitude but slow enough toreject spurious lowTo improve the perceived quality of the reconstructed level peaks in pitch periods of long duration. beginning of the missing packet. A “confiselection of Figs. If i) is true for the estimates from only one deadded delay. WAVEFORM SUBSTITUTION BASED ON PITCH pitch period. The timeposition of the last update is then stored ymax= max [y(n + I). mean-absolute. If a reliable estimate of pitch period P ms voiced speech and a copy of the previous packet is taken is available for a voiced segment. cepted on the assumption that the other detector has erroneously detected more than one significant peakper v. of speech samples until no update has occurred for HLD samples. Although. EXPERIMENTAL RESULTS strategy aims at continuity on the basis of the most recent In order to evaluate the effects of substitution paramewaveform information but makes no allowance for ters and the missing packet rate pn speech quality. Our method can be interpreted as a variant of pled at 8 kHz.women.following strategies. Formal listening tests will be undertaken to tector.same detector.tence fromeach of four speakers. y ( n + 2). Such a .reconstructed packet is merged with adjacent packets as ment has a stronger influence on quality than the choice described in Section 111. the positive peak detector where updates the value of MAX. A wide variety of methods is available for speech pitch have simulated the transmission of 11. nentially until exceeded by a speech sample whereupon Inour simulation experiments with the five pattern. DETECTION 3) If conditions 1) or 2) above are not met. Any estimate which falls outside in a manner that is exactly analogous to the past-packet an expected range from PL to PH is discarded. ~ ( 2 1 .2 s of speech samdetection. root-mean-square. ms before the other and superior to the peak-to-peak measure.received after the estimates are calculated for each pitch detector from the missing packet in addition to speech that precedes it. we judged the difference methods to of samples corresponding to the smallest expected pitch producespeechthatsounds slightly better thanspeech period.occurs. respectively. . and (8). then a reconstructed packet is generated according value. * * . for negative peaks and both detectors store the time poTo achieve this we experimented with three measures of sitions of the latest 3 significant peaks. the missing tinually detect positive and negative peaks. (7). it is asOur other approach to waveform selection is based on sumed that pitch detection has failed in the presence of pitch detection. Center-clipping [7] with threshold ets in our experiments were randomlydistributed in time. we changes that may occur during missing packets. The speech material consists of one sena parallel processing method proposed by Gold and Ra. and/or ii) the latest estimates from both sided approach and it incurs the performance penalty of detectors. The operation of the positive peak detector is specified in Fig.with successive local maxima x. then two pitch period matching algorithm that uses speech. assess the extent to which the added complexity and delay However.. packet statistics are likely to be bursty. then it is assumed that copy of the previous the previous packet is unvoiced and a C. within 8 percent between.two men andtwo biner [8]. the packet size has a strong effect on the perceived nature of the reconstructed speech.The measurements in Figs. we found that regardless of packet size. 4 ms (M = 32) for the basic scheme and 2 ms (A4 = 16) for the two-sided scheme. With very small packets (1 or 2 ms. Fig. When only half of the packets arrive.Fig. Again. We observed that signal-to-noise ratio i s a reasonably good indicator of the relationship of perceived quality to search window duration. B. with 10 percent of the packets missing. 3 ) Template: As in thecase of the search window.1444 IEEE TRANSACTIONS ON ACOUSTICS.5 ta 12. although 16 ms is also good. the replacement algorithm produces speech interspersed with beeps and chirps very similar to the voice of the robot R2D2 in the movie ‘‘Star Wars. Our listening experience suggests that communication breaks down when more than 30 percent of the speech is lost.Signal-to-noise ratio of reconstructed speech is not a good indication of how the quality changes with packet size. 10 shows the dependence of total SNR on missing packet rate for the two pattern-matching methods and for missing packets replaced by silent gaps. 7 shows the total SNR of the num(measured across 11 s of speech) as a function ber of samples in the search window. I ) Packet Size: For a given fraction of packets missing. the overall effect of waveform substitution has to be assessed in the presence of packet loss patterns that are representative of each potential application. Fig. the best sizes. A. Flowchart of the pitch estimation procedure. The quality also goes down if the template is too long. In our experiments. 6. the template contains M = 32 samples. the packet size most tolerant to packet loss is 8 ms (64 samples). ASP-34. We found that in order to accommodate pitch periods ranging from 2. rather than constantly like the crackle. 8 displays the average SNR (in dB) per missing packet. AND SIGNAL PROCESSING. there is a minimum acceptable template duration. Speech quality does not deteriorate appreciably when the search window is longer than optimum. 7 and 8 were obtained with L = 64 samples per packet and a missing packet rate of 9. a situation which can result in the selection of a suboptimum reconstruction segment.2 percent. tbere is simply insufficient speech information. VOL. it contains speech that is unrelated to the missing packet.there is aconstant. 9 shows the average SNR per missing packet as a function of template size.There is an optimum search window duration. To our ears. The algorithm illus- . In this case.5 ms. annoying crackle. L = 8 or 16). With the two-sided technique of Section IVC. M = 16. we have used the magnitude difference measure (7) for pattern matching and the root-mean-square amplitude adjustment of replacement packets. some degreeof parameter adaptation was necessary in the pitch detector. we defined several matching criteria and discussed their influence on the quality of the reconstruction scheme. Nevertheless. Pattern Matching In Section IV-B. it omits the best reconstruction waveform. In performing these evaluations. of various parameter settings are generally applicable. the one-sided scheme repeats the same segment. 7 and 8 that the shortest search window is equal in duration to the template. for the two-side scheme. If it is too long. and occur infrequently. If the search window is too short.For very large packets (32 ms or more). DECEMBER 1986 INPUT SPEECH SAMPLE CENTRE CLIP f CT I CNT = f N T . This observation conforms approximately to the reports of earlier researchers who found that with silence substitution 1632 ms packets were most tolerant of packet loss [6]. 6. 8 ms was the best search window duration. the algorithm replaces the missing packet with the previous packet. Pitch Detection Approach Table I lists the parameters used in this simulation. SNR correlates well with the effect of template duration on perceived quality.I YES STORE ?EAK POSITIONS I NO I L 1 l MAX = DKXMAX I I Fig. the speech sounds as though the person is trying to gargle while speaking. there is a chance that a small segment of this speech is well matched to the M samples in the template. a measure found in a previous study to be a good indicator of the relative quality of reconstruction methods [ 6 ] . NO.” This is because when many packets in a row are lost. For sizes in between. the SNR and the perceived quality decline as the fraction OC packets missing increases. Fig. Note in Figs. 4) Missing Packet Rate: As expected. appear to be independent of the packet size. creating a highly periodic signal. the crackles become pops. 2) Search Window: . We now present the effects of packet length (L). and search window length ( N ) on signal-to-noise ratio and onour perceptions of speech quality. template length ( M ) . a 16 ms search window was best for the basic (one-sided) pattern-matching scheme. SPEECH. it improves as the packets get smaller and smaller. showing the procedure for determining the numberof samples between positive waveform peaks. If M is too small. However. For the basic method. Total SNR as a function of missingpacketrate for bothversions There are 64samples per packet and 9. 3 2 5 IO - 5 I I I I I . In the basic (one-sided) version. Decisions for strategy 2) taken with the confidence of only one detector were usually considered satisfactory on visual inspection of the waveform records.10. Average signal-to-noise ratio per missing packet as a function of correlation window size. Voicediunvoiced decision threshold CT Maximumallowance for acceptance of signal peaks T. and provided a rapid response at the beginning of a voiced segment. trated in Fig. there are N = 128 samples in the search window. The entries for strategy 2) are subdivided according to how a confident estimate P of pitch period was derived. 6 was therefore modified to allow the hold parameter HLD to increase.3. These decisions usually occurred in the vicinity of a transition. 0 0 200 300 400 500 0 A . Cases where reliable estimates from both detectors disagreed significantly are not shown as they occurred with extremely small frequency. 7. . Average signal-to-noise ratio per missing packet for the same condition as Fig.1. N = 64 in the two-sided case..B/HLD at the beginning of a cycle to achieve some lengthening of the decay time constant for longer values of pitch period. ing. and 0. I I I v) -5 I I 0 50 TEMPLATE SIZE I00 (SAMPLES) 150 Fig.5 ms = 1 ms = 16 ms = = 0 z v) -5 0 I00 200 300 400 500 600 SEARCHWINDOW SIZE ( S A M P L E S ) Fig. invoked when neither voiced nor unvoiced decisions can be made. A small proportion of decisions was made with only the agreement of the latest estimates from both detectors. 0. With one-sided estimation. Such cases occurred typically when high-level spurious peaks of either positive or negative polarity disabled the operation of one detector but not both. M = 32 with two-sided reconstruction. usually in the vicinity of transitions where voicedhnvoiced distinctions become difficult to make.2. Within this category.2 percentof the packets are missof the pattern-matching scheme and for silence substitution. Minimum acceptable pitch period PL Maximum acceptable pitch period PH 16ms 10 percent of peak speech 2. the estimates are usually made with the confidence of both detectors and examination of the speech records confirms that the measure of periodicity is invariably correctin these cases.2 percent of the packets are missing.3 A SEARCHWINDOWSIZE (SAMPLES) MISSING PACKET RATIO Fig.: WAVEFORM SUBSTITUTION TECHNIQUES 15 2 SIDED M = i 6 0 SILENT GAPS I I 1445 I I 1 a m 0 1 SIDED M .4. the entries in Table I1 are averages over 3 runs using different random number seeds to select missing packets. 7 . from an initial value of 20 samples. Merge duration T. m 0 5 I I I I I I I PARAMETERVALUESUSED IN TABLE I SIMULATION OF PITCH DETECTION ALGORITHM = 8 kHz = = W a a PCM sampling rate Packet duration T. In order to reduce dependence on specific patterns of missing packets. by one-quarter of the number of samples taken during the decay phase of the previous detection cycle.5 0 1 . 0. was used for about 8 percent of missing packets. 9. 8.5 ms 12.Fig. The relative-frequenciesof the three decision strategies (Section V) invoked for missing packet reconstruction are listed in Table I1 for missing packet probabilities of 0. The decay factor DK was also made partially adaptive by setting DK = l-O. Strategy 3). M = 16 samples per correlation segment.GOODMAN et al. There are 64 samples per packet and 9.2 . Totalsignal-to-noiseratio as a function of search window size.. 1 69 41.implementations of some of the packet replacement correlation function for 2.illustrated in Fig. we have gained considerable experience listening to the last packet).one based on pattern matching and the other based on pitch coded PCM at an 8 kHz sampling rate. On the basis of detection. ASSP-34.expression for the probability of failure of an idealized cessor [9] to implement a simplified version of the pitch substitution scheme and presented two practical schemes detection scheme operating with 16 ms packets of p-en. 6. to deduce the relative merits of the two ap. Our perceptions of speech quality are generally length window.1446 IEEETRANSACTIONS ON ACOUSTICS.SPEECH.4 1. rameters such as pitch period or pattern-matching location VIII. The entries in Table I11 to our performance predictions.3 No.6 1. and the second conperiod is likely to be dominated by the parameter extrac.3 1.4 6.voice packets. Although the latter may be a logically complicated straightforward methods for the reconstruction of missing decision process in the pitch estimation algorithm involv.00 1.4 4. “Pure” autocorrelation refers to pitch simulations with other source speech and to the real-time detection by computation of the complete discrete auto. Thefirst is the maximum period over which therefore. refer to executed DSP 20 instructions per speech sample In addition to the measurements (Figs. and 3) in Section V.6 4. VOL.5 ms delays over apacket.4 219 45.cerns the transitions which occur between voiced and unvoiced or silent speech segments.ANDSIGNALPROCESSING. Both RECONSTRUCTION SCHEMES measures tend to decrease with increasing missing packet probability.53 0. and it is our impression that in- . 2). CONCLUSION while the second involves waveform placement in the Waveform substitution schemes form a class of missing packet time slot on the basis of available parameters. “Fast autocorrelation” refers to the same consistent with the data. 7-10 and Table averaged over one packet and for purposes of comparison 11) obtained in simulations of 11. the computational demand averaged over this speech can be considered stationary. We have argued that the performance of ing strategies such as l ) . of missing packets Strategy(1): Unvoiced (percent) Strategy(2): Voiced (percent) Both detectors agree Only +ve detector confident Only -ve detector confident Both detectors confident only on latest estimates Strategy(3) Voicing ambiguous (percent) Total signalidistortion (dB) Average signalidistortion per missing packet (dB) 0. DECEMBER 1986 TABLE I1 PITCH DETECTION ALGORITHM PERFORMANCE Missing Packet Probability 0.reasonably reduce computational effort by a factorof two.1 10. Although the pattern-matching method appears more proaches from these data.2 2. We programmed the WE DSP 20@Digital Signal Pro.48 0. We have derived an tion process. it requires about the COMPLEXITY VII.4 0. it is impossible.2 139 45.8 4. IMPLEMENTATION same computational effort as the pitch detection method Essentially two computational processes are required for since the DSP 20 favors pipelined arithmetic rather than the implementation of a waveform substitution scheme.01 8.9 3.6 31.1 32.7 6. it is possible to give an indication assessment of the quality of these schemes in relation to of the number of DSP 20 instructions required for some the ideal. it is such schemes is governed by essentially two characterisrequired no more than onceper packet duration.04 -0.0 4. simulation results appear generally to conform waveform substitution schemes.6 10. the conditional branching required for the pitch detection The first involves extraction of the necessary speecli pa. and tics of speech. 6.process using computational shortcuts [7] which might tive quality.1 23.2 10.2 8. The distortion per missing packet increases Pure Autocorrelation 100 IO because thesuccess of packet reconstruction becomes Pattern Matching Pitch Detection IO limited by the frequent occurrence of long missing packet Fast Autocorrelation 50 3 chains [failure condition a) in Section 111.1 10.2 s of speech transmisinclude the trivial case of direct substitution (of zeros or sion. Although it is difficult to make a quantitative the experience gained.4 5.2 3. While the total Direct Substitution SNR’s in Table I1 are comparable to those displayed in Fig.28 209 42.25 TABLE III Table I1 also displays SNR measured over the entire 11 ESTIMATED DSP INSTRUCTIONS PER SPEECH FOR SAMPLE VARIOUS s speech sample and the SNR per missing packet.schemes.9 28. 10 for the pattern-matching schemes. owing to the loose connection between SNR and subjec.5-12. complex as a software simulation.NO.2 12.7 7. Signal Processing. NY. NY. 46. the probability of no transition over the duration of a packet lasting Tp ms. r > C. the duration speech is too long to allow accurate substitution from past waveform segments. produces (3). O’Riordan. 1983. R. Holmdel. with 700 16-ms packets in the 11. Evans. Commun. all in University of London. vol.” 101-109. SAC-1. particularly chance coincidences of long missing packet chains with high-level voiced speech. Dec. combined with (10) and (9). [4] J. COM-33..in1967. [9] J. Gold and L. If r IC . degreefromRensselaerPolytechnicInstitute. pp. SAC-1. Con$ Acoust. pp. Schafer. independent of packet position. 383-385. which means that of the missing regardless of transitions. and S. we adopt (1) for C . and the Doctorate degree from Imperial College. S. we estimate that in English there are 5. R. Forgie. vol. pp. very careful consideration should be given to the formulation of subjective tests for waveform substitution schemes. However. 801-808.2 s test signal. r IC l. “Subjective effects of variable delay and speech clipping in dynamically managed voice systems. R. we have also learnedthat subjective quality is very sensitive to specific patterns of missing packets.” 1.” IEEE Trans. 1982.pp. Packet Switching Tomorrow’s Communications Today. Digital Processing of Speech Signals. [5] C. Because of this phenomenon. including mobile radio and indoor wireless communications. In this case. 2. Examining several speech waveforms. [7] L. COM-29. Selected Areas Commun. The second failure condition b) occurs when there is a category transition within theduration of the missing packet. the Master’s degree from New York University.t‘.PI.3. terpolationprocedure.. Since speech parameters can be assumed stationary for no more than about 30 ms [6]. electrical engineering. (11) which. He has done research in several aspects of source coding. . McLane. 442-448. Dec. Selected A r e a Commun. all available substitution waveforms (from previous packets) will be incorrect a€ter the transition. S. Since 1993 he has been a Visiting Professor in the Electrical Engineering Department of Imperial Colle& University of London. [6] N . Acoust. Gruber and L. Gruber and N.. and communications. “Parallel processing techniques for estimating pitch periods of speech in the time domain. the probability of a sequence of r missing packets is Prn(r) = P‘U . If there is a sequence of Y consecutive missing packets. in 1962. we have t . New York. [2] M. Because packet durations of 8. .. Assuming that transitions occur independently from packet to packet.1. These durations correspond to C = 4. “An LSI digital signal processor.: WAVEFORM SUBSTITUTION TECHNIQUES 1447 telligible speech can be obtained in the face of missing packet probabilities up to about 0. [3] J. Selected Areas Commun. digital signal processing. NJ. Le. pp.2 category transitions per second. 1985. 981-1005. Strawczynski. in 1960. vol. The perceptual effect of failed reconstruction within the resulting 48 ms segment will therefore depend critically on the time position of the segment and distortion may vary from negligible during silent periods to-gross during high-level voiced speech. (9) David J. Rabiner and R. REFERENCES [ l ] R. (10) Thus. R. J. W.1980. Christensen. Belmont. D ~ c 1983. 1981. pp.(r). 16.Speech. pp. 963-980. Rosner. Decina and D.. For example. W. Troy. where Tp ms is the packet duration. Int. only about onemissing packet chain of 3 ormore packets can be expected with a missing packet probability of 0. Feb. D.GOODMAN et a[. Since 1967 he has been at AT&T Bell Laboratories. “Effects of packet losses in waveform coded speech and improvements due to an odd-even sample inIEEE Trans. Goodman (”67) received the Bachelor’s. 1983. SOC. Thompson and J . W. Weinstein and J. in (2). Aug. CA: Lifetime Learning Publications. vol. and 1 consecutive missing packets. thetotal probability of failure of the waveform substitution method is Pf = ‘F0 m Pf@) P. vol. Aug. Jayant and S . where he is currently Head of theCommunicationsMethodsResearchDepartment. the substitution will fail if there is a transitionsomewhere in the sequence of r missing packets. EnglewoodCliffs. the number of contiguous missing packets that can be tolerated before condition a) occurs in approximately 30/Tp. and 32 ms are of practical interest.” in Proc. 1969. P.” IEEE J. Assuming their occurrences conform’to a Poisson probability model. the waveform substitution fails if r > C . Rabiner.1978. “Performance requirements for integrated voice/ data networks. We begin by observing that condition a) occurs only when the total duration of a sequence of missing packets becomes significant relative to expected changes in long-term speech parameters such as envelope and pitch. pt. NJ: Prentice-Hall. “Voice by the packet?” IEEE J. - - .. “Experience with speech communication in packet networks. Commun.. [8] B. the probability of no transition in a sequence of r packets is t ‘ and the probability of failure is Pf(r) = 1 = We are grateful for the advice andencouragement of J. vol. SAC-1. If the probability of packet loss is p . 961-962. Valenzuela.England. V‘lack. Boddie. Apr. Recently he and his colleagues have been studying short range communications networks.” IEEE J. themaximum number of consecutive missing packets consistent with successful waveform substitution. ACKNOWLEDGMENT APPENDIX WAVEFORM SUBSTITUTION FAILURE PROBABILITY We derive (3) which is the probability of occurrence of failure condition a) or condition b) (or both simultaneously) defined in Section 11. Amer.2. 6. in June 1986.S. Wasem (S’84) was born on March 26. SPI!ECH. Ondria J. There she completed her Master’s thesis on reconstructing missing packets of PCM and ADPCM . Wasem chaired the M. During the Summer of 1984 and from June 1985 through January 1986. She is currently working on the Ph. Student Chapter of the IEEE and the M. AND SIGNAL PROCESSING.1448 IEEE TRANSACTIONS ON ACOUSTICS. this time in the Communication Methods Research Department.I. She is also a member of Eta Kappa Nu.NO.T. she worked again as anIntern at AT&T Bell Laboratories. During the Summer of 1983 she worked as an Intern at AT&T Bell Laboratories.I.T. NJ. Mrs. Tau Beta Pi.I.D. in the Robotics Principles Research Department. VOL. Cambridge.S. degree in electrical eagineering and computer science under a National Science Foundation Fellowship at M. 1964. Holmdel. She received the B.T. ASSP-34. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology. and M. DECEMBER 1986 encoded speech. Department of Electrical Engineering and Computer Science Student Faculty Committee from September 1984 through May 1985. and Sigma Xi.
Copyright © 2024 DOKUMEN.SITE Inc.