Forensic tape enhancement: aural identification of recorded voices (familiar)
Posted 10/01/1993
In Forensic Science
Enhancement of tape recorded voices to facilitate transcription & aural identification
Bruce E. Koenig
Federal Bureau of Investigation
Ongoing law enforcement operations throughout the world are continually capturing the voices of suspects with miniature transmitter/receiver systems, analog and digital on-the-body recorders, telephone intercept devices, and concealed room microphones. Since these recordings are normally utilized for investigative leads and/or legal proceedings, specific speakers must be accurately identified. Voice identifications that occur through self-recognition of one’s voice, eye-witness information, surveillance logs, and the use of a person’s name in the conversation are usually readily accepted. However; voice identifications that involve listening only and/or laboratory tests are often more difficult to evaluate accurately. To provide a better understanding of these voice comparison topics, two types of aural-only comparisons will be discussed, and an update on the spectrographic technique is included.
Aural Identification of Familiar Voices
Recognition of familiar voices is a daily occurrence for most people, as they identify spouses, children, coworkers, friends, and business associates after only a few words spoken over the telephone or by hearing them from an adjacent room. This process involves long-term memory, where recognition occurs through a prior knowledge of speech characteristics, including such attributes as accent, speech rate, pronunciation, pitching, vocabulary, and vocal variance (intraspeaker variability).
Some of the relevant scientific research, and opinions that address the accuracy of identifying familiar voices include the following:
- Researchers used 7 listeners who were familiar with the 16 chosen speakers through daily contact. The speakers had no pronounced speech defects or accents. Groups of two to eight speech samples of varying lengths were played back to the listeners, which resulted in an identification accuracy of better than 95% for samples lasting from about 1 to 2 seconds. Voice samples were also frequency restricted, but the results reflected only a limited loss of accuracy under conditions normally encountered in law enforcement investigations. In tests involving whispered speech, the duration had to be somewhat greater than three times longer than normal speech samples to obtain equivalent levels of identification (Pollack et al. 1954).
- Sixteen listeners with no hearing losses, who had known the recorded 10 male coworkers for at least 2 years, were chosen. None of the 10 recorded individuals had either pronounced regional accents or speech abnormalities. When the listeners heard sentences of less than 3 seconds duration from the 10 coworkers, their median accuracy rate of identification was 98% (range of 92% to 100%). When only a disyllable (e.g., mama) was spoken, the median accuracy rate dropped to 88% (range of 73% to 98%) (Bricker and Pruzansky 1966).
- In a study of coworkers, recordings were made on different telephone lines of four women and seven men, each talking for 30 seconds to 1 minute on a neutral topic such as the weather. An additional recording was prepared of another male; who was relatively unfamiliar to most of the listeners. The recordings were arranged in a random order and played to 10 of the other coworkers, who were asked to identify the speakers. “All the listeners except one correctly identified all the 11 [coworkers]… The one listener who made an error.. confused two speakers who were not well known to him. Three of the 10 listeners knew [the eighth male, who was not a coworker], and correctly identified him. Of the remaining seven listeners, only two said that they could not recognize this speaker. Five listeners wrongly identified this speaker as…” another one of their coworkers. “It is worth noting that four of the five listeners who made the wrong identification were highly skilled, experienced phoneticians…” with doctoral degrees in the field (Ladefoged 1978). This experiment reflects a 100% identification rate for the coworkers’ voices that were well-known to them and an overall average accuracy rate of 96% when the relatively unfamiliar voice was added.
- Twenty-four individuals were asked to listen to speech samples of 24 coworkers (15 males and 9 females) whom they had known for several years and 4 speakers unknown to the listeners. The speech samples averaged about 30 seconds in length and contained at least 12 utterances of 2 to 4 words each. Listeners rated each coworker on a scale of very familiar to totally unfamiliar prior to the testing. They listened to the samples for as long as they wished and then rated their decisions as follows: (1) guessing, (2) fairly sure, or (3) very sure. Deleting the results of any voice rated totally unfamiliar to the listener, the results showed a 90.4% correct identification rate and 4.3% incorrect identification rate, with 5.3% who said they did not know the speaker. If the 5.3% are deleted, the correct identification rate is 95.4%. “This rate is probably fairly representative of situations where a limited vocabulary is required and can be expected to be even higher in informal conversations where more of the individual speaker’s speech habits are present as cues for identification” (Schmidt-Nielson and Stern 1985).
This research reflects that the identification accuracy rate for familiar voice samples lasting 1 second or longer ranged from 92% to 100% and averaged 95% to 100%. Samples recorded through the telephone or other limited bandwidth systems had little effect on accuracy. The effects of noise and loss of high frequency information were studied in another experiment (Clarke et al. 1966) which found that aural speaker identification was only slightly degraded when progressing from high-quality voice samples to typical investigative recordings. It is obvious from everyday experience and the cited research that identifying familiar voices can be an accurate method for identifying voices recorded in forensic applications, even with the limiting factors of noise and attenuated high frequencies.






Bookmark this Page