From Forensic Examiner magazine , interview with Tom Owen - CHS-III, member of ACFEI's Executive Advisory Board for the American Board for Certification in Homeland Security and the Chair of the American Board of Recorded Evidence
Does a person's voice change?
Yes, after puberty. Once you reach maturity your voice stays the same, and then if things change it's because you did something--you had an operation, got sick, had a lung removed. For instance, take Winston Churchill, who was a heavy cigar-smoker and drinker for at least 50 years of his life. If you compare his spectrograms from his 20s to those from his 70s saying the same words, there's a huge difference. It didn't make his voice unidentifiable, but the structure, format and pitch changed.
Please walk us through the basic steps you would follow in analyzing an audio recording for voice identification.
Normally you receive the tape from the prosecutor or a defense team and check it to make sure it's not physically damaged, take pictures of it, make sure it hasn't been spliced and that somebody didn't spill a Coke on it. Then you play it into a computer from a specialized recorder playback device that keeps constant tension on the tape. The computer lets you know that the tape is running at the speed at which it was initially recorded. Many times tapes are made on recorders that are running off-speed, and you compensate for that.
Next you critically listen to the tape, listening for things that are indigenous to the voice. We call them oral cues and visual cues Then there are other unique factors that might be relative to a particular voice (i.e., someone who sniffs all the time or has a sinus problem--we call it "snorking"--or someone who's always clearing his or her throat).
Then we look at a spectrogram for the visual cues. A spectrogram is just a visual representation of time, energy and frequency. Time is fixed at 2.2 seconds, energy is the amplitude and the frequency is set by the machine--we can go up to 8 kilohertz or 8000 cycles per second. A hertz is a vibration. For example, cycles per second refers to the vibrations of our vocal chords. Our vocal chords make sounds by vibrating and pushing air through these vocal chords from the diaphragm and the esophagus, and our articulators format the words. By recording vibrations, a spectrogram can create a visual representation of a voice.
When we've loaded the tape, we're looking on a computer at spectrograms. To do a comparison of an unknown voice and the voice of the person who you suspect is on the tape, you have to make a short-term memory tape. Up to this point we've been looking at the unknown tape. Now we're going to take an exemplar of a voice, the voice you're comparing to the unknown voice.
For example, I had a recent case where "Annie" got a job that was promised to another woman. That woman was mad and called Annie up and said "You dirty ... son of a so and so. I should have gotten that job. You didn't deserve it; you were sleeping with the boss." This would be your example of the unknown voice, and all the steps I've told you about have been taken with the unknown voice.
Now the company plays the tape around the office and people say, "That sounds like 'Mary Jo.'" So they go to Mary Jo and she says, "Well, it's not me, and I'm willing to take a voice exemplar." I get Mary Jo on the phone and ask her to say the same thing, in exactly the same manner that is on the unknown tape. Then I go through this whole process with her tape (the known sample) with the oral and visual cues. Next I'll make a short-term memory tape, which allows us to compare short phrases from the two tapes back to back. The unknown tape will say "I should have gotten the job." Then Mary Jo will say "I should have gotten the job," and then the next phrases and so on for about 10 phrases. At the end you have a large sample of verbatim speech, the unknown next to the known. In listening to it, you can tell if the voices sound the same or different. Then you do the spectrograms of the known and unknown, of just the phrases you're going to use for your sample. Now you compare the two--do they sound the same, do they look the same? If it looks the same and sounds the same, it is the same. For example, if it sounds the same but doesn't look the same, that's an indication that it may be a sound-alike voice, bur it may not be the same person.
Are you saying that where the human ear might be fooled, the spectrogram will reveal the truth?
Right. For example, when it looks like the same person but it doesn't sound at all like the same person, then you have to be wary of the exemplar recording and wonder if you actually used the exact same words spoken in the same manner The experience of the examiner comes into play, because if you get on the phone with someone and tell him or her to say, "There's a bomb in the building," and he or she says, "There's a bomb in the building," that's not good enough, because that's going to skew your result. The individual must say the words exactly the way you direct him or her to say them.
Do you have to authenticate audio recordings for voice identification?
That's something different. Usually that's when someone says, "Yes, that's my voice on the tape, but that's not the conversation we had. You edited the tape. "To verify that, there's a process in which you have to do an authenticity examination. These are the steps: critical listening, waveform analysis, magnetic development, tape enhancement, spectrum analysis, phase continuity and speed fluctuation and voice identification. should it be necessary.
The way most people splice together multiple conversations is that they take two tape recorders and go back and forth to get the conversation the way they want it, and then they make a copy of that tape. We would be able to determine and discover all of this.
With computers, this process can be more difficult because people create the tape they want, load it into a computer, cut out all the stopping and starting signatures and try to paste together that way. That's more difficult to analyze and more difficult to arrive at a conclusive determination of whether it has been edited or not. However, creating such a tape requires special knowledge, equipment and software.
Is voice identification more of an art or a science?
I think it's both. The art part of it is getting a good exemplar and making a good short-term memory tape. The science part is that the machine makes a spectrogram from what you give it. When deciding, "Yes it is this person," or "No it's not," that's when your expertise comes in. The human analyst and the computer technology are equally important.
How reliable is voice identification?
It is extremely reliable if you good samples, meaning good recordings and a good number of words, at least 20. It is very reliable with the best samples. While it's always going to be reliable, your opinion is going to determine how strong your reliability is. In other words, there are several possible results: positive, probable, possible, etc. Each delineation has its own standards.
I understand that the government used a computer program to analyze the same bin Laden tape. Should we trust voice identification performed solely by a computer?
No, in my opinion you can't eliminate the expert. There are many studies done with just a computer spitting out a spectogram--this what biometrics is all about. Biometrics is a science where you identify people from a large pool, looking at voices, eyes, eye retinas, whatever. Biometrics is used a lot in government, and you see it in movies all the time. The problem with the voice is it's too dynamic. Because of this biometrics doesn't work very well, especially with forensic tapes, meaning bad tapes.
When I was on the news with the Osama bin Laden tapes, there were people who called in and said, "I thought the government has this program where they can throw one voice in among 150 and it spits out with 99% accuracy who the voice is." Well, those programs work fine if you're dealing with perfect tapes. But in the forensic community, very seldom do you get perfect tapes. That's where you have to rely on your skills and experience as an examiner, because there is no "throw it in the hopper and it comes out right" answer. Biometrics doesn't work with bad forensic tapes, and the bin Laden recording was a poor recording. That's why the government was hedging on coming out and saying it was or was not bin Laden. They knew that the biometrics could fool them. 1 came out on Thursday with my opinion, and they came out the following Monday. We came to the same opinion, but they kept stalling the press, and that's why the press came to me.
Would you tell me about a few of your most interesting and challenging cases?
For most people that usually means celebrity cases. I did the Woody Allen and Mia Farrow trial, which was a video authenticity case. Julio Eglasias was an audio authenticity case; Mariah Carey was an audio authenticity case. I did the Osama bin Laden case and a couple of Mob cases. There have been others that have involved huge amounts of money, and some of the most interesting ones I can't talk about because I had to sign non-disclosure agreements.
VOICE IDENTIFICATION: THE AURAL SPECTROGRAPHIC METHOD
Steps for Voice ID Case Procedure
1. Receive, mark and photograph evidence tapes, recorders and containers.
2. Physical inspection, tape inspection, lot number, condition.
3. Track configuration Mono or Stereo, 1 or 2, control track, etc.
4. Azimuth and Zenith alignment on lab recorder.
5. Playback speed analysis and adjustment.
6. Load into computer for electronic enhancement.
7. Critical listening and notes.
8. Create "unknown" word and phrase list.
9. Take verbatim exemplar and create known "best" word list and phrases,
10. Create an audio unknown/known short term memory tape for aural comparison.
11. Do the Visual comparison of the spectrograms of the unknown/known ST phrases.
12. Analyze the results and form conclusions, offer an opinion. Write report
13. Write to an archive file, make copies and send report to client with original materials (FedEx or Certified Mail). Include all Rule 26 requirements.
Voice ID Criteria
With special thanks to Sgt. Lonnie Smrkovski, Michigan State Police (Ret.)
Aural Cues
1. Perceived pitch (eg: voice sounds high or low).
2. Quality (eg: street talk vs. educated speech).
3. Rate (how fast or slow a person speaks).
4. Mannerisms (eg: Someone who speaks last and then slows down at the end of a sentence. Eg: "Sopranos" guys who end every sentence with "forget-about-it.").
5. Amplitude (how loud someone speaks).
6. Pathologies (eg: a harelip, a lisp or a stutter).
7. Breath patterns.
8. Dialect/accent.
9. Syllable coupling (the way we put the word s together when we speak).
Visual Cues
1. Bandwidth.
2. Mean frequency (vibrations of the vocal chords per second; average male has a mean frequency 130 cycles per second and average female is 150-160).
3. Trajectory of formants (on a spectrogram the formants are shapes that represent the vocal energy of the words that we're speaking, and our voices).
4. Inter-formant information/intra-formant.
5. Fricatives ("ch" sounds).
6. Plosives ("P" sounds).
7. Gaps (refers to syllable couplings, how we put words together when we speak).
8. Consonants have a distinctive look and shape on a spectrogram).
9. Transitions between consonants and vowels.
10. Transition between words.
11. Rate (average number of words spoken per minute).
12. Pitch.
13. Distribution.
14 Nasal patterns distribution.
15. Evidence of pathology (e.g., nasality, lisp, etc.).
16. Relative intensity.
17. Other spectral data.
TENANTS AUDIO AUTHENTICITY
1. Recording device was capable.
2. Operator was competent to operate the device.
3. The recording is authentic and correct.
4. Changes, additions or deletions have not been made in the recording.
5. The recording has been preserved in a manner shown to the court.
6. The speakers are identified.
7. The conversation elicited was made voluntarily and in good faith without any kind of inducement.
Basic Methodology
1. Receive evidence, photograph tapes, recorders and containers, punch tabs.
2. Physical inspection, tape inspection, run lot number, condition.
3. Track configuration, develop for tracks.
4. Azimuth and Zenith alignment on lab/evidence rercorder, speed adjustment.
5. Critical listening for things like ticks and pops and what appears to be stops and over-recordings, slurred words, speed changes, etc.).
6. Waveform analysis (way to verify the things you've heard in the listening process), load into computer, from lab and original recorder (16Bit/44.1 sample rate minimum).
7. Spectrum analysis, FFT (looking in the time and frequency domain to learn certain things about the tape, about the noise on the tape, about the signatures on the tape, electronic information about everything that happens on that tape).
8. Magnetic development (put a Freon-based ionized-particle solution on the tape to develop the signatures). When you start a tape recorder the heads go up and touch the tape and leave a signature, like a fingerprint, that's indigenous to that tape recorder. When you stop it the tape heads pull away and demagnetize, leaving a signature.
9. Record test recordings for comparison.
10. Note all signal anomalies, print waveforms together with the magnetic prints of transients and signatures, spectrograms if applicable--evidence tape and recorder. Compare.
11. Write report of your findings. Authentic or not. Conform to Rule 26 of the Federal Code for Expert Witness Opinion.
12. Make an archive copy of all your findings and send the report and the original materials, including all the Rule 26 requirements, to the attorney through FedEx or Certified Mail.