Reverberation is a phenomenon familiar to mankind since the time of prehistoric cave dwellers. Nowadays, it is commonly experienced whenever a sound is emitted in an enclosed space such as an office, a concert hall or a church. From a perceptual point of view, its effects are twofold: the heard sound is more diffuse and “echoey” as the distance between the source and the listener increases.
The physical explanation behind these observations is illustrated in Figure 1: a localised sound source propagates in waves in different directions. While some rays arrive to the microphone directly, a lot of them bounce on walls or other surfaces within the room before reaching it. Therefore at a given instant, the recorded sound is the sum of the “direct” sound, which has just been emitted, and the past reflected sounds.
As shown in the audio samples associated with Figure 2 and 3, the “ringing” sound introduced by the room acoustics is increased as the distance between the recording device and the source increases.
Whilst reverberation is desirable in music to produce naturalness or even to create imaginary spaces, its uncontrolled effects on speech for daily applications such as hand-free phone calls or teleconferencing are problematic. The speech intelligibility is indeed reduced in reverberant environment. A simple yet convincing experience would be to have a conversation with one person placed at the entrance of a large church and the other one at the other end of the building.
Human intelligibility of reverberant speech
Reverberation and background noise are the two main elements that have an impact on the quality and intelligibility of speech.
The sound arriving at the receiver (human listener, microphone, etc.) is made up of the direct sound (coming directly from the source, without any reflection), and the reverberant sound. The total energy of the reverberant sound can be decomposed into two parts: the early and late reflections. The early reflections reach the receiver quite shortly after the direct sound and are partially integrated to it, creating a coloration effect on the sound and possibly increasing its intelligibility. The late reflections, consisting of all the reflections arriving after the early ones (typically after 50-80ms), mainly have a detrimental effect on the perception of speech as they create a diffuse sound field and make the speaker sound far away.
Removing the reverberation is therefore of great importance as intelligibility improvement finds applications in hearing-aids processing, law-enforcement, speech recognition, source localisation, and many more voice-controlled communication systems.
Reverberant speech recognition
In the most common case, a speech recognition system can be considered as a black box with one input (speech) and one output (transcription). The system first processes the input speech which is delivered to an internal decoder. Then, this last element decodes the processed speech into a sequence of words using mainly two resources:
- Acoustic model which contains the knowledge about small acoustic units (e.g. phones) used to articulate words.
- Language model which provides information about how likely is to occur a sequence of words.
Figure 4 shows an example of the recognition of a non-reverberant speech where the transcription is nearly perfect. In this case there is one insertion (AND) since the original transcription is MS. AMSTERDAM DECLINED TO COMMENT.
Figure 5 displays the transcription obtained when the input is the reverberant version of the recording presented in Figure 4. In this case, the transcription is totally incorrect. This poor performance is mainly due to the mismatch between the acoustic models (obtained with non-reverberant data) and the acoustic observations (the input reverberant speech). Figure 5 clearly illustrates the problem of recognizing reverberant speech.
The aim of derevereberation methods is to suppress or cancel reverberation from reverberant speech to increase the human intelligibility as well as improve the speech recognition performance.
Spectral Enhancement is one of the most popular techniques used to perform dereverberation on speech. It can be seen as a technique that removes unwanted ringing notes from a music score (see Figure 7). First the speech sample is transformed into a Time-Frequency representation. Using local rules for each time-frequency element (or music note in the analogy presented Figure 7), the reverberant elements are identified and suppressed. Once the process has been applied, the signal is converted back to the time domain.
A possible approach to improve the performance of speech recognition systems in reverberant environments is to train the system (specifically the acoustic models) employing reverberant recordings. Figure 6 shows the improvement achieved with this approach. In this case only one word (FORM) was incorrectly recognized.
Marie Curie Action – DREAMS project
WHAT IS MARIE CURIE ACTIONS? Marie-Curie Actions is a European research grant programme open to researchers regardless of their nationality. Its main purpose is to train scientists by providing them with the adequate research environment (institution and equipment) as well as opportunities in both the academic and industrial worlds. The programme gives more than research skills only thanks to the international collaborations between the partners within a project and the work carried out in the private sector.
THE DREAMS PROJECT In 2013, the Dereverberation and REverberation of Audio Music and Speech (DREAMS) project was launched. It gathers 12 early stage researchers (working as researchers enrolled in a PhD programme) and 4 experienced researchers (or doctors) under the supervision of specialists from both the private and academic sectors spread across Europe (United Kingdom, Belgium, Germany and Denmark). The project aim is to investigate the problem of modeling, controlling, removing, and synthesising acoustic reverberation in order to enhance the quality and intelligibility of audio, music, and speech signals.