Which signal processing technique is suitable for your device?
Acoustic signal processing techniques such as beamforming and blind source separation improve the intelligibility of captured speech, but which technique is best for which application?
In an increasingly noisy world, it can be hard to hear clearly. And that’s as true for electronic devices as it is for humans, which is a problem if they’re designed to pick up or respond to our voices. The signals reaching their microphones are a mix of voices, background noise, and other interference such as room reverberations. This means that the quality and intelligibility of captured speech can be severely affected, resulting in poor performance.
Intelligible speech is crucial for technology ranging from telephones, computers and conferencing systems to transcription services, car infotainment, home assistants and hearing aid devices. Signal processing techniques such as beamforming and blind source separation (BSS) can help, but they have different advantages and disadvantages. So which technique is best for which application?
Audio beamforming is one of the most versatile multi-microphone methods for emphasizing a particular source in an acoustic scene. Beamformers can be divided into two types, depending on how they work: data independent or adaptive. One of the simplest forms of data-independent beamformers is a delay-and-sum beamformer, in which the microphone signals are delayed to compensate for different path lengths between a target source and the different microphones. This means that when the signals are summed, the target source from a certain direction will experience coherent combination and the signals from other directions are expected to suffer from destructive combination to some extent.
However, in many consumer audio applications, these types of beamformers will be of little use because they need the signal wavelength to be small relative to the size of the microphone array. They work well in high-end conferencing systems with 1m diameter microphone arrays containing hundreds of microphones to cover the wide dynamic range of wavelengths. But these systems are expensive to produce and therefore only suitable for the corporate conferencing market.
Consumer devices, on the other hand, typically have only a few microphones in a small array, so delay-and-sum beamformers struggle as the large wavelengths of speech arrive over a small microphone array. A delay-and-sum beamformer the size of a normal hearing aid, for example, can give no directional discrimination at low frequencies – and at high frequencies its directivity is limited to a level of forward/backward discrimination.
Another problem is that sound does not travel in straight lines – a given source has several different paths to the microphones, each with different amounts of reflection and diffraction. This means that simple delay and sum beamformers are not very effective in extracting a source of interest from an acoustic scene. But they are very easy to implement and provide a small advantage, so they were often used in older devices.
An adaptive beamformer is the Minimum Variance Undistorted Response (MVDR) beamformer. This tries to pass the signal from the target direction without distortion, while trying to minimize the power at the output of the beamformer. This has the effect of trying to preserve the target source while attenuating noise and interference.
This technique may work well under ideal laboratory conditions, but in the real world, microphone mismatch and reverberation can lead to inaccuracy in modeling the effect of source location relative to the network. The result is that these beamformers often malfunction as they will begin to cancel out parts of the target source. A voice activity detector can be added to solve the target cancellation problem, and beamformer adaptation can be disabled when the target source is active. This can work well when there is only one target source, but if there are multiple competing loudspeakers, this technique has limited effectiveness.
Additionally, MVDR beamforming – just like delay-and-sum beamforming and most other types of beamforming – requires calibrated microphones, as well as knowledge of the geometry of the microphone array and the direction of the target source. Some beamformers are very sensitive to the accuracy of this information and may reject the target source because it is not coming from the indicated direction.
Many modern devices use another beamforming technique called adaptive sidelobe cancellation, which attempts to cancel sources that do not come from the direction of interest. These are state-of-the-art modern hearing aids and allow the user to focus on the sources directly in front of them. But the major downside is that you have to watch what you’re listening to, which can be inconvenient if your visual attention is needed elsewhere – for example, when you’re staring at a computer screen and trying to discuss what you’re seeing with a coworker.
An alternative approach to improve speech intelligibility in noisy environments is the use of BSS. Time-Frequency Masking The BSS estimates the time-frequency envelope of each source, then attenuates the time-frequency points that are dominated by interference and noise. Another type of BSS uses linear multi-channel filters. The acoustic scene is separated into its component parts using statistical models of the general behavior of the sources. BSS then calculates a multi-channel filter whose output best matches these statistical models. In doing so, it inherently extracts all sources from the scene, not just one.
The multi-channel filter method can handle microphone mismatch and will handle reverb and multiple competing speakers well. It does not require any prior knowledge of the sources, the microphone array or the acoustic scene, since all of these variables are absorbed into the design of the multi-channel filter. A change of microphone, or a calibration error, simply changes the optimal multi-channel filter.
As BSS works off of audio data rather than microphone geometry, it is a very robust approach that is insensitive to calibration issues and can generally achieve much higher source separation in real-world situations. than any beamformer. And, because it separates all sources regardless of direction, it can be used to automatically track a multi-directional conversation. This is particularly useful for hearing aid applications where the user wishes to follow a conversation without having to manually interact with the device. BSS can also be very effective when used in VoIP calls, smart home devices, and in-car infotainment applications.
But BSS is not without its problems. For most BSS algorithms, the number of sources that can be separated depends on the number of microphones in the array. And, because it works from data, BSS needs a consistent frame of reference, which currently limits the technique to devices with a fixed microphone array – for example, a tabletop hearing aid, a microphone array for fixed conference systems or video calls. from a phone or tablet held firmly in your hands or on a table.
When there is chatter in the background, the BSS usually separates the more dominant sources in the mix, which can include the annoying loud person on the next table. Thus, to work effectively, the BSS must be combined with an auxiliary algorithm to determine which sources are the sources of interest.
BSS alone separates sources very well, but does not reduce background noise by more than about 9 dB. To obtain very good performances, it must be associated with a technique of noise reduction. Many noise reduction solutions use artificial intelligence (AI) – it’s used by Zoom and other conferencing systems, for example – to analyze the signal in the time-frequency domain and then try to identify which components are due to the signal and which are due to the noise. It can work well with a single microphone. But the big problem with this technique is that it extracts the signal by dynamically triggering time-frequency content, which can lead to nasty artifacts in poor signal-to-noise ratios (SNR), and it can introduce considerable latency.
A low latency noise cancellation algorithm combined with BSS, on the other hand, provides up to 26dB of noise cancellation and makes the products suitable for real-time use – with a latency of just 5ms and sound more natural with less distortion than AI solutions. Hearing aids, in particular, need ultra-low latency to maintain lip sync, as it is extremely off-putting to users if the sound they hear lags behind the person’s mouth movements. who they talk to.
With an increasing number of signal processing techniques to choose from, choosing the right one for your application is more important than ever. The choice requires considering not only the performance you need, but also the situation in which you need the application to operate and the physical constraints of the product you have in mind.
|David Betts is the scientific director of the audio software specialist Audio Intelligence. He has been solving complex audio problems for over 30 years, with experience ranging from audio restoration and audio forensics to designing innovative audio algorithms used in blockbuster movies. At AudioTelligence, Dave leads a team of researchers delivering innovative commercial audio solutions for the consumer electronics, hearing aid and automotive markets.|
For more embedded, subscribe to Embedded’s weekly newsletter.