뒤로가기back

Patently Gaudio #1 - Phase-matched Binaural Rendering

2023.12.07 by Henney Oh

The Importance of Uncompromised Sound Quality:
Gaudio Lab’s Solution to Sound Quality Distortion in Spatial Audio

 

 

At Gaudio Lab, we tackle the world’s audio challenges with software products that incorporate innovative audio source technology and make people's lives better. The term ‘source technology’ implies that Gaudio Lab is a hub of originality. One method we use to share our innovations with the world is through patents. Many of our core technologies are protected this way, and we hold a substantial number of patents, especially considering our company’s size and years in operation. However, sometimes we may strategically decide not to patent certain technologies, despite their originality.

 

A patent is a system granting inventors exclusive rights for a certain period (typically 20 years), in exchange for making their technological advancements public. After this period, the technology becomes openly available, making it easy for competitors to imitate. The responsibility to prove any unauthorized use or infringement of the technology mainly falls on the patent holder. In cases where proving infringement is challenging, it might be more advantageous for us not to disclose the technology. This approach may seem at odds with the advancement of humanity, but for a company focused on profit, freely sharing extensive research and development is a tough decision. Nonetheless, we at Gaudio Lab have filed over 100 patents since we started, a noteworthy achievement even for a company driven by technology.

 

You can find a list of our publicly disclosed patents here: https://www.gaudiolab.com/company/patents 

They can also be explored in detail through a simple Google search. However, for those not in the field, understanding these patent documents can be quite difficult. Therefore, we have decided to make our patents more approachable. We plan to explain them in simpler terms, focusing on (1) the problems we aimed to solve, (2) the main ideas of our inventions, and (3) the benefits these inventions bring.

 

After careful consideration, Gaudio Lab decided that the first topic to discuss from our extensive portfolio of over 100 patents would be on Spatial Audio. This is a field where we consider ourselves to be the original experts. More specifically, we focused on the crucial technology of Binaural Rendering in headphones (earbuds). Despite narrowing down our focus, there were still over 50 patents to choose from. It’s important to note that all our patents are significant (if they weren’t, we wouldn’t have invested the time and expense in filing them). Each patent is highly valued, much like a cherished child, filled with the inventors’ hard work and dedication. While this series will eventually cover all of our patents, the first episode is always critical.

 

The first patent we chose to discuss is US 10,609,504 B2 (Audio signal processing method and apparatus for binaural rendering using phase response characteristics). To help make understanding patents easier, let’s start with how to read the unique number of this patent. ‘US’ indicates that it is a United States patent. Other countries use two-letter codes like KR for Korea, CN for China, and JP for Japan. The ‘B2’ shows that this patent has been ‘granted’, meaning it has been reviewed and recognized by the United States Patent and Trademark Office. Patents that are still under review and not yet granted are marked with symbols like A1/A2. The number ‘10,609,504’ is the serial number assigned by the US Patent Office, suggesting it’s approximately the 10,609,504th patent since the patent system was established in the United States. Back in Edison’s era, patents were numbered in the tens to hundreds of thousands. The text in the brackets is the title of the invention. Sometimes this title clearly describes the invention, while other times it can be vague or less helpful, and this is often intentional. The title does not limit the rights of the patent, allowing some freedom in its wording.

 

This patent also has been filed and registered in KR, CN, and JP, with the following equivalent patents:

 

  • KR 10-2149214 (Audio signal processing method and apparatus for binaural rendering using phase response characteristics)
  • CN 110035376B (使用相位响应特征来双耳渲染的音频信号 处理方法和装置)
  • JP 6790052 B2 (位相応答特性を利用するバイノーラルレンダリングのためのオーディオ信号処理方法及び装置)

 

The full texts of these patents are available at the following link.

For US patents, visit: https://patents.google.com/patent/US10609504B2/en?oq=US10609504B2

 

These patents are collectively known as family patents. As patent laws vary by country, securing global protection for an invention requires separate applications and registrations with each country’s patent office. The recognition of a technology and the extent of rights it receives can differ from one country to another, and even among examiners within a patent office. As a result, the same technology might have a different scope of rights in various countries, and in some instances, it might not get registered at all.

 

Now, let’s look more closely at what the patent entails.

 

 

[The Problem Addressed by the Invention]

 

Binaural rendering, a method for producing spatial sound through headphones, involves adding a filter known as HRTF (Head-Related Transfer Function) to the audio signal. For a detailed overview of the binaural rendering technology, please see this link. However, often it’s necessary to apply not just one, but several filters simultaneously. An HRTF is a filter linked to a specific point in space. For example, if the sound of a sparrow corresponds to one point, then the roar of an elephant might be associated with a broader area, requiring multiple HRTFs to accurately depict this sound. Similarly, if a sound reflects off a wall and reaches our ears, it would need different HRTFs for the original and reflected sounds, as they come from different directions. Furthermore, if the impact of the HRTF is too strong (which might mean a significant change in sound quality) and needs to be softened, the process of softening also requires a filter, leading to the use of multiple filters.

 

In real-world applications of binaural rendering, there are many situations where applying multiple filters to a single sound source becomes necessary. This is not exclusive to binaural rendering; in general, adding effects to audio often involves the complex use of multiple filters.

 

What issue arises when multiple filters are superimposed? Different time delays (Delay) for each filter, as the input audio signal passes through, can unintentionally distort sound quality, known as the 'Comb Filter' Effect. This term comes from the pattern that appears on the frequency spectrum, resembling the teeth of a comb, where certain frequencies are significantly amplified and others are notably reduced, altering the original sound dramatically. It typically occurs when signals pass through two filters with disparate delays.

 

Fig.1: Example of the comb filter effect
(image source: http://www.sengpielaudio.com/calculator-combfilter.htm)

 

Fig.2: An example of the comb filter effect – Periodic distortions that mimic the appearance of a comb’s teeth emerge in the frequency response

(Referenced from Fig. 24 of patent US 10,609,504 B2)

 

It would seem that if we could just align the delays of the multiple filters we plan to use in parallel, there would be no issue. And that is a valid point. However, there’s an added complexity when it comes to HRTFs. A HRTF is a filter that’s obtained by measuring sounds played from various directions using microphones placed in human ears or a mannequin. Despite precise measurements, slight variations in delays between filters are unavoidable due to measurement errors that can arise from many factors. These variations cause slight differences in delay at each frequency of the filter response, leading to the unintentional comb filter effect when multiple HRTFs are used in parallel, which can degrade the audio quality.

 

This patent was developed to overcome this specific problem.

 

 

[The Fundamental Concept of the Invention (Subject Matter)]

 

Identifying the problem clearly can sometimes lead to a surprisingly straightforward solution. This is the situation with the problem this invention addresses. Since the central issue is the varying delays across each frequency of each HRTF filter, the core idea, or the Subject Matter, of this invention is to “align them uniformly.” Up to this point, we have used the term delay for clarity, but in signal processing (Signal Processing) terminology, this becomes a value known as phase (Phase) on the frequency axis. To align this phase response linearly (Linear) means setting a consistent delay, and by ensuring this fixed delay is the same for every filter, we can eliminate the comb filter distortion. Below is a graphic representation showing the phase response of the original HRTF before and after this linearization process.

 

Fig.3: Fig. 4 from U.S. Patent US 10,609,504 B2 – Demonstrates the phase response of the original HRTF alongside the linearized phase response

 

Integrating the concept of linearization into HRTFs involves a significant issue. HRTFs consist of filter pairs that represent the acoustic paths from the sound’s origin to both ears. The spatial effect of the sound depends on the relative relationship of these filter pairs, which includes their phase responses. If we linearize each filter within a pair without considering this relationship, we would alter their inter-aural phase difference (IPD), a key element in spatial perception. Such an alteration could lead to a loss of the spatial effect, which is central to HRTF functionality. To address this, the invention suggests linearizing only one filter of each pair and adjusting the phase of the other to maintain the IPD.

 

The filters in an HRTF pair are distinguished as ipsilateral HRTF for the nearer ear and contralateral HRTF for the farther ear. Since the ipsilateral HRTF captures more energy (sound is perceived as louder by the nearer ear), the method involves linearizing the phase response of the ipsilateral HRTF.

 

 

[Impact of the Invention]

 

This method allows the layering of any number of HRTFs without the issue of comb filter distortion. It significantly reduces one of the most prevalent challenges in spatial audio production: preserving the integrity of the original sound quality.

 

Spatial audio is a technology that simulates the effect of sounds as if they are occurring in a real space, making it essential for applications like gaming, films, virtual/augmented reality, and Spatial Computing. Imagine creating the illusion that you’re in your room, yet feeling like you’re at a Taylor Swift concert in Carnegie Hall – this is often described as the “Being There” Experience. However, the application of HRTFs inherently alters the original sound, as it involves applying a transformative filter, leading to distortion as we’ve discussed. Thus, the quality of spatial audio technology hinges on its ability to minimize such distortion while delivering a lifelike auditory experience. The technology from this invention is expected to be increasingly important in this respect.

 

We have made an effort to thoroughly explain a core component of Gaudio’s spatial audio technology. We hope that you have gained some understanding of our commitment to crafting high-quality spatial audio experiences. Gaudio Lab is excited to continue showcasing our dedication to exceptional sound experiences through our patents, and we invite you to stay engaged with our future updates.

after-image
Gaudio Studio – Your favorite songs, as you’ve never heard them before

Gaudio Studio – Your favorite songs, as you’ve never heard them before         Last year, MusicRadar reviewed five of the most popular stem separation software tools that are available today, and Gaudio Studio emerged as winner of the battle against Serato Sample, Acon Digital, DeepRemix and FL Studio. To the judge’s surprise, the champion was the only one free. We are honored that Gaudio Studio’s first overseas coverage has such a flattering review, as well as valuable feedback, and believe it’s a good time as any for its official introduction by Gaudio Lab.   Gaudio Studio is our web-based AI sound source separation service, powered by cutting-edge audio AI models. Currently under beta, Gaudio Studio offers two features that are fun and easy for anyone to use and still performs frighteningly well:   Instrument Separation – A stem separation tool that can isolate vocals and instruments from any music that you want Noraebang – An instant karaoke maker with vocal separation and lyrics synchronizing capabilities, title after the Korean word for karaoke.       Audio Stem Separation   Before we go on, what is stem separation? While sound source separation refers to the general practice of eliminating or extracting desired sounds from the original audio, stem separation refers to the more specific task of isolating sounds of individual tracks, or ‘stems’ from a mix. The conventional problem definition in the modern music industry is the separation of four stems, namely the vocals, bass, drums and other instruments, for their ubiquity and distinction in character.   Traditionally, stem separation relied on signal processing techniques using manually crafted features and were mostly limited for use in simple audio scenarios. However, recent advancements in artificial intelligence have opened up the possibility for stem separation of more complex tracks with many instruments and diverse tones. Given enough training data, deep learning models can be trained to distinguish the intricate patterns of different instruments autonomously and adaptively.   But even with deep learning, designing a well-performing stem separation model is no walk in the park, and many AI-based programs available today still produce results mixed with artifacts and distortions. This is especially so for mixes with multiple instruments masking each other in terms of timbre and loudness. In fact, it is often very much a challenging task even for humans to do with untrained ear, let alone for AI.       GSEP and Instrument Separation       Gaudio Studio’s Instrument Separation provides one of world’s most reliable – if not the most reliable – stem separation service out there, as tested and approved by our users each and every day. With a simple utility, the current version supports isolation of up to 6 instruments for the music of your choice, including electric guitar and piano on top of the aforementioned four-stem system. The other unselected or undefined stems are all tied up into the Other Instruments stem. After the instruments are chosen, the separation request is loaded to a queue and the processed results become available for playback and download.   At the core of the technology is Gaudio Lab’s AI separation model GSEP, short for Gaudio source SEPeration, which boasts state-of-the-art performance that has outshined its competitors since its release in 2021. Developed with utmost attention to greater sound quality, GSEP delivers clean and natural separation results that are often indifferentiable from stand-alone studio recordings. Compared to other AI separation solutions, some of the most common issues that plague sound quality such as over-suppression (muffled sounds) and loudness inconsistency (fluctuations) are rarely heard. Of course, readers are welcome to listen for themselves, either by trying out with their own examples or checking out some of the comparisons already made by other users, like this one.   Sure, GSEP sounds good (no pun intended). But it has also surpassed many other stem separation models under objective criteria, having reached an SDR (Signal-to-Distortion Ratio) of 10 dB for vocals and 16 dB for accompaniments in a 2021 external evaluation. Here, SDR is a key metric commonly used for audio separation. It measures the amount of undesirable distortions in the result in comparison to the ideally separated signal. For reference, every 10 dB increase in SDR means that the distortions of the results are 10 times less significant. While this in itself implies that GSEP’s record is an impressive feat, it also means that GSEP scores even higher than the latest version of Meta’s Demucs.   Behind GSEP’s exceptional quality lies Gaudio Lab’s sincerity and passion for audio in general. Not only are our AI team members also audio enthusiasts, but they create a special synergy with our Audio team, strongly based in audio signal processing, for applying deep learning within the domain of sound. Together they decide what kind of psychoacoustic considerations, additional databases and model architecture would lead to more versatile and reliable audio separation. GSEP is continually refined by our developers with ongoing training aimed at not only achieving higher SDR but also actual superior sound quality, ensuring that the results meet the highest standards at the perceptual level.     GTS and Noraebang       GSEP’s clean vocal-accompaniment capabilities naturally led to the idea of a karaoke backing track generator. Together with an automatic lyrics synchronization technology, the idea was soon developed and implemented as Gaudio Studio’s Noraebang. With it, all you need to do is upload a music of your choice along with its lyrics, and the rest of the karaoke experience is set up by the AI engine. The web interface of Noraebang displays the synchronized lyrics highlighted word-by-word in precise timing with the music playback, delivering a karaoke experience accessible from any device.   Working in tandem with GSEP under the hood of Noraebang is Gaudio Lab's GTS – Gaudio Text-Synchronization – a robust tool for aligning speech audio with corresponding text. While the challenge of first identifying vocals within complex musical structures is rendered trivial with GSEP’s sound separation capabilities, GTS handles the remaining problem of correlating and generating time stamps between the speech information and the natural language text.   GTS is an adaptable AI model that is designed to be robust against across different rhythmic styles, tempos and vocal nuances. A part of its adaptability comes from its indifference to the specific language of the text, as it is not trained to recognize the sounds of individual languages, but rather the sounds of phonemes that match with the International Phonetic Alphabet (IPA). Simply put, all GTS needs in order to learn a new language is its pronunciation scheme using a dictionary of words tagged with their IPA symbols, a well-documented data for most common languages.   GTS achieves highly consistent results independent of the song’s genre or artist, but without compromise in speed and quality. Processing long text and audio sequences requires high computation cost and time. GTS’s model deals with this problem by adopting a hierarchical structure in which alignment predictions are first made at sentence level, then recursively at word level. This allows inference time of under 5 seconds to synchronize an entire song and an impressive accuracy of around 99% regardless of the song’s length and complexity.       Using Gaudio Studio Beta         So, you can use Instrument Separation and Noraebang to create and share isolated tracks on a whim and even instant karaoke versions of your favorite songs. Of course, no worries even if the music of your choice is instrumental only – GSEP is trained on individual stem types and faithfully works on those requested by the user.   Another reason why Gaudio Studio is so useful is that you can use its services wherever you want, however you want. It supports audio inputs from lossless to compressed formats (including flac, wav, mp3 and m4a), as well as video files video urls without the need of conversions or downloads. Since Gaudio Studio is accessible through either PC or mobile devices, it is as easy to use for casual mobile users who want to try out a few songs for fun, as it is for more serious hobbyists and musicians who want to process batches of high-quality samples in their desktops.   Despite all that, Gaudio Studio is still under beta and there are a few limitations. While GSEP and GTS are frontrunners in their fields without a doubt, there is much room for improvement with corner cases and functionalities. Our developers are not satisfied short of perfect and are constantly investigating and logging points of improvement and tweaks. Users may also feel that they currently have to wait a bit too long for their requests to be processed and may wish to download the results in a higher quality format than mp3. We want to assure fans and supporters that future updates are under way and that they can look forward to added stem options, higher performance and better utility.       Try for yourself.   At Gaudio Lab, we love to hear how the users of Gaudio Studio apply stem separation in so many diverse ways, from simplifying transcription tasks by separating individual instruments to crafting personalized backing tracks for practice sessions, and even extracting unique samples for homage in new compositions. Now and then, we are pleasantly surprised when we come across use cases that we could not have imagined.   What would you do with Gaudio Studios’s AI sound separation technology? Try it out for yourself! We are eager to find out.  

2024.02.15