Improving Spatial Audio Quality with Motion-to-Sound Latency Measurement

2023.04.10 ・ by Dewey Yoon

Improving Spatial Audio Quality with Motion-to-Sound Latency Measurement

(Writer: James Seo)

[Introduction: Synopsis]

Spatial Audio is an audio rendering technology that reproduces sound according to the user's position and direction when listening through headphones or earphones, creating the illusion of a natural and lifelike sound experience. The quality and performance of Spatial Audio depend not only on the rendering of spatial and directional characteristics but also on the time it takes for the sound to be rendered and played back based on the user's movement (Motion-to-Sound Latency).

- If the Motion-to-Sound Latency is too long, users may lose immersion and even experience motion sickness due to the discrepancy between the visual stimuli and auditory experience. This is the same concept as motion-to-photon latency, which causes motion sickness when using VR devices.

Measuring Motion-to-Sound Latency is essential for evaluating the user’s spatial audio experience. However, this measurement is not a simple task. The overall Motion-to-Sound Latency can be divided into several parts, as shown in Figure 1: (1) motion-to-sensor latency, which detects the user's motion, (2) sensor-to-processor latency, where the detected movement is transmitted from the sensor to the audio processor, (3) rendering latency occurring during the rendering process in the processor, and (4) communication latency that occurs when the rendered signal is transmitted through communication channels such as Bluetooth. Measuring these components independently is not easy, especially for finished products on the market, as it is impossible to break down each module inside for measurement.

In this article, we will explain a method for more accurately measuring Motion-to-Sound Latency. The content is not too difficult, so we believe you can follow along and understand it well.

Figure 1 Breakdown of Motion-to-Sound Latency

[Measurement Hypothesis: Binaural Rendering & Crosstalk Cancellation]

As mentioned earlier, Spatial Audio utilizes Binaural Rendering technology to render sound sources in a space according to the relative position of the sound source and the listener. In other words, it is a technology that aims to reproduce not only the position of the sound source but also the feeling of the space in which it exists.

Generally, Binaural Rendering is performed using Binaural Filters such as BRIRs (Binaural Room Impulse Responses) and HRIRs (Head-Related Impulse Responses).

Figure 2 HRIR(Left) vs. BRIR(Right)

(HRIR: The relationship between the listener and a sound source that doesn’t contain listening room characteristics)
(BRIR: The relationship between the listener and a sound source that contains listening room characteristics)

The basic concept of a Binaural Filter is a filter that defines the characteristic changes of sound coming from a specific 'position' to the left and right ears. Therefore, this Binaural Filter can be defined as a function of distance, horizontal angle, and vertical angle. The definition of the Binaural Filter may vary depending on whether it reflects spatial characteristics due to reflected sound components (BRIR) or only represents the relationship between the sound source and the user's ears (HRIR). In the case of BRIR, both direct and reflected sounds are represented in the form of a Binaural Filter, while HRIR is a Binaural Filter that considers only direct sound, excluding reflected sound. Naturally, BRIR has a much longer response length than HRIR.

Typically, Spatial Audio uses BRIR rather than HRIR, so in this article, we will explain based on BRIR.

(a)

(b)

Figure 3 An Example of Impulse Responses of (a) HRIR and (b) BRIR

First, you need to determine two virtual sound source positions. It is better to choose two points that belong to different sides based on the median plane (the plane that divides left and right equally; in this context, it means the plane dividing left and right with the user at the center). The reason is that this measurement method uses the crosstalk cancellation phenomenon.

(You may be wondering what crosstalk cancellation phenomenon is. Well, you will naturally learn about it as you finish reading this article!)

Figure 4 An example of virtual speakers positions for M2S measurement

Once the positions of the two virtual sound sources, which are on different sides of the median plane as shown in Figure 4, are determined, you can measure two sets of transfer functions, or BRIRs, from each sound source to both ears. These sets are represented as [BRIR_LL, BRIR_LR] and [BRIR_RL, BRIR_RR]. In each set, the first letter after '_' indicates the position of the sound source (left or right), and the second letter indicates the position of the ear (left ear or right ear). So, BRIR_LL refers to the impulse response from the left speaker propagating through space and reaching the left ear, right?

With these BRIR sets, you can calculate the magnitude difference and phase difference between the Ipsilateral Ear Input Signal (the signal transmitted from the same side sound source) and the Contralateral Ear Input Signal (the signal transmitted from the opposite side sound source) for any single-frequency signal. In simpler terms, you can calculate the magnitude difference and phase difference between the sound played from the left speaker to the left ear and the sound played from the right speaker to the left ear.

By calculating the magnitude difference and phase difference of these Ipsilateral Ear Input Signals and Contralateral Ear Input Signals for a specific frequency and using them in the form of an inverse function to modify the signal of the right virtual sound source, you can either completely eliminate the sound in the left ear due to crosstalk or play a much smaller sound compared to the right. It creates a barely audible sound. This magnitude difference and phase difference can be calculated from the magnitude response and phase response of BRTF (Binaural Room Transfer Function), which is the frequency domain representation of BRIR, or can be obtained by measuring it using a specific frequency.

For example, the input signals without the magnitude difference and phase difference are as follows:

Figure 5 Uncontrolled input signal for left and right virtual speakers

In the above Figure 5, the top is the input signal of the left channel among the virtual channels, and the bottom is the input signal of the right channel among the virtual channels. They are completely identical signals. However, what if you calculate the magnitude difference and phase difference for a specific frequency from BRIR to the left ear from the left channel and the right channel, and change the magnitude and phase of the right virtual channel signal so that the final left ear input signal is offset? The input signals would then look like Figure 6.

Figure 6 Controlled input signal for left and right virtual speakers

Now, what would the input signal to the left ear be when played with an input like Figure 6?

The results are shown in Figure 7.

Figure 7 An example of left ear input signal for uncontrolled and controlled input signal

In Figure 7, the solid line represents the left ear input signal when rendering the same signal to both virtual channels without adjusting the magnitude/phase difference, while the dotted line represents the left ear input signal when the adjusted magnitude and phase difference are applied to the right virtual channel signal. You can clearly see the reduced magnitude. This method is called "crosstalk cancellation." It involves adjusting the magnitude/phase difference skillfully so that the sounds transmitted from the ipsilateral and contralateral sides cancel each other out. Crosstalk cancellation occurs when the magnitude and phase differences are perfectly aligned, and if either of them does not meet the conditions, the output signal may even become larger.

When rendering an input signal like in Figure 6 and looking straight ahead, the signal entering the left ear will either not be heard or will be heard as a very faint sound. Since the BRIR has a tail-forming filter corresponding to the rear reverb, there may be some actual errors even if the exact magnitude/phase difference is calculated. However, I can say that the sound produced at this time has the smallest magnitude.

[Measurement Method & Results]

Figure 8 Block diagram for M2S measurement

Figure 8 shows the process of measuring M2S (Motion-to-Sound) Latency. The earlier explanation was about how to generate input signals for measuring M2S Latency and corresponds to the [input signal control stage] in the figure above. The generated signal is s_msr. Now, let's actually measure the M2S Latency. The input signal s_msr is fed into the Spatial Audio Renderer, which receives information about movement from a device such as TWS (True Wireless Stereo) or another IMU-equipped device and performs spatial audio rendering accordingly.

First, we assume that the TWS detects the user's movement and sends the information. In the absence of user movement, the input signals to the left or right ears of the binaural output signal are either silent or playing a relatively very quiet sound due to crosstalk cancellation. When you rotate the TWS using a motor or similar at t=t0, the TWS detects the movement (in reality, the user's movement) and sends the corresponding movement information to the Spatial Audio Renderer. The Spatial Audio Renderer then renders and generates the output signal accordingly, which is played back through the TWS. When you capture the played sound, you can see the change in the envelope of the rendered signal as the crosstalk cancellation condition breaks, and from that, you can measure the M2S Latency.

However, there may be situations where you cannot directly acquire the output signal of the Spatial Audio Renderer due to environmental constraints. In such cases, you can use an external microphone to record and acquire the signal. There may be the influence of external noise in this case, but if you use a specific frequency, you can remove the noise using a bandpass filter. I'll explain in more detail through the following figure.

Figure 9 Recorded signal before(upper) and after(lower) bandpass filtering

The top image in Figure 9 is the original measurement signal. The 'Moving Start' in Figure 9 corresponds to t₀ in Figure 8. That is, the section marked 'Moving Start' represents the stationary state. In the stationary state, the Ear Input Signal for that direction is almost inaudible due to Crosstalk Cancellation. From the 'Moving Start' moment, the microphone records both the noise of the motor operating and the actual rendering signal; however, in the top image, the rendered signal is small, and the noise is relatively large, making it impossible to know when the crosstalk cancellation disappears. In this experiment, we used a 500 Hz pure tone as the input signal. Thus, we only need to look at the 500 Hz signal, so if we pass the signal in the top column through a bandpass filter with fc=500 Hz, we can cleanly eliminate the motor noise (using the bandpass filter mentioned in the previous paragraph to remove noise). The result is the bottom image in Figure 9. After starting to move and after a certain time, as the crosstalk cancellation condition breaks, you can see the envelope of the recorded signal increase. That is, the point marked as "crosstalk cancellation disappears" corresponds to t₁ in Figure 8. Therefore, we can calculate the M2S Latency as t₁-t₀.

There are various ways to find the point where the envelope increases. You could simply find the section where the recorded signal's sample values increase, but that would be too inaccurate. If perfect cancellation does not occur, the sample values will continue to change even during the cancellation. Instead of using individual sample values, you can divide the recorded signal into sections of a certain length, then calculate the variance of the sample values included in each section. To calculate the envelope's variance, it is essential to select the length of each section. A shorter time interval provides higher time precision in the time domain but must be longer than the input signal frequency's period. That is, the minimum length of each section must be longer than the input signal frequency's period. In the example above, we used a 500 Hz input signal, meaning we need at least a 2ms time interval to obtain the envelope's change. In other words, with a 500 Hz input signal, the maximum precision is 2 ms. If you want higher resolution, you can use a higher frequency input signal. However, be cautious as using too high a frequency input signal may slightly increase the error range for the magnitude/phase difference that can be calculated from the filter. In addition, you can measure the M2S Latency by extracting the sparse envelope from the measured signal, calculating the actual slope of the envelope, and using the point where the slope changes drastically as the basis. Depending on the measurement environment and the recording results, you can selectively change and use this method.

Ultimately, as shown in Figure 9, we can measure the M2S Latency based on the originally recorded signal and the signal bandpass-filtered with fc=500 Hz. This latency includes all latencies occurring in the processes where all related information is exchanged, as shown in the earlier diagram. Therefore, this latency will be the actual latency experienced by the user.

[So when we actually measured the Latency…]

These days, newly released TWS (True Wireless Stereo) devices come equipped with spatial audio features that not only add a sense of space to the sound but also provide rendering functions that respond to users' head movements. As you may know, Gaudio Lab is a company boasting both the original and best technology in the field of spatial audio. To determine how quickly each product can respond to user movements and provide good quality, we measured the M2S (Motion-to-Sound) Latency of TWS devices from various manufacturers. The measurement values are based on the average of at least 10 measurements, and we have also recorded the standard deviation.

Using the measurement method described above, we measured the M2S Latency of various TWS devices, and the example values can be found in Table 1 below.

<Table 1 M2S latencies for different TWS [unit: ms]>

	Apple Airpods Pro[1]	Samsung Galaxy Buds 2 Pro[2]	Gaudio Spatial Audio Mockup[3]
Average	124.2	203	61.1
Standard deviation	12.18	12.61	6.17

The results were surprising. Gaudio Lab's technology (even though it is still at the mock-up level) recorded a significantly lower Motion-to-Sound Latency! The reason for this is that Gaudio Lab's Spatial Audio Mock-up is based on the world's best Spatial Audio rendering optimization technology and operates on TWS devices. This eliminates the Bluetooth Communication Latency that is necessary for the smartphone rendering method used by other major TWS producers.

At the beginning of this article, we mentioned that the overall quality of Spatial Audio is determined not only by the sound quality of rendering space and direction characteristics but also by the time it takes for the sound to be reproduced from user movements (Motion-to-Sound Latency).

Through this article, which began by explaining how to measure the Motion-to-Sound Latency that enhances the quality of spatial audio, we were able to confirm that Gaudio Lab's technology records overwhelmingly outstanding figures.

You might be curious about the sound quality since the latency is at the highest level. In the next article, we plan to reveal the sound quality evaluation experiment, which showed surprising results following latency, along with sound samples. So, stay tuned~!

Ah! It's no surprise that Gaudio Lab won two innovation awards at CES 2023! Haha.

------

[1] Measurement based on rendering using an iPhone 11 and AirPods Pro. The actual rendering takes place on the iPhone, resulting in significant delay due to communication latency between the phone and TWS.

[2] Measurement based on rendering using a Galaxy Flip4 and Galaxy Buds 2 Pro. The actual rendering takes place on the Galaxy, resulting in significant delay due to communication latency between the phone and TWS.

[3] Measurement result of the mock-up eliminates communication latency by implementing it on the TWS chipset produced by Gaudio Lab with the iPhone 11. The iPhone is used as a source device, and Spatial Audio Rendering is performed on the TWS.

GSA(GAUDIO Spatial Audio)Spatial AudioGSA for Live Streaming

Gaudio Lab's Golden Ears Introduce - Spatial Audio, Try Listening Like This

Gaudio Lab's Golden Ears IntroduceSpatial Audio, Try Listening Like This Listening to sound is quite a personal experience. Everyone has their own tastes when it comes to music. Personally, I really enjoy listening to spatially rich sounds in a concert hall, as I can feel the heat and energy from my favorite artists right in front of me through the sound. With 'Spatial Audio' technology gaining global attention, anyone with a Spatial Audio-supported smartphone and earbuds can now easily experience sound as if they were at a concert, without the need for multiple speakers or expensive audio equipment. Due to this convenience, interest in Spatial Audio is growing rapidly! To help you enjoy Spatial Audio even more, we would like to introduce some tips from), pioneers in Spatial Audio technology. → If you want to know more about Spatial Audio, read this article too. If someone asks about Korea's 'Sound Craftsmen,' everyone thinks of Gaudio Lab First, let us introduce Gaudio Lab. The company prides itself on having the highest concentration of audio lovers worldwide. As such, it has a unique team of members. Among them are, who have either innate talent or have honed their skills in the industry for a long time, distinguishing good sounds. Gaudio Lab's technologies must pass through the hands and ears of these audio craftsmen before they can see the light of day. We have invited some of these Golden Ears, who have a high level of understanding and expertise in Spatial Audio, to share valuable tips with you! Although the sound experience is highly personal, we hope that their advice serves as an opportunity to rediscover your preferences! Introducing Gaudio Lab's Leading Golden Ears James James holds a Ph.D. in acoustical engineering. Since childhood, he has enjoyed listening to and playing music. He especially loves playing and listening to the piano. He is a fan of classical and jazz genres and listens to many old songs. Lately, he has been immersed in Claudio Arrau's piano performances. Jayden Jayden also holds a Ph.D. in acoustical engineering. He entered the world of music through the influence of roommates he lived with. He is a romantic who enjoys spending quiet nights with ballads. His dream is to play the guitar with his child once they grow a little older. Bright Bright is a sound engineer. He developed a dream of becoming a sound engineer after listening to his cousin's MP3 player. He enjoys ballads accompanied by orchestras. He is a space enthusiast who loves the universe and is also very interested in computer assembly. 1. These days, there are many Spatial Audio technologies on the market that are gaining a lot of attention. Have you experienced any well-implemented or impressive Spatial Audio? Jayden : Personally, I think that among the Spatial Audio solutions currently available, the AirPods Pro's Spatial Audio is highly refined (not the AirPods Max). I believe other solutions still have areas that need improvement. To give a sense of to sound, it's important to provide listeners with a sense of 'direction,' making them aware of where the sound is coming from. To do this, you need to apply something called 'binaural cues' to the sound. Simply put, when sound enters the ears, the difference in position (distance) between the two ears causes the sound to be heard differently in the left and right ears. Binaural cues are a reflection of this. When binaural cues are applied, the high-frequency range is relatively emphasized, resulting in a change in timbre. It is technically not easy to give an accurate sense of direction, such as up or down, to the sound (even if you accept that). I think the AirPods Pro's Spatial Audio is well-tuned to minimize timbre distortion while providing a sense of direction to the sound. Another advantage is that you can enjoy comfortable sounds regardless of the content being played. Thanks to Apple's unique, non-stimulating, and mild timbre, your ears won't get tired even after listening for a long time. Bright : Although there are differences in each mix engineer's additional mixing, I also felt that Apple's Spatial Audio provides the most seamless and spatial rendering compared to existing stereo sound sources. When going through all Spatial Audio renderings, the bass sensation tends to disappear quite a bit, but among them, I would say it's still the most natural. James : Excluding Gaudio Lab's Spatial Audio technology GSA, the Spatial Audio I've listened to in-depth is Apple's Spatial Audio. However, I haven't yet found a Spatial Audio that I think is good in every aspect. I believe that each Spatial Audio has its own unique characteristics, strengths, and weaknesses. So, rather than determining an absolute hierarchy, various perspectives should be considered for evaluation. In that regard, I think there is still a long way to go for Spatial Audio, and the areas for improvement are endless. That means Spatial Audio will continue to evolve even more in the future! :) 2. Do you have any criteria for determining good Spatial Audio, or any personal tips for listening to Spatial Audio well? Jayden : When listening to Spatial Audio, I think it's good to assess how well the 'externalization effect' is implemented. As the term 'Spatial Audio' implies, the sound coming from headphones or earphones should create a space that feels like it's coming from outside. ("External" means existing outside!) If we were to visually represent externalization, you can think of it as a spherical sound space that forms outside, centered around your head. If you observe carefully when listening to music without applying Spatial Audio, you can feel that the sound image is concentrated in the center of your head. However, when you think about it, all the sounds we hear in our daily lives actually occur in the space outside our heads (outside our ears). Wearing earphones and suddenly hearing the sound inside your head creates an unnatural situation. The technology that makes it sound as if it's coming from outside your head, rather than inside, when you wear earphones and listen to music, is externalization. It's only natural that the better the externalization technology, the more natural the sound you can hear, even with earphones, as if you were actually listening to it in reality. What would it be like to experience the music exactly as the artist intended? In fact, without Spatial Audio, you can't help but often feel the phenomenon of sounds blending together, with louder sounds overpowering quieter ones. However, when Spatial Audio is applied and the direction of each sound source is widely dispersed in the 3D space outside your head, the sound comes to life in every corner, making the audio richer, more three-dimensional, and ultimately closer to the original intention. Figure 1. Comparison between no externalization effect (left) and its application (right) James : I also agree that the essence of Spatial Audio is externalization. Instead of sounds that feel forced into your ears, it's important to pay close attention to how the instruments that make up the song naturally disperse in space and how they sound. Most sound sources are typically located in front of the listener. When you listen to a song and feel that a frontal stage is well-created while still being able to sense the individual positions of the instruments, it will feel like you're actually listening to the sound in a physical space, rather than through earphones. Figure 2. Aha, if music creates a sound image like this, it's good Spatial Audio! Jayden : I think the ultimate goal of Spatial Audio is to reproduce the sound we hear in reality through earphones. Observing how sound is perceived in real spaces and remembering those experiences can help you in making choices. If you have a quiet space where you usually enjoy music, trying to listen to the sound there would also be a good experience. Moreover, focusing on how sounds are perceived in everyday life can be helpful too. People actually hear many sounds in daily life, but we tend to focus more on processing the information rather than the auditory stimuli. Consciously having thoughts or experiences like ‘this is how the sound is perceived in this situation’ more frequently will help you encounter good Spatial Audio. 3. Please recommend some songs that are good for enjoying Spatial Audio. James : Ironically, older recordings often show good effects when Spatial Audio solutions are applied. (Maybe it's because they were recorded with speaker playback in mind? Haha.) When listening to albums like the Oscar Peterson Trio's "Night Train" or "We Get Requests," you can feel that the frontal stage space is slightly more present with Spatial Audio applied, and the instruments are naturally arranged. Jayden : It's too difficult to recommend specific songs (laughs), but I think quiet music composed of a few instruments, like a guitar or vocals, will help you understand Spatial Audio better. Bright : Lately, I've been into the Apple Music Live: Billie Eilish album.The cheers from the audience on both sides and behind you, as well as the unique details of the concert hall, create a powerful energy without getting muddled. Live albums are indeed great for experiencing Spatial Audio. Generally, movie soundtracks with strong audio imagery also give you the emotion of watching the movie again in a theater. Here, we've compiled songs that are great to listen to with Spatial Audio. Turn on the Spatial Audio feature on your device and give them a listen!If you're curious about how songs with Gaudio Lab's Spatial Audio technology sound, check out this playlist as well. 4. Lastly, is there a shortcut to becoming a golden ear? James : I think it might be helpful to compare familiar songs using various devices. Each device has slightly different characteristics, so it's good to compare the same sound source while listening. Naturally, you can more easily determine how certain aspects have changed with familiar sound sources. However, regardless of the process, building your own database is necessary, so it will inevitably take a considerable amount of time! :) Bright : At least once, just buy a pair of good earphones or headphones! When you start listening with good earphones, you'll suddenly hear sounds that you couldn't before, and you'll find yourself wondering, "Oh? Is this what the drum sounded like?" You'll gradually become a golden ear. I think you're no longer an ordinary person once you start discovering new elements in the music you love and revisiting old songs. Jayden : I think it's important to know what sounds you like and what sounds you don't like. First, understand how your ears perceive sound and try to figure out why certain sounds are uncomfortable or burdensome. Because everyone's ears are different in shape and the way our brains have been trained from birth until now, the evaluation of sound is inevitably subjective. Whether you're a golden ear or not depends on how confident you are in your own judgment. Listening a lot, consciously focusing on receiving sound stimuli, and gradually gaining confidence in your evaluation is the shortcut to becoming a golden ear. (laughs) Did you find this helpful? Gaudio Lab is the pioneering Spatial Audio technology company that has adopted its technology to the only audio standard defining Spatial Audio, ISO/IEC MPEG-H Audio, and utilizes GSA (Gaudio Spatial Audio) for devices and content creation such as live streaming. If you want to learn more about GSA, created by Gaudio Lab's golden ears, click the button below and check it out!

2023.03.29

Is there AI that creates sounds? : Sound and Generative AI

The Surge of Generative AI Brought by ChatGPT (Writer: Keunwoo Choi) The excitement surrounding generative AI is palpable, as evidenced by its widespread integration like ChatGPT into our daily lives. It feels like we're witnessing the dawn of a new era, similar to the early days of smartphones. The enthusiasm for generative AI, initially sparked by ChatGPT, has since spread to other domains, including art and music. Generative AI in the Field of Sound Sound is no exception; Generative AI has made significant advances, particularly in AI-based voice synthesis and music composition. However, when we consider the sounds that make up our everyday environments, voices and music are only a small part of the equation. It's the little sounds like the clicking of a keyboard, the sound of someone breathing, or the hum of a refrigerator that truly shapes our auditory experiences. Without these sounds, even the most finely crafted voices and music would fail to capture the essence of our sonic world fully. Despite their significance, we have yet to see an AI that can generate the multitude of sounds that make up our sonic environment, which we'll refer to as 'Foley Sounds' for the sake of explanation. The reason for this is quite simple: it's dauntingly challenging. To generate every sound in the world, an AI must have access to data representing every sound in the world. This requires the consideration of numerous variables, making the task incredibly complex. Knowing the difficulties involved, Gaudio Lab has taken up the challenge of generating all of the sounds in the world. In fact, we began developing an AI capable of doing so even before generative AI became a hot topic in 2021. Without further ado, let's listen to the demo they have created. AI-Generated Sounds vs. Real Sounds How many right answers did you get? As you could hear in the demo, the quality of AI-generated sound now easily surpasses the common expectation. Finally, It is time to uncover how these incredibly realistic sounds are generated. Gaudio Lab generates sounds through AI Visualizing Sounds: Waveform Graph Before we can delve into the process of generating sounds with AI, it's essential to understand how sounds are represented. You may have come across images like this before: The graph above illustrates the waveform of sound over time, which enables us to estimate when and how loud a sound occurs, but not its specific characteristics. Visualizing Sounds: Spectrogram The spectrogram was developed to overcome these limitations. At first glance, the spectrogram already appears to contain more information than the previous graph. The x-axis represents time, the y-axis represents frequency, and the color of each pixel indicates the amplitude of the sound. Essentially, a spectrogram can be seen as the DNA of sound, containing all the information about a particular sound. Therefore, if a tool can convert a spectrogram into an audio signal, creating sound is equivalent to generating an image. This simplifies many tasks, as it allows for the use of the similar diffusion-based image generation algorithm employed by OpenAI's DALL-E 2. Now, do you see why we explained the spectrogram? Let's take a closer look at the actual process of creating sound using AI. Creating Sound with AI: The Process Explained Step 1: Generating Small Spectrograms from Text Input The first step in creating sound with AI involves processing the input that describes the desired sound. For example, when given a text input such as "Roaring thunder" the diffusion model generates a small spectrogram from random noise. This spectrogram is made up of a 16x64 pixel image, representing 16 frequency bands and 64 frames. Although it may appear too small to be useful, even a small spectrogram can contain significant information about a sound. Step 2: Super Resolution The image then undergoes a 'Super Resolution' phase, where the diffusion model iteratively improves the resolution through multiple stages, resulting in a clear and detailed spectrogram as shown earlier. Step 3: Vocoder This final step involves converting the spectrogram into an audio signal using a vocoder. However, most market-available vocoders are designed to work with voice signals, making them unsuitable for a wide range of sounds. To address this limitation, Gaudio Lab developed its own vocoder, which has achieved world-class performance levels. Furthermore, Gaudio Lab plans to release this vocoder as an open-source tool in the first half of 2023. Gaudio Lab’s World-class Sound Generation AI What makes developing sound-generation AI challenging? While the process may appear straightforward at first glance, producing realistic sounds with AI requires addressing numerous challenges. In fact, AI is only a tool, and solving the actual problem demands the expertise of individuals in the field of audio. Handling and managing large amounts of audio data is one of the biggest challenges in creating AI-generated sound. For instance, the size of Gaudio Lab's training data is approximately 10TB, corresponding to about 10,000 hours of audio. This requires a high level of expertise to collect and manage the data, as well as the ability to efficiently load the audio data for training, in order to minimize I/O overhead. In comparison, ChatGPT's training data is known to be around 570GB, and ImageNet, a dataset that has driven the progress of deep learning in computer vision, is only about 150 GB. Evaluating AI models for audio is also difficult because it requires listening to the generated audio in its entirety, which is time-consuming and can be influenced by the listening environment. This makes it challenging to determine the quality of the generated audio objectively. When it comes to sound generation AI, expertly developed AI models produce better results. Having a team of experts in audio engineering is undoubtedly an advantage for Gaudio Lab. Our expertise and knowledge of audio help to ensure that the AI models generate high-quality and realistic sounds. Additionally, our experience in audio research at globally renowned companies allows Gaudio Lab to stay up-to-date with the latest audio technologies and trends. Their participation in the listening evaluation process ensures that the generated audio meets high standards, making Gaudio Lab's sound generation AI a unique and valuable asset. The Evolution of Sound Generation AI Gaudio Lab's ultimate goal is to contribute to the development of the metaverse by filling it with sound using their AI technology. While the current performance of their sound generation AI is impressive, they acknowledge the need for further development to capture all the sounds in the world. Participation in the DCASE 2023 Challenge Participating in DCASE is a great opportunity for Gaudio Lab to showcase its exceptional sound generation AI and compete with other top audio research teams from around the world. The evaluation process in DCASE will likely involve objective metrics such as signal-to-noise ratio, perceptual evaluation of speech quality, and speech intelligibility, as well as subjective evaluations where human listeners will provide feedback on the generated sounds. The results of DCASE will provide valuable insights and feedback for Gaudio Lab to continue improving its AI models and enhance the quality of the generated sounds. Please wish Gaudio Lab the best of luck in DCASE and look forward to hearing positive updates about our progress. We'll be releasing AI-generated sounds soon, so stay tuned!

2023.04.18