Introducing the first audiobook featuring sound created by AI

Introducing the first audiobook featuring sound created by AI! Wait, are you telling me AI made this

2023.09.18 ・ by Bright Kwon

Looking for that perfect sound? AI can now create it for you!

Hi, I'm Bright, a sound engineer at Gaudio Lab 🌟!

For sound engineers who need to produce a range of sounds, creating sound effects is often a tedious job that can eat up nearly a third of the workday. It’s like searching for a needle in a haystack—or in our case, trying to find a “warm iced coffee” or a design that’s both modern and classic. The odds of finding the ideal sound with just one search are pretty low.

So, people like me, let’s call us “sound nomads,” used to spend our days digging through extensive sound libraries to find the right effect for each scene. But that was then, and this is now.

In this age of advanced AI like ChatGPT, we asked ourselves, “Why can’t AI generate sound?”

Now, we’re thrilled to introduce Generative Sound AI FALL-E! (*applause* 👏🏻)

And guess what? This year, FALL-E’s creations were featured in an audiobook for the first time. Curious about how that happened?

Introducing the first audiobook enhanced with sound effects created by generative AI!

Gaudio Lab is providing its technology to a special summer thriller collection called <Incident Reports>. The audiobook is already making waves, especially since it’s directed by Kang Soo-Jin, a famous voice actor known for iconic roles like Detective Conan and Sakuragi Hanamichi from the Slam Dunk series. We’ve used Gaudio Lab’s spatial audio tech to really bring out the thriller vibes, and reviews say it makes the experience super immersive and real. (I can’t help but be proud, I worked on this!)

But here’s the twist: this year’s <Incident Reports> doesn’t just feature Gaudio Lab’s audio technology. We’ve also thrown in sound effects generated by FALL-E, a state-of-the-art sound AI. That makes it the very first audiobook to use AI-produced sound effects!

Even as a seasoned sound engineer, I was amazed by the sound quality that the AI managed to deliver. I found myself wondering, “Can these AI-created sounds really match up to recorded ones?” And you know what? They absolutely can.

Curious? Want to hear for yourself?

(Here are some generated sounds of thunder and lightning, for example.)

What did you think? I have to say, I was completely blown away when I first listened to the sample. Honestly, as a sound engineer, the thought of AI creating sounds had me a bit concerned about the potential quality. But you’ve heard it yourself, right? It delivers such high quality sound that you can’t tell the difference from real recordings. And what sets FALL-E apart is its ability to create just about any sound, unlike other generative AIs that are limited to certain noises. (And can you believe it was developed with over 100,000 hours of training data?)

Don't miss behind-the-scenes video!

Let me dive a bit deeper into the <Incident Reports> for you. (And don’t miss the behind-the-scenes details in the above video!)

Taking the piece titled <Baeksi (白視) - The End of the Snowstorm> as an example from this project, the narrative unfolds on a snowy mountain in the middle of a ferocious blizzard. Now, within this setting, you’ve got several elements, or let’s call them sound objects, like snow plows, the blizzard, and avalanches making an appearance. Under normal circumstances, I would’ve been searching libraries and comparing tons of sounds, spending counless hours to match these elements. Also, sounds of a snow plow or an avalanche aren’t something you come across every day, so creating these noises would have been quite difficult. I would likely have been digging through sound effect libraries and possibly blending various sounds from different sources to get it just right.

However, when I gave FALL-E a prompt to generate these sounds, it managed to create these sound effects for me.

Creating Unique Sounds with Generative AI

Using FALL-E, a generative AI model, we can craft brand new sounds just by entering a simple prompt. It’s not only saved me a massive amount of time (no more sifting through libraries for that perfect sound!), but it also generates a unique sound every time, helping me create that one sound in the world I’ve always wanted.

I have a soft spot for the sound of rain, and I’ve actually created several different versions of my favorite rain sounds. Want to hear how they change with each new prompt during the generation process? Give this a listen:

Wrapping up...

While working on sound projects, I find myself chatting with FALL-E, a remarkable AI that generates sounds, instead of spending tedious hours of mouse clicks searching through sound effect libraries. It’s honestly quite surprising to be living in this day and age, an idea I only dreamed about during my early days as a sound engineer. I’ve always hoped for a tool that would let me create the exact sounds I imagined instantly with each project.

Currently, as a sound engineer at Gaudio Lab, I’m bubbling with excitement every day, always looking forward to the day when many more people will get to experience FALL-E. ☺️ And don’t worry, we are constantly working on and refining FALL-E to make it even better for everyone. Hang tight, we’re excited to share it with you all very soon!

Sound QualityFALL-EGenerative AI

How Can We Estimate RIR in Environments with Multiple Sound Sources?

How Can We Predict RIR in Environments with Multiple Sound Sources? Introduction Greetings! I’m Monica from Gaudio Lab, a pioneering hub for audio and AI research. The recent introduction of Apple's Vision Pro has sparked renewed enthusiasm for Spatial technology. This development extends into the realm of audio, suggesting that by understanding a user’s spatial surroundings, a deeper audio experience can be curated. Coincidentally, Gaudio Lab embarked on related research last year, and I am excited to share our insights. Understanding Room Impulse Response (RIR) : A Key Concept to Remember If you’ve ever thought, “I want to make a sound seem as if it’s coming from a specific space!”, all you need to know is Room Impulse Response (RIR). RIR measures how an impulse signal, like the gunshot from a man in Figure 1, reverberates within a given space. Any sound, when convoluted with the RIR of a specific space, can be made to sound as if it originates from that space. Thus, we can describe RIR as "data that contains invaluable information about that space". Figure 1 – Source: https://www.prosoundweb.com/what-is-an-impulse-response/ So, how can we estimate RIR? The most accurate method to obtain RIR data involves direct measurements with a microphone in the targeted area. However, this method, while precise, is often cumbersome, requiring both specialized equipment and considerable time. Moreover, physical barriers might restrict access to certain spaces. Fortunately, the evolution of machine learning presents alternative RIR prediction techniques. For instance, one such study aims to predict the RIR of a space using only a sound recording (e.g., a human voice) from that particular area. Can TWS Record Surrounding Sounds to Predict RIR in Real-Time? For a genuinely immersive augmented reality (AR) experience, it’s essential that virtual auditory cues seamlessly blend with the user’s physical environment. This demands an accurate understanding of the user’s immediate surroundings. Our research dives into the potential of using True Wireless Stereo (TWS) to record ambient sounds, which are then analyzed through machine learning. When predicting real-time auditory experiences within a user’s environment, it’s crucial to consider that multiple sound sources, such as various individuals and objects, will contribute to the soundscape. In contrast, much of the previous research has been tailored towards predicting Room Impulse Response (RIR) from audio cues of a solitary entity, or a “single source.” While on the surface, RIR prediction for single and multiple sources might appear analogous, they necessitate distinct approaches and considerations. This divergence is due to the inherent variability in RIR measurements, even within the same space, based on the exact positioning and orientation of individual sound sources. While commonalities exist due to the shared environment, specific nuances in measurements are inevitable. Let’s only predict the RIR in front of us! Predicting the Room Impulse Response (RIR) in environments with multiple sound sources can be challenging. At Gaudio Lab, we tailored our approach to align with anticipated scenarios for our upcoming products. Recognizing the needs of True Wireless Stereo (TWS) users, we determined that predicting the RIR from sources directly in front should be prioritized. As depicted in Figure 2, even when multiple sounds are recorded, our primary objective remains to estimate the RIR originating from a sound source situated 1.5 meters directly in front of the user. Figure 2 – The user is at the center, surrounded by diverse sound sources being emitted from different positions (grey circles). Despite the presence of multiple sound sources, our model consistently assumes a virtual sound source (blue circle) located 1.5 meters ahead and aims to predict its RIR. Our AI model’s architecture takes cues from a relatively recent study. At its essence, the model takes in sound from a designated environment and outputs the associated RIR. Most previous studies primarily used datasets from single-source sound environments (Figure 3, Top). In contrast, our approach incorporates multi-source datasets (Figure 3, Bottom). We constructed our dataset by convolving select RIRs from Room A with an anechoic speech signal. The model’s output provides a singular monaural RIR, corresponding to the user’s frontal perspective. We crafted a loss function to ensure the model’s output closely matches the true RIR. Figure 3 – Traditional studies predominantly utilized single-source environment data for training (top figure), whereas our method emphasizes datasets from multi-source settings (bottom figure). The essence of AI system development lies in data collection, a phase in which we invested significant time and effort. While RIR data is plentiful, datasets that capture multiple RIRs within a single environment are scarce. Directly recording from countless rooms was a daunting prospect. To address this, we turned to open-source tools, enabling us to generate and utilize synthetic datasets effectively. The Outcome? Figure 4 – Comparative analysis between the conventionally studied Single Source model (SS model) and our proposed Multi-Source model (MS model). An uptick in both Loss and Error values indicates a decrease in performance. When comparing the traditionally trained model using a single sound source (SS model, illustrated in blue) to our approach, which incorporates multiple sound sources (MS model, depicted in pink), distinct performance variations emerge. As highlighted in Figure 4, the efficiency of the single-source model diminishes as the number of sound sources increases from one to six. In stark contrast, our multi-source model consistently maintains its efficiency in RIR prediction, irrespective of the growing number of sound sources. In real-world settings, it’s often challenging to determine the exact number of sound sources in a user’s vicinity. Therefore, a model like ours, capable of predicting the RIR of a space regardless of the number of sound sources, promises a more immersive auditory experience. Curious about this research? Based on our findings, I developed a system to predict RIRs in real-time from sounds recorded directly in three different office spaces of Gaudio Lab. Each space had its unique characteristics, but the model was able to reliably predict the RIR reflecting that space! During our auditory evaluation with the Gaudio team, the majority remarked, “It truly sounds like it’s coming from this space!” 😆 This research is also scheduled to be presented at the AES International Conference on Spatial and Immersive Audio in August 2023 (NOW!). If you’re interested, please check out the link for more details!

2023.08.23

Get ahead of the curve with Apple's Vision Pro and spatial audio! [Part1]

Intro: Get ahead of the curve with Apple's Vision Pro and spatial audio! Do you ever wish for a more complete sound experience? Have you heard of something called spatial audio, and wondered what the hype is all about? If so, then this blog post can answer your questions! Follow this conversation of two industry pioneers Ben Chon from Gaudio Lab and Martin Rieger - who have been hard at work pushing the boundaries of audio technology. What we are currently witnessing is an exciting new era for applications specifically designed to utilize spatial audio. You will learn how these developments promise to bring us much closer to having true 360° immersive sound experiences - those that make us feel truly present in our digital domain! With the recent announcement of Apple's Vision Pro, it's easy to get swept up in the hype. As we all eagerly awaited Apple's next big thing, the "immersive" experience promises seem to be more than just marketing talk. One of the most exciting features of the Vision Pro is the introduction of spatial audio, which has been hailed as a game-changer for the world of VR and AR. It's clear that the arrival of Apple's Vision Pro is going to be a massive step forward for immersive technology, and also the stock price has reached new highs. Prepare yourself for a journey into the world of 3D Audio; buckle up because this is where it gets really interesting. Spatial Audio spreading over industries Martin: Hi, I am an enthusiastic and knowledgeable 3D audio expert named Martin Rieger. I specialize in immersive audio productions and am proud to be one of the few experts who work full-time in recording, post-production, and consulting. My studio, VRTonung, and dedicated blog showcase the incredible possibilities of spatial computing technology when used with creativity. Being an active member of the audio engineering society, I attended AES Dublin years ago where I had the pleasure of meeting Ben from Gaudio Lab. Since then, we have stayed in touch, discussing the latest developments in spatial audio technologies. So we thought it'd be fun to come back together and discuss the spatial computing revolution. 360 degrees videos are underrated I am truly passionate about my field and always strive to create something unique that wouldn't have been possible in stereo. While I appreciate films and music in formats like Dolby Atmos or Sony 360 Reality, I think they don't add too much value for spatial technology. So I'm also a big critic of the whole trend of making everything spatial audio for the sake of it. But more on this where I see the true spatial audio potential later. My personal favorites are still 360 videos, which I believe are still quite underrated and haven't been utilized to their full potential mostly. In these immersive videos, sound plays a crucial role in guiding users' attention by utilizing head-tracking. Additionally, combining the sound with head-locked stereo elements such as narrators or background music enhances the immersive experience and contributes to a captivating 360 soundtrack. By incorporating multichannel recordings like Ambisonics, we unlock what I'd consider a truly immersive audio experience. Elements for a truly immersive audio experience This is where I appreciated the workflow by Gaudio Works, having all of these three elements: an objects-based approach to pan mono sounds in space, an ambisonics bed for recordings of reverbs and a head-locked stereo track. This non-spatial track, which plays independently from where people are looking to, is missing in popular formats such as AC-4, making it useless for podcasts of radio dramas in my opinion. With Gaudio Lab's works, I created the mix for a German premium furniture client. Janua's 360° VR experience, takes showcasing furniture to an entirely new level! The VR project gives us an inside look at the stories behind our creations, giving viewers an emotional connection to furniture. “I really liked mixing with Gaudio Works since it just felt like the developers put a lot of thought into the software. I’m not aware of any spatializer plugin for DAWs that allows rotating the equirectangular view or a timbre correction. I could even mix remotely on my laptop and transported easily to the Oculus Go that we were using back then” Ben: Hi, I'm Ben from Gaudio Lab. At Gaudio Lab, I've led research and development efforts for spatial audio product families, including Works—an immersive audio post-production tool for 360 videos, Craft—a 6-DOF (Degrees of Freedom) immersive audio game engine plugin for Unity and Unreal, and GSA (Gaudio Spatial Audio)—a headtracking spatial audio rendering SDK for TWS. Gaudio Works, an immersive audio post-production tool tailored for enhancing 360 videos I'm thrilled that you've enjoyed using Gaudio Works for mixing. We designed and developed Works based on three key principles. First, it's crucial to support a wide range of technologies to enable sound engineers to create the most exceptional sound experiences ever available in 360 videos. Through collaborations with mainstream Hollywood studios and small studios dedicated to 360 videos, we've learned about production-side needs. Supporting object, channel, ambisonics, and non-diegetic signals is one of the valuable lessons we've learned. Second, the tools should provide binaural rendering that sounds as natural as possible or, at the very least, offer a means to control the balance between immersiveness and naturalness. While numerous technologies with binaural rendering filters exist, we often hear artifacts from binaural filtering, especially when the filter has a long reverberant time. We've strived to design the binaural filter to have timbral changes as minimal as possible and have also incorporated a feature called "binaural strength." This empowers sound engineers to control the filter intensity for each sound element. Third, ease of use is essential for the tool. Works adopted spherical coordinates (r, θ, ϕ) instead of rectangular coordinates (x, y, z) and provided both equirectangular and head-mounted display (HMD) views. Many pioneers in the VR scene have embraced Gaudio Works, leading Gaudio to win the "Innovative VR Company of the Year" at the AMD VR Awards in 2017. The Winner of VR Awards 2017, Gaudio Lab What even is spatial audio, AR “Headtracking” Martin: Spatial audio is a fascinating technology that has the power to completely transform our listening experiences. However, it's frustrating when discussions about spatial audio only focus on its role in films and music, as if these are the only areas where it's relevant with formats such as Dolby Atmos. The truth is, spatial audio has the potential to revolutionize many areas of our lives, particularly in the realm of spatial computing, where it can push the boundaries of what's possible. Understanding how spatial audio needs to be approached differently in 0DoF, 3DoF, and 6DoF is crucial for anyone interested in this exciting technology. DoF stands for degrees of freedom and here is a little overview to give you ideas on how surround sound can be used here: 0DoF: Movies, Music, and Podcasts applications. You are not supposed to move your head and just watch the front. This is where most of the sound waves are coming from 3DoF: 360 Videos and AR Headphones. You can enable head movement and the soundfield around you adapts in real time. The directional audio filters help you to localize sound 6DoF: Extended Reality (Virtual Reality, Augmented Reality), Spatial Computing. You are in a 3D digital world and can even move toward sound. Similar to gaming applications So let's dive in and explore the future of spatial audio with multiple degrees of freedom beyond the confines of traditional media. Immersive audio isn't necessarily "immersive" As we navigate through the world of audio, we often hear chatter about the elusive and wondrous "immersive audio" experience. While it certainly has its place, we must not forget that stereo and even mono can be just as impactful when accompanied by engaging content. The key is to find that sweet spot where sound and visuals align in a perfect union. This is where spatial audio comes in for spatial computing. When executed correctly, it can transport us to another world entirely, regardless of the number of channels involved. Many game developers believe that incorporating audio into their game is as simple as checking off a 3D audio box and calling it a day. However, truly immersive audio requires careful consideration and utilization of spatial audio technology. By neglecting this crucial aspect of sound design, developers are missing out on the potential impact that audio can have on a player's overall gaming experience. Such aspects for spatial audio support include: did you match the frequency curve of sounds you randomly took from SFX library before implementing them into your game engine? did you add context to the sound effect like background a surround sound or reverberation to make your audio object blend into your virtual world? did you sort the layers of your soundtrack like voice, music, and sound effects, and distinguish between diegetic and non-diegetic elements? Head tracking is great when done right For audiophiles like me, the experience of immersive soundscapes is nothing short of a thrill ride. With platforms like YouTube360 or Facebook360, it's easy to achieve that through Ambisonics audio. But anything below second-order ambisonics may disappoint. Since the spatial resolution could be too diffuse on such social media platforms. So if you are like me, you're yearning for a more object-based approach already common in the Apple Music app to support spatial audio. That is why I can't help but wish for something more advanced - something that can capture the nuances of sounds more faithfully. If you've been following the latest advancements in technology, you might have heard of the integration of head-tracking into AR headphones - namely, Airpods. As a professional in the field of audio and video production, I've adapted my knowledge within 360 videos to this new form of AR audio. It's an exciting time for the industry, with Apple leading the charge and Android catching up quickly with products like the Galaxy Buds Pro2 or Pixel Buds Pro. Why Dolby Atmos tracks are a danger for immersive podcasts Despite all the hype surrounding Dolby Atmos technology, there is one major flaw that could be a major danger for immersive podcasts – the lack of head-locked audio. The format is designed to be listened to facing forward, just like a cinema screen or TV, leaving no room for head-tracking. This means that any audio mixing done in this format will ultimately lack the crucial element for creating truly spatial audio. Incorporating non-diegetic audio such as background music or most importantly a narrator voice-over. It's a pity that so many podcasts and even Audible are jumping on the Dolby Atmos bandwagon without considering the limitations of the technology, especially since it touts itself as being future-proof. This is exactly why I dig Apple's spatial audio approach with its RealityKitpro and the three elements of an immersive soundtrack as listed above. Story will be continued to the next post

2023.09.21