[Firsthand Experience] FALL-E that Impressed Nadella: How Far Have Gaudio Lab Evolved?

2024.05.31 ・ by Dewey Yoon

Opening

FALL-E by Gaudio Lab is an Audio generative AI that automatically creates sound tailored to various inputs such as images, text, and even videos.

Sound can be broadly categorized into 1) speech, 2) music, and 3) sound effects. FALL-E is specifically designed to focus on creating 3) sound effects.

While it's relatively easy to find AI that can generate or manipulate voices and music, finding AI capable of generating other types of sounds (like sound effects) is quite challenging.

From the sound of keyboard typing to footsteps, and the rustle of leaves in the wind... There are so many sounds around us! And now FALL-E aims to take on the task of generating these diverse sounds.

Recently, Gaudio Lab has launched a closed demo webpage where users can directly experience FALL-E. Anyone can easily generate the desired sound by simply entering prompts, just like the screen below:

Text to Audio Generation Screen

Image to Audio Generation Screen

I'd like to share the experience of AI Times reporter Jang, Semin, who experienced this demo page.

Through this experience, I encourage you to imagine the future that Gaudio Lab will bring.

Now, let's take a look at the full article below!

[Firsthand Experience] AI-Generated Sound Effects that Impressed Nadella: How Far Have Gaudio Lab Evolved?

Gaudio Lab (CEO Hyun-oh Oh), a specialist in sound AI, recently announced the release of a closed demo site that allows users to experience their own AI-generated sound effects.

Gaudio Lab's signature generative sound AI, 'FALL-E,' first garnered global attention at CES in Las Vegas this January. This is the same product that stunned Microsoft CEO Satya Nadella, who visited the booth and exclaimed, "Is this really a sound created by AI? Amazing!"

FALL-E is a 'multimodal AI' capable of processing not only text but also images, boasting technology that surpasses that of international competitors. Recently, Gaudio Lab completed its front-end development and is currently testing the solution with a limited number of users through the closed demo.

AI Times participated in the test, accessing the closed demo site to generate sounds based on several criteria.

To begin testing FALL-E’s basic functionality, I started by inputting text. Currently, only English prompts are supported.

The first prompt was "An old pickup truck accelerating on a dirt road." The generated sound accurately depicted the sensation of wheels rolling. Adding a bit more roughness could enhance the effect.

The second prompt was "Ambience of the interior of a crowded, rattling urban train." This one was incredibly realistic, almost indistinguishable from an actual recording.

Next, I tried "A demonic alien creature roaring and screaming," which was chilling enough to send shivers down my spine as soon as the sound played. This technology could be incredibly useful for genres like mystery, thriller, and horror.

Other prompts included "a door closed violently," "stepping on mud after raining," "ghost sound," and "HAHAHA- sound of a murderer chasing someone." All of these produced results that exceeded my expectations.

However, one downside is that it cannot generate dialogue or vocal sounds. For instance, a prompt like "Who is that?" voice with fear did not produce any output.

A Gaudio Lab representative explained, "FALL-E was not developed to handle voices or music; it focuses on sound effects. While it includes non-verbal sounds like sneezing or coughing, generating verbal sounds requires different technologies, such as TTS (text-to-speech)."

Still, what's impressive is that the generated sound effects are of such high quality that they allow you to imagine “an entire story.”

One of the standout features is its ease of use. Similar to image generation AI, it can produce plausible sounds with just a few everyday words, without needing highly detailed or specific descriptions.

So, can even the “subtlest differences” be expressed through sound?

To verify this, I tested various prompts by slightly differentiating factors such as age, emotion, texture of objects, distance of sound, and scale. First, I tested how age differences are expressed through ”the sound of a child's cry.”

I started with the prompt "A child is crying after ruining the test." However, the result wasn't what I expected. The voice sounded too young for a school test scenario. So, I added a specific age setting.

When I input "A 13-year-old boy student is crying after ruining the test," it generated a much more mature voice than before. It was possible to adjust the age using text alone.

To test the texture of objects, I compared chocolate and honey using the common descriptor 'sticky.' While it seemed easy to create distinct sounds for steel and honey, it appeared challenging to express similar viscosities with different sounds.

However, I was astonished upon hearing the results. FALL-E accurately captured the differences between the materials.

For emotion, I used the sound of a dog barking. One prompt was for an angry, alert bark, and the other for a puppy whining to go for a walk. Once again, the differences were clear, and the emotion was effectively conveyed.

Lastly, to gauge distance and scale, I used the sound of “a zombie growl.” I differentiated between “a single zombie growling nearby,” “multiple zombies growling from a distance,” and “multiple zombies growling nearby.”

When the scale was set to one, the sound expression was much more detailed. What was interesting was the difference in distance. Even with the same group of zombies, the sound was faint when they were far away, “as if a wall was blocking them.”

The final test, and the one I was most curious about, was “image input.” This feature is Gaudio Lab's key differentiator and a starting point for their ultimate goal. If entire videos could be inputted to generate sound, it could revolutionize the time needed for film production.

However, this is also technically challenging. While text input clearly conveys the user's intent, images require the AI to analyze many more aspects. The AI must reanalyze and calculate elements like emotion, distance, scale, texture, and age that were tested previously.

The most fascinating result was that the AI did not produce just one sound. FALL-E provided up to three separate sounds reflecting different objects and situations within the image, and a final “integrated version,” offering a total of four sounds. For example, in a scene of two people fighting, it generated sounds like ▲ clothes rustling, ▲ impact with the floor, and ▲ a window breaking.

For image inputs, I used both “generated images” and “official movie stills.”

When I input a cartoon-style image generated by Lasco.ai, FALL-E did not recognize all objects accurately. In a scene where a girl and a dog were playing, it generated the sound of the dog barking but not the girl's laughter. This likely stems from the inherent ambiguity in drawings.

So this time, I used live-action images. I chose intense movie scenes from “John Wick,” “Transformers,” “Terminator,” and “Fast & Furious.”

While the AI recognized all the objects in these images, the sounds it generated were not as intense as the actual movie sound effects. It seems challenging to convey the full intensity of a movie from just a single still. If the AI could understand the context of the film, it might produce stronger sound effects.

I also tested images where the sounds were not obvious, such as a person riding a unicorn or a cow working. Even in these cases, the AI generated plausible sounds.

As seen in the video, the overall results of this test exceeded expectations. If CEO Nadella were to see this version, he would undoubtedly be even more astonished.

Gaudio Lab stated that they are striving to make it easy for anyone to create the desired sound. A representative mentioned, "This test is significant because it allows non-experts to experience AI sound generation, aligning with our company vision."

Given their history of developing high-quality, advanced technology, if their multimodal capabilities expand to include video, I believe the prediction that "Gaudio Lab's technology will be in every movie and video" could very well become a reality.

Reporter Jang Se-min semim99@aitimes.com

Original Article : AI Times (AI Times )

FALL-EGenerative AI

Audio AI Researchers' Festival: ICASSP 2024 & Gaudio Night On-Site Highlights

Hello, this is Kaya from Gaudio Lab, where I research audio AI. ICASSP 2024, the International Conference on Speech and Audio Signal Processing, was held at the COEX Convention Centre from 14th to 19th April. The event, which celebrated its 49th edition this year and was held in Korea for the first time, is one of the most prestigious conferences in the field of speech and audio signal processing, and Gaudio Lab took advantage of this rare gathering of audio AI researchers to host a networking party. In this post, I will vividly convey the experience of ICASSP and Gaudio Night ✨! Bring Tension to a Loose Conference Scene - I'M HERE! ICASSP, the World's Largest Speech and Audio Conference The International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an international academic conference organized by the IEEE Signal Processing Society. It brings together researchers from around the world to share and discuss the latest research results. The papers presented here significantly influence academic trends in the field. As a researcher, attending ICASSP is an important opportunity to keep up with the latest research trends and network. The six-day conference featured oral presentation sessions, poster sessions, and tutorials on various topics such as speech recognition, speech synthesis, sound source separation, and 3D audio. As it was the first in-person event since the COVID-19 pandemic, the venue was bustling with researchers. It was impressive to see people greeting each other warmly and engaging in research discussions. 🤭 Approximately 4,000 scientists from around the world gathered Our Gaudio Lab AI research team quickly caught up with presentations that could help our ongoing research and papers containing interesting ideas that sparked personal curiosity. Especially, impromptu discussions with the authors of studies we were interested in were a dopamine rush🫧. There were so many fascinating studies. I also presented a poster on my research on sound generation AI, FALL-E. Through deep feedback exchanges with those interested in my research, I realized that networking with fellow researchers is like a double XP event for my growth as a researcher. Because: Knowledge and insights gained from each other’s research introductions can lead to better research outcomes. Sharing trial and error with other researchers working on similar topics can reduce unnecessary efforts. Exchanging thoughts on the same subject can lead to new ideas. and even more importantly, this can lead to collaboration opportunities. I promise myself that I will continue to present and exchange my hard work at Gaudio Lab through conferences 💪🤓 Many people showed interest and asked questions, so I was quite busy.I don’t know if I explained with my mouth or nose. Gaudio Night, a Networking Event for Speech/Audio AI Researchers Gaudio Lab already knew the importance of networking that I felt firsthand 😎. During the conference, Gaudio Lab invited audio AI researchers participating in the conference to our office for a networking meetup. The event was called 'Gaudio Night' with the aim of contributing to the development of audio AI research by promoting exchanges and cooperation between industry and academia. It's early in the event, so it's still quiet... but... About 40 researchers joined us at Gaudio Night. We had enjoyable conversations about each other’s research in various subfields of audio AI, accompanied by delicious food and wine. In fact, the scale of research on audio AI is still small compared to other fields, so I think there are still not many places like this in Korea. Just having these valuable researchers gathered in one place made my heart swell… 😌. Additionally, it was a time that helped in discovering potential partners and future Gaudio Lab recruits. It was a great event, and if we continue to organize events like Gaudio Night on a regular basis, we hope that Gaudio Lab will one day become an indispensable part of the audio AI community. We hope this event is the beginning of that, and we will continue to play a leading role in the development of audio AI. If you're interested in growing with us, please feel free to knock on our door! Gaudio Lab is wide open for you~ -The Final 12- Gaudio Lab’s Journey Continues 👣 Through participating in ICASSP 2024 and hosting 'Gaudio Night,' both I and Gaudio Lab have grown a step further. We’ve learned the latest research trends, interacted with excellent researchers, and strengthened our position as industry leaders. At Gaudio Lab, we don't plan to stop here. Our vision is to create the best sound experience through audio technology! Please keep an eye on Gaudio Lab’s journey towards the future✨. That's all from Kaya, thank you very much.

2024.05.17

Behind the Scenes: How We Built the Just Voice Recorder, a Noise-Reducing Recording App

On May 20th, Gaudio Lab released the AI noise-cancelling recording app 'Just Voice Recorder' on the App Store. Was it because it featured AI noise removal, which is not common in recording apps? The app was highly anticipated from pre-orders and had a successful debut on the App Store. Today, we'd like to share an interview with Jin, the PO of the Just Voice Recorder app, to give you a behind-the-scenes look at the app's development process, as well as tips for better use of the Just Voice Recorder app. Q. Can you start by introducing yourself? Hello, I’m Jin, the PO of the Service and App (SNA) team at Gaudio Lab. I have about 8 years of experience as a PO/PM in various industries before joining Gaudio Lab. Q. How do you feel about launching your first mobile app after joining Gaudio Lab? I’m very proud, of course. Thinking back on all the hard work, it feels very rewarding. While I’m happy, I also have some regrets. Due to various constraints, I feel like we couldn’t fully meet all the needs. We plan to address these through continuous updates. Q. What kind of app is Just Voice Recorder? Just Voice Recorder is a recording app with powerful noise-canceling AI technology from Gaudio Lab. It removes background noise, allowing you to hear voices clearly even in noisy environments. Moreover, the noise-cancelling AI in Just Voice Recorder operates on-device without sending recording data to the server, making it secure for personal recordings. Q. Who should use Just Voice Recorder? Anyone who needs to record can use Just Voice Recorder. Regardless of time or place, you can record and then process noise or volume issues through Just Voice Recorder. To be more specific, I recommend it for students who record a lot of lectures, or for creatives and journalists who record a lot for their jobs. If you're a student, you can remove keyboard noise, air conditioner noise, etc. from your lectures to make sure your recordings from the back of the room are clear. If you're a creator, you can capture your voice clearly anytime, anywhere, without the need for specialized equipment. Q. How did Just Voice Recorder get started? What’s the story from the initial idea to the decision to create it? The idea began with ‘Let’s create a mobile app using *GSEP-HQ technology.’ GSEP-HQ is a technology used to separate instruments or vocals from audio tracks, widely loved and used in Gaudio Studio. However, Gaudio Studio is a web service that runs on servers. We wanted to implement it in a mobile app to align with the current trend of on-device AI. We chose to develop a recording app, believing it was the most straightforward (though it wasn’t easy) and valuable for users. *GSEP: Developed by Gaudio Lab, GSEP is a sound separation technology. It includes GSEP-LD, capable of real-time processing, and GSEP-HQ, which offers higher quality sound separation. You can experience GSEP-LD through Just Voice Lite. (Learn more about GSEP) Q. What is the biggest value that Just Voice Recorder offers? The main goal of Just Voice Recorder is to solve real user problems. While there are many recording apps, few address issues like noise during recording or inaudibility. We believed Gaudio Lab’s GSEP-HQ technology was suitable for solving these unmet needs, providing a good fit between the technology and the problem. Q. If you had to choose the most important feature of Just Voice Recorder, what would it be? I would say it’s the noise cancellation based on powerful audio separation technology. Although the default iPhone recording app has a noise reduction feature, Just Voice Recorder excels at removing really disruptive noise. However, there’s still room for improvement, especially in separating voices when the background noise is louder or the recorded voice is faint. We’re continually working on enhancements. Q. You mentioned there were many challenges during app development. Can you share a memorable episode? The process of deciding on the AI model for noise cancellation stands out. Initially, GSEP-HQ wasn’t ready for mobile use. We used the Just Voice SDK for development and planned to switch to GSEP-HQ once ready. However, we faced challenges because Just Voice SDK could process in real-time using CPU, while GSEP-HQ required GPU and had longer processing times, making real-time processing impossible. This led to a prolonged decision-making process, requiring us to develop for both simultaneously. We later reflected that quicker decisions could have led to more efficient development. Q. Why was the app released only for iOS and not Android? Although over 60% of pre-registrations were from Android users, we faced significant UI performance issues unique to Android due to high GPU usage during the GSEP-HQ model implementation. This structural difference between Android and iOS led us to prioritize the iOS release. Q. What’s next for the Just Voice Recorder app? We have many features we’d like to add, such as adjusting background noise volume during export, STT (Sound to Text), and recording file editing. We’re also considering expanding to iPad and Apple Watch However, I believe it's more important to prioritize the basics of the app, particularly the best noise removal and stable app operation. The first issue we want to improve is the long wait time for noise removal. Using Just Voice SDK can solve this, but we chose performance over speed because the value of Just Voice Recorder lies in top-level noise removal. We're monitoring user reactions to find the sweet spot between speed and performance and improving the model to enhance speed accordingly. Another issue is the decreased separation performance for faint voices. This is an inherent limitation of the GSEP-HQ engine developed for instrument separation. We're working with the R&D team to overcome this limitation and plan to implement solutions in Just Voice Recorder as soon as possible to ensure it can cleanly separate voices of any volume. Q. It’s been about a month since the app launch. How would you summarize the experience so far? I would like to take this opportunity to thank all of the Gaudins who stepped up to the plate and helped us finish this project, despite the many challenges we faced. One of the biggest questions we had while developing Just Voice Recorder was, "Is this app solving a real problem that users are experiencing?" It's still a difficult question to be sure, but since we've put the app in front of users, I think we can only find the answer by looking at user reactions and analyzing data. I also think it's a great achievement to see that Gaudio Lab's technology can be fully utilized in mobile and B2C environments. 🎙️ In conclusion… From Just Voice Lite for Mac to Just Voice Recorder, our iPhone app, GaudioLab continues to push the envelope to deliver innovative sound experiences wherever there is sound. We had a candid interview with Jin, our PO, to learn more about the development of the Just Voice app, from its beginnings to its limitations, and we hope you're curious about the ever-evolving Just Voice Recorder app. If so, head straight to the app store and download it via the link below. A new world of recording awaits you! >> https://apps.apple.com/app/just-voice-recorder/id6479693805

2024.06.14