Audio AI Researchers' Festival: ICASSP 2024 & Gaudio Night On-Site Highlights

2024.05.17 ・ by Kaya Chung

Hello, this is Kaya from Gaudio Lab, where I research audio AI.

ICASSP 2024, the International Conference on Speech and Audio Signal Processing, was held at the COEX Convention Centre from 14th to 19th April. The event, which celebrated its 49th edition this year and was held in Korea for the first time, is one of the most prestigious conferences in the field of speech and audio signal processing, and Gaudio Lab took advantage of this rare gathering of audio AI researchers to host a networking party.

In this post, I will vividly convey the experience of ICASSP and Gaudio Night ✨!

Bring Tension to a Loose Conference Scene - I'M HERE!

ICASSP, the World's Largest Speech and Audio Conference

The International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an international academic conference organized by the IEEE Signal Processing Society. It brings together researchers from around the world to share and discuss the latest research results. The papers presented here significantly influence academic trends in the field. As a researcher, attending ICASSP is an important opportunity to keep up with the latest research trends and network.

The six-day conference featured oral presentation sessions, poster sessions, and tutorials on various topics such as speech recognition, speech synthesis, sound source separation, and 3D audio. As it was the first in-person event since the COVID-19 pandemic, the venue was bustling with researchers. It was impressive to see people greeting each other warmly and engaging in research discussions. 🤭

ICASSP 현장 사진

Approximately 4,000 scientists from around the world gathered

Our Gaudio Lab AI research team quickly caught up with presentations that could help our ongoing research and papers containing interesting ideas that sparked personal curiosity. Especially, impromptu discussions with the authors of studies we were interested in were a dopamine rush🫧.

There were so many fascinating studies.

I also presented a poster on my research on sound generation AI, FALL-E. Through deep feedback exchanges with those interested in my research, I realized that networking with fellow researchers is like a double XP event for my growth as a researcher.

Because:

Knowledge and insights gained from each other’s research introductions can lead to better research outcomes.
Sharing trial and error with other researchers working on similar topics can reduce unnecessary efforts.
Exchanging thoughts on the same subject can lead to new ideas.
and even more importantly, this can lead to collaboration opportunities.

I promise myself that I will continue to present and exchange my hard work at Gaudio Lab through conferences 💪🤓

ICASSP FALL-E 발표

Many people showed interest and asked questions, so I was quite busy.
I don’t know if I explained with my mouth or nose.

Gaudio Night, a Networking Event for Speech/Audio AI Researchers

Gaudio Lab already knew the importance of networking that I felt firsthand 😎.

During the conference, Gaudio Lab invited audio AI researchers participating in the conference to our office for a networking meetup. The event was called 'Gaudio Night' with the aim of contributing to the development of audio AI research by promoting exchanges and cooperation between industry and academia.

ICASSP Gaudio Night

It's early in the event, so it's still quiet... but...

About 40 researchers joined us at Gaudio Night. We had enjoyable conversations about each other’s research in various subfields of audio AI, accompanied by delicious food and wine. In fact, the scale of research on audio AI is still small compared to other fields, so I think there are still not many places like this in Korea. Just having these valuable researchers gathered in one place made my heart swell… 😌. Additionally, it was a time that helped in discovering potential partners and future Gaudio Lab recruits.

It was a great event, and if we continue to organize events like Gaudio Night on a regular basis, we hope that Gaudio Lab will one day become an indispensable part of the audio AI community.

We hope this event is the beginning of that, and we will continue to play a leading role in the development of audio AI. If you're interested in growing with us, please feel free to knock on our door!

Gaudio Lab is wide open for you~ -The Final 12-

Gaudio Lab’s Journey Continues 👣

Through participating in ICASSP 2024 and hosting 'Gaudio Night,' both I and Gaudio Lab have grown a step further. We’ve learned the latest research trends, interacted with excellent researchers, and strengthened our position as industry leaders.

At Gaudio Lab, we don't plan to stop here. Our vision is to create the best sound experience through audio technology! Please keep an eye on Gaudio Lab’s journey towards the future✨.

That's all from Kaya, thank you very much.

FALL-EGenerative AI

Behind the Scenes: Gaudio Lab's B2C App Debut, Just Voice Lite

🎙️ Interviewer’s note Hi there! I'm Harry, a marketing intern at Gaudio Lab. 😊 We've just released our first B2C app, Just Voice Lite, and our marketing team took a behind-the-scenes look at the team behind the app. We interviewed Howard, our PO, who came up with the idea for our first B2C service; Joey, an 8-year veteran developer; Jack, who juggles audio SDK and app development; and Steven, the team's trusted app developer. People involved in the development of Just Voice Lite We thought we could scale up,if we could broaden the product to a general user audience. Q. What was the reason for GaudioLab, which had been developing B2B audio solutions, to start developing B2C services? Howard (PO) : I didn't listen to the company. 🙂 I suggested B2C as soon as I joined the company. Just like waiting for the fruit to fall from the tree, with B2B, you have to wait for customers, right? I thought scaling up would be possible if we could broaden the product to target general users. Q. What were you hoping to accomplish with your first B2C app? Howard (PO): We weren't sure if we were going to be able to make a ton of money with this, so it was more like, 'Let's start lightly and for free for now.'. Joey (Dev): Just Voice Lite was more like an app to promote GaudioLab's technology rather than an app for revenue. The idea was to showcase our technology as a B2C product to attract B2B customers. Jack (Dev) : I think we can put a lot of our SDKs like spatial acoustics, EQ, loudness normalization, etc. into the app now, and it will be possible as we grow the app. Q. Were there any additional considerations in developing a B2C service compared to a B2B service? Howard (PO): When you're selling an SDK (Software Development Kit) to an enterprise, even if it's a little difficult to use, you can explain it in the user manual, but it's a different story when you're trying to convince a regular user. If there's any hassle or inconvenience, they'll just delete it. The moment they have to click one more time or change their experience, they'll stop using it. Joey (Dev): Since we have to use a virtual driver, we thought a lot about how to make sure that users can use the core features of the product smoothly without any obstacles. (Left) Previously developed app (Right) Newly released app Even with the same technology, it can be used differently depending on the user. Q. Has the development intention changed significantly from existing apps? Howard (PO) : I wanted to create an app that solves different pain points from existing noise reduction apps. So, we changed the main target from office workers who frequently participate in video conferences to fans who like artists. By slightly changing the shape to an app that boosts voices instead of an app that removes noise, I suggested targeting the content streaming market. For example, when watching a concert video, you can remove surrounding noise and focus on the artist's voice, or when watching a movie, you can make the actor's voice clearer if it's not audible. Joey (Dev): The same technology can be used differently depending on who's using it, and Howard was really good at spotting this. Jack (Dev): I liked this perspective. Originally, Just Voice Lite has a feature called GSEP (Gaudio Source Separation), which is 'Denoise.' But now, looking at the application form, it's 'Speech Enhancement.' It seemed impressive to make it look like it's made with different technology. Joey (Dev) : Yeah, I think Howard did a good job with that. If you explain audio-based technology to a general user, they wouldn't understand the need for it, but we targeted the product to B2C with flexibility, saying, "It makes artists' voices sound better. Howard (PO): I wrote about it in the blog, but for example, Joey is at Google Meet right now, and he's singing and playing guitar, and if you put noise reduction on it, you can't hear the guitar. When fans of artists watch content, it's not just the voice, it's also the background music, so we thought, "Why not boost voice? Q. What are the technologies behind Just Voice Lite? Jack (Dev): Just Voice Lite has an AI technology called GSEP that removes noise. It's SDK'd, and it's the noise removal technology that's been rated as the most effective in listening evaluations. And this noise separation algorithm runs in real-time. When we brought this technology to the Mac OS app, we put a lot of effort into making the usability of it a seamless experience. And Joey worked with us to make sure that the video sync works well when using Bluetooth on the desktop, so you can watch content seamlessly. In summary, I think Just Voice Lite's strengths are the technology behind the algorithm, the performance of the SDK, and the know-how to make the application seamless. The app finally passes review, and Henney hits the publish button. Apple is very selective when it comes to first-time app releases. Q. It took a while to get through the App Store review, didn't it? Joey (Dev) : Companies releasing their first apps usually face strict reviews from Apple. That's what they say in the industry. When I asked my friends at other companies, they said that if it's the first app, it's considered good if it passes on the 10th attempt. Some even said they get rejected up to 30 times... 🙂 Harry (Marketer): How long does it usually take to get a review response? Howard(PO) : It depends on the mobile review, but I think the desktop review was done within a day. Steven (Dev) : Because of the time difference with the US, we would submit, go to bed, wake up, and get rejected. 🙂 Q. What were the reasons for the rejections? Steven (Dev): The first rejection we got was that we shouldn't force the user to install the driver. Joey (Dev): We were given the guidance not to force driver installation with the phrase 'Do not expose driver installation on the main page view.' Different team members interpreted this differently... We kept testing while submitting reviews, and fortunately, Howard explained well in the comments to the reviewer, so we were able to expose the driver installation on the main page. He put it into words well. 🙂 Howard (PO): We should never go against their wishes. Steven (Dev): After that, we caught on to other small details. 'The user manual explanation is lacking,' 'It would have been nice to add marketing information,' 'The export function should be completed within 15 minutes.' Also, there were times when they rejected by saying, 'Why didn't you fix what I told you to fix?'... But Howard explained well to the reviewers, and we moved on. There were several back-and-forth communication processes with the reviewers. Q. How did you feel when your app passed the review? Steven (Dev): We were in the middle of a meeting thinking, "What if we get rejected again?" and then it just happened. Howard (PO): I screamed because I was so happy that it passed. Steven (Dev): I was like, 'Let's finish the meeting quickly.' Howard (PO): It seemed intentional. So that we would feel happy later. 🙂 It's always nice when they say no at the beginning and then they do it later. Q. Did the feedback you received during the app registration process actually help you? Howard (PO): I think it's been a good process, because from Apple's point of view, they're filtering out weird apps in the store because they can destroy the ecosystem. From our perspective, it wasn't bad because we were able to include file handling features and do QA thanks to it. 🎙️ We also interviewed PO Wan to talk about the SDK! Wan(PO): Hello, I'm Wan. Since this year, I've been in charge of the Product Owner role for the SDK product line. Just Voice SDK was released earlier this year and is being pushed as a flagship product by the SDK squad. Every SDK product says it’s easy to integrate, but after working with customers, we found that this one is really easy. Q. Can you give us a brief introduction to the Just Voice Lite SDK? Wan (PO): It's AI-based, but it can be run on-device, not on a server, like on phones or laptops. And it doesn't take long and can be processed in real-time. Our research head always emphasizes, 'Faster than the blink of an eye.' It's numerically 3/100 of a second, but if you actually hear it, you won't feel the speed at all. So, it's a solution that suppresses noise and only makes voices crisp in any environment with various noises. Q. What are some scenarios where the Just Voice Lite SDK can help? Wan (PO): Calls and video conferencing are the most basic scenarios that you can think of, and there are also service cases that are well used by companies that provide radio solutions for noisy industrial sites. Also, when agents are answering calls in call centers, customers often call in noisy environments, right? If you use Just Voice in such scenarios, you can hear your voice clearly. Q. What are the biggest advantages of the Just Voice Lite SDK? Wan (PO): Actually, many SDK products claim that integration is very easy, but when we actually worked with customers this year, we found that this one is really easy. We have prepared a well-prepared guide document, so I think you can try it in about 30 minutes. We have uploaded a trial version on our website, so you can apply it directly to your environment, app, or device. Just Voice Lite SDK can be installed on all laptops and phones. It can be applied to applications running on laptops or phones. We have also prepared a version that can run on low-spec devices, such as wireless earphones, much lower performance than smartphones, so now we can say 'it can run on most devices.' Also, we have many audio experts at GaudioLab, so we quickly consult with experts to provide the parts you need according to the situation. 🙂 Just Voice Lite on the App Store! Q. Finally, how did you feel about developing Gaudio Lab's first B2C app? Howard (PO): I think it was a good attempt. We actually did a lot of demos with the app on B2B alone. Saying, 'Try Denoise like this.' So, I think there is meaning in that alone, and more B2C customers will emerge. Please promote it a lot. 🙂 Joy (Dev): It was a good attempt, but I don't think Just Voice Lite is the kind of app that people install and say, 'This looks fun.' What I want to create is a product that anyone can use regardless of age or gender. If there are such ideas, I want to create an interesting product for the next project. Jack (Dev): Howard has been thinking about putting various sound effects into Just Voice. I think that's one of the points that will make it interesting as Joy mentioned. I don't know when it will happen, but I hope that day comes soon. Steven (Dev): Since we broke through with the first B2C app at GaudioLab, we gained know-how once. I think there will be fewer trial and error next time. 🎙️ In conclusion… Through the interview, we were able to glimpse the difficulties encountered and the process of overcoming them while developing the first B2C app. It was a valuable opportunity for me to indirectly experience the entire process of app development. Sincere thanks to the app team for participating in the interview. 🙂 Taking this experience of developing the first B2C app as a stepping stone, Gaudio Lab plans to launch various B2C services gradually. Please show interest in Just Voice Lite and the new services to be launched in the future!

2024.04.30

[Firsthand Experience] FALL-E that Impressed Nadella: How Far Have Gaudio Lab Evolved?

Opening FALL-E by Gaudio Lab is an Audio generative AI that automatically creates sound tailored to various inputs such as images, text, and even videos. Sound can be broadly categorized into 1) speech, 2) music, and 3) sound effects. FALL-E is specifically designed to focus on creating 3) sound effects. While it's relatively easy to find AI that can generate or manipulate voices and music, finding AI capable of generating other types of sounds (like sound effects) is quite challenging. From the sound of keyboard typing to footsteps, and the rustle of leaves in the wind... There are so many sounds around us! And now FALL-E aims to take on the task of generating these diverse sounds. Recently, Gaudio Lab has launched a closed demo webpage where users can directly experience FALL-E. Anyone can easily generate the desired sound by simply entering prompts, just like the screen below: Text to Audio Generation Screen Image to Audio Generation Screen I'd like to share the experience of AI Times reporter Jang, Semin, who experienced this demo page. Through this experience, I encourage you to imagine the future that Gaudio Lab will bring. Now, let's take a look at the full article below! - [Firsthand Experience] AI-Generated Sound Effects that Impressed Nadella: How Far Have Gaudio Lab Evolved? Gaudio Lab (CEO Hyun-oh Oh), a specialist in sound AI, recently announced the release of a closed demo site that allows users to experience their own AI-generated sound effects. Gaudio Lab's signature generative sound AI, 'FALL-E,' first garnered global attention at CES in Las Vegas this January. This is the same product that stunned Microsoft CEO Satya Nadella, who visited the booth and exclaimed, "Is this really a sound created by AI? Amazing!" FALL-E is a 'multimodal AI' capable of processing not only text but also images, boasting technology that surpasses that of international competitors. Recently, Gaudio Lab completed its front-end development and is currently testing the solution with a limited number of users through the closed demo. AI Times participated in the test, accessing the closed demo site to generate sounds based on several criteria. To begin testing FALL-E’s basic functionality, I started by inputting text. Currently, only English prompts are supported. The first prompt was "An old pickup truck accelerating on a dirt road." The generated sound accurately depicted the sensation of wheels rolling. Adding a bit more roughness could enhance the effect. The second prompt was "Ambience of the interior of a crowded, rattling urban train." This one was incredibly realistic, almost indistinguishable from an actual recording. Next, I tried "A demonic alien creature roaring and screaming," which was chilling enough to send shivers down my spine as soon as the sound played. This technology could be incredibly useful for genres like mystery, thriller, and horror. Other prompts included "a door closed violently," "stepping on mud after raining," "ghost sound," and "HAHAHA- sound of a murderer chasing someone." All of these produced results that exceeded my expectations. However, one downside is that it cannot generate dialogue or vocal sounds. For instance, a prompt like "Who is that?" voice with fear did not produce any output. A Gaudio Lab representative explained, "FALL-E was not developed to handle voices or music; it focuses on sound effects. While it includes non-verbal sounds like sneezing or coughing, generating verbal sounds requires different technologies, such as TTS (text-to-speech)." Still, what's impressive is that the generated sound effects are of such high quality that they allow you to imagine “an entire story.” One of the standout features is its ease of use. Similar to image generation AI, it can produce plausible sounds with just a few everyday words, without needing highly detailed or specific descriptions. So, can even the “subtlest differences” be expressed through sound? To verify this, I tested various prompts by slightly differentiating factors such as age, emotion, texture of objects, distance of sound, and scale. First, I tested how age differences are expressed through ”the sound of a child's cry.” I started with the prompt "A child is crying after ruining the test." However, the result wasn't what I expected. The voice sounded too young for a school test scenario. So, I added a specific age setting. When I input "A 13-year-old boy student is crying after ruining the test," it generated a much more mature voice than before. It was possible to adjust the age using text alone. To test the texture of objects, I compared chocolate and honey using the common descriptor 'sticky.' While it seemed easy to create distinct sounds for steel and honey, it appeared challenging to express similar viscosities with different sounds. However, I was astonished upon hearing the results. FALL-E accurately captured the differences between the materials. For emotion, I used the sound of a dog barking. One prompt was for an angry, alert bark, and the other for a puppy whining to go for a walk. Once again, the differences were clear, and the emotion was effectively conveyed. Lastly, to gauge distance and scale, I used the sound of “a zombie growl.” I differentiated between “a single zombie growling nearby,” “multiple zombies growling from a distance,” and “multiple zombies growling nearby.” When the scale was set to one, the sound expression was much more detailed. What was interesting was the difference in distance. Even with the same group of zombies, the sound was faint when they were far away, “as if a wall was blocking them.” The final test, and the one I was most curious about, was “image input.” This feature is Gaudio Lab's key differentiator and a starting point for their ultimate goal. If entire videos could be inputted to generate sound, it could revolutionize the time needed for film production. However, this is also technically challenging. While text input clearly conveys the user's intent, images require the AI to analyze many more aspects. The AI must reanalyze and calculate elements like emotion, distance, scale, texture, and age that were tested previously. The most fascinating result was that the AI did not produce just one sound. FALL-E provided up to three separate sounds reflecting different objects and situations within the image, and a final “integrated version,” offering a total of four sounds. For example, in a scene of two people fighting, it generated sounds like ▲ clothes rustling, ▲ impact with the floor, and ▲ a window breaking. For image inputs, I used both “generated images” and “official movie stills.” When I input a cartoon-style image generated by Lasco.ai , FALL-E did not recognize all objects accurately. In a scene where a girl and a dog were playing, it generated the sound of the dog barking but not the girl's laughter. This likely stems from the inherent ambiguity in drawings. So this time, I used live-action images. I chose intense movie scenes from “John Wick,” “Transformers,” “Terminator,” and “Fast & Furious.” While the AI recognized all the objects in these images, the sounds it generated were not as intense as the actual movie sound effects. It seems challenging to convey the full intensity of a movie from just a single still. If the AI could understand the context of the film, it might produce stronger sound effects. I also tested images where the sounds were not obvious, such as a person riding a unicorn or a cow working. Even in these cases, the AI generated plausible sounds. As seen in the video, the overall results of this test exceeded expectations. If CEO Nadella were to see this version, he would undoubtedly be even more astonished. Gaudio Lab stated that they are striving to make it easy for anyone to create the desired sound. A representative mentioned, "This test is significant because it allows non-experts to experience AI sound generation, aligning with our company vision." Given their history of developing high-quality, advanced technology, if their multimodal capabilities expand to include video, I believe the prediction that "Gaudio Lab's technology will be in every movie and video" could very well become a reality. Reporter Jang Se-min semim99@aitimes.com Original Article : AI Times (AI Times )

2024.05.31