뒤로가기back

Exploring ICML 2024: Latest Advances in AI and Audio Research

2024.08.30 by Kaya Chung

Introduction

 

Hello again, it’s Kaya back with more updates!
As a researcher focused on audio AI at Gaudio Lab, I often get the chance to attend various conferences. You might remember my recent post on ICASSP 2024 & Gaudio Night.

 

This time, I’m excited to share my experience attending the ICML (International Conference on Machine Learning) 2024, which was held in beautiful Vienna, Austria.

 

ICML is one of the most prominent AI conferences in the world, along with ICLR and NeurIPS, and Gaudio Lab makes it a point to attend every year. It’s an incredible opportunity for researchers like me to connect with others in the field and discover the latest breakthroughs.

 

Unlike my last post, which focused on the event atmosphere, this one will dive into the innovative ideas and insights I gained from ICML 2024!

 

오스트리아 빈에서 개최된 2024 ICML 현장. Wien~

The venue of ICML 2024 held in Vienna, Austria. Wien~

 

 

 

Highlights from ICML 2024

 

Spotlighted Papers

 

Let’s start with a few standout research papers I encountered. The first one that caught my eye was "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution." This study introduces a novel approach to handling discrete data, showcasing a method called SEDD that outperforms existing SOTA models like GPT-2 in the natural language processing field. This research significantly broadens the applicability of diffusion models and has made a notable impact in the AI community.

 

 

ou, Aaron, Chenlin Meng, and Stefano Ermon. "Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution."

Lou, Aaron, Chenlin Meng, and Stefano Ermon. "Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution."

 

 

 

Another intriguing study was titled "Debating with More Persuasive LLMs Leads to More Truthful Answers." This paper presents experimental evidence that AI models can enhance their accuracy through debates—a concept that once seemed confined to the realm of imagination. This research could play a crucial role in addressing issues with misinformation in models like ChatGPT.

 

 
 

Khan, Akbir, et al. "Debating with More Persuasive LLMs Leads to More Truthful Answers."

 

 

 

 

The Introduction of “Position” Papers

 

A new category of papers made its debut at this year’s ICML: Position papers. These papers don’t necessarily propose new models or ideas; instead, they challenge current academic norms and provoke deep reflection.

 

One particularly compelling paper was titled "Position: Measure Dataset Diversity, Don’t Just Claim It." This study argues that simply asserting dataset diversity isn’t enough. The authors analyzed 135 image and text datasets to offer a fresh perspective on the topic, reminding us as AI researchers to deeply consider the fairness and inclusiveness of datasets.

 

 

 

Trends in Audio Research

 

ICML 2024 also featured a number of exciting papers on audio AI.

 

The current trend in this field is a focus on more sophisticated generative models, which is evident across areas from music generation to general-purpose audio synthesis.

 

One of the most impressive studies was "DITTO: Diffusion Inference-Time T-Optimization for Music Generation." This paper introduces a technique that allows for precise control over the intensity, melody, and structure of generated music, marking a significant step forward in music AI. The future of music generation looks incredibly promising with such advancements.

 

 

 

Novack, Zachary, et al. "Ditto: Diffusion inference-time t-optimization for music generation."

 

 

 

Additionally, Video-to-Audio Generation has emerged as a hot topic. With tools like OpenAI’s “Sora” pushing the boundaries of video generation, the creation of matching audio for these videos has become a critical research area. Google proposed a model called "VideoPoet" that generates both video and audio simultaneously, while Adobe introduced "Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity," focusing on syncing sound effects with video actions.

 

 

 

 

 

These developments naturally reminded me of Gaudio Lab’s own sound effects generation model, FALL-E. It’s a technology that even caught the attention of Microsoft’s CEO, Satya Nadella, at CES. A few months ago, we demonstrated how FALL-E could generate sound effects perfectly synchronized with Sora videos. Seeing these trends at ICML reinforced my pride in how quickly Gaudio Lab is catching on to industry trends and reaffirmed our team’s research direction.😎

 

 

 

Conclusion

 

In this post, I’ve shared some of the diverse trends and fascinating research from ICML 2024.
I was so engaged in exploring all the new studies that I found myself running around the conference hall to keep up with everything! 🏃‍♀️

 
 

One particularly fun experience was a poster session where researchers were given questions about large language models (LLMs) and then encouraged to debate, turning the session into a lively and spontaneous event.

 

 

Attending ICML 2024 and absorbing all the new trends and research has left me feeling inspired and ready to return to Korea to continue developing smarter AI models. I’m excited to apply the knowledge and inspiration I gained from the conference.

 

Hopefully, at the next conference, we’ll see Gaudio Lab’s research featured as a spotlighted paper! 💪 Until then, this wraps up my ICML 2024 recap. 😁

 

 

pre-image
Introducing Gaudio Sing, the Next-Generation Karaoke System

  Ever Heard of a Global Audio AI Tech Company Created Karaoke System?   After we, Gaudio Lab, captured worldwide attention with our generative AI 'Fall-E', we're now on the brink of officially launching Gaudio Sing, our cutting-edge karaoke system. You might be wondering, "Karaoke?" Well, the concept behind Gaudio Sing is actually quite relatable—it all started with a simple desire that many of us have had: singing duets with our favorite artists or belting out tunes to original backing tracks.   At the heart of our innovation lie our core AI technologies. GSEP (Gaudio source SEParation) allows real-time vocal removal, while GTS (Gaudio Text Sync) synchronizes lyrics—an AI marvel adored by music streaming services worldwide. Together, these technologies give birth to a karaoke feature where you can sing along to original tracks as if by magic.   But that's not all; our expertise in signal processing and AI enhances features like sound filters (Reverb, Echo), drum/beat separation for pitch and speed adjustment, and detailed scoring capabilities. What may seem like a simple karaoke setup is, in fact, a sophisticated blend of state-of-the-art audio technologies.   Gaudio Sing embodies our commitment to revolutionizing audio experiences, perfectly aligning with our mission to deliver the pinnacle of sound quality to users worldwide.       "Revolutionizing Karaoke ‘Spaces’ with Innovative Technology"   Traditional karaoke spaces are currently challenged by various karaoke apps and software. These are available on a wide range of platforms, from smartphones to smart TVs, but they still haven't become widely popular. A comfortable 'space' to sing is crucial for users, yet existing apps often overlook the importance of a new user experience and a comfortable singing environment.   We believe in the importance of karaoke spaces and the experiences within them. Providing an exceptional sound experience in a karaoke space is paramount. Our goal isn't merely to launch new karaoke software but to revolutionize the existing karaoke experience with Gaudio Lab's technology.   With Gaudio Sing, in a karaoke room, users can select songs via their smartphones and enjoy features like the automatic creation of personalized playlists. We also offer fun elements that allow users to assess their singing skills on a national level, enabling competition and collaboration with others. Since our system is implemented purely through software without traditional accompaniment machines, we can use augmented reality (AR) technology to provide an even more immersive karaoke room experience.   We aim to transform traditional karaoke rooms into cultural spaces where families, friends, and individuals can relax and enjoy themselves. This experience, which can't be replicated by mobile apps or smart TVs, is unique to physical spaces.       Starting in the Birthplace of Karaoke: Japan   Our karaoke culture originated from Japan, the birthplace of karaoke, but it has evolved differently in Korea. In Japan, it's common to see long lines forming outside popular karaoke facilities early in the morning, a stark contrast to Korea's karaoke rooms that typically operate only at night. Japanese karaoke isn't just about singing; it's a "multi-room" concept where people can gather to sing, practice instruments alone, watch idol performances, or even wait for friends as an alternative to a café.   With Gaudio Lab's core technology, GSA (Gaudio Spatial Audio), users can enjoy optimal sound effects that make them feel like they're in a concert hall. Additionally, GSEP allows for the separation of vocals and various instruments, enabling users to practice guitar solos to original tracks in karaoke rooms.   The COVID-19 pandemic heavily impacted Japan's karaoke market, but it still maintains a significant size of 2.8 billion dollars and is quickly recovering to pre-pandemic levels. The market is dominated by two hardware manufacturers and over ten karaoke chain operators who use this hardware, leading to a closed structure that has slowed digitalization. This market size, the positive image of karaoke culture, and the outdated user experience due to the monopolistic structure present a unique opportunity for us. Our collaboration with a key player in Japan's karaoke industry has affirmed that Japan is the perfect place to bring our vision of Gaudio Sing to life.   In Japan, karaoke goes beyond mere entertainment; it's a crucial cultural element that fosters social bonds. People use karaoke to relieve stress and enjoy time with friends and family. Gaudio Lab aims to leverage this positive image of karaoke and introduce innovative technology to carve out a new market within Japan. Success in Japan will also positively impact our expansion into other countries.       In Conclusion: Music - The Universal Language   Music possesses a powerful ability to unite people across languages and cultures. The phrase "Music is the universal language" exists for a reason—it's beloved worldwide. At the heart of this culture lies karaoke. We believe that Gaudio Sing, offering this universal joy in a new way, will bring positive changes not only in Japan but also in our own country.   Thank you for joining us on this journey into the future of karaoke. Stay tuned as we revolutionize the way we sing, connect, and celebrate through the power of music.  

2024.07.22