뒤로가기back

Synchronizing lyrics? It's Directly Handled by AI: An Introduction to GTS by the PO

2023.07.07 by Dewey Yoon

Synchronizing lyrics? It's Directly Handled by AI: An Introduction to GTS by the PO

(Writer: John Jo)

 

Before we dive into the details of GTS, let me introduce myself.

 

Greetings! I'm John, the Product Owner (PO) of GTS at Gaudio Lab. I work as the PO of GTS at Gaudio Lab by day, and chase my passion as a jazz vocalist by night. I lead and sing for the New Orleans marching band SoWhat NOLA (shameless plug here).

 

Do you agree with the phrase, "We listen to music with our eyes 👀"?

 

What I’m actually referring to, is song lyrics. GTS plays a pivotal role in the 'real-time lyric display' feature seen in today's streaming services. Here, AI autonomously syncs lyrics with their corresponding music.

 

In this discussion, my goal is to shed light on the inner workings of GTS and explore how this AI product, jointly cultivated by Gaudio Lab and myself, is revolutionizing our world.

 

 

The Untold Story Behind 'Real-time Lyric Display'

 

Currently, most music streaming services offer real-time lyric services.

 

 

<Here's a snapshot of a real-time lyric service - just like this!>

 

Until recently, were you aware that the synchronization of lyrics and melody for music streamed online was manually done by individuals?

 

Unfortunately, the practice of manually synchronizing lyrics comes with its own limitations, some of which may seem familiar:

 

  1. It is a time-consuming process. Syncing requires listening to the song in its entirety. If lyrics flow at a rapid pace, as in rap, or if the language isn't your native tongue, the processing time escalates significantly.
  2. The quality isn't always consistent. As I mentioned earlier, tracks that are harder to sync take longer to process. This makes it near impossible for those handling multiple songs in a day to perfectly process each one.

  3. The music market is flooded with thousands of songs each day. It's only logical for music streaming services to invest more in labor to ensure proper lyric syncing. Factor in management costs, and it quickly becomes a daunting task.

  4. It concerns artists as well. I personally experienced difficulties with lyric syncing during my album release. When I released my debut album two years ago, Spotify, an international streaming service, didn't provide real-time lyric services. As a result, I had to painstakingly use a manual sync service. In contrast, I recall the sense of satisfaction I felt with the smooth syncing on local streaming services that had incorporated GTS.

 

 

Let's dive deeper into the revolutionary AI solution known as GTS that resolves all these issues at once.

 

[GTS Concept Diagram]

 

 

GTS, short for Gaudio Text Sync, is a tool that [utilizes AI to automatically synchronize lyrics with the audio of a song]. For instance, given an mp3 file of BTS's new song "Take Two," and a txt file with the lyrics, GTS automatically generates a timeline for the lyrics.

 

 

If you're curious about how GTS can cut down time and cost, here are your answers:

 

  1. GTS only takes about 5 seconds to process a single song and it covers a wide range of languages, including English, Korean, Chinese, and Japanese. Regardless of the song being processed, GTS creates accurate and considerably faster timelines than human capabilities allow. Even high-speed tracks from artists like Eminem pose no problem. For streaming services facing similar challenges, adopting GTS could be a game-changing decision (that will also positively impact your business!)

  2. GTS not only provides synchronization results, but also offers a detailed 'Sync Result Report' for each song. This report specifies the accuracy of the synchronization and checks if any lines of lyrics have been improperly synced. Users can use this report to effortlessly review and correct any discrepancies in the lyrics or synchronization. (And it could even get you off work sooner!)

  3. Today, most Korean music streaming services provide real-time lyrics created through GTS to their customers. Foreign streaming services have started to take notice, and we're continually holding intensive meetings with them. We're setting our sights on international expansion. Full speed ahead!

  4. The surprises don't end here. GTS, which started with creating timelines for each line of lyrics, is now capable of generating timelines down to each word. The R&D team is currently working on advancing the technology to create timelines on a per-character basis, just like you'd see in a karaoke session.

 

As the saying goes, seeing is believing, so here's a comparison:


Let's take a look at the lyric syncing capabilities of Company G, a leading Korean music streaming platform, and NAVER VIBE. I've selected a track titled "Still a Friend of Mine" by one of my favorite bands, Incognito.


 

First, let's take a look at Company G's real-time lyrics. You'll notice that the synchronization is off by almost a complete line.

 

Many of you might have encountered this frustrating and inconvenient situation. It's difficult to navigate to a specific part of a song just by clicking the corresponding lyrics.

 

 

Now, let's take a look at VIBE, which has incorporated GTS technology.

 

With accurate synchronization, the music-listening experience is greatly enhanced. (Plus, VIBE has an exclusive feature allowing you to see native-level translated lyrics!)

 

 

 

 

GTS is revolutionizing the world in numerous ways.

E.UN Education: Facilitating the enjoyment of music for those with hearing impairments.

It's instrumental in music education services for those with hearing impairments. Through GTS, these individuals can experience music via vibrations corresponding to each syllable. GTS is actively being utilized by E.UN Education in Korea, who are gearing up for the launch of such an application. By determining the beginning and end points of each phrase in a song, GTS assists hearing-impaired individuals in discerning each line of the lyrics. This innovative use of AI technology is a true embodiment of Gaudio Lab's mission to leverage AI for the betterment of society.

 

 

[Picture taken during a meeting with E.UN Education]

 

CONSALAD: Indie musicians can now effortlessly generate lyric videos!

 

At CONSALAD, lyric videos are created (refer to picture below) and distributed to promote indie artists' new songs.

 

Here, GTS is utilized to synchronize the lyrics in these videos, illustrating how Gaudio Lab's technology assists indie musicians in their promotional efforts.

 

 

 

These cases are testament to Gaudio Lab's dedication to 'delivering exceptional auditory experiences through innovative technology'.

 

 

However, the journey of exploring GTS's potential doesn't stop here. The applications of GTS are endless…

 

  1. GTS can be applied wherever there is a need for synchronization between text and sound.
    1. In OTT or videos, where synchronization between closed captions and audio is required.
    2. In audiobooks, where synchronization between audio and text is essential.
    3. In language learning, where synchronization between audio and text is crucial.

  2. Furthermore, GTS can be incorporated into a much broader array of applications than you might think, and I'm always open to further exploration and discussion.

 

 

In conclusion,

 

The first project I undertook when I joined Gaudio Lab was the commercialization of GTS. Today, GTS is widely used in most of Korea's music streaming services, and it's gratifying to see how it enhances the music experience for many users.

 

My next goal is to speed up the implementation of GTS in a multitude of music/OTT streaming services around the world.

 

If you are interested in enhancing the auditory experiences of users worldwide, don't hesitate to reach out to us at Gaudio Lab!

 

 

 

 

pre-image
Integrating the Audio AI SDK into WebRTC (2): Methodology for Building a Testing Environment

Integrating the Audio AI SDK into WebRTC (2): Methodology for Building a Testing Environment for Effective Integration Development (Writer: Jack Noh)   As outlined in the previous post (Integrating Audio AI SDK with WebRTC (1): A Look Inside WebRTC's Audio Pipeline), WebRTC is a substantial multimedia technology, encompassing audio, video, and data transmission, among other aspects. Even considering the audio segment alone, it encompasses various modules (APM, ACM, ADM, …). It's a testament to the high applicability of this technology. In this upcoming post, I plan to share the methodology behind creating a 'robust testing environment' — a critically significant step when incorporating an Audio AI SDK into WebRTC.   WebRTC Audio Processing Module   Among the audio modules in WebRTC, which one is most suited for integrating a noise suppression filter? You may have already inferred from previous posts, the module most compatible for such integration (indeed, specifically designed for this purpose) is the Audio Processing Module (APM). Primarily, the APM is a collection of signal-processing based audio filters constructed with the aim of enhancing call quality.   The Audio Processing Module chiefly functions as an essential module imparting effects on audio at the client-side. This module is designed with the purpose of assembling filters that elevate the quality of audio signals, making them suitable for calls. These filters, also known as sub-modules within the APM, carry out various functions.   Here's a brief introduction of some prominent sub-modules in WebRTC:   High Pass Filter (HPF): This filter operates by removing low-frequency signals to isolate high-frequency ones, such as voice. Automatic Gain Controller (AGC): This filter maintains a uniform level by automatically adjusting the amplitude of the audio signal. Acoustic Echo Cancellation (AEC): This filter eradicates echo by preventing signals from the speaker (far-end) from reentering the microphone. Noise Suppression (NS): A signal-processing-based filter that eradicates ambient noise. These sub-modules are strategically used in both Near-end Streams (where microphone input signals are transmitted to the other party) and Far-end Streams (where audio data received from the other party are outputted to the speaker). To illustrate this, consider the example of a video conference scenario.   Initially, the audio signal inputted into the microphone is processed through the High Pass Filter (HPF) to eradicate low-frequency noise. Then, to prevent discomfort from abrupt loud sounds, the signal passes through the Automatic Gain Controller (AGC) which automatically adjusts the signal's amplitude. Following this, the Acoustic Echo Cancellation (AEC) comes into play, preventing echo by stopping signals from the speaker from reentering the microphone. Finally, the Noise Suppressor (NS) eliminates any remaining ambient noise.   Visual representations often simplify the explanation process compared to textual descriptions. The function ProcessStream() depicted in the diagram below signifies the processing of the signal stream captured from the microphone, a pathway often referred to as the forward direction. Concurrently, ProcessReverseStream() exists to handle the processing of the reverse stream, which aims to render audio data received from a remote peer via the speaker. Each of these processes can be understood as an effect-processing step within the Near-end Stream and Far-end Stream of the APM.   Let's revisit the aforementioned process flow via the following diagram:   (WebRTC Branch: branch-heads/5736)   It's crucial to note that the two stream processing stages are not independent. This is because, in order to mitigate the echo—a situation where the voice of the other party enters the microphone again—the signal needs to be analyzed within the ProcessReverseStream() before being passed onto the ProcessStream() for echo cancellation. (This is the role of AEC as discussed earlier!)   With an understanding that the structure of WebRTC's APM is as described, my intention was to integrate it with Gaudio Lab’s superior audio separation technology, GSEP-LD. At first glance, it seems logical to replace NS with GSEP-LD, but is this the optimal strategy?   From a signal processing perspective, this might seem plausible. However, as the SDK we intend to integrate is AI-based, it's not possible to conclusively determine this. Positioning it before AEC might enhance the output, or conversely, placing it at the very end might yield superior results.   In response to the question 'Would the optimal integration location coincide with the location of the existing noise removal filter?' numerous additional queries arise:   ‘Will there be any side-effects with other Submodules?’   ‘What about simultaneously using NS and GSEP-LD?’   ‘Is it necessary to utilize the HPF filter to eliminate low-frequency noise?’   ‘How would the efficacy of GSEP-LD alter depending on the operation of AEC?’     Upon analyzing WebRTC's APM, it became clear that several elements need to be carefully evaluated during testing. Now, let's delve into strategies for robust testing of these scenarios.   Note that the integration of another SDK as a Submodule into WebRTC is not particularly challenging. This article primarily discusses the establishment of a test environment for the WebRTC Audio pipeline and will not extensively cover this topic, but here is a brief overview: WebRTC's APM is scripted in C++ language. Initially, a Wrapper Class is devised to manage instances and states, mirroring the functionality of other Submodules within the APM. Subsequently, the Wrapper Class that you wish to integrate is embedded into the actual APM Class, and is managed in a similar fashion to other Submodules.     Leveraging APM CLI for Efficient Integration   Building upon our comprehension of the APM module, we now aim to establish a rigorous testing environment to derive results from GSEP-LD. For this, creating a Command Line Interface (CLI) capable of running the APM independently proves effective for obtaining file output results (Indeed, we need to actually hear the output!)   We have identified two methods for harnessing the APM CLI for efficient integration:   1) Method one involves using WebRTC's open-source resources directly.   2) Method two makes use of an open-source project that exclusively features the WebRTC APM.     1) Utilizing WebRTC's Open Source Directly   WebRTC's open-source resources can be directly accessed from the following link. The WebRTC project offers comprehensive guides on build procedures for each platform. Upon downloading the source code, installing the necessary software according to the guide, and ensuring a successful build, you are then equipped to use the APM CLI!   By checking the build path, you will be able to locate audioproc_f , a file mode testing tool for APM. With audioproc_f , you can feed a WAV file into the APM and obtain an output that has been processed through the audio filtering effects of the APM. To verify the APM output results using default settings, execute the following command:   $ ./audioproc_f -i INPUT.wav --all_default -o OUTPUT.wav   The testing environment can also be customized. For the next step, we will set the stream delay between the Near-end and Far-end to 30ms and activate the Noise Suppressor. Here, 'stream delay' refers to latency introduced by hardware and system constraints. Notably, to improve the performance of AEC, accurately specifying the stream delay between Near-end and Far-end is crucial, making it an important parameter to be cautious of during testing.   ./audioproc_f -i INPUT.wav --use_stream_delay 1 --stream_delay 30 --ns 1 -o OUTPUT.wav   2) Leveraging an Open Source Project Solely Featuring WebRTC APM   The first method indeed offers a substantial level of testing. By modifying the source code of audioproc_f , a robust test environment can also be established. However, within the WebRTC project, there exists a range of additional, nonessential multimedia source codes (such as those pertaining to video, data network source codes, and other modules like ADM, ACM for audio only), thereby presenting a disadvantage due to their redundant presence. Consequently, we contemplated the possibility of devising a leaner test environment focusing exclusively on the APM, the essential component. In the course of this consideration, we stumbled upon a valuable open source project which enables us to isolate and structure tests solely for the WebRTC's APM module (Special thanks to David Seo for suggesting the idea!). This constitutes our second method.   Here is the link to the open source project employed in the second method. This particular project provides the opportunity to separate the APM of WebRTC and assemble it using the Meson build system.   A brief interjection on Meson: Meson represents a forward-thinking C++ build system, capable of constructing codes more swiftly compared to other build systems such as CMake, primarily due to its user-friendly syntax. Furthermore, it offers straightforward handling of test cases and management of test architectures. To compose a unit test in Meson, you first generate a test execution file encompassing the test code, and subsequently, inscribe the test code in said file. Afterwards, you define the test with the test() function and instruct Meson to carry it out. For instance, you could draft a code as follows:   test('test_name', executable('test_executable', 'test_source.cpp'), timeout: 10)   This code specifies a test named test_name . It conducts the test by operating test_source.cpp, a test execution file constructed from the test_executable file. The timeout parameter assigns the duration for the test to be carried out, expressed in seconds. If the test fails to complete within the stipulated time, it is deemed unsuccessful. It's noteworthy that the test execution file is capable of receiving arguments, thereby allowing for the application of different values to accommodate various scenarios.   Having covered the simplified process of crafting test codes with the Meson build system, the subsequent step involves orchestrating the tests for integrating GSEP-LD. The subsequent image offers a representation of the test design we assembled (albeit it being more abridged than the actual test we conducted).       First, we partitioned our tests into two primary categories. As you may know, there are two types of streams in WebRTC: Near-end and Far-end streams. We designated these as Group A and Group B, respectively.   The principal distinction between the two groups lies in the operation of the Acoustic Echo Canceller (AEC), responsible for echo removal. (When only the Near-end stream is active, the AEC is turned off, whereas it is enabled when both the Near-end and Far-end streams are utilized.)   Each group also has a unique configuration for the input files. In Group A, the signal consisting of speech through a microphone with added noise suffices as the input. However, Group B requires a simulation that mimics a scenario where the output signal from a speaker (i.e., the signal produced by the other party) re-enters as an echo. The echo_generator serves this purpose. The input signal for Group B is formed by summing the echo originating from the echo_generator with the signal transmitted through the microphone.   Subsequently, we controlled the integration point and operation of GSEP-LD. The integration was managed at the forefront of the processing (Pre-processing), at the existing noise removal filter's location, and finally at the Post-processing stage.   It is essential also to examine the potential side effects with other submodules, isn't it? Let's explore the side effects concerning the Noise Suppressor (NS). To validate the side effect with NS, we toggled its status and set the parameter controlling the degree of noise reduction. It's worth mentioning that the noise reduction parameter in NS can be set anywhere from Low ~ VeryHigh. However, employing an excessively high value may lead to significant sound distortion.   Finally, we categorized the noise types present in the input file. We performed tests for a variety of noise types, including noise typically found in a café, car noise, and the sound of rain. Despite this degree of detail leading to a substantial number of test branches (owing to the multiplication of each independent branch), the test functionality of the Meson build system allows for easy generation using just a few for-loops. Below is a pseudo-code depiction of the Meson build file.     The test code is then complete. Using Meson commands to execute the prepared test code enables us to cover all designated test branches and obtain respective output results.   The test went well and creating the output Check metrics to be ensured the test went well     Conclusion   In a project like WebRTC, which encompasses numerous technologies, integrating a third-party technology presents a host of considerations. One must understand the overall flow for selecting an integration point and anticipate the integration across a multitude of environments. These environments involve the location of integration, the operating environment, and the potential side effects with pre-existing technologies. The details discussed above take into account relatively straightforward cases. Nevertheless, we are still presented with: Noise types (5 types) x AEC On/Off (2 types) x NS settings (5 types) x GSEP On/Off (2 types) = 100 scenarios!   These 100 scenarios require consideration. However, in practice, the need to test an even wider variety of noise types, consider additional submodules beyond NS, and account for platform-specific differences (Windows, MacOS, etc.) results in an exponential increase in the number of test cases. Furthermore, should the GSEP positioning need to be controlled across a broader range of locations for testing purposes, it becomes rather confusing.   In this article, we have shared our experience of effectively managing complex scenarios, utilizing a more streamlined Command Line Interface (CLI) environment for testing. Our learnings have demonstrated that by harnessing the capabilities of the open-source projects that isolate WebRTC's Audio Processing Module (APM) and employing the testing feature of the Meson build system, a relatively straightforward environment construction is feasible. If you wish to integrate specific filters into WebRTC in the future, we hope this article provides some insights and assistance. Thank you for taking the time to read this article.   +) For those eager to integrate Gaudio Lab’s unique audio SDK into WebRTC and develop a service, you are always most welcome! +) Thanks to David Seo for letting me know about WebRTC's audio popeline, and for the idea of using an open source that stands alone as a WebRTC APM!       

2023.06.26
after-image
Audio Quality Evaluation of Spatial Audio Part 1: Designing the Evaluation

Audio Quality Evaluation of Spatial Audio Part 1: Designing the Evaluation   (Writer: James Seo)   My name is James, a specialist in Research and Development for GSA (GAUDIO Spatial Audio). In our previous discussion, we explored the measurement of M2S (Motion-to-Sound) Latency, an indicator of GSA’s responsiveness to user movements. Today, we seek to address the following question: “Is the sound we hear really that good?” Given that GSA is designed specifically for wearable devices such as True Wireless Stereo (TWS) and Head-Mounted Displays (HMD), sound quality is a vital factor. Regardless of its swift responsiveness to movements, it cannot qualify as an exceptional product if the sound quality is subpar. Sound matters!   Methods of Audio Quality Evaluation   Before delving into the evaluation techniques for GSA’s performance, it is necessary to familiarize ourselves with the methods used to assess audio quality.   Various strategies exist for evaluating the performance of an acoustic device or system. A prevalent approach involves measuring performance based on parameters extrapolated from the sound being reproduced. Our prior discussion on the M2S latency measurement serves as a quintessential example. In addition, standardized methodologies such as PEAQ (Perceptual Evaluation of Audio Quality) and PESQ (Perceptual Evaluation of Speech Quality) are commonly used for assessing audio/voice codecs’ efficacy. Typically, these techniques analyze individual sounds to calculate Model Output Variables (MOVs), elements that influence perceptual quality. The final quality score is derived from a weighted summation of these values. This type of evaluation is termed Objective Quality Evaluation. These evaluations involve feeding an acoustic signal into software or a device and then calculating the final quality score, a process praised for its efficiency due to the relatively short time required.   However, these standardized objective evaluation methodologies have their limitations. Since they are designed to assess the degree of degradation in the quality of the signal under test (SUT) relative to a reference signal, they become impractical in the absence of such a reference. This constraint is inherent in these standardized methodologies, given they were initially developed to evaluate codec performance.   Another evaluation strategy is the Subjective Quality Evaluation method. Here, the evaluator listens to and compares the audio source under evaluation, judging its quality based on personal criteria. An example of this evaluation method is MUSHRA (Multiple Stimuli with Hidden Reference and Anchor). Nonetheless, as personal standards can vary significantly, obtaining reliable results necessitates a large pool of evaluators, which presents a logistical challenge. Besides the need for numerous evaluators, this approach can be both time-consuming and expensive since each evaluator must personally listen to and assess the sound. Lastly, as implied by its name (Hidden Reference and Anchor), it shares the limitation of being applicable only when a reference signal is available.   We are now in the process of choosing how to evaluate the GSA. Given that spatial audio signals do not have a suitable reference signal, we cannot employ objective evaluation methods like Perceptual Evaluation of Audio Quality (PEAQ). Similarly, Multiple Stimuli with Hidden Reference and Anchor (MUSHRA), a subjective evaluation method, is not applicable for the same reason. Moreover, conducting a subjective evaluation based solely on the output signal of the GSA presents significant challenges for evaluators and can make it difficult to produce reliable results.   After careful consideration, we’ve decided to compare the GSA with familiar, widely-used solutions that have a well-established reputation for quality. Although there were several spatial audio solutions that could have been considered, increasing the number of comparative systems also expands the quantity of signals requiring comparison. This circumstance could put significant pressure on the evaluators and potentially reduce the reliability of the results. As a result, we chose to limit our comparison to Apple’s Spatial Audio (hereinafter referred to as ASA), ensuring a focused one-to-one comparison.   Design of Subjective Quality Evaluation   (1) Preference Testing through Paired Comparison   Our primary comparison method is a preference test performed using a paired comparison. This process involves presenting two signals in a random sequence and determining a preference. The preferred system is awarded a score of +1, while the non-preferred system receives a score of 0. This method can be likened to asking someone: “Who do you love more, Mom or Dad?” We’ve adopted the Double-Blind Forced Choice technique for this evaluation, which means the evaluators cannot know whether the sound they are hearing has been rendered by ASA or GSA. Since evaluators simply listen to signals and randomly select a preference from two options (A or B), intentional bias can be effectively minimized.   (2) Selection of Sound Excerpts   The next step involves selecting the sound sources for evaluation. Since the performance of a solution can vary based on the characteristics and format (number of channels) of a sound source, the results may differ depending on which sound source is used for evaluation. Initially, we differentiated and selected sound sources based on 2-channel stereo and 5.1 multi-channel. Although 2-channel stereo is the most common format encountered by users in everyday environments, we also included 5.1 channel sound sources because some films and music sources are mixed into 5.1 channels to enhance the sense of spatiality.   For stereo sound sources, we selected one song from each of various music genres, and added a few movie clips in stereo version, resulting in seven sound sources in total. We also selected seven sound sources with various characteristics, including films, music, and applause, for multi-channel audio. Each chosen sound source was trimmed to a length of 10-15 seconds, considered to be the most appropriate duration for subjective evaluation.   (3) Generation of Evaluation Signals   The primary goal of this evaluation is to measure the quality of Spatial Audio that dynamically adapts to user movements. Ideally, we would render each evaluation audio source to reflect actual user movements. However, due to the exclusivity of Apple’s Spatial Audio (ASA) within the Apple product ecosystem, we encountered limitations in constructing the ideal experimental environment. It was impractical to secretly implement alternative Spatial Audio renderers on devices such as the AirPods Pro or iPhone for evaluation purposes. As a result, we resorted to creating signals separately, rendering them for fixed head orientations, and selecting front-facing orientations that users encounter most frequently.   Another challenge is capturing sound from ASA, given it only operates within Apple’s ecosystem. Fortunately, at Gaudio Lab, we managed to acquire filters associated with ASA. Despite current iOS updates blocking this route of acquisition, there was a period when we could capture signals transmitted to True Wireless Stereo (TWS) devices like the AirPods Pro. We accomplished this by activating the Spatial Audio feature and playing the audio source.   Although it is feasible to play the desired audio source from an iPhone and capture the actual rendered signal, using specific signals such as a swept sine can also directly yield ASA’s filter coefficients. After combining the obtained filter and audio source and replaying it through the AirPods Pro, the sound rendered with Spatial Audio reproduces identically to an actual iPhone/AirPods Pro setup. Alternatively, using an ear simulator equipped with AirPods Pro to capture the response of ASA in its on/off states—excluding the TWS response—also offers a method to obtain filter coefficients. However, this approach is somewhat technical and diverges from the main discussion, and thus will not be covered here.   For the listening evaluation, we used the AirPods Pro as the evaluation medium to enable a fair comparison between ASA and GSA. This approach minimizes the impact of variations in quality due to differences in the final playback device, facilitating a more focused evaluation of the renderer’s performance in implementing spatial audio.   In addition, we included original signals that bypassed both ASA and GSA in our comparative analysis. Should the evaluation reveal that both ASA and GSA underperform compared to the original signal, it would render the comparative exercise futile. Reviewing these results offers insights into evaluators’ overall preference for spatial audio against the original signal. We treated a downmixed audio source from 5.1 channels-to-2 channels as the original signal. As a result, the final set of comparative signals for each evaluation audio excerpt is organized as follows.   GSA vs. ASA GSA vs. Original ASA vs. Original   (4) Setting for Subjective Quality Evaluation   The setting for subjective quality evaluation, where evaluators conduct their assessments, is outlined below:   As demonstrated in the figure, evaluators can identify only the name of the evaluation audio excerpt. They are kept unaware of which signals – ASA, GSA, or the original – have been allocated to options A and B. The individuals designing the evaluation are equally unaware of how options A and B are allocated, and they cannot dictate which audio source appears first for assessment. The sequence of the audio sources is randomized by the system, ensuring that their order does not influence the results.   Within this interface, evaluators are tasked to select what they believe to be the superior option between the two presented. Upon making this selection and clicking the ‘Next Test’ button, they proceed to the subsequent evaluation. This procedure embodies the double-blind forced-choice method previously discussed. Evaluators are allowed to repeat specific sections during the assessment of a single audio source without any imposed restrictions. As the evaluation system is web-based and hosted on the Gaudio Lab’s server, it allows for concurrent evaluations by multiple users. Upon the conclusion of each assessment session, the results are stored on the server.   (5) The Evaluation Panel   The current round of evaluations included a total of 20 adults, both men, and women, ranging in age from 20 to 40. Among these participants, 11 were seasoned evaluators with extensive experience in auditory assessments. To incorporate perspectives from the general public, we included nine individuals who lacked specialized knowledge in sound quality evaluation or related technology but displayed a keen interest in activities such as regular music listening.   The result will be revealed in Part 2. Stay Tuned!

2023.07.13