Balancing speed and precision: a comparative study of ASR systems in multimodal collaborative environments

Terpstra, Corbyn, author; Blanchard, Nathaniel, advisor; Ghosh, Sudipto, committee member; Cleary, Anne, committee member

Balancing speed and precision: a comparative study of ASR systems in multimodal collaborative environments

dc.contributor.author	Terpstra, Corbyn, author
dc.contributor.author	Blanchard, Nathaniel, advisor
dc.contributor.author	Ghosh, Sudipto, committee member
dc.contributor.author	Cleary, Anne, committee member
dc.date.accessioned	2025-09-01T10:42:26Z
dc.date.available	2025-09-01T10:42:26Z
dc.date.issued	2025
dc.description.abstract	Automatic Speech Recognition (ASR) systems are increasingly critical for analyzing collaborative problem-solving (CPS) tasks, yet their segmentation and transcription accuracy in dynamic, multimodal environments remain underexplored. This study evaluates the performance of OpenAI's Whisper (Large, Medium, Turbo) and Vosk ASR systems in segmenting and transcribing collaborative dialogue, with a focus on implications for CPS annotation workflows. Leveraging a dataset of triads solving a multimodal task—comprising oracle (human-segmented), Google-segmented, and Whisper-segmented audio—we measure transcription accuracy via Word Error Rate (WER) and assess segmentation alignment through start time deviations, segment length ratios, and pause dynamics. Results reveal that while Whisper Turbo achieves the lowest overall WER (52.5%), its semantic segmentation strategy fragments coherent CPS moves, complicating annotation. Conversely, Vosk's pause-based approach under-segments rapid exchanges, obscuring interruptions and cross-talk. The study highlights a fundamental tension: Whisper prioritizes intent preservation at the cost of over-segmentation, while Vosk and Google ASR sacrifice nuance for efficiency. Annotation fidelity is further eroded by ASR-induced errors, including insertions (e.g., hallucinated phrases during silence) and temporal misalignments. These findings underscore the need for hybrid segmentation strategies and adaptive annotation frameworks that explicitly account for ASR limitations. Practical recommendations are proposed, including model-specific post-processing and context-aware annotation tools. By bridging technical evaluation with real-world application, this work advances the design of ASR systems tailored for collaborative environments, ensuring their outputs align with the complexities of human interaction.
dc.format.medium	born digital
dc.format.medium	masters theses
dc.identifier	Terpstra_colostate_0053N_19247.pdf
dc.identifier.uri	https://hdl.handle.net/10217/241841
dc.identifier.uri	https://doi.org/10.25675/3.02161
dc.language	English
dc.language.iso	eng
dc.publisher	Colorado State University. Libraries
dc.relation.ispartof	2020-
dc.rights	Copyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.
dc.subject	machine learning
dc.subject	automatic speech recognition
dc.title	Balancing speed and precision: a comparative study of ASR systems in multimodal collaborative environments
dc.type	Text
dcterms.rights.dpla	This Item is protected by copyright and/or related rights (https://rightsstatements.org/vocab/InC/1.0/). You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Colorado State University
thesis.degree.level	Masters
thesis.degree.name	Master of Science (M.S.)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Terpstra_colostate_0053N_19247.pdf
Size:: 226.31 KB
Format:: Adobe Portable Document Format

Download

Collections

2020-
Theses and Dissertations