End-to-end speech-to-text models aim to convert spoken audio directly into text using a single, unified neural system. Unlike traditional pipelines that separated acoustic modeling, pronunciation ...