Modern approaches to controllable emotional speech synthesis

dc.contributor.authorIvashchenko, Dmytroen_US
dc.contributor.authorMarchenko, Oleksandren_US
dc.date.accessioned2026-02-04T08:23:24Z
dc.date.available2026-02-04T08:23:24Z
dc.date.issued2025
dc.descriptionУ статті представлено комплексний огляд сучасних технологій керованих систем для емоційного синтезу мовлення. Проаналізовано еволюцію нейронних архітектур, систематизовано підходи за технологіями та методами емоційного контролю. Визначено ключові виклики галузі, що охоплюють відокремлення мовленнєвих ознак та дефіцит даних для мов з обмеженими ресурсами. Окреслено перспективні напрями розвитку систем емоційно контрольованого синтезу мовлення.uk_UA
dc.description.abstractThe generation of emotionally expressive and controllable speech is one of the most dynamic and technically demanding areas in the intersection of artificial intelligence, natural language processing, and speech synthesis. Recent progress in emotional text-to-speech (TTS) systems has enabled increasingly natural and emotionally nuanced voice generation, shifting from early concatenative methods to advanced neural models. This review provides a structured overview of the state of the art in controllable emotional TTS, highlighting key architectural paradigms. A special focus is placed on emotional control mechanisms, including discrete emotional tagging with categorical or dimensional labels, reference-based control which conditions synthesis on prosodic or stylistic exemplars, and prompt-based techniques that leverage the capabilities of large language models for flexible and intuitive emotional specification. Despite substantial improvements in synthesis quality and emotional expressiveness, several critical challenges remain unresolved. These include the disentanglement of emotional, speaker, and prosodic features, the lack of standardized evaluation metrics for emotional clarity and naturalness, and the significant computational demands associated with training high-fidelity models. Furthermore, the scarcity of diverse and emotion-labeled speech data, especially for low-resource and morphologically rich languages, continues to limit the generalizability of current approaches. This review not only summarizes existing methods and their trade-offs but also outlines promising research directions, aiming to support the development of more robust, efficient, and emotionally expressive speech generation systems.en_US
dc.identifier.citation111en_US
dc.identifier.issn2617-3808
dc.identifier.issn2617-7323
dc.identifier.urihttps://doi.org/10.18523/2617-3808.2025.8.28-37
dc.identifier.urihttps://ekmair.ukma.edu.ua/handle/123456789/38254
dc.language.isoenen_US
dc.relation.sourceНаукові записки НаУКМА. Комп'ютерні наукиuk_UA
dc.statusfirst publisheden_US
dc.subjectdeep learningen_US
dc.subjecttext-to-speech synthesisen_US
dc.subjectnatural language processingen_US
dc.subjectspeech emotion controlen_US
dc.subjectdiffusion modelsen_US
dc.subjectarticleen_US
dc.subjectглибоке навчанняuk_UA
dc.subjectсинтез мовлення з текстуuk_UA
dc.subjectобробка природної мовиuk_UA
dc.subjectемоційний контроль мовленняuk_UA
dc.subjectдифузійні моделіuk_UA
dc.titleModern approaches to controllable emotional speech synthesisen_US
dc.title.alternativeСучасні підходи до контрольованого синтезу емоційного мовленняuk_UA
dc.typeArticleen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ivashchenko_Marchenko_Modern_approaches_to_controllable_emotional_speech_synthesis.pdf
Size:
823.85 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections