Speech audio modeling by means of causal moving average equipped gated attention

Ivaniuk, Andrii

Speech audio modeling by means of causal moving average equipped gated attention

dc.contributor.author	Ivaniuk, Andrii
dc.date.accessioned	2023-04-03T07:47:31Z
dc.date.available	2023-04-03T07:47:31Z
dc.date.issued	2022
dc.description	У цій роботі ми порівнюємо різні механізми уваги па прикладі задачі генерації аудіо, використовуючи підходи "навчання без вчителя", беручи за основу попередні дослідження в моделюванні мови. Це важлива проблема, оскільки технологію синтезу мови можна використовувати для, конвертації текстової інформації в звукові сигнали. Таке представлення, можна зручно інтегрувати в мобільні пристрої та використовувати в таких програмах, як голосові месенджери або програми електронної пошти. Іноді важко зрозуміти та прочитати важливі повідомлення, перебуваючи за кордоном. Таким чином, може виникнути нестача відповідних комп'ютерних систем або проблеми з безпекою. Завдяки цій технології повідомлення електронної пошти можна швидко й ефективно прослуховувати на смартфонах, підвищуючи продуктивність. Крім того, вона може використовуватись для допомоги людям із вадами зору, щоб, наприклад, вміст екрана міг автоматично читатися вголос для незрячого користувача. Сьогодні побутова техніка, як-от мультиварки, також може використовувати цю систему для, читання, кулінарних рецептів, автомобілі для голосової навігації до місця призначення, або особи, які вивчають мову для навчання вимови. Генерація мови с протилежною проблемою автоматичного розпізнавання мови (ASR) і досліджується з другої половини XVIII століття. Крім того, ця технологія також допомагає людям із вадами голосу знайти спосіб спілкування, з іншими, хто не розуміє мови жестів. Однак існує проблема, пов'язана з тим, що частота дискретизації звуку є дуже високою, що призводить до дуже довгих послідовностей, які обчислювально важко змоделювати. Друга проблема полягає в тому, що мовні сигнали з однаковим семантичним значенням можуть бути представлені великою кількістю сигналів зі значною мінливістю, яка спричинена каналом передавання даних, вимовою або характеристиками тембру мовця. Щоб подолати ці проблеми, ми навчаємо модель автоенкодера, щоб дискретизувати безперервний аудіосигнал у скінченний набір дискримінативних аудіотокенів, які мають нижчу чистоту дискретизації. Після цього авторегресивні моделі, які не зале;ить від тексту, навчаються на цих репрезентаціях, щоб передбачити наступний токен на основі попередніх елементів послідовності. Отже, цей підхід до моделювання нагадує авторегресивне моделювання мови. У нашому дослідженні ми показуємо, що на відміну від оригінальної роботи МЕGА, традиційний механізм перевершує механізм з рухомим, середнім, що показує, що останній ще не с стабільним та потребує ретельної оптимізації гіперпараметрів.	uk_UA
dc.description.abstract	In the paper we compare different attention mechanisms on the task of audio generation using unsupervised approaches following previous work in language modeling. It is important problem, as far as speech synthesis technology could be used to convert textual information into acoustic waveform signals. These representations can be conveniently integrated into mobile devices and used in such applications as voice messengers or email apps. Sometimes it is difficult to understand and read important messages when being abroad. The lack of appropriate computer systems or some security problems may arise. With this technology, e-mail messages can be listened quickly and efficiently on smartphones, boosting productivity. Apart from that, it is used to assist visually impaired people, so that, for instance, the screen content can be automatically read aloud to a blind user. Nowadays, home appliances, like slow cookers can use this system too for reading culinary recipes, automobiles for voice navigation to the destination spot, or language learners for pronunciation teaching. Speech generation is the opposite problem of automatic speech recognition (ASR) and is researched since the second half of the eighteen’s century. Also, this technology also helps vocally handicapped people find a way to communicate with others who do not understand sign language. However, there is a problem, related to the fact that the audio sampling rate is very high, thus lea,ding to very long sequences which are computationally difficult to model. Second challenge is that speech signals with the same semantic meaning can be represented by a lot of signals with significant variability, which is caused by channel environment, pronunciation or speaker timbre characteristics. To overcome these problems, we train an autoencoder model to discretize continuous audio signal into a finite set of discriminative audio tokens which have a lower sampling rate. Subsequently, autoregressive models, which are not conditioned on text, are trained on this representation space to predict the next token, based on previous sequence elements. Hence, this modeling approach resembles causal language modeling. In our study, we show that unlike in the original MEGA work, traditional attention outperforms moving average equipped gated attention, which shows that EMA gated attention is not stable yet and requires careful hyper-parameter optimization.	en_US
dc.identifier.citation	Ivaniuk A. Speech audio modeling by means of causal moving average equipped gated attention / A. Ivaniuk // Могилянський математичний журнал. - 2022. - Т. 5. - С. 53-56. - https://doi.org/10.18523/2617-70805202253-56	uk_UA
dc.identifier.issn	2617-7080
dc.identifier.issn	2663-0648
dc.identifier.uri	https://doi.org/10.18523/2617-70805202253-56
dc.identifier.uri	https://ekmair.ukma.edu.ua/handle/123456789/24883
dc.language.iso	en
dc.relation.source	Могилянський математичний журнал	uk_UA
dc.status	first published	en_US
dc.subject	audio modeling	en_US
dc.subject	artificial neural networks	en_US
dc.subject	attention mechanism	en_US
dc.subject	article	en_US
dc.subject	аудіомоделювання	uk_UA
dc.subject	штучні нейронні мережі	uk_UA
dc.subject	механізм уваги	uk_UA
dc.title	Speech audio modeling by means of causal moving average equipped gated attention	en_US
dc.title.alternative	Мовне моделювання аудiо з допомогою механiзму уваги з рухомим середнiм	uk_UA
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ivaniuk_Speech_audio_modeling_by_means_of_causal_oving_average_equipped_gated_attention.pdf
Size:: 148.44 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Том 5
Кафедра інформатики