Restoring tone-marks in standard yorùbá electronic text: improved model

Asahiah, Franklin Ọládiípò; Ọdéjọbí, Ọdétúnjí Àjàdí; Adagunodo, Emmanuel Rotimi

doi:https://doi.org/10.7494/csci.2017.18.3.2128

Article

Restoring tone-marks in standard yorùbá electronic text: improved model

creativeworkseries.issn	1508-2806
dc.contributor.author	Asahiah, Franklin Ọládiípò
dc.contributor.author	Ọdéjọbí, Ọdétúnjí Àjàdí
dc.contributor.author	Adagunodo, Emmanuel Rotimi
dc.date.available	2025-06-16T10:13:20Z
dc.date.issued	2017
dc.description	Bibliogr. s. 313-315.
dc.description.abstract	Diacritic Restoration is a necessity in the processing of languages with Latinbased scripts that utilizes letters outside the basic Latin alphabet used by English language. Yorùbá is one such languages, marking underdot (dot-below)on three characters and tone marks on all seven vowels and two syllabic nasals. The problem of restoring underdotted characters has been fairly addressed using character as linguistic units for restoration. However, the existing characterbased approaches and word-based approach has not been able to sufficiently address restoration of tone marks in Yorùbá. We address in this study tone marks restoration as a subset of diacritic restoration. We proposed using the syllable (derived from word) as the linguistic token for tone marks restoration. In our experimental setup, we used Yoruba text collected from various sources as data with total word count of 250,336 words. These words, on syllabification, yielded 464,274 syllables. The syllables were divided into training and testing data in different proportions ranging from 99% used for training and 1% used for testing to 70% used for training and 30% used for testing. The aim of evaluation different proportions was to determine how the ratio of training-to-test data affect the variations that may occur in the result. We applied Memory-based learning to train the models. We also set up a similar experiment using character token to be able to compare the performance. The result showed that using syllable was able to increase accuracy at word level to 96.23% and an average of almost 15% over that gotten from using character. We also found out that using 75% of data for training and the remaining 25% for testing gives the results with the least variation in a ten-fold cross validation test. Hybridizing this method that uses syllabless as processing linguistic units with other methods like lexicon lookup might likely lead to improvement over the current result.	en
dc.description.placeOfPublication	Kraków
dc.description.version	wersja wydawnicza
dc.identifier.doi	https://doi.org/10.7494/csci.2017.18.3.2128
dc.identifier.eissn	2300-7036
dc.identifier.issn	1508-2806
dc.identifier.uri	https://repo.agh.edu.pl/handle/AGH/113187
dc.language.iso	eng
dc.publisher	Wydawnictwa AGH
dc.relation.ispartof	Computer Science
dc.rights	Attribution 4.0 International
dc.rights.access	otwarty dostęp
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/legalcode
dc.subject	diacritic restoration	en
dc.subject	syllables	en
dc.subject	characters	en
dc.subject	word-level accuracy	en
dc.title	Restoring tone-marks in standard yorùbá electronic text: improved model	en
dc.title.related	Computer Science	en
dc.type	artykuł
dspace.entity.type	Publication
publicationissue.issueNumber	No. 3
publicationissue.pagination	pp. 301-315
publicationvolume.volumeNumber	Vol. 18
relation.isJournalIssueOfPublication	370e9597-005d-43f6-b5d9-1e994e0b8a5c
relation.isJournalIssueOfPublication.latestForDiscovery	370e9597-005d-43f6-b5d9-1e994e0b8a5c
relation.isJournalOfPublication	020291ee-249b-4dcf-98a3-276a2f7981aa

Files

Original bundle

Now showing 1 - 1 of 1

Name:: csci.2017.18.3.301.pdf
Size:: 876.01 KB
Format:: Adobe Portable Document Format

Download

Collections

Artykuły (CN-csci)