Repository logo
Article

Restoring tone-marks in standard yorùbá electronic text: improved model

creativeworkseries.issn1508-2806
dc.contributor.authorAsahiah, Franklin Ọládiípò
dc.contributor.authorỌdéjọbí, Ọdétúnjí Àjàdí
dc.contributor.authorAdagunodo, Emmanuel Rotimi
dc.date.available2025-06-16T10:13:20Z
dc.date.issued2017
dc.descriptionBibliogr. s. 313-315.
dc.description.abstractDiacritic Restoration is a necessity in the processing of languages with Latinbased scripts that utilizes letters outside the basic Latin alphabet used by English language. Yorùbá is one such languages, marking underdot (dot-below)on three characters and tone marks on all seven vowels and two syllabic nasals. The problem of restoring underdotted characters has been fairly addressed using character as linguistic units for restoration. However, the existing characterbased approaches and word-based approach has not been able to sufficiently address restoration of tone marks in Yorùbá. We address in this study tone marks restoration as a subset of diacritic restoration. We proposed using the syllable (derived from word) as the linguistic token for tone marks restoration. In our experimental setup, we used Yoruba text collected from various sources as data with total word count of 250,336 words. These words, on syllabification, yielded 464,274 syllables. The syllables were divided into training and testing data in different proportions ranging from 99% used for training and 1% used for testing to 70% used for training and 30% used for testing. The aim of evaluation different proportions was to determine how the ratio of training-to-test data affect the variations that may occur in the result. We applied Memory-based learning to train the models. We also set up a similar experiment using character token to be able to compare the performance. The result showed that using syllable was able to increase accuracy at word level to 96.23% and an average of almost 15% over that gotten from using character. We also found out that using 75% of data for training and the remaining 25% for testing gives the results with the least variation in a ten-fold cross validation test. Hybridizing this method that uses syllabless as processing linguistic units with other methods like lexicon lookup might likely lead to improvement over the current result.en
dc.description.placeOfPublicationKraków
dc.description.versionwersja wydawnicza
dc.identifier.doihttps://doi.org/10.7494/csci.2017.18.3.2128
dc.identifier.eissn2300-7036
dc.identifier.issn1508-2806
dc.identifier.urihttps://repo.agh.edu.pl/handle/AGH/113187
dc.language.isoeng
dc.publisherWydawnictwa AGH
dc.relation.ispartofComputer Science
dc.rightsAttribution 4.0 International
dc.rights.accessotwarty dostęp
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/legalcode
dc.subjectdiacritic restorationen
dc.subjectsyllablesen
dc.subjectcharactersen
dc.subjectword-level accuracyen
dc.titleRestoring tone-marks in standard yorùbá electronic text: improved modelen
dc.title.relatedComputer Scienceen
dc.typeartykuł
dspace.entity.typePublication
publicationissue.issueNumberNo. 3
publicationissue.paginationpp. 301-315
publicationvolume.volumeNumberVol. 18
relation.isJournalIssueOfPublication370e9597-005d-43f6-b5d9-1e994e0b8a5c
relation.isJournalIssueOfPublication.latestForDiscovery370e9597-005d-43f6-b5d9-1e994e0b8a5c
relation.isJournalOfPublication020291ee-249b-4dcf-98a3-276a2f7981aa

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
csci.2017.18.3.301.pdf
Size:
876.01 KB
Format:
Adobe Portable Document Format