Ongoing Research projects

Speech recognition system for Sinhala

This project aims on building a continuous speech recognition system for Sinhala mainly for the purpose of dictating documents. The system is developed using the Hidden Markove Model technology with the help of HTK toolkit. (Research Papers: 1, 2)

Sinhala – Tamil Machine translation system

 Ultimate goal of this research is to develop a machine translation system for Sinhala-Tamil language pair to reduce the language barrier between Sinhala and Tamil communities and thereby to help solve a burning issue in the country. We are aiming to develop machine translation system for both Sinhala to Tamil and Tamil to Sinhala directions. Currently we are investigating the possibilities of  integrating linguistic information to machine translation system since Sinhala and Tamil languages are morphologically rich languages.  (Research Papers: 1, 2, 3)

Morphological analyzer for Sinhala

 Description will be added soon (Research Paper)

Building a Sinhala –Tamil parallel corpus


Building a Tamil corpus

The main objective of this task is to collect 3 million words Sri Lankan Tamil language corpus. This is a very useful resource to build tools for machine assisted and automatic translation between the two languages as well as translation studies. 

Building a speech corpus for Sinhala

This project aims on building a speech corpus for Sinhala using NEWS recordings from Sri Lanka Broadcasting Corporation. This aims on collecting 20,000 utterances and their corresponding text transcriptions. (Research Paper)

Text To Speech (TTS) System for Sinhala

 This project was to develop a human sounding Sinhala test reader which will enable the visually impaired Sinhala community to access digital text in their mother language, using a computer. Unit selection approach on OpenMARY Text-to-Speech Synthesis platform was used to build the Sinhala voice.


Past Research projects

Sinhala OCR (Beta version)

This attains overall accuracy of approximately 85%. This attempt of recognizing Sinhala characters is font specific and we considered four popular fonts. A font specific Sinhala Optical Character Recognition system developed using the k-NN algorithm. This system supports four popular proprietary fonts ‘Abhaya’, ‘Manel’ ‘Lakbima’ and ‘Divaina’. (Research Paper)

Shikshaka – framework for developing language learning tools

Shikshaka is a computer based teaching framework developed for teaching spoken aspect of languages using a dialogue-based andragogy. Effective technological methodologies were used to develop the system as an interactive tool using the goal oriented approach developed by linguistic scholars. We have used the framework to implement two scenarios: to teach Tamil using Sinhala and to teach Sinhala using English. The learner’s language can be customized with little effort, while the framework is flexible enough to be customized to teach other target languages as well with some pedagogical input. (Research Paper)

Developing a Computational Grammar for Sinhala

This is an attempt to develop a feature-based CFG for non-trivial sentences in Sinhala. The resulting grammar covers a significant subset of Sinhala. A parser for producing the appropriate parse tree(s) of input sentences was developed using the NLTK toolkit. (Research Paper)

kathabaha - Text To Speech (TTS) System for Sinhala

While there were some experimental TTS systems by the UCSC for Sinhala are already under work, the aim of this project was to produce one that is of commercial quality. To this end, considerable effort was be spent on quality aspects of this activity. Apart from identifying the phonetic alphabet of the language, recording relevant word sentences in the database and building a text analysis component, the project also produced a synthesizing engine that facilitates natural sounding Sinhala voice. The basic methodology adopted is based on the diphone concatenation approach. (Research Papers: 1, 2, 3)

Akaradi - Trilingual Lexicon

Description will be added soon

Aharamariyi Tamil OCR

Aharamariyi is a Tamil Optical Character Recognition system built using the Tesseract OCR engine. We used several methods in identifying the features of Tamil script and to extract the most accurate combinations of training data. Aharamariyi OCR gives a 81% accuracy to our evaluation data set. (Research Paper)