Research

Ongoing Research projects

Developing a Conversational Agent for Sinhala

TThe goal of the chat bot is to provide conversational self-service channels for general and account-related actions, as well as offering customer-related support.
Since there is no reliable chat bot available in Sinhala language, we are embarking on developing commercial-grade Sinhala conversational agent technology with a strong market potential as a way of generating a revenue stream for launching a startup and/or sustaining research into new language technologies. Our approach is to incorporate the recent successes of deep learning techniques to achieve this..

Automatic Speech Recognition system for SinhalaThe main goal of this proposed project is to enhance the human-computer interaction by developing applications that enables users to easily access the computers using local languages. Automatic Speech Recognition (ASR) is one of the key technologies in facilitating this.

While there are many approaches that have been experimented with for ASR, those employing statistical Hidden Markov Models (Rabiner, 1989) have emerged as the dominant approach to (Jelinek, 1997). With the advent of Deep Learning, a new class of algorithms has allowed dramatic reductions in the error rates.

This projects aims to explore some of these new learning algorithms to improve the quality of the speech recognition currently implemented for Sinhala.

Research Papers:

  1. Efficient use of training data for sinhala speech recognition using active learning 
  2. Continuous Sinhala Speech Recognizer 

Human Quality Text To Speech (TTS) System for Sinhala

The main goal of this project is to enhance human-computer interaction by developing applications that enable users to easily access the computers in local languages.

In this project we focus on developing a human quality Text-to-Speech (TTS) systems for Sinhala. Human quality TTS, enables visually impaired community to use a computer and access digital text in their mother language. TTS also helps in creating applications such as learning materials, games, etc., which enables not only the visually impaired but also all the users to get feedback from the computer in a more natural way.

Human Quality Text To Speech (TTS) System for Tamil

The main goal of this project is to enhance human-computer interaction by developing applications that enable users to easily access the computers in local languages.

In this project we focus on developing a human quality Text-to-Speech (TTS) systems for Sri Lanka Tamil. Human quality TTS, enables visually impaired community to use a computer and access digital text in their mother language. TTS also helps in creating applications such as learning materials, games, etc., which enables not only the visually impaired but also all the users to get feedback from the computer in a more natural way.

Optical Character Recognition (OCR) system for Sinhala

Though acceptable quality Sinhala and Tamil OCR systems have been developed by us and others in the past, this project aims to develop an accessible and distributable OCR engine that can be embedded in third-party applications.

It also aims to incorporate the recent successes of deep learning techniques for improving performance of existing systems.

Sinhala Encyclopedia project

The official Sinhala encyclopaedia is a rich resource which is severely under utilized owing to its inaccessibility.

The Sinhala Encyclopedia Office at Department of Cultural Affairs is now collaborating with the Language Technology Research Lab of the UCSC to web-enable its content in order to make it accessible to a wide audience.

Our approach is to use the Media Wiki framework to achieve this.

 

Past Research projects

Data Driven Spell Checker for Sinhala

This research was carried out construction of a spell checker for Sinhala. The approach used in this research is based on n-gram statistics and is relatively inexpensive to construct without deep linguistic knowledge. This approach is particularly useful as there are very little linguistic resources available for Sinhala at present. The proposed algorithm has shown to be able to detect and correct many of the common spelling errors of the language. Results show a promising performance achieving an average accuracy of 82%. This technique can also be applied to construct spell checkers for other phonetic languages where linguistic resources are scarce or non-existent.

Research Paper: A Data-Driven Approach to Checking and Correcting Spelling Errors in Sinhala

kathabaha - Text To Speech (TTS) System for Sinhala

While there were some experimental TTS systems by the UCSC for Sinhala are already under work, the aim of this project was to produce one that is of commercial quality. To this end, considerable effort was be spent on quality aspects of this activity. Apart from identifying the phonetic alphabet of the language, recording relevant word sentences in the database and building a text analysis component, the project also produced a synthesizing engine that facilitates natural sounding Sinhala voice. The basic methodology adopted is based on the diphone concatenation approach.

Research Paper:

  1. Festival-si: A Sinhala Text-to-Speech System
  2. Sinhala Grapheme-to-Phoneme Conversion and Rules for Schwa Epenthesis
  3. A Rule Based Syllabification Algorithm for Sinhala 

Aharamariyi - Optical Character Recognition (OCR) system for Tamil

Aharamariyi is a Tamil Optical Character Recognition system built using the Tesseract OCR engine. We used several methods in identifying the features of Tamil script and to extract the most accurate combinations of training data. Aharamariyi OCR gives a 81% accuracy to our evaluation data set.

Research Paper: Developing a commercial grade Tamil OCR for recognizing font and size independent text

Shikshaka – framework for developing language learning tools

Shikshaka is a computer based teaching framework developed for teaching spoken aspect of languages using a dialogue-based andragogy. Effective technological methodologies were used to develop the system as an interactive tool using the goal oriented approach developed by linguistic scholars. We have used the framework to implement two scenarios: to teach Tamil using Sinhala and to teach Sinhala using English. The learner’s language can be customized with little effort, while the framework is flexible enough to be customized to teach other target languages as well with some pedagogical input.

Research Paper: Content independent open-source language teaching framework 

Developing a Computational Grammar for Sinhala

This is an attempt to develop a feature-based CFG for non-trivial sentences in Sinhala. The resulting grammar covers a significant subset of Sinhala. A parser for producing the appropriate parse tree(s) of input sentences was developed using the NLTK toolkit.

Research Paper: A Computational Grammar of Sinhala

Sinhala – Tamil Machine translation system

Ultimate goal of this research is to develop a machine translation system for Sinhala-Tamil language pair to reduce the language barrier between Sinhala and Tamil communities and thereby to help solve a burning issue in the country. We are aiming to develop machine translation system for both Sinhala to Tamil and Tamil to Sinhala directions. Currently we are investigating the possibilities of  integrating linguistic information to machine translation system since Sinhala and Tamil languages are morphologically rich languages. 

Research Papers:

  1. Statistical Machine Translation from and into Morphologically Rich and Low Resourced Languages 
  2. Sinhala-Tamil Machine Translation: Towards better Translation Quality 
  3. Towards Sinhala Tamil machine translation

Morphological analyzer for Sinhala

Description will be added soon

Research Paper: Evaluating a Machine Learning Approach to Sinhala Morphological Analysis

Sinhala Wordnet project

Building a lexical resource such as wordnet is essential for language processing applications for the less resourced languages of the world. This research was caried out to build a Wordnet for Sinhala language. The importance of entries were estimated using the word frequencies of the 10 million word UCSC Sinhala corpus of Contemporary Sinhala, and the relevant lexico-semantic relations extracted from the Princeton English WordNet (PWN).

Research Paper: Towards a Sinhala Wordnet

Internet Domain Names for Sinhala

Description will be added soon

Research Paper: Implementation of Internet Domain Names in Sinhala

Sinhala OCR (Beta version)

This attains overall accuracy of approximately 85%. This attempt of recognizing Sinhala characters is font specific and we considered four popular fonts. A font specific Sinhala Optical Character Recognition system developed using the k-NN algorithm. This system supports four popular proprietary fonts ‘Abhaya’, ‘Manel’ ‘Lakbima’ and ‘Divaina’.

Research Paper: NLP Applications of Sinhala: TTS & OCR

Building a speech corpus for Sinhala

This project aims on building a speech corpus for Sinhala using NEWS recordings from Sri Lanka Broadcasting Corporation. This aims on collecting 20,000 utterances and their corresponding text transcriptions.

Research Paper: Developing a Speech Corpus for Sinhala Speech Recognition 

Building a Sinhala –Tamil parallel corpus

The main objective of this task is to collect 1 million aligned Sinhala‐Tamil parallel text. This is also a very useful resource to build tools for machine assisted and automatic translation between the Sinhala and Tamil languages.

Building a Sri Lanka Tamil corpus

The main objective of this task is to collect 3 million words Sri Lankan Tamil language corpus. This is a very useful resource to build tools for machine assisted and automatic translation between the two languages as well as translation studies.