Tools & Resources


Online Applications

Subasa online Spell-checker for Sinhala

Subasa spell-checker is a free online application which can be used to spell-check Sinhala texts written in Sinhala Unicode fonts. This does not support for Sinhala proprietary fonts. You can fix your Sinhala spelling problems by simply copying your text into the provided space.

Font Encoding Converters for Sinhala & Tamil

Real-time font encoding converting tool that runs in your web browser. It facilitates the encoding conversion from proprietary fonts to Unicode and vice versa for both Sinhala and Tamil. This tool currently supports DL-Manel, FM Abhaya, Thibus, kaputa and Amalee Sinhala fonts and Bamini Tamil font.

Online phonetic keyboard for Sinhala

An easy to use virtual Sinhala keyboard to input Sinhala texts.


Stand Alone Applications

Text To Speech (TTS) System for Sinhala

‘Voice-Si’ is a Text to Speech System (TTS) for Sinhala developed using MaryTTS speech synthesizer. This can be integrated with NVDA screen reader using the SpeechHub speech server. This can be used as a computer aid for the visually impaired community who works with Sinhala language.

Language Learning Tool to learn Tamil in Sinhala

A Computer Assisted Language Learning (CALL) tool to help Sinhala native speakers to learn Tamil. This tool comprises 10 lessons designed using the materials taken from a translation of a well know Tamil language leaning book written by Prof. Suseendirarajah ea al. The main target here is to teach the spoken aspect of the Tamil language. Audio clips, photos and text materials are used to help the users in understanding the lessons.

Language Learning Tool to learn Sinhala in English

A Computer Assisted Language Learning (CALL) tool to help other language users to learn Sinhala in English medium. This comprises 15 lessons and has been designed using audio clips, photos and text materials for various dialogs.

Sinhala Syllabification Tool

A tool to syllabify Sinhala text (in Unicode) into phonetic representations. Input Sinhala text is analyzed by a phonetic analyzer and word by word phonetic representations along with syllable boundaries (marked by brackets) are produced.

Aharamariyi Tamil OCR

Aharamariyi  Tamil OCR is an Optical Character Recognition system for Tamil script. This enables conversion of images in Tamil text into machine-readable text. This OCR system supports for .jpg file type and output text will be in Tamil Unicode without formatting.

Corpus concordance for Sinhala/Tamil

A search tool to help understand the use of words in Sinhala and Tamil languages. This is a helpful resource for people involved in language resources.

Translation memory tool

A tool to aid translators for Sinhala to Tamil text translation. This tool enables the translators to store the previous translations and use them to suggest most likely translations for the current source text.

Font encoding converter for Sinhala & Tamil

A desktop application to convert Sinhala/Tamil proprietary fonts into Unicode fonts and vice versa. This supports 26 Sinhala proprietary fonts and 4 Tamil proprietary fonts.

ingiya English-Sinhala dictionary add-on for Firefox

ingiya English-Sinhala dictionary is a pop-up dictionary add-on for Firefox web browser. This helps people to find meaning of difficult words of either English or Sinhala web pages.

ingiwadana Sinhala predictive text input system for Android

ingiwadana the Sinhala predictive text input system is an android application which helps to type in Sinhala very effectively. This application can be used to send SMS/emails and search on the web in Sinhala. This enables people to work in Sinhala on android platform.


Lexical Resources

10 Million word contemporary Sinhala text corpus for language research

UCSC mini corpus contains 10 million Sinhala words collected from Sinhala newspaper articles. There are around 135,000 distinct words in the corpus and it comprises 2794 text files containing editorials, feature articles, foreign news and sports news.

UCSC Sinhala POS tagset

English-Sinhala parallel corpus is for language researchers who are involved in English-Sinhala machine translation. The corpus contains 4,301 English sentences along with corresponding Sinhala translations.

500k Sinhala tagged corpus

UCSC tagged corpus contains 500K words, manually tagged by Sinhala linguists using UCSC Sinhala POS tagset (version 1). Words that do not belong to any defined tag are tagged with a question mark (?)

1300 word Sinhala WordNet for language technology improvement

UCSC Sinhala wordnet (version 1) contains 1,075 word senses and each sense includes synsets along with the corresponding English word, Princeton ID for the synset, POS Category and the Gloss.

List of proper names for language research

A list of Sinhala proper names including country names, Sinhala personal names, names of Sri Lankan and international cities, names of Sinhala artists, Sri Lankan rivers and reservoirs. Currently there are around 20,800 proper name entries.

NamedEntity Tagged Corpus

Sinhala Named Entity Tagged Corpus consists around 83K words that have been tagged for person names, location names and organization names as Named Entities.

Sinhala NEWS Corpus

A speech corpus with 8000 utterances of recorded Sinhala NEWS from both male and female announcers. This is still an ongoing project.

List of Sinhala Functional Words

A list of 425 Sinhala functional words with Sinhala conjunctions, determinants, interjections, particles and post positions.

Sinhala_NE_Data (including person and place name entities)

Ingiya English-Sinhala dictionary database

The English-Sinhala dictionary database used in the ingiya English-Sinhala dictionary add-on. This database consists of ≈36,000 English word entries and the corresponding Sinhala meanings.

400K Distinct word list

A list of 400K distinct words extracted from the UCSC Sinhala text corpus.

Speech corpora for Sinhala speech processing

Female voice corpus

Speech corpus with 3000 Sinhala utterances spoken by a single female speaker. This corpus was initially designed to built an Automatic Speech Recognition System (ASR) for Sinhala. Spoken utterances were selected considering the most frequently used words in Sinhala.

Male voice corpus

Speech corpus with 625 Sinhala utterances spoken by a single male speaker. This corpus was initially designed to built a Text to Speech Syatem (TTS) for Sinhala.

2000 voice corpus

Speech corpus with 74,000 Sinhala utterances spoken by various speakers representing both male and female in different age groups. This corpus was initially designed to built a song request application for mobile phones.