Skip to content

Latest commit

 

History

History
28 lines (20 loc) · 1.24 KB

roadmap.md

File metadata and controls

28 lines (20 loc) · 1.24 KB

Roadmap for Balochi NLP

This is a non-exhaustive / incomplete list of some of the things for which work is needed for Balochi NLP:

Short term mini-projects

  • A custom-trained tokenizer (@strickvl working on this)
  • A stopword list (and some other basic things like lists of characters/punctuation and their associated unicode code points etc)
  • A conversion tool for language in different scripts
  • Dialect classifier
  • NER (named entity recognition) models
  • Good quality dataset(s) that are openly available for all to use
  • OCR support for Balochi texts (in the computer vision domain, but would probably help build datasets and it is highly likely we can benefit from work done for Arabic and Persian.)

Medium - Long term goals / projects

  • Embeddings
  • Benchmarks
  • Text-to-Speech (TTS) models (for generating audio)
  • Speech-to-Text (STS) models (for transcribing audio)
  • Language models (of various architectures)

Potential partner organisations

Support could possibly come from leading organisations in the space. Importantly, they both have a strong track-record of encouraging and offering support for low-resource languages: