Instruct Multilingual

This repository contains code to translate datasets into multiple languages using the NLLB (No Language Left Behind) model. The model can be used:

From a server to translate a single text or entire datasets.
Using the translate script to translate a single text.

Note: This repository has been tested on a Linux machine using a Nvidia GPU. The code assumes access to a GPU. Depending on your hardware, you might need to modify the code to change the number of GPUs and the batch size.

Setup

It is recommended to use a virtual environment to install the dependencies.

conda create -n instructmultilingual python=3.8.10 -y
conda activate instructmultilingual
pip install -r requirements.txt

Inference Server

The inference server is a FastAPI application that can be used to translate a single text or entire datasets.

Convert the models first

For efficient inference, the model is converted using CTranslate2.

mkdir models
ct2-transformers-converter --model facebook/nllb-200-3.3B --output_dir models/nllb-200-3.3B-converted

Run the server locally

To start the server, we need to run the following command:

uvicorn instructmultilingual.server:app --host 0.0.0.0 --port 8000

Using docker to run the server

To run the server using docker, we need to build and run the docker image, and run the server.

# Build
docker build -t instruct-multilingual .

# Run
docker run -it --rm --gpus 1 -p 8000:8000 -v $(pwd):/instruct-multilingual instruct-multilingual

Client Side

This script translate the samsum dataset using the inference server. It showcases how to use the inference server to translate a single text and entire datasets.

python main.py

Translate

We also provide a script to translate a single text from the CLI. This script downloads the model from Hugging Face and translates the text provided into the target language.

python -m instructmultilingual.translate \
          --text="Cohere For AI will make the best instruct multilingual model in the world" \
          --source_language="English" \
          --target_language="Egyptian Arabic"

Translate an instructional dataset from xP3 (or any dataset repo from HuggingFace Hub)

An example of using translate_dataset_from_huggingface_hub to translate PIQA with the finish_sentence_with_correct_choice template into languages used by Multilingual T5 (mT5) model

from instructmultilingual.translate_datasets import translate_dataset_from_huggingface_hub

translate_dataset_from_huggingface_hub(
    repo_id = "bigscience/xP3",
    train_set = ["en/xp3_piqa_None_train_finish_sentence_with_correct_choice.jsonl"],
    validation_set = ["en/xp3_piqa_None_validation_finish_sentence_with_correct_choice.jsonl"],
    test_set = [],
    dataset_name="PIQA",
    template_name="finish_sentence_with_correct_choice",
    splits=["train", "validation"],
    translate_keys=["inputs", "targets"],
    url= "http://localhost:8000/translate",
    output_dir= "/home/weiyi/instruct-multilingual/datasets",
    source_language= "English",
    checkpoint="facebook/nllb-200-3.3B",
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Instruct Multilingual

Setup

Inference Server

Convert the models first

Run the server locally

Using docker to run the server

Client Side

Translate

Translate an instructional dataset from xP3 (or any dataset repo from HuggingFace Hub)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Instruct Multilingual

Setup

Inference Server

Convert the models first

Run the server locally

Using docker to run the server

Client Side

Translate

Translate an instructional dataset from xP3 (or any dataset repo from HuggingFace Hub)