Skip to content

for-ai/llm-profiling-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Luisa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

arxiv Static Badge

Code for LLM profiling detailed in "LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives".

Currently we support base models from HuggingFace's Transformers library in the PyTorch framework and Cohere models via the Cohere API.

🖥️ Setup

Run the following to create the environment with all required dependencies:

conda create -n profiling python=3.11.7
conda activate profiling
pip install -r requirements.txt
pip install -e ~/profiling-toolkit/src/benchmarking/lm-evaluation-harness
pip install -e ~/profiling-toolkit/src/profiling/bias_bench

🧰 Usage

📊 Profiling LLMs

To profile a given LLM the following script should be run:

python run_profiling.py \
    --profiling_tools <profiling_tools> \
    --model_type <model_type> \
    --basemodel_path <basemodel_path> \ 
    --batch_size <batch_size> \ 
    --max_new_tokens <max_tokens> \ 
    --experiment_dir <experiment_dir> \ # optional
    --seed <seed> \ # optional
    --hf_auth_token <auth_token> \ # optional
    --quantize \ # optional
    --quantization_type <quant_type> \ # optional
    --precision <precision> \  # optional
    --text_dataset <text_dataset> \ # optional
    --perspective_key <perspective_key> \ # optional
>>> python run_profiling.py --help
Parameters to perform profiling of a given model.

options:
  -h, --help            show this help message and exit
  --persistent_dir PERSISTENT_DIR
                        Directory where all persistent data will be stored, default to the directory of the cloned repository.
  --model_type {HuggingFaceModel,AyaHuggingFace,CohereModels}
                        Model type to evaluate on, AutoModelForCausalLM models should use HuggingFaceModel.
  --basemodel_path BASEMODEL_PATH
                        Path to folder where model checkpoint is stored, both local checkpoints and remote HF paths can be used.
  --batch_size BATCH_SIZE
                        Max batch size to use to collect generations for TextualCharacteristicsProfiling.
  --max_new_tokens MAX_NEW_TOKENS
                        Max number of tokens to be generated per generation for TextualCharacteristicsProfiling.
  --text_dataset {StrategyQA,Dolly200_val,Dolly200_test}
                        Dataset to be used to prompt models to calculate textual characteristics.
  --profiling_tools PROFILING_TOOLS
                        List of types of profiling tools to run separated by a comma (,), valid options are TextualCharacteristicsProfiling,SocialBiasProfiling,CalibrationProfiling,ToxicityProfiling.
  --experiment_dir EXPERIMENT_DIR
                        Directory where results should be stored, if no directory name is provided defaults to <persistent_dir>/results/profiling/.
  --quantize            Flag determining whether model should be quantized or not.
  --quantization_type {4_bit,8_bit}
                        What type of quantization to use.
  --precision {bf16,fp16,regular}
                        Whether to use mixed-precision when training or not.
  --seed SEED           Seed value for reproducibility.
  --auth_token AUTH_TOKEN
                        Hugginface authorization token necessary to run restricted models (e.g. LLaMa models).
  --perspective_key PERSPECTIVE_KEY
                        Perspective API key to use to perform ToxicityProfiling.

🩺 Benchmark General Performance

To benchmark the general performance of a given LLM on a selection of tasks from lm-evaluation-harness the following script should be run:

python run_utility_benchmarking.py \
    --basemodel_path <basemodel_path> \ 
    --experiment_dir <experiment_dir> \ # optional
    --seed <seed> \ # optional
    --quantize \ # optional
    --quantization_type <quant_type> \ # optional
    --num_fewshot <num_fewshot> # optional
>>> python run_utility_benchmarking.py --help
Parameters to run general performance benchmarks.

options:
  -h, --help            show this help message and exit
  --persistent_dir PERSISTENT_DIR
                        Directory where all persistent data will be stored, default to the directory of the cloned repository.
  --basemodel_path BASEMODEL_PATH
                        Path to folder where model checkpoint is stored, both local checkpoints and remote HF paths can be used.
  --experiment_dir EXPERIMENT_DIR
                        Directory where results should be stored, if no directory name is provided defaults to <persistent_dir>/results/profiling/.
  --tasks TASKS         List of types of tasks from lm-evaluation-harness to evaluate your model on. To check the complete list of tasks run `lm-eval --tasks list`.
  --quantize            Flag determining whether model should be quantized or not.
  --quantization_type {4_bit,8_bit}
                        What type of quantization to use.
  --num_fewshot NUM_FEWSHOT
                        Number of few-shot examples to use during evaluation, default to 0.
  --seed SEED           Seed value for reproducibility.

📖 Metrics Overview

Category Metric/Benchmark Overview Reference
Toxicity Expected Maximum
Toxicity (EMT)
Calculates the mean maximum toxicity scores by collecting k=25 generations for the same prompt. This is used to estimate worst-case toxicity. Scores are measured via the Perspective API. RealToxicityPrompts Perspective API
Toxicity Probability Calculates the empirical probability of a model generating at least one response with TOXICITY >= 0.5 over k=25 generations. This serves as a way to measure how frequently a model generates toxic responses.
Social Bias SEAT The Sentence Encoder Association Test (SEAT) is an embedding-based benchmark that extends the Word Embedding Association Test (WEAT) to sentence-level representations. It evaluates bias by measuring the association strength between sets of attribute words (e.g., gender-related words) and sets of target words (e.g., family or career-related words). SEAT
StereoSet StereoSet is a benchmark for measuring stereotypical bias in language models, using contexts with masked words and sets of stereotypical, anti-stereotypical, and unrelated associations. It quantifies bias by calculating a stereotype score, which is the percentage of examples where a model prefers stereotypical associations. StereoSet
CrowS-Pairs Crowdsourced Stereotype Pairs (CrowS-Pairs) is a benchmark dataset that contains pairs of minimally distant sentences, with one sentence reflecting a stereotype and the other violating it. The benchmark quantifies bias in language models by measuring their preference for stereotypical sentences over anti-stereotypical ones, similarly to StereoSet but using a different set of comparison sentences. CrowS-Pairs
BBQ BBQ (Bias in Question Answering) is a benchmark designed to measure social biases in the predictions of language models, particularly in question-answering tasks. It contains unique examples and templates, each consisting of two questions, answer choices, and two contexts: a partial context missing relevant information, and a disambiguating context that provides the necessary information. BBQ
Textual
Characteristics
Measure of Textual
Lexical Diversity (MTLD)
The Measure of Textual Lexical Diversity (MTLD) employs a sequential analysis of a body of text to estimate a lexical diversity score. MTLD reflects the average number of words in a row for which a certain TTR (Type Token Ratio) is maintained. MTLD
Length Calculates a group of metrics related to the length of generations: number of characters/tokens/sentences, sentence/token length etc TextDescriptives
Gunning-Fog Readability index that estimates the years of formal education needed to understand the text on a first reading. Grade level = 0.4 × (ASL + PHW) (ASL is the average sentence length (total words / total sentences), and PHW is the percentage of hard words (words with three or more syllables)).
Rix Readability measure that estimates the difficulty of a text based on the proportion of long words (more than six characters) in the text. Rix = (n_long_words / n_sentences).
Miscellaneous Aside from the metrics described above additional metrics and descriptive statistics are also computed and can be checked on the TextDescriptives reference.
Calibration Expected Calibration Error The Expected Calibration Error (ECE) is a metric used to evaluate the reliability of a model's predicted probabilities. It does this by measuring the difference between accuracy and confidence across multiple bins of predictions. A lower ECE indicates better calibration, with a perfectly calibrated model achieving an ECE of zero. We calculate 1-bin and 10-bin ECE on HellaSwag and OpenBookQA. HELM

🗣️ Acknowledgments

This repository makes use of code and/or data from the following repositories:

We thank the authors for making their code publicly available.

📄 Citation

@misc{shimabucoro2024llmseellmdo,
      title={LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives}, 
      author={Luísa Shimabucoro and Sebastian Ruder and Julia Kreutzer and Marzieh Fadaee and Sara Hooker},
      year={2024},
      eprint={2407.01490},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.01490},}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published