Skip to content

luciusssss/MiLiC-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages (ACL'25 Findings)

MiLiC-Eval is an NLP evaluation suite for Minority Languages in China, covering Tibetan (bo), Uyghur (ug), Kazakh (kk, in the Kazakh Arabic script), and Mongolian (mn, in the traditional Mongolian script).

📑 Preprint

📊 HuggingFace

Statistics

Tasks

Currently, MiLiC-Eval consists of 9 tasks and 4 languages, with 24K instances. The statistics of each task are shown in the following table.

Task Size Metric Languages
Vocabulary Understanding 1,000/lang Accuracy bo, ug, kk, mn
Topic Classification (Sentence) 492/lang Accuracy bo, ug, kk, mn, zh, en
Topic Classification (Passage) 600/lang Accuracy bo, ug, kk, mn
Reading Comprehension 250/lang Accuracy bo, ug, kk, mn, zh, en
Response Selection 507/lang Accuracy bo, ug, kk, mn, zh, en
Title Generation 1,000/lang ROUGE-L bo, ug, kk, mn
Machine Translation (Article) 1,012/lang chrF++ bo, ug, kk, mn, zh, en
Machine Translation (Dialogue) 773/lang chrF++ bo, ug, kk, mn, zh, en
Math Reasoning 250/lang Accuracy bo, ug, kk, mn, zh, en

Data Splits

For each task, we provide a data split, including training, development, and test sets.

The training sets are small and used for in-context learning. For each task, we provide three training sets sampled with different seeds, to reduce the impact of randomness during prompting.

The development sets are used for hyperparameter tuning. The test sets are used for evaluation.

For each language, the data split is shown in the following table.

Task Train Dev Test
Vocabulary Understanding 20 * 3 40 900
Topic Classification (Sentence) 10 * 3 30 432
Topic Classification (Passage) 16 * 3 48 504
Reading Comprehension 10 * 3 20 200
Response Selection 20 * 3 40 407
Title Generation 20 * 3 40 900
Machine Translation (Article) 20 * 3 40 912
Machine Translation (Dialogue) 20 * 3 40 673
Math Reasoning 10 * 3 20 200

Usage

Download

The dataset can be downloaded from Hugging Face. Put the downloaded dataset in the data directory.

Setup

  1. Install the packages required for inference by running:
pip install -r requirements.txt
  1. Install the package required by multilingual ROUGE scoring. See https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring

  2. Run the scripts for inference and metric calculation (with Qwen-2.5 as an example):

cd scripts
bash run_eval.sh
bash calculate_metrics.sh

The evaluation results will be saved in the output directory.

Pretraining Corpus

Current LLMs have limited performance in minority languages due to the lack of pretraining data. We provide a pretraining corpus, MC^2 for the four languages in MiLiC-Eval.

The corpus can be downloaded from Hugging Face. You can read the details of the corpus in our paper MC^2: Towards Transparent and Culturally-Aware NLP for Minority Languages in China (ACL 2024).

Citation

If you use MiLiC-Eval in your research, please cite our GitHub repository:

@article{zhang2025milic,
      title={MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages}, 
      author={Zhang, Chen and Tao, Mingxu and Liao, Zhiyuan and Feng, Yansong },
      journal={arXiv preprint arXiv:2503.01150},
      year={2025},
      url={https://arxiv.org/abs/2503.01150}, 
}

About

[ACL'25 Findings] MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages

Topics

Resources

License

Stars

Watchers

Forks