Skip to content

Commit 981e293

Browse files
Jize WangJize Wang
Jize Wang
authored and
Jize Wang
committed
update README
1 parent 816710a commit 981e293

File tree

1 file changed

+31
-19
lines changed

1 file changed

+31
-19
lines changed

README.md

+31-19
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,17 @@
1515
[📃 [Paper](https://xxx)]
1616
[🌐 [Project Page](https://xxx)]
1717
[🤗 [Hugging Face](https://xxx)]
18-
[📌 [License](https://xxx)]
18+
[📌 [License](https://github.com/open-compass/GTA/blob/main/LICENSE.txt)]
1919
</div>
2020

2121
## 🌟 Introduction
2222

23+
>In developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively.
24+
2325
GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in real-world scenarios. It features three main aspects:
2426
- **Real user queries.** The benchmark contains 229 human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps.
25-
- **Real deployed tools.** an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance.
26-
- **Real multimodal inputs.** authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely.
27+
- **Real deployed tools.** GTA provides an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance.
28+
- **Real multimodal inputs.** Each query is attached with authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely.
2729
<div align="center">
2830
<img src="figs/dataset.jpg" width="800"/>
2931
</div>
@@ -32,10 +34,10 @@ GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in re
3234

3335
- **[2024.x.xx]** Paper available on Arxiv. ✨✨✨
3436
- **[2024.x.xx]** Release the evaluation and tool deployment code of GTA. 🔥🔥🔥
35-
- **[2024.x.xx]** Release the GTA dataset on Hugging Face and OpenDataLab. 🎉🎉🎉
37+
- **[2024.x.xx]** Release the GTA dataset on Hugging Face. 🎉🎉🎉
3638

3739
## 📚 Dataset Statistics
38-
GTA comprises a total of 229 questions. The basic dataset statistics is presented below.
40+
GTA comprises a total of 229 questions. The basic dataset statistics is presented below. The number of tools involved in each question varies from 1 to 4. The steps to resolve the questions range from 2 to 8.
3941

4042
<div align="center">
4143
<img src="figs/statistics.jpg" width="800"/>
@@ -76,48 +78,58 @@ yi-6b-chat | 21.26 | 14.72 | 0 | 32.54 | 1.47 | 0 | 1.18 | 0 | 0.58
7678
## 🚀 Evaluate on GTA
7779
To evaluate on GTA, we prepare the model, tools, and evaluation process with [LMDeploy](https://github.com/InternLM/lmdeploy), [AgentLego](https://github.com/InternLM/agentlego), and [OpenCompass](https://github.com/open-compass/opencompass), respectively. These three parts need three different conda environments.
7880

81+
### Prepare GTA Dataset
7982
1. Clone this repo.
8083
```
8184
git clone https://github.com/open-compass/GTA.git
85+
cd GTA
86+
```
87+
2. Download the dataset via Hugging Face.
8288
```
83-
2. Prepare the dataset.
89+
pip install -U huggingface_hub
90+
export HF_ENDPOINT=https://hf-mirror.com
91+
mkdir -p ./opencompass/data/gta
92+
huggingface-cli download --repo-type dataset --resume-download Jize1/GTA --local-dir ./opencompass/data/gta --local-dir-use-symlinks False
8493
```
85-
pip install openxlab
86-
openxlab login
87-
openxlab dataset get --dataset-repo Jize/GTA
94+
### Prepare Your Model
95+
1. Download the model weights. Take Qwen1.5-7B-Chat as an example.
8896
```
89-
3. Prepare the model with LMDeploy.
97+
mkdir -p models/qwen1.5-7b-chat
98+
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ./models/qwen1.5-7b-chat --local-dir-use-symlinks False
99+
```
100+
2. Install LMDeploy.
90101
```
91-
# install LMDeploy
92102
conda create -n lmdeploy python=3.10
93103
conda activate lmdeploy
94104
pip install lmdeploy
95105
```
106+
3. Launch a model service.
96107
```
97-
# launch a model service with LMDeploy
98-
lmdeploy serve api_server /path/to/your/model --server-port [port_number] --model-name [your_model_name]
108+
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
109+
110+
lmdeploy serve api_server models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat
99111
```
100-
4. Prepare tools with AgentLego
112+
### Deploy Tools
113+
1. Install AgentLego
101114
```
102-
# install AgentLego
103-
conda deactivate
104115
conda create -n agentlego python=3.10
105116
conda activate agentlego
106117
cd agentlego
107118
pip install -e .
108119
```
120+
2. Deploy tools for GTA benchmark
109121
```
110-
# deploy tools for GTA benchmark
111122
agentlego-server start --port 16181 --extra ./benchmark.py `cat benchmark_toollist.txt` --host 0.0.0.0
112123
```
113-
5. Infer and evaluate with OpenCompass
124+
### Start Evaluation
125+
1. Install OpenCompass.
114126
```
115-
# install OpenCompass
116127
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
117128
conda activate opencompass
118129
cd opencompass
119130
pip install -e .
120131
```
132+
2. Infer and evaluate with OpenCompass.
121133
```
122134
# infer and evaluate
123135
python run.py configs/eval_gta_bench.py -p llmit -q auto --max-num-workers 32 --debug

0 commit comments

Comments
 (0)