You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
>In developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively.
24
+
23
25
GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in real-world scenarios. It features three main aspects:
24
26
-**Real user queries.** The benchmark contains 229 human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps.
25
-
-**Real deployed tools.** an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance.
26
-
-**Real multimodal inputs.** authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely.
27
+
-**Real deployed tools.**GTA provides an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance.
28
+
-**Real multimodal inputs.**Each query is attached with authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely.
27
29
<divalign="center">
28
30
<imgsrc="figs/dataset.jpg"width="800"/>
29
31
</div>
@@ -32,10 +34,10 @@ GTA is a benchmark to evaluate the tool-use capability of LLM-based agents in re
32
34
33
35
-**[2024.x.xx]** Paper available on Arxiv. ✨✨✨
34
36
-**[2024.x.xx]** Release the evaluation and tool deployment code of GTA. 🔥🔥🔥
35
-
-**[2024.x.xx]** Release the GTA dataset on Hugging Face and OpenDataLab. 🎉🎉🎉
37
+
-**[2024.x.xx]** Release the GTA dataset on Hugging Face. 🎉🎉🎉
36
38
37
39
## 📚 Dataset Statistics
38
-
GTA comprises a total of 229 questions. The basic dataset statistics is presented below.
40
+
GTA comprises a total of 229 questions. The basic dataset statistics is presented below. The number of tools involved in each question varies from 1 to 4. The steps to resolve the questions range from 2 to 8.
To evaluate on GTA, we prepare the model, tools, and evaluation process with [LMDeploy](https://github.com/InternLM/lmdeploy), [AgentLego](https://github.com/InternLM/agentlego), and [OpenCompass](https://github.com/open-compass/opencompass), respectively. These three parts need three different conda environments.
0 commit comments