Akashah Shabbir*, Muhammad Akhtar Munir*, Akshay Dudhane*, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan and Salman Khan
Mohamed bin Zayed University of Artificial Intelligence, IBM Research, LinkΓΆping University, Australian National University
*Equal Contribution
- May-29-2025: π ThinkGeo benchmark is released on HuggingFace MBZUAI/ThinkGeo
- May-29-2025: π Technical Report of ThinkGeo paper is released arxiv link.
ThinkGeo is a specialized benchmark designed to evaluate how language model agents handle complex remote sensing tasks through structured tool use and step-by-step reasoning. It features human-curated queries grounded in satellite and aerial imagery across diverse real-world domains such as disaster response, urban planning, and environmental monitoring. Using a ReAct-style interaction loop, ThinkGeo tests both open and closed-source LLMs on over 400 multi-step agentic tasks. The benchmark measures not only final answer correctness but also the accuracy and consistency of tool usage throughout the process. By focusing on spatially grounded, domain-specific challenges, ThinkGeo fills a critical gap left by general-purpose evaluation frameworks.
-
A dataset comprising 436 remote sensing tasks, linked with medium to high-resolution earth observation imagery across domains like urban planning, disaster response, aviation, and environmental monitoring.
-
A set of 14 executable tools simulates real-world RS workflows, including modules for perception, computation, logic, and visual annotation.
-
Two evaluation modesβstep-by-step and end-to-endβuse with detailed metrics to assess instruction adherence, argument structure, reasoning steps, and final accuracy.
-
Benchmarking advanced LLMs (GPT-4o, Claude-3, Qwen-2.5, LLaMA-3) reveals ongoing challenges in multimodal reasoning and tool integration.
The following figure presents a set of representative samples from the ThinkGeo benchmark, a comprehensive evaluation framework for geospatial tasks. Each row in the table showcases a complete interaction flow, beginning with a user query that is grounded in remote sensing (RS) imagery. Following the user query, each example demonstrates a ReAct-based execution chainβan approach that integrates reasoning and action through a combination of tool calls and logical steps. These execution chains involve the dynamic selection and use of various tools, depending on the demands of the specific query.
The data samples span a wide range of application domains, underscoring the benchmark's diversity. These domains include transportation analysis, urban planning, disaster assessment and change analysis, recreational infrastructure, and environmental monitoring, highlighting multi-tool reasoning and spatial task complexity.
Evaluation results across models on the ThinkGeo benchmark are summarized in the table. The left side presents step-by-step execution metrics, while the right side reports end-to-end performance. Metrics include tool-type accuracyβcategorized by Perception (P), Operation (O), and Logic (L)βas well as final answer accuracy (Ans.) and answer accuracy with image grounding (Ans_I).
- Clone this repo.
git clone https://github.com/mbzuai-oryx/ThinkGeo.git
cd ThinkGeo
- Download the dataset from Hugging face ThinkGeo.
mkdir ./opencompass/data
Put it under the folder ./opencompass/data/
. The structure of files should be:
ThinkGeo/
βββ agentlego
βββ opencompass
β βββ data
β β βββ ThinkGeo_dataset
β βββ ...
βββ ...
- Download the model weights.
pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False
- Install LMDeploy.
conda create -n lmdeploy python=3.10
conda activate lmdeploy
For CUDA 12:
pip install lmdeploy
- Launch a model service.
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat
- Install AgentLego.
conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install agentlego
pip install -e .
mim install mmengine
mim install mmcv==2.1.0
Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py
, then set _supports_sdpa = False
to _supports_sdpa = True
in line 1279.
- Deploy tools for ThinkGeo benchmark.
To use the GoogleSearch, you should first get the Serper API key from https://serper.dev . Then export this key as environment variables.
export SERPER_API_KEY='your_serper_key_for_google_search_tool'
Start the tool server.
agentlego-server start --port 16181 --extra ./benchmark.py `cat benchmark_toollist.txt` --host 0.0.0.0
- Install OpenCompass.
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .
huggingface_hub==0.25.2 (<0.26.0)
transformers==4.40.1
2. Modify the config file at configs/eval_ThinkGeo_bench.py
as below.
The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.
The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.
models = [
dict(
abbr='qwen1.5-7b-chat',
type=LagentAgent,
agent_type=ReAct,
max_turn=10,
llm=dict(
type=OpenAI,
path='qwen1.5-7b-chat',
key='EMPTY',
openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
query_per_second=1,
max_seq_len=4096,
stop='<|im_end|>',
),
tool_server='http://10.140.0.138:16181',
tool_meta='data/ThinkGeo_dataset/toolmeta.json',
batch_size=8,
),
]
If you infer and evaluate in step-by-step mode, you should comment out tool_server and enable tool_meta in configs/eval_ThinkGeo_bench.py
, and set infer mode and eval mode to every_with_gt in configs/datasets/ThinkGeo_bench.py
:
models = [
dict(
abbr='qwen1.5-7b-chat',
type=LagentAgent,
agent_type=ReAct,
max_turn=10,
llm=dict(
type=OpenAI,
path='qwen1.5-7b-chat',
key='EMPTY',
openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
query_per_second=1,
max_seq_len=4096,
stop='<|im_end|>',
),
# tool_server='http://10.140.0.138:16181',
tool_meta='data/ThinkGeo_dataset/toolmeta.json',
batch_size=8,
),
]
ThinkGeo_bench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
ThinkGeo_bench_eval_cfg = dict(evaluator=dict(type=ThinkGeoBenchEvaluator, mode='every_with_gt'))
If you infer and evaluate in end-to-end mode, you should comment out tool_meta and enable tool_server in configs/eval_ThinkGeo_bench.py
, and set infer mode and eval mode to every in configs/datasets/ThinkGeo_bench.py
:
models = [
dict(
abbr='qwen1.5-7b-chat',
type=LagentAgent,
agent_type=ReAct,
max_turn=10,
llm=dict(
type=OpenAI,
path='qwen1.5-7b-chat',
key='EMPTY',
openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
query_per_second=1,
max_seq_len=4096,
stop='<|im_end|>',
),
tool_server='http://10.140.0.138:16181',
# tool_meta='data/ThinkGeo_dataset/toolmeta.json',
batch_size=8,
),
]
ThinkGeo_bench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
ThinkGeo_bench_eval_cfg = dict(evaluator=dict(type=ThinkGeoBenchEvaluator, mode='every'))
- Infer and evaluate with OpenCompass.
# infer only
python run.py configs/eval_ThinkGeo_bench.py --max-num-workers 32 --debug --mode infer
# evaluate only
# srun -p llmit -q auto python run.py configs/eval_ThinkGeo_bench.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
srun -p llmit -q auto python run.py configs/eval_ThinkGeo_bench.py --max-num-workers 32 --debug --reuse 20250616_112233 --mode eval
# infer and evaluate
python run.py configs/eval_ThinkGeo_bench.py -p llmit -q auto --max-num-workers 32 --debug
@misc{shabbir2025thinkgeoevaluatingtoolaugmentedagents,
title={ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks},
author={Akashah Shabbir and Muhammad Akhtar Munir and Akshay Dudhane and Muhammad Umer Sheikh and Muhammad Haris Khan and Paolo Fraccaro and Juan Bernabe Moreno and Fahad Shahbaz Khan and Salman Khan},
year={2025},
eprint={2505.23752},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23752},
}