Skip to content

mbzuai-oryx/ThinkGeo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ThinkGeo : Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir*, Muhammad Akhtar Munir*, Akshay Dudhane*, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan and Salman Khan

Mohamed bin Zayed University of Artificial Intelligence, IBM Research, LinkΓΆping University, Australian National University

Website paper HuggingFace GitHub stars GitHub license

*Equal Contribution


πŸ“’ Latest Updates

  • May-29-2025: πŸ“‚ ThinkGeo benchmark is released on HuggingFace MBZUAI/ThinkGeo
  • May-29-2025: πŸ“œ Technical Report of ThinkGeo paper is released arxiv link.

ThinkGeo Overview

ThinkGeo is a specialized benchmark designed to evaluate how language model agents handle complex remote sensing tasks through structured tool use and step-by-step reasoning. It features human-curated queries grounded in satellite and aerial imagery across diverse real-world domains such as disaster response, urban planning, and environmental monitoring. Using a ReAct-style interaction loop, ThinkGeo tests both open and closed-source LLMs on over 400 multi-step agentic tasks. The benchmark measures not only final answer correctness but also the accuracy and consistency of tool usage throughout the process. By focusing on spatially grounded, domain-specific challenges, ThinkGeo fills a critical gap left by general-purpose evaluation frameworks.

stats

Key Features

  • A dataset comprising 436 remote sensing tasks, linked with medium to high-resolution earth observation imagery across domains like urban planning, disaster response, aviation, and environmental monitoring.

  • A set of 14 executable tools simulates real-world RS workflows, including modules for perception, computation, logic, and visual annotation.

  • Two evaluation modesβ€”step-by-step and end-to-endβ€”use with detailed metrics to assess instruction adherence, argument structure, reasoning steps, and final accuracy.

  • Benchmarking advanced LLMs (GPT-4o, Claude-3, Qwen-2.5, LLaMA-3) reveals ongoing challenges in multimodal reasoning and tool integration.

Dataset Examples

The following figure presents a set of representative samples from the ThinkGeo benchmark, a comprehensive evaluation framework for geospatial tasks. Each row in the table showcases a complete interaction flow, beginning with a user query that is grounded in remote sensing (RS) imagery. Following the user query, each example demonstrates a ReAct-based execution chainβ€”an approach that integrates reasoning and action through a combination of tool calls and logical steps. These execution chains involve the dynamic selection and use of various tools, depending on the demands of the specific query.

samples

The data samples span a wide range of application domains, underscoring the benchmark's diversity. These domains include transportation analysis, urban planning, disaster assessment and change analysis, recreational infrastructure, and environmental monitoring, highlighting multi-tool reasoning and spatial task complexity.

Results

Evaluation results across models on the ThinkGeo benchmark are summarized in the table. The left side presents step-by-step execution metrics, while the right side reports end-to-end performance. Metrics include tool-type accuracyβ€”categorized by Perception (P), Operation (O), and Logic (L)β€”as well as final answer accuracy (Ans.) and answer accuracy with image grounding (Ans_I).

res

πŸš€ Evaluating ThinkGeo

Prepare ThinkGeo Dataset

  1. Clone this repo.
git clone https://github.com/mbzuai-oryx/ThinkGeo.git
cd ThinkGeo
  1. Download the dataset from Hugging face ThinkGeo.
mkdir ./opencompass/data

Put it under the folder ./opencompass/data/. The structure of files should be:

ThinkGeo/
β”œβ”€β”€ agentlego
β”œβ”€β”€ opencompass
β”‚   β”œβ”€β”€ data
β”‚   β”‚   β”œβ”€β”€ ThinkGeo_dataset
β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ ...

Prepare Your Model

  1. Download the model weights.
pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False
  1. Install LMDeploy.
conda create -n lmdeploy python=3.10
conda activate lmdeploy

For CUDA 12:

pip install lmdeploy
  1. Launch a model service.
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat

Deploy Tools

  1. Install AgentLego.
conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install agentlego
pip install -e .
mim install mmengine
mim install mmcv==2.1.0

Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py, then set _supports_sdpa = False to _supports_sdpa = True in line 1279.

  1. Deploy tools for ThinkGeo benchmark.

To use the GoogleSearch, you should first get the Serper API key from https://serper.dev . Then export this key as environment variables.

export SERPER_API_KEY='your_serper_key_for_google_search_tool'

Start the tool server.

agentlego-server start --port 16181 --extra ./benchmark.py  `cat benchmark_toollist.txt` --host 0.0.0.0

Start Evaluation

  1. Install OpenCompass.
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .

huggingface_hub==0.25.2 (<0.26.0) transformers==4.40.1 2. Modify the config file at configs/eval_ThinkGeo_bench.py as below.

The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.

The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        tool_meta='data/ThinkGeo_dataset/toolmeta.json',
        batch_size=8,
    ),
]

If you infer and evaluate in step-by-step mode, you should comment out tool_server and enable tool_meta in configs/eval_ThinkGeo_bench.py, and set infer mode and eval mode to every_with_gt in configs/datasets/ThinkGeo_bench.py:

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        # tool_server='http://10.140.0.138:16181',
        tool_meta='data/ThinkGeo_dataset/toolmeta.json',
        batch_size=8,
    ),
]
ThinkGeo_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
ThinkGeo_bench_eval_cfg = dict(evaluator=dict(type=ThinkGeoBenchEvaluator, mode='every_with_gt'))

If you infer and evaluate in end-to-end mode, you should comment out tool_meta and enable tool_server in configs/eval_ThinkGeo_bench.py, and set infer mode and eval mode to every in configs/datasets/ThinkGeo_bench.py:

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        # tool_meta='data/ThinkGeo_dataset/toolmeta.json',
        batch_size=8,
    ),
]
ThinkGeo_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
ThinkGeo_bench_eval_cfg = dict(evaluator=dict(type=ThinkGeoBenchEvaluator, mode='every'))
  1. Infer and evaluate with OpenCompass.
# infer only
python run.py configs/eval_ThinkGeo_bench.py --max-num-workers 32 --debug --mode infer
# evaluate only
# srun -p llmit -q auto python run.py configs/eval_ThinkGeo_bench.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
srun -p llmit -q auto python run.py configs/eval_ThinkGeo_bench.py --max-num-workers 32 --debug --reuse 20250616_112233 --mode eval
# infer and evaluate
python run.py configs/eval_ThinkGeo_bench.py -p llmit -q auto --max-num-workers 32 --debug

πŸ“œ Citation

@misc{shabbir2025thinkgeoevaluatingtoolaugmentedagents,
      title={ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks}, 
      author={Akashah Shabbir and Muhammad Akhtar Munir and Akshay Dudhane and Muhammad Umer Sheikh and Muhammad Haris Khan and Paolo Fraccaro and Juan Bernabe Moreno and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2505.23752},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23752}, 
}

About

ThinkGeo is a Comprehensive Benchmark to evaluate Tool-Augmented Agents for Remote Sensing Tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published