GroundingDINO自定义数据微调指南

本文档介绍如何使用自己的数据微调GroundingDINO模型,使模型在保持文本泛化能力的同时,提升对目标的定位精度与稳定性。

目录:

基本要求

完整流程

常见问题

参考资料


基本要求

环境配置

  • Python版本:3.11.14
  • 操作系统:Windows/Linux
  • CUDA版本:推荐11.8或更高
  • 编译软件:Microsoft visual studio(2017-2022)、visual c++ build tools >= 14.0
  • 适配:CUDA-11.8适配VS<2022,CUDA-12.8可适配VS2022

兼容性验证

已测试可行环境

  • RTX 4060 + CUDA 11.8 + Python 3.11.14 + PyTorch 2.4.1+cu118
  • RTX 5090 + CUDA 12.8 + Python 3.12.0 + PyTorch 2.7.0+cu128

安装依赖

已有conda环境

1
conda create -n 环境名 python==3.11.14  # 创建新的conda环境, python为3.11.14

步骤1:安装PyTorch

1
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

步骤2:安装GroundingDINO

克隆open-GroundingDino和GroundingDINO仓库:

1
2
3
4
# open-GroundingDino
git clone https://github.com/longzw1997/Open-GroundingDino
# GroundingDINO
git clone https://github.com/IDEA-Research/GroundingDINO

将GroundingDINO源码解压到open-GroundingDino目录下,然后执行:

1
2
3
cd GroundingDINO
# 编辑requirements.txt,注释掉torch和torchvision
pip install -e .

⚠️ Windows用户注意事项

  • 确认Visual Studio C++版本为2019(2022版本会报错)
  • GroundingDINO绝对路径不允许包含中文,否则会编译错误

步骤3:安装setuptools

1
pip install setuptools==69.5.1

步骤4:安装MultiScaleDeformableAttention模块

1
2
cd Open-GroundingDino/models/GroundingDINO/ops
python setup.py build install

步骤5:验证安装

1
python test.py

正确的输出应该包含以下内容:

1
2
3
4
5
6
* True check_forward_equal_with_pytorch_double: max_abs_err 8.67e-19 max_rel_err 2.35e-16
* True check_forward_equal_with_pytorch_float: max_abs_err 4.66e-10 max_rel_err 1.13e-07
* True check_gradient_numerical(D=30)
* True check_gradient_numerical(D=32)
* True check_gradient_numerical(D=64)
* True check_gradient_numerical(D=71)

完整流程

数据准备

步骤1:准备训练数据(ODVG格式)

默认已使用labelImg标注得到xml格式的标签文件,使用 voc2jsonl.py 将XML标注文件转换为JSONL格式,保存为训练集文件train.jsonl

voc2jsonl.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import os
import json
from pathlib import Path

import cv2
import jsonlines
from xml.dom import minidom
import xml.etree.ElementTree as ET


def main(xml_dir, category_dict):
metas = []
for xml_path in os.listdir(xml_dir):
# jsonl_dict = {}
xml_path = os.path.join(xml_dir, xml_path)
tree = ET.parse(xml_path)
root = tree.getroot()

# regular
# file_name = root.find('filename').text
# unregular
file_name = str(Path(root.find('path').text).name)

file_name = file_name.replace("(", "").replace(")", "").replace("(", "").replace(")", "").replace(" ", "")
height = root.find('size').find('height').text
width = root.find('size').find('width').text
# jsonl_dict["filename"] = file_name
# jsonl_dict["height"] = height
# jsonl_dict["width"] = width

bboxes_l = []
for obj in root.findall('object'):
cls_name = obj.find('name').text
if cls_name not in category_dict:
print(f"Warning: '{cls_name}' is not in the category dictionary.")
continue
label = category_dict[cls_name]

xmlbox = obj.find('bndbox')
xmin = float(xmlbox.find('xmin').text)
xmax = float(xmlbox.find('xmax').text)
ymin = float(xmlbox.find('ymin').text)
ymax = float(xmlbox.find('ymax').text)
bboxes = [xmin, ymin, xmax, ymax]
bboxes_l.append({"bbox": bboxes, "label": label, "category": cls_name})

metas.append(
{
"filename": file_name,
"height": height,
"width": width,
"detection": {
"instances": bboxes_l
}
}
)
# jsonl_dict["detection"]["instances"] = bboxes_l
# metas.append(jsonl_dict)

return metas


if __name__ == "__main__":
xml_dir = "xml dir path" # 修改为自己数据集的xml目标地址
jsonl_save_path = "train.jsonl" # jsonl保存地址
category_dict = {
"cat": 0
} # 类别名及标签索引
metas = main(xml_dir, category_dict)
with jsonlines.open(jsonl_save_path, mode="w") as writer:
writer.write_all(metas)
print("finish!")

ODVG格式示例train.jsonl(两张图片的标注):

1
2
{"filename": "IMG_20230306_110523.jpg", "height": "4096", "width": "3072", "detection": {"instances": [{"bbox": [1059.0, 1498.0, 1289.0, 1720.0], "label": "0", "category": "cat"}, ...]}}
{"filename": "IMG_20230213_153518.jpg", "height": "4624", "width": "3472", "detection": {"instances": [{"bbox": [879.0, 2701.0, 1302.0, 3131.0], "label": "0", "category": "cat"}, ...]}}

步骤2:准备验证数据(COCO格式)

默认已使用labelImg标注得到xml格式的标签文件,使用 voc2coco.py 将验证数据xml转换为COCO格式的JSON文件(如 val.json)。

voc2coco.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
"""
功能说明:
--------------------------------------------------
本脚本用于将 Pascal VOC 格式的 XML 标注文件,
转换为 COCO Detection(instances)格式的 JSON 文件。

"""

import os
import json
import argparse
import xml.etree.ElementTree as ET
from pathlib import Path
from datetime import datetime


class VOC2COCOConverter:
def __init__(self, categories):
self.keep_categories = set(categories)

self.coco = {
"info": {
"description": "VOC to COCO Dataset",
"version": "1.0",
"year": datetime.now().year,
"date_created": datetime.now().strftime("%Y-%m-%d")
},
"type": "instances",
"images": [],
"annotations": [],
"categories": []
}

self.category_name_to_id = {}
self.image_name_set = set()
self.next_category_id = 0
self.next_image_id = 0
self.next_annotation_id = 0

# ---------- Category ----------
def _add_category(self, name):
if name in self.category_name_to_id:
return self.category_name_to_id[name]

cid = self.next_category_id
self.next_category_id += 1

self.coco["categories"].append({
"id": cid,
"name": name,
"supercategory": "none"
})
self.category_name_to_id[name] = cid
return cid

# ---------- Image ----------
def _add_image(self, file_name, width, height):
if file_name in self.image_name_set:
raise ValueError(f"Duplicate image name: {file_name}")

self.next_image_id += 1
self.image_name_set.add(file_name)

self.coco["images"].append({
"id": self.next_image_id,
"file_name": file_name,
"width": width,
"height": height,
"date_captured": datetime.now().isoformat()
})
return self.next_image_id

# ---------- Annotation ----------
def _add_annotation(self, image_id, category_id, bbox):
x, y, w, h = bbox
segmentation = [[
x, y,
x, y + h,
x + w, y + h,
x + w, y
]]

self.next_annotation_id += 1
self.coco["annotations"].append({
"id": self.next_annotation_id,
"image_id": image_id,
"category_id": category_id,
"bbox": bbox,
"area": w * h,
"iscrowd": 0,
"segmentation": segmentation
})

# ---------- XML ----------
def parse_xml(self, xml_path):
tree = ET.parse(xml_path)
root = tree.getroot()

file_node = root.find("path") or root.find("filename")
if file_node is None:
raise ValueError(f"No filename/path in {xml_path}")

file_name = Path(file_node.text).name
file_name = file_name.translate(str.maketrans("", "", "()() "))

size = root.find("size")
width = int(size.findtext("width"))
height = int(size.findtext("height"))

image_id = self._add_image(file_name, width, height)

for obj in root.findall("object"):
name = obj.findtext("name")
if name not in self.keep_categories:
continue

category_id = self._add_category(name)

box = obj.find("bndbox")
xmin = int(float(box.findtext("xmin")))
ymin = int(float(box.findtext("ymin")))
xmax = int(float(box.findtext("xmax")))
ymax = int(float(box.findtext("ymax")))

if xmax <= xmin or ymax <= ymin:
continue

bbox = [xmin, ymin, xmax - xmin, ymax - ymin]
self._add_annotation(image_id, category_id, bbox)

def save(self, json_path):
os.makedirs(os.path.dirname(json_path), exist_ok=True)
with open(json_path, "w", encoding="utf-8") as f:
json.dump(self.coco, f, indent=2, ensure_ascii=False)

print("✅ COCO json saved:", json_path)
print("Categories:", len(self.coco["categories"]))
print("Images:", len(self.coco["images"]))
print("Annotations:", len(self.coco["annotations"]))


# ==================================================
# CLI
# ==================================================
def load_categories_from_file(path):
with open(path, "r", encoding="utf-8") as f:
return [line.strip() for line in f if line.strip()]


def main():
parser = argparse.ArgumentParser("VOC XML → COCO JSON")

parser.add_argument("--voc-dir", type=str, required=True, help="VOC XML 目录")
parser.add_argument("--save-path", type=str, required=True, help="输出 COCO json 路径")

# 🔹 类别参数(二选一)
parser.add_argument(
"--categories",
nargs="+",
help="类别列表,如:--categories corn soybean leaf"
)
parser.add_argument(
"--categories-file",
type=str,
help="类别文件,每行一个类别"
)

args = parser.parse_args()

# ---------- 类别解析 ----------
if args.categories:
categories = args.categories
elif args.categories_file:
categories = load_categories_from_file(args.categories_file)
else:
raise ValueError("必须提供 --categories 或 --categories-file")

converter = VOC2COCOConverter(categories)

xml_files = [
os.path.join(args.voc_dir, f)
for f in os.listdir(args.voc_dir)
if f.endswith(".xml")
]

print(f"Found {len(xml_files)} XML files")
print("Categories:", categories)

for xml in xml_files:
converter.parse_xml(xml)

converter.save(args.save_path)


if __name__ == "__main__":
main()

# 使用说明
# python voc2coco.py --voc-dir /path/to/voc/xmls --save-path val.json --categories cat dog
# python voc2coco.py --voc-dir /path/to/voc/xmls --save-path val.json --categories-file classes.txt

COCO格式示例val.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"images": [
{"id": 1, "file_name": "IMG_20241216_110229.jpg", "width": 3072, "height": 4096, ...},
{"id": 2, "file_name": "IMG_20241217_161005.jpg", "width": 3072, "height": 4096, ...},
...
],
"type": "instances",
"annotations": [
{"image_id": 1, "bbox": [550, 903, 173, 179], "category_id": 0, "id": 1, ...},
{"image_id": 1, "bbox": [1396, 714, 224, 337], "category_id": 0, "id": 2, ...},
...
],
"categories": [
{"supercategory": "none", "id": 0, "name": "cat"} # 将cat修改为自定义类别名
]
}

步骤3:创建类别标签文件

创建一个标签映射文件(如 label.json),用于记录类别ID和类别名称的对应关系。

标签文件格式label.json

1
2
3
{
"0": "cat" # 将cat修改为自定义类别名
}

步骤4:配置数据集路径

参考 ./config/datasets_od_example.json,创建自定义数据配置文件datasets_self.json

配置文件格式datasets_self.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"train": [
{
"root": "E://images", // 训练图片路径
"anno": "./train.jsonl", // 训练标注文件
"label_map": "./label.json", // 类别标签文件
"dataset_mode": "odvg" // 数据格式:odvg
}
],
"val": [
{
"root": "E://images", // 验证图片路径
"anno": "./val.json", // 验证标注文件
"label_map": null,
"dataset_mode": "coco" // 数据格式:coco
}
]
}

模型训练

步骤1:修改训练配置文件

参考 ./config/cfg_odvg.py 文件,创建自定义配置文件./config/cfg_odvg_self.py,进行以下修改:

修改1:设置类别文本提示

在文件末尾添加:

1
2
3
# 前面不修改,复制cfg_odvg.py即可
# 最后一行添加
label_list = ['cat'] # 设置文本提示词,根据自己的类别名修改

步骤2:下载预训练模型

必需模型

  1. BERT预训练模型:下载 bert-base-uncased(用于文本编码)
  2. GroundingDINO权重:下载预训练权重文件,groundingdino_swint_cogcoor.pthSwin-Bgroundingdino_swinb_cogcoor.pth

修改2:选择Backbone(可选)

如需使用 groundingdino_swinb_cogcoor.pth 预训练模型,修改backbone配置:

1
2
3
4
5
# 修改前
backbone = 'swin_T_224_1k'

# 修改后
backbone = 'swin_B_384_22k'

步骤3:启动训练

修改 train_mydata.sh 中的数据路径参数:

train_mydata.sh 代码:

1
2
3
4
5
6
7
8
python main.py  \
--output_dir ./output/ \
-c config/cfg_odvg_self.py \
--datasets config/datasets_self.json \
--pretrain_model_path ./weights/groundingdino_swinb_cogcoor.pth \
--options text_encoder_type=bert-base-uncased \
--amp

运行:

1
bash train_mydata.sh

或直接在终端执行脚本中的命令。

1
python main.py --output_dir  ./output/ -c  config/cfg_odvg_self.py --datasets  ./config/datasets_self.json --pretrain_model_path  ./weights/groundingdino_swinb_cogcoor.pth --options  text_encoder_type=bert-base-uncased --amp

模型推理

推理使用

步骤1:主目录下新建一个dino_conf.yaml文件,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# =================================================
# GroundingDINO 模型配置
# =================================================
model:
# -------------------------
# 模型结构配置文件
# ⚠️ 必须修改为你本地的 GroundingDINO 配置
config_file: ./tools/GroundingDINO_SwinB_cfg.py

# -------------------------
# 模型权重路径
# ⚠️ 必须修改为你本地训练好的或官方权重
checkpoint_path: ./weights/groundingdino_swinb_cogcoor.pth

# -------------------------
# 是否只使用 CPU 推理
# True 仅使用 CPU,False 使用 GPU
cpu_only: False

# =================================================
# 推理参数
# =================================================
inference:
# -------------------------
# 框置信度阈值
box_threshold: 0.3 # [0~1],越大输出框越严格

# -------------------------
# 文本 token 匹配阈值
text_threshold: 0.25 # [0~1],越大匹配越严格

# -------------------------
# 精确短语模式,可为 null 或 [[(start,end), ...]]
# 一般不改为 null 即可
token_spans: null

步骤2:主目录下新建infer_dino.py并编辑,修改以下参数:

  • cfg_path:配置文件
  • image_dir:待推理图片目录路径
  • **save_dir **:结果保存目录路径
  • text_prompt :文本提示词

infer_dino.py:代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
"""
推理并可视化
"""

from __future__ import division
import os
import yaml
import torch
import numpy as np
from PIL import Image, ImageDraw, ImageFont, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

# GroundingDINO(确保 GroundingDINO 已正确安装)
import groundingdino.groundingdino.datasets.transforms as T
from groundingdino.groundingdino.models import build_model
from groundingdino.groundingdino.util.slconfig import SLConfig
from groundingdino.groundingdino.util.utils import (
clean_state_dict,
get_phrases_from_posmap,
)
from groundingdino.groundingdino.util.vl_utils import (
create_positive_map_from_span,
)


# =================================================
# Config
# =================================================
def load_config(cfg_path):
"""
读取 yaml 配置文件
"""
with open(cfg_path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)


# =================================================
# Image
# =================================================
def load_image(image_path):
"""
加载并预处理单张图片
⚠️一般不需要修改
"""
image_pil = Image.open(image_path).convert("RGB")

transform = T.Compose([
T.RandomResize([800], max_size=1333), # ← 若显存不足,可调小 800
T.ToTensor(),
T.Normalize(
[0.485, 0.456, 0.406],
[0.229, 0.224, 0.225],
),
])
image, _ = transform(image_pil, None)
return image_pil, image


# =================================================
# Model
# =================================================
def build_dino_model(cfg):
"""
构建 GroundingDINO 模型
"""
# ===== 【需要修改】是否只用 CPU =====
device = "cpu" if cfg.get("cpu_only", False) else "cuda"

# ===== 【通常不需要修改】模型结构配置 =====
args = SLConfig.fromfile(cfg["config_file"])
args.device = device

model = build_model(args)

# ===== 【需要修改】模型权重路径 =====
checkpoint = torch.load(cfg["checkpoint_path"], map_location="cpu")
model.load_state_dict(
clean_state_dict(checkpoint["model"]),
strict=False,
)

model.eval()
model.to(device)

return model, device


# =================================================
# Inference
# =================================================
@torch.no_grad()
def groundingdino_infer(
model,
image,
text_prompt,
box_threshold,
text_threshold,
device="cuda",
token_spans=None,
with_logits=True,
):
"""
单张图片推理
"""
# ===== 【需要修改】检测目标文本 =====
caption = text_prompt.lower().strip()
if not caption.endswith("."):
caption += "."

image = image.to(device)
outputs = model(image[None], captions=[caption])

logits = outputs["pred_logits"].sigmoid()[0]
boxes = outputs["pred_boxes"][0]

# ===== 【需要修改】box 置信度阈值 =====
logits_filt = logits.cpu()
boxes_filt = boxes.cpu()
mask = logits_filt.max(dim=1)[0] > box_threshold

logits_filt = logits_filt[mask]
boxes_filt = boxes_filt[mask]

tokenizer = model.tokenizer
tokenized = tokenizer(caption)

phrases = []
for logit in logits_filt:
phrase = get_phrases_from_posmap(
logit > text_threshold, # ← 【需要修改】文本 token 阈值
tokenized,
tokenizer,
)
if with_logits:
phrase += f"({logit.max().item():.3f})"
phrases.append(phrase)

return boxes_filt, phrases


# =================================================
# Box utils
# =================================================
def cxcywh_to_xyxy(boxes, image_size):
"""
模型输出 box 格式转换为像素级 xyxy
"""
H, W = image_size
boxes = boxes * torch.tensor([W, H, W, H])

xyxy = boxes.clone()
xyxy[:, :2] -= xyxy[:, 2:] / 2
xyxy[:, 2:] += xyxy[:, :2]

return xyxy.int().tolist()


# =================================================
# Visualization
# =================================================
def visualize_and_save(image_pil, boxes, phrases, save_path):
"""
可视化并保存检测结果
"""
draw = ImageDraw.Draw(image_pil, "RGBA")

# ===== 【可选修改】字体大小 / 字体路径 =====
try:
font = ImageFont.truetype("SimHei.ttf", 32)
except IOError:
font = ImageFont.load_default()

for phrase, box in zip(phrases, boxes):
color = tuple(np.random.randint(0, 255, size=3).tolist())
x0, y0, x1, y1 = box

draw.rectangle([x0, y0, x1, y1], outline=color, width=4)

if hasattr(draw, "textbbox"):
tb = draw.textbbox((x0, y0), phrase, font=font)
else:
w, h = draw.textsize(phrase, font=font)
tb = (x0, y0, x0 + w, y0 + h)

draw.rectangle(tb, fill=color + (160,))
draw.text((x0, y0), phrase, fill=(255, 255, 255), font=font)

os.makedirs(os.path.dirname(save_path), exist_ok=True)
image_pil.save(save_path)


# =================================================
# Batch inference
# =================================================
def batch_infer(
image_dir,
save_dir,
model,
device,
cfg,
text_prompt,
):
"""
批量目录推理
"""
os.makedirs(save_dir, exist_ok=True)

# ===== 【需要修改】支持的图片格式 =====
image_files = [
f for f in os.listdir(image_dir)
if f.lower().endswith((".jpg", ".png", ".jpeg", ".bmp", ".tif"))
]

print(f"Found {len(image_files)} images")

for idx, img_name in enumerate(image_files):
print(f"[{idx+1}/{len(image_files)}] Processing {img_name}")

img_path = os.path.join(image_dir, img_name)
image_pil, image = load_image(img_path)

H, W = image_pil.size[1], image_pil.size[0]

boxes, phrases = groundingdino_infer(
model=model,
image=image,
text_prompt=text_prompt,
box_threshold=cfg["box_threshold"], # ← 来自 yaml
text_threshold=cfg["text_threshold"], # ← 来自 yaml
device=device,
)

if len(boxes) == 0:
image_pil.save(os.path.join(save_dir, img_name))
continue

boxes_xyxy = cxcywh_to_xyxy(boxes, (H, W))
save_path = os.path.join(save_dir, img_name)

visualize_and_save(image_pil, boxes_xyxy, phrases, save_path)


# =================================================
# Main
# =================================================
if __name__ == "__main__":

# ===== 【必须修改】配置文件路径 =====
cfg_path = "./dino_conf.yaml"

# ===== 【必须修改】输入图片目录 =====
image_dir = "./test_images"

# ===== 【必须修改】可视化结果保存目录 =====
save_dir = "./vis_results"

# ===== 【必须修改】检测目标文本(prompt) =====
text_prompt = "cat"

cfg = load_config(cfg_path)

model, device = build_dino_model(cfg)

batch_infer(
image_dir=image_dir,
save_dir=save_dir,
model=model,
device=device,
cfg=cfg,
text_prompt=text_prompt,
)

步骤3:运行推理

1
python infer_dino.py

多类别提示词使用

如需检测多个类别,使用 . 分隔提示词,例如:

1
2
text_prompt = "cat"
# text_prompt = "dog. cat. pig."

常见问题

安装问题

Q1:CUDA 12.2+版本编译错误

问题描述:使用CUDA 12.2及更高版本编译时,显示构建ms_deform_attn_cuda.cu文件报错,或提示:

1
no suitable conversion function from 'const at::DeprecatedTypeProperties' to 'c10::ScalarType' exists

解决方案

修改文件:GroundingDINO/groundingdino/models/GroundingDINO/csrc/MsDeformAttn/ms_deform_attn_cuda.cu

在第65行和第135行,将 value.type() 替换为 value.scalar_type()

1
2
3
4
5
# 修改前
AT_DISPATCH_FLOATING_TYPES(value.type(), "ms_deform_attn_forward_cuda", ([&] {

# 修改后
AT_DISPATCH_FLOATING_TYPES(value.scalar_type(), "ms_deform_attn_forward_cuda", ([&] {

Q2:NameError: name ‘_C’ is not defined

问题描述:运行时提示 _C 未定义。

解决方案

检查CUDA环境变量是否正确设置:

1
echo $CUDA_HOME    # 应输出CUDA路径,如 /usr/local/cuda-11.8

更多信息参考:GroundingDINO官方文档

Q3:安装时提示缺少torch模块

问题描述:执行 pip install -e . 时报错:

1
2
3
Traceback (most recent call last):
File "<string>", line 32, in install_torch
ModuleNotFoundError: No module named 'torch'

解决方案

pip在构建时创建了隔离环境,导致找不到已安装的torch。使用以下命令禁用构建隔离:

1
pip install -e . --no-build-isolation

训练问题

Q1:random.sample TypeError错误

问题描述:在 datasets/odvg.py 中运行到以下代码时报错:

1
vg_labels.extend(random.sample(neg_labels, num_to_add))

错误信息:

1
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).

原因分析:Python 3.9+版本中,random.sample() 不再接受集合(set)作为输入。

解决方案

修改 datasets/odvg.py 文件:

1
2
3
4
5
6
# 修改前
vg_labels.extend(random.sample(neg_labels, num_to_add)) # neg_labels为set类型

# 修改后
sample_result = random.sample(list(neg_labels), num_to_add)
vg_labels.extend(set(sample_result))

Q2:COCO验证数据集加载KeyError

问题描述:加载COCO格式验证数据集时报错:

1
2
res.dataset['info'] = copy.deepcopy(self.dataset['info'])
KeyError: 'info'

原因分析:pycocotools版本问题,需要在COCO格式数据集中添加 info 字段。

解决方案

在COCO格式验证数据集JSON文件开头添加 info 键(参考 ./config/instances_val2017.json):

1
2
3
4
5
6
7
8
9
10
11
12
13
{
"info": {
"description": "COCO 2017 Dataset",
"url": "http://cocodataset.org",
"version": "1.0",
"year": 2017,
"contributor": "COCO Consortium",
"date_created": "2017/09/01"
},
"images": [...],
"annotations": [...],
"categories": [...]
}

参考资料