Skip to content

【开源任务】Paddle CPU/GPU Kernel 精度问题推全 #72667

Open
@lshpku

Description

@lshpku

一、背景

Paddle目前正在对全量API的边界正确性做系统性排查,我们开发了PaddleAPITest用于测试存在正确性问题的API。通过与Torch执行相同的API进行精度对比,我们发现一些API与Torch的API存在精度diff。经初步少量API确认,我们发现Paddle API确实存在一些正确性问题(过程中也发现了少量Torch API的正确性问题,如torch.tril、torch.triu)。现将这些问题Paddle API公开,邀请社区同学共同解决问题。
参与本项活动,你将学习到Paddle算子库框架的设计,并对Paddle CPU、GPU Kernel的实现风格有详细的了解,对算子精度问题的调试技能积累一定经验。

二、任务描述

2.1 任务简介以及任务分配

针对通过PaddleAPITest测试出的一些和torch存在精度diff的Paddle API,查找其出现的原因并进行修复。
存在精度diff的API以及任务分配如下:

Important

每个任务难度:0.15×🌟
题目讲解见录屏文件:https://meeting.tencent.com/crm/l59EWmRZc4 (00:52:00~00:59:30)

序号 API kernel类别 报名人/状态/PR
1 paddle.add_n CPU @BeingGod
@cangtianhuang
2 paddle.all GPU @BeingGod
@cangtianhuang
@co63oc #280
3 paddle.all CPU @BeingGod
@cangtianhuang
@co63oc #280
4 paddle.any GPU @BeingGod
@cangtianhuang
@co63oc #280
5 paddle.any CPU @BeingGod
@cangtianhuang
@co63oc #280
6 paddle.argsort GPU @cszdrg #219
7 paddle.argsort CPU @cszdrg #219
8 paddle.bitwise_right_shift CPU @ooooo-create #194
9 paddle.broadcast_to GPU @ooooo-create #72992
10 paddle.broadcast_to CPU @ooooo-create #72992
11 paddle.clip CPU @BeingGod
@ooooo-create
@co63oc #73492
12 paddle.concat CPU @BeingGod
13 paddle.copysign GPU @BeingGod
14 paddle.count_nonzero CPU @BeingGod
@co63oc #280
15 paddle.cumprod CPU @BeingGod
@ooooo-create #72897 #212
16 paddle.cumsum GPU @BeingGod
17 paddle.cumsum CPU @BeingGod
18 paddle.diag CPU @cangtianhuang #269
19 paddle.dot GPU @co63oc #316
20 paddle.expand CPU @ccsuzzh
@ooooo-create #72992
21 paddle.heaviside CPU @ooooo-create #72894
22 paddle.kron CPU
23 paddle.linalg.cholesky_solve GPU
24 paddle.linalg.cholesky_solve CPU
25 paddle.linalg.eigh GPU @NKNaN #276
26 paddle.linalg.eigvals CPU
27 paddle.linalg.eigvalsh CPU
28 paddle.linalg.norm GPU @ooooo-create #197
29 paddle.linalg.norm CPU @ooooo-create #197
30 paddle.linalg.pinv GPU
31 paddle.linalg.pinv CPU
32 paddle.linalg.svd_lowrank GPU
33 paddle.linalg.svd_lowrank CPU
34 paddle.logit GPU @ooooo-create #72973
35 paddle.logit CPU @ooooo-create #72973
36 paddle.maximum CPU @BeingGod
@Z-NAVY
@co63oc #286
37 paddle.minimum CPU @BeingGod
@Z-NAVY
@co63oc #286
38 paddle.nextafter GPU @ooooo-create #72965
39 paddle.nextafter CPU @ooooo-create #72965
40 paddle.nn.functional.cosine_similarity GPU @BeingGod
@Z-NAVY
@Cutelemon6 #73014
41 paddle.nn.functional.cosine_similarity CPU @BeingGod
@Z-NAVY
@Cutelemon6 #73014
42 paddle.nn.functional.embedding CPU @Z-NAVY
@ooooo-create #73445
43 paddle.nn.functional.grid_sample GPU @Juggler-YAN
44 paddle.nn.functional.grid_sample CPU @Juggler-YAN
45 paddle.nn.functional.gumbel_softmax GPU @ooooo-create #303
46 paddle.nn.functional.gumbel_softmax CPU @ooooo-create #303
47 paddle.nn.functional.hardsigmoid CPU @BeingGod
@ooooo-create
@co63oc #281
48 paddle.nn.functional.rrelu GPU @BeingGod
@Z-NAVY
@ooooo-create #234
49 paddle.nn.functional.rrelu CPU @BeingGod
@Z-NAVY
@ooooo-create #234
50 paddle.prod CPU @BeingGod
@ooooo-create #251
51 paddle.put_along_axis CPU @ooooo-create #218
52 paddle.reciprocal GPU @BeingGod
@Z-NAVY
@ooooo-create #73128
53 paddle.rsqrt GPU @BeingGod
@Z-NAVY
@co63oc #314
54 paddle.signal.istft CPU @co63oc #287
55 paddle.std GPU @Z-NAVY
@ooooo-create #72879
56 paddle.std CPU @Z-NAVY
@ooooo-create #72879
57 paddle.sum CPU @BeingGod
@ooooo-create #73012
58 paddle.Tensor.argsort CPU @cszdrg #219
59 paddle.Tensor.cholesky_solve GPU @Juggler-YAN
60 paddle.Tensor.cholesky_solve CPU @Juggler-YAN
61 paddle.Tensor.expand CPU @ooooo-create #72992
62 paddle.Tensor.fill_diagonal_ CPU @co63oc #288
63 paddle.Tensor.logit GPU @ooooo-create #72973
64 paddle.Tensor.logit CPU @ooooo-create #72973
65 paddle.Tensor.median GPU @Z-NAVY
@NKNaN #265
66 paddle.Tensor.median CPU @Z-NAVY
@NKNaN #265
67 paddle.Tensor.put_along_axis CPU @ooooo-create #218
68 paddle.Tensor.set_ CPU @NKNaN #73294
69 paddle.Tensor.sum CPU @BeingGod
@ooooo-create #73012
70 paddle.Tensor.topk CPU @ooooo-create #217
71 paddle.tensordot CPU @co63oc #288
72 paddle.trace CPU @ooooo-create #73018
73 paddle.unique CPU @ccsuzzh
74 paddle.unique_consecutive GPU @ccsuzzh #72948 #192
75 paddle.unique_consecutive CPU @ccsuzzh #72948 #192
76 paddle.var GPU @ooooo-create #72879
77 paddle.var CPU @ooooo-create #72879
78 paddle.cast CPU
79 paddle.combinations GPU @ooooo-create #73293
80 paddle.cumulative_trapezoid CPU @ooooo-create #73317
81 paddle.diag GPU @cangtianhuang #269
82 paddle.diagflat GPU @cangtianhuang #269
83 paddle.diagonal_scatter GPU @ooooo-create #304
84 paddle.einsum GPU
85 paddle.einsum CPU
86 paddle.gammaln GPU @NKNaN #73344
87 paddle.geometric.send_ue_recv CPU
88 paddle.geometric.send_uv CPU @cangtianhuang #249
89 paddle.incubate.nn.functional.fused_bias_act GPU
90 paddle.incubate.nn.functional.fused_bias_dropout_residual_layer_norm GPU
91 paddle.incubate.nn.functional.fused_layer_norm CPU
92 paddle.incubate.nn.functional.fused_linear GPU
93 paddle.incubate.nn.functional.fused_linear_activation GPU
94 paddle.incubate.nn.functional.fused_multi_head_attention GPU
95 paddle.incubate.nn.functional.fused_rotary_position_embedding GPU
96 paddle.incubate.nn.functional.variable_length_memory_efficient_attention GPU
97 paddle.incubate.softmax_mask_fuse GPU
98 paddle.index_put GPU @ooooo-create #308
99 paddle.lerp GPU @co63oc #278
100 paddle.linalg.cond CPU @ooooo-create #73229
101 paddle.linalg.eigh CPU
102 paddle.linalg.lstsq CPU
103 paddle.linalg.matrix_rank CPU @ooooo-create #73295
104 paddle.matmul GPU
105 paddle.max CPU @ooooo-create #73229
106 paddle.nn.functional.adaptive_avg_pool2d GPU
107 paddle.nn.functional.adaptive_avg_pool3d GPU
108 paddle.nn.functional.avg_pool3d GPU
109 paddle.nn.functional.batch_norm GPU
110 paddle.nn.functional.binary_cross_entropy GPU @NKNaN #267 #270
111 paddle.nn.functional.binary_cross_entropy_with_logits GPU @NKNaN #267 #270
112 paddle.nn.functional.conv1d CPU @co63oc #306
113 paddle.nn.functional.conv2d GPU
114 paddle.nn.functional.conv2d_transpose CPU
115 paddle.nn.functional.cross_entropy GPU
116 paddle.nn.functional.ctc_loss GPU @ooooo-create #277
117 paddle.nn.functional.dice_loss GPU @NKNaN #319
118 paddle.nn.functional.embedding GPU @ooooo-create #73445
119 paddle.nn.functional.gaussian_nll_loss GPU @cangtianhuang #272
120 paddle.nn.functional.interpolate GPU
121 paddle.nn.functional.interpolate CPU
122 paddle.nn.functional.kl_div GPU @co63oc #279
123 paddle.nn.functional.linear CPU
124 paddle.nn.functional.log_softmax GPU @ooooo-create #271
125 paddle.nn.functional.multi_margin_loss GPU @co63oc #282
126 paddle.nn.functional.rnnt_loss CPU @ooooo-create #266
127 paddle.nn.functional.sigmoid_focal_loss GPU @NKNaN #73430 #292
128 paddle.nn.functional.softmax_with_cross_entropy GPU @NKNaN #317
129 paddle.nn.functional.temporal_shift GPU
130 paddle.nn.functional.upsample GPU
131 paddle.nn.utils.parameters_to_vector GPU @co63oc #278
132 paddle.outer GPU @NKNaN #73324
133 paddle.pow GPU @cszdrg #73244 #73274
134 paddle.prod GPU @ooooo-create #251
135 paddle.put_along_axis GPU
136 paddle.scale GPU @NKNaN #264
137 paddle.scatter GPU @ooooo-create #302
138 paddle.sort CPU @ooooo-create #259
139 paddle.strided_slice CPU @ooooo-create #250
140 paddle.Tensor.matmul CPU
141 paddle.Tensor.mul CPU
142 paddle.Tensor.astype CPU
143 paddle.Tensor.cast CPU
144 paddle.Tensor.clip GPU @co63oc #73492
145 paddle.Tensor.fill_diagonal_tensor GPU @cangtianhuang #272
146 paddle.Tensor.mean GPU @NKNaN #73320
147 paddle.Tensor.set_ GPU @NKNaN #73294
148 paddle.Tensor.tile GPU @ooooo-create #73454
149 paddle.tensordot GPU @co63oc #288
150 paddle.trapezoid CPU @ooooo-create #73317
151 paddle.vander CPU @cszdrg #252
152 paddle.vision.ops.deform_conv2d GPU
153 paddle.vision.ops.deform_conv2d CPU
154 paddle.vision.ops.distribute_fpn_proposals GPU @cszdrg #252
155 paddle.incubate.nn.functional.fused_layer_norm GPU output dtype类型不同,暂不要报名
156 paddle.nn.functional.conv2d_transpose GPU output dtype类型不同,暂不要报名
157 paddle.nn.functional.linear GPU output dtype类型不同,暂不要报名
158 paddle.vision.ops.roi_align GPU output dtype类型不同,暂不要报名
159 paddle.Tensor.cumsum GPU output dtype类型不同,暂不要报名
160 paddle.Tensor.frexp GPU output dtype类型不同,暂不要报名
161 paddle.add GPU output dtype类型不同,暂不要报名
162 paddle.add_n GPU output dtype类型不同,暂不要报名
163 paddle.clip GPU output dtype类型不同,暂不要报名
164 paddle.copysign GPU output dtype类型不同,暂不要报名
165 paddle.cummax GPU output dtype类型不同,暂不要报名
166 paddle.cummin GPU output dtype类型不同,暂不要报名
167 paddle.cumsum GPU output dtype类型不同,暂不要报名
168 paddle.floor GPU output dtype类型不同,暂不要报名
169 paddle.frexp GPU output dtype类型不同,暂不要报名
170 paddle.histogram GPU output dtype类型不同,暂不要报名
171 paddle.linalg.lstsq GPU output dtype类型不同,暂不要报名
172 paddle.nn.functional.adaptive_max_pool1d GPU output dtype类型不同,暂不要报名
173 paddle.nn.functional.adaptive_max_pool2d GPU output dtype类型不同,暂不要报名
174 paddle.nn.functional.adaptive_max_pool3d GPU output dtype类型不同,暂不要报名
175 paddle.nn.functional.max_pool1d GPU output dtype类型不同,暂不要报名
176 paddle.nn.functional.max_pool2d GPU output dtype类型不同,暂不要报名
177 paddle.nn.functional.max_pool3d GPU output dtype类型不同,暂不要报名
178 paddle.nn.functional.one_hot GPU output dtype类型不同,暂不要报名
179 paddle.nn.functional.smooth_l1_loss GPU output dtype类型不同,暂不要报名
180 paddle.where GPU output dtype类型不同,暂不要报名
181 paddle.maximum GPU @wanghuancoder #71716
182 paddle.minimum GPU @wanghuancoder #71716
183 paddle.Tensor.getitem GPU @wanghuancoder #71716

2.2 修复建议

  1. 对于精度diff问题,可以将input、output Tensor使用numpy.savetxt保存到文件,分析diff数据,看看能否找到错误规律
  2. 对于CUDA Error报错,可以通过如下代码调试:
void test_cuda(const std::string& str) {
std::cout << str << " begin" << std::endl;
// 1. wait all kernel finish
PADDLE_ENFORCE_GPU_SUCCESS(cudaDeviceSynchronize());
// 2. get error state
PADDLE_ENFORCE_GPU_SUCCESS(cudaGetLastError());
// 3. check if cuda 700
size_t bytes = 256;
char* cuda_mem;
char* cpu_mem = new char[bytes + 1];
cudaMalloc(&cuda_mem, bytes + 1);
cudaMemset(cuda_mem, 0, bytes + 1);
cudaMemcpyAsync(cpu_mem, cuda_mem, bytes, cudaMemcpyDeviceToHost);
cudaFree(cuda_mem);
delete[] cpu_mem;
std::cout << str << " end" << std::endl;
}

在XXXKernel实现逻辑的各个环节“打桩”调用该函数,第一次崩溃位置的前面代码,往往是造成CUDA Error的地方
3. 对于Tensor精度调试,使用numpy.savetxt保存的文件太大不可行,还可以用如下方式调试:
4.

import torch
import paddle
import numpy
device = torch.device("cuda:0")
torch.set_default_device(device)
def init_input(numpy_tensor):
paddle_x = paddle.to_tensor(numpy_tensor)
torch_x = torch.tensor(numpy_tensor, requires_grad=True)
paddle_x.stop_gradient = False
numpy.testing.assert_allclose(
paddle_x.numpy(),
torch_x.cpu().detach().numpy(),
1e-10,
1e-10,
err_msg='intput diff'
)
return paddle_x, torch_x
# paddle.amax(Tensor([3, 38028357, 4, 5],"float32"), axis=-1, keepdim=True, )
input_tensor = (numpy.random.random([3, 38028357, 4, 5]) - 0.5).astype("float32")
paddle_x, torch_x = init_input(input_tensor)
paddle_out = paddle.amax(paddle_x, axis=-1, keepdim=True)
torch_out = torch.amax(torch_x, dim=-1, keepdim=True)
print(paddle_out.shape)
print(torch_out.shape)
numpy_tensor = (numpy.random.random([3, 38028357, 4, 1]) - 0.5).astype("float32")
paddle_grad, torch_grad = init_input(numpy_tensor)
torch_x_grad = torch.autograd.grad([torch_out], [torch_x], grad_outputs=torch_grad)
paddle_x_grad = paddle.grad([paddle_out], [paddle_x], grad_outputs=paddle_grad, allow_unused=True)
p = paddle_x_grad[0].numpy()
t = torch_x_grad[0].cpu().detach().numpy()
for i in range(3):
for j in range(38028357):
for k in range(4):
for m in range(5):
if p[i][j][k][m] != t[i][j][k][m]:
print("i = {}".format(i), "j = {}".format(j), "k = {}".format(k), "m = {}".format(m), "paddle = {}".format(p[i][j][k][m]), "torch = {}".format(t[i][j][k][m]), "numpy_tensor = {}".format(numpy_tensor[i][j][k][0]), "paddle_out = {}".format(paddle_out[i][j][k][0]), "torch_out = {}".format(torch_out[i][j][k][0]))
print("input_tensor = ", "{}".format(input_tensor[i][j][k][0]), "{}".format(input_tensor[i][j][k][1]), "{}".format(input_tensor[i][j][k][2]), "{}".format(input_tensor[i][j][k][3]), "{}".format(input_tensor[i][j][k][4]))
print("p = ", "{}".format(p[i][j][k][0]), "{}".format(p[i][j][k][1]), "{}".format(p[i][j][k][2]), "{}".format(p[i][j][k][3]), "{}".format(p[i][j][k][4]))
print("t = ", "{}".format(t[i][j][k][0]), "{}".format(t[i][j][k][1]), "{}".format(t[i][j][k][2]), "{}".format(t[i][j][k][3]), "{}".format(t[i][j][k][4]))
numpy.testing.assert_allclose(
paddle_out.numpy(),
torch_out.cpu().detach().numpy(),
1e-2,
1e-2,
err_msg='output diff'
)
numpy.testing.assert_allclose(
paddle_x_grad[0].numpy(),
torch_x_grad[0].cpu().detach().numpy(),
1e-2,
1e-2,
err_msg='output diff'
)

2.3 注意事项

  1. 错误配置、报错日志可以在https://github.com/PFCCLab/PaddleAPITest/tree/main/report/ci_ce_cpuhttps://github.com/PFCCLab/PaddleAPITest/tree/main/report/ci_ce_gpu中查看
  2. 给Paddle提PR,每修复1个API一个PR
  3. 有些API,比如paddle.max、paddle.Tensor.max其实是一个API,尽量同时报名,防止修复重复了。同时,比如paddle.max、paddle.min也建议一起分析一起修复
  4. 修复后,应给相应API的单测添加新的测试用例
  5. 如果所有配置均无法复现错误,可能是已经被其它人修好了,请在issue中告知未复现的api
  6. 参考修复PR:精度问题:PR71716

三、测试工具

本项目使用PaddleAPITest工具来排查问题API、问题case。
通过如下命令可以执行一个paddle.abs与torch.abs的精度前反向对比测试:python engine.py --accuracy=True --api_config='paddle.abs(Tensor([1, 100],"float64"), )' 。“accuracy”表示执行精度对比测试,“api_config”表示要测试的配置。注意:配置中的双引号需要加斜杠。
所有测试前,必须创建一个目录PaddleAPITest/tester/api_config/test_log/,用于存放测试所产生的测试结果和checkpoint。
PaddleAPITest目前支持paddle_only、accuracy、paddle_cinn三种测试:

  • paddle_only:用于单纯把配置在Paddle动态图跑一遍,验证PaddleAPITest “引擎”是否支持该配置,但不验证精度。
  • accuracy:用于将Paddle API的前反向与Torch的前反向做精度对比测试。本次任务主要用到这项测试
  • paddle_cinn:用于Paddle动态图与Paddle静态图编译器做精度对比测试。与本次任务无关
    此外:
  • api_config,用于测试单独一个配置使用,如:python engine.py --accuracy=True --api_config='paddle.abs(Tensor([1, 100],"float64"), )'
  • api_config_file,用于测试多个配置使用,可以将要测试的所有配置放到一个文件中,如:python engine.py --api_config_file=PaddleAPITest/tester/api_config/api_config.txt --accuracy=True > tester/api_config/test_log/log.log 2>&1
  • 由于测试过程中可能触发一些bug,导致程序Coredump,进而导致测试中断。PaddleAPITest开发了checkpoint机制,记录“已测试”配置。可以通过如下命令启动批量测试:for i in {1..10000}; do python engine.py --api_config_file=PaddleAPITest/tester/api_config/api_config_merge.txt --accuracy=True >> tester/api_config/test_log/log.log 2>&1; done

四、如何开始

4.1 任务认领

Important

请务必严格按照格式填写,否则快乐开源小助手无法自动更新信息
请大家在 issue 下以 comment 的形式认领任务,否则无法正确报名,格式如下:

【报名】: 1、2、3-5

多个任务之间使用中文顿号分隔,多个连续任务可用横线表示

4.2 环境准备

建议使用Docker环境进行开发

wget https://paddle-qa.bj.bcebos.com/benchmark/pretrained/torch_whl_250.tar.gz
tar xf torch_whl_250.tar.gz && cd torch_whl_250
pip install torch*.whl --no-index --find-links .
  • 安装PaddleAPITest:直接clone该项目即可,无需安装

4.3 提交PR

使用以下模版提交PR:
// ------- PR 标题 --------

[Accuracy diff No.xxx] Fix accuracy diff for xxx API

// ------- PR 内容 --------

PR Category
Accuracy diff
PR types
Improvements
Description
xx的修改历程。
(修改细节、测试结果等等)

注:为了便于review,请务必在PR中写下如下信息:

看板信息

任务方向 任务数量 提交作品 / 任务认领 提交率 完成 完成率
精度问题修复 183 103 / 139 56.28% 86 46.99%

统计信息

排名不分先后 @co63oc (15) @cszdrg (6) @ooooo-create (43) @cangtianhuang (6) @NKNaN (9) @Cutelemon6 (2) @ccsuzzh (2) @wanghuancoder (3)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions