Skip to content

微调自定义数据集用ch_PP-OCRv3_rec_distillation.yml vs ch_PP-OCRv4_rec_distillation.yml都有些bug #14872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
hecheng64 opened this issue Mar 17, 2025 · 4 comments
Labels
training this is a training related issue

Comments

@hecheng64
Copy link

🔎 Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

在bug14870:ch_PP-OCRv3_rec_distillation.yml vs ch_PP-OCRv4_rec_distillation.yml
如果使用的是 PP-OCRv3 作为基础模型,建议使用 ch_PP-OCRv3_rec_distillation.yml 进行蒸馏训练,以保持与 PP-OCRv3 训练策略一致。
回答:因为我当初按照官网训练:
14870
Architecture:
model_type: &model_type "rec"
name: DistillationModel
algorithm: Distillation
Models:
Teacher:
pretrained:
freeze_params: false -----官方默认配置false
return_all_feats: true
model_type: *model_type
algorithm: SVTR_LCNet
可以训练,可以收敛,但是感觉对手写体测试有效果,原先印刷体失效的,后面怀疑 freeze_params: false 配置
有问题,根据bug 14866 建议GreatV大佬建议 改成 freeze_params: true,训练好久,损失率还很大,收敛不了
Models:
Teacher:
pretrained:
freeze_params: true
return_all_feats: true
model_type: *model_type
algorithm: SVTR_LCNet
Transform:

数据集是手写OCR汇总,由中科院手写数据和网上开源数据合并组合:https://aistudio.baidu.com/datasetdetail/102884/0

如果使用 PP-OCRv4 作为基础模型,则应使用 ch_PP-OCRv4_rec_distillation.yml,因为 PP-OCRv4 可能在蒸馏策略上有优化或新的调整。
用 ch_PP-OCRv4_rec_distillation.yml 训练会报错
python tools/train.py -c configs/rec/PP-OCRv4/ch_PP-OCRv4_rec_distillation.yml

[2025/03/17 15:13:06] ppocr ERROR: When parsing line train_data/138440.png 率普遍较高如方正
, error happened with msg: Traceback (most recent call last):
File "/home/PaddleOCR/ppocr/data/simple_dataset.py", line 137, in getitem
outs = transform(data, self.ops)
File "/home/PaddleOCR/ppocr/data/imaug/init.py", line 73, in transform
data = op(data)
File "/home/PaddleOCR/ppocr/data/imaug/operators.py", line 123, in call
data_list.append(data[key])
KeyError: 'valid_ratio' 然后尝试在配置里面把 - KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio 把valid_ratio 去掉 要报其他错误
Exception in thread Thread-1 (_thread_loop):
Traceback (most recent call last):
File "/root/anaconda3/envs/ocr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/anaconda3/envs/ocr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 619, in _thread_loop
batch = self._get_data()
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 766, in _get_data
batch.reraise()
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/worker.py", line 195, in reraise
raise self.exc_type(msg)
ValueError: DataLoader worker(1) caught ValueError with message:
Traceback (most recent call last):
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/worker.py", line 380, in _worker_loop
batch = fetcher.fetch(indices)
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/fetcher.py", line 85, in fetch
data = self.collate_fn(data)
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/collate.py", line 75, in default_collate_fn
return [default_collate_fn(fields) for fields in zip(*batch)]
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/collate.py", line 75, in
return [default_collate_fn(fields) for fields in zip(*batch)]
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/collate.py", line 56, in default_collate_fn
batch = np.stack(batch, axis=0)
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/numpy/core/shape_base.py", line 449, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Traceback (most recent call last):
File "/home/PaddleOCR/tools/train.py", line 270, in
main(config, device, logger, vdl_writer, seed)
File "/home/PaddleOCR/tools/train.py", line 223, in main
program.train(
File "/home/PaddleOCR/tools/program.py", line 312, in train
for idx, batch in enumerate(train_dataloader):
File "/root/anaconda3/envs/ocr/lib/python3.10/site-packages/paddle/io/dataloader/dataloader_iter.py", line 840, in next
self.reader.read_next_list()[0]
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed
!= true, but received killed_:1 == true:1.] (at ../paddle/phi/core/operators/reader/blocking_queue.h:175)

🏃‍♂️ Environment (运行环境)

release/2.10.0

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

见bug描述

@GreatV
Copy link
Collaborator

GreatV commented Mar 17, 2025

能提供一个最小复现的demo吗,我试试看。

@hecheng64
Copy link
Author

你说用 ch_PP-OCRv4_rec_distillation.yml 训练会报错? 这个必现的,我看issue里面差不多5.6个提过,好像在安排修复中 @GreatV 如果用 ch_PP-OCRv3_rec_distillation.yml 进行蒸馏训练 ,不收敛的话,我把配置给你

ch_PP-OCRv3_rec_distillation.yml.txt

,数据集链接就在https://aistudio.baidu.com/datasetdetail/102884/0

@GreatV
Copy link
Collaborator

GreatV commented Mar 17, 2025

感觉对手写体测试有效果,原先印刷体失效的

Teacher:
pretrained:
freeze_params: false

可能的原因

  1. 预训练模型未加载

如果 pretrained 字段为空,教师模型可能从随机权重开始训练,而不是使用 PP-OCRv3 的预训练权重。这可能导致模型在印刷体上的性能下降,而更适应训练数据中的手写体。

  1. 训练数据分布

训练数据中手写体样本可能较多,导致模型在手写体上表现良好,但在印刷体上表现较差。建议确保训练数据包含足够的印刷体样本。

  1. 教师模型参数未冻结

配置文件中 freeze_params: false 表示教师模型参数在训练中会更新,这可能导致教师模型偏向于训练数据中的手写体特征,影响印刷体性能。

@hecheng64
Copy link
Author

@GreatV https://aistudio.baidu.com/projectdetail/4330587 你能不能提供下你们官方例子OCR手写文字识别例子里面修改好的ch_PP-OCRv3_rec_distillation.yml 配置,我看你们代码里面doc描述https://paddlepaddle.github.io/PaddleOCR/latest/applications/%E6%89%8B%E5%86%99%E6%96%87%E5%AD%97%E8%AF%86%E5%88%AB.html :configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml

epoch_num: 100 # 训练epoch数
save_model_dir: ./output/ch_PP-OCR_v3_rec
save_epoch_step: 10
eval_batch_step: [0, 100] # 评估间隔,每隔100step评估一次
pretrained_model: ./pretrained_models/ch_PP-OCRv3_rec_train/best_accuracy # 预训练模型路径

lr:
name: Cosine # 修改学习率衰减策略为Cosine
learning_rate: 0.0001 # 修改fine-tune的学习率
warmup_epoch: 2 # 修改warmup轮数

Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data # 训练集图片路径
ext_op_transform_idx: 1
label_file_list:
- ./train_data/chineseocr-data/rec_hand_line_all_label_train.txt # 训练集标签
- ./train_data/handwrite/HWDB2.0Train_label.txt
- ./train_data/handwrite/HWDB2.1Train_label.txt
- ./train_data/handwrite/HWDB2.2Train_label.txt
- ./train_data/handwrite/hwdb_ic13/handwriting_hwdb_train_labels.txt
- ./train_data/handwrite/HW_Chinese/train_hw.txt
ratio_list:
- 0.1
- 1.0
- 1.0
- 1.0
- 0.02
- 1.0
loader:
shuffle: true
batch_size_per_card: 64
drop_last: true
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data # 测试集图片路径
label_file_list:
- ./train_data/chineseocr-data/rec_hand_line_all_label_val.txt # 测试集标签
- ./train_data/handwrite/HWDB2.0Test_label.txt
- ./train_data/handwrite/HWDB2.1Test_label.txt
- ./train_data/handwrite/HWDB2.2Test_label.txt
- ./train_data/handwrite/hwdb_ic13/handwriting_hwdb_val_labels.txt
- ./train_data/handwrite/HW_Chinese/test_hw.txt
loader:
shuffle: false
drop_last: false
batch_size_per_card: 64
num_workers: 4

感觉很写有点含糊:上面例子感觉没有teacher,student,但是他名字ch_PP-OCRv3_rec_distillation.yml 不应该没有teacher,student,你能否提供完整的针对这个例子修改好ch_PP-OCRv3_rec_distillation.yml 。已经走好几天弯路,网上好多例子都拷贝这份描述,这个例子到底有没有配置teacher,student,如果配置teacher,student里面到底怎么配置,一会说 freeze_params: false,一会true,我就是想看看你们跑成功手写体例子时候ch_PP-OCRv3_rec_distillation.yml

@SWHL SWHL added the training this is a training related issue label Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training this is a training related issue
Projects
None yet
Development

No branches or pull requests

3 participants