GPU训练 #29

yhl-97 · 2020-04-14T14:26:45Z

作者您好，我使用您的crnn训练代码在cpu上运行会出现valueerror（weight或height需要大于0），经过修改trans.py中的参数可以解决这一问题。但使用相同的参数在GPU上训练仍会出现valueerror，请问这是什么原因？

courao · 2020-04-14T14:34:49Z

可以贴一下详细的错误信息吗

yhl-97 · 2020-04-15T02:41:24Z

我觉得应该还是trans.py对图像处理过程出了问题，可能输入的图像本身长或宽不够，造成resize后的图像长或宽小于等于了0。但是这应该如何修改呢？

yhl-97 · 2020-04-15T02:41:56Z

courao · 2020-04-15T13:02:47Z

一个简单粗暴的方法是在mydataset.py文件中把__getitem__中的代码加一个try catch，就比如用下面的代码替换一下
def __getitem__(self, index): try: img = Image.open(self.files[index]) if self.transform is not None: img = self.transform( img ) img = img.convert('L') label = self.labels[index] if self.target_transform is not None: label = self.target_transform( label ) return (img,label) except: return self.__getitem__(np.random.randint(self.__len__()))

courao · 2020-04-15T13:04:34Z

emmm插入的代码不知道为啥排版不太好，重新发一下吧：
def getitem(self, index):
try:
img = Image.open(self.files[index])
if self.transform is not None:
img = self.transform( img )
img = img.convert('L')
label = self.labels[index]
if self.target_transform is not None:
label = self.target_transform( label )
return (img,label)
except:
return self.getitem(np.random.randint(self.len()))

就是在原来的基础上加一个try catch如果有异常则重新换一个

yhl-97 · 2020-04-16T02:13:30Z

谢谢您的指点，不过我加入了异常处理后又出现了新的错误，您能否帮我看一下是什么原因?我还需要修改哪些地方？

courao · 2020-04-16T07:01:55Z

这个问题应该是在验证集上测试的时候报的错，看着似乎是模型时cpu上的输入的图像是gpu上的造成的，你是在cpu上训练的吗？
正常在GPU上训练的话，不会出现这样的情况。。

yhl-97 · 2020-04-16T09:43:14Z

不好意思，我发现config没有调回来，这应该是CPU上训练的结果，我如果在CPU上训练需要对测试的代码进行什么改动呢？我现在GPU出现了一些状况没办法输出运行结果，等拿到出错信息后我再给您看。

courao · 2020-04-16T09:47:26Z

在训练代码中找到这一行num_correct, num_all = val_model(config.val_infofile,net,True,log_file='compare-'+config.saved_model_prefix+'.log')，
把True改成False

yhl-97 · 2020-04-16T13:59:15Z

谢谢作者大佬，CPU上的运行已经没有问题了，但是在GPU上训练时输入某些图片还会出错，我把一张出错的图片放在下面，您能帮我看一下为什么会报错吗？

courao · 2020-04-16T14:05:10Z

从图片上看不出问题啊，报的什么错呢？异常处理也解决不了吗？

yhl-97 · 2020-04-16T14:08:36Z

因为我是连服务器上运行的，一到这种图片服务器就把我kill了，而且没有错误信息，用nohup指令也没法把错误信息输出来。您能否试一下这张图片有没有什么问题？非常感谢！

courao · 2020-04-16T14:18:38Z

你好，我这边暂时也没有空余的GPU服务器没法再GPU上测试，不过我在本地测试了一下这张图似乎没啥问题，我在想是不是因为路径什么的有错误，其他的我也说不上来什么原因。
我觉得至少得有点错误信息才能看看是哪个环节有问题，或者你在那个try catch的地方让它在catch之后把捕捉的错误打出来看看。

yhl-97 · 2020-04-16T14:21:28Z

好的，谢谢您的帮助，我再想想办法。不过奇怪的是这张图片在CPU模型里没有问题，到了GPU就出问题了。

yhl-97 · 2020-04-16T16:13:13Z

大佬您好，我让朋友帮我测试了一下，他遇到了一个奇怪的问题，但是感觉这个错误比较奇怪，如果是这种错误感觉我之前不可能跑通，为什么会遇到这种问题呢？

yhl-97 · 2020-04-16T16:36:35Z

顺带一问，我在CPU训练的检测部分遇到了outputsize过小的问题，这有办法解决吗？

courao · 2020-04-17T02:55:00Z

大佬您好，我让朋友帮我测试了一下，他遇到了一个奇怪的问题，但是感觉这个错误比较奇怪，如果是这种错误感觉我之前不可能跑通，为什么会遇到这种问题呢？

应该是pytorch版本的问题，比较早的版本没有这个zero_infinity参数

courao · 2020-04-17T02:56:12Z

顺带一问，我在CPU训练的检测部分遇到了outputsize过小的问题，这有办法解决吗？

抱歉，从图中看不出outputsize过小的问题

yhl-97 · 2020-04-20T11:17:16Z

作者大佬您好，我在cpu上训练时基本上不会出现错误，但是在训练到第二个epoch时速度就会变得非常慢，基本上处在停滞状态，这可能是因为什么引起的？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU训练 #29

GPU训练 #29

yhl-97 commented Apr 14, 2020

courao commented Apr 14, 2020

yhl-97 commented Apr 15, 2020

yhl-97 commented Apr 15, 2020

courao commented Apr 15, 2020 •

edited

Loading

courao commented Apr 15, 2020 •

edited

Loading

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 17, 2020

courao commented Apr 17, 2020

yhl-97 commented Apr 20, 2020

GPU训练 #29

GPU训练 #29

Comments

yhl-97 commented Apr 14, 2020

courao commented Apr 14, 2020

yhl-97 commented Apr 15, 2020

yhl-97 commented Apr 15, 2020

courao commented Apr 15, 2020 • edited Loading

courao commented Apr 15, 2020 • edited Loading

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

yhl-97 commented Apr 16, 2020

courao commented Apr 17, 2020

courao commented Apr 17, 2020

yhl-97 commented Apr 20, 2020

courao commented Apr 15, 2020 •

edited

Loading

courao commented Apr 15, 2020 •

edited

Loading