You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Data in pretrain_hq.jsonl use im_start and im_end labels to separate multiple sentences in one line。
(https://www.modelscope.cn/datasets/gongjy/minimind_dataset/files)
This is one sample line:
{"text": "<|im_start|>如果一等奖有三个获奖者,二等奖有五个获奖者,那么一等奖和二等奖中谁的概率更高?获得二等奖的概率更高。<|im_end|> <|im_start|>我喜欢墨西哥的坦塔罗卡海滩最多,因为那里的海水非常清澈,沙滩也非常干净。谢谢你和我分享你的旅行经历!<|im_end|> <|im_start|>写一首五言绝句,表达家乡的景色与情感。故园绿柳依依,\n江水波光滟滟。\n儿时玩耍无尽,\n远离时空嬉戏。<|im_end|> <|im_start|>请为我生成一个五言绝句。秋叶飘零落,寒风吹过枝。\n瑟瑟霜华夜,孤影伴人迟。\n思念随风去,回忆在心悲。<|im_end|> <|im_start|>什么是万有引力定律?万有引力定律提出物体之间的引力与它们的质量和距离成反比,这是一个基本的物理定律。<|im_end|> <|im_start|>谢谢你!那请你为我开一个提醒,让我在明天早上带伞出门。好的,我将为您设置一个明天早上带伞出门的提醒。<|im_end|> <|im_start|>产生一首五言诗,以“梅花”为主题。白玉冰肌,花香清韵,寒冬沁心。山千二万重,雪遮满地,唯见梅花傲世。<|im_end|> <|im_start|>为什么寒冷的天气里,老人们喜欢喝姜茶?因为姜可以帮助身体保持温暖,增强免疫力,减少感冒和流感的风险。<|im_end|>"}
————————————
(https://github.com/jingyaogong/minimind/blob/master/dataset/lm_dataset.py#L16)
When I read the source code of PretrainDataset class, I didn't find any special process to trim those two labels. Anyone who can tell me which code did this trim action? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions