-
Notifications
You must be signed in to change notification settings - Fork 382
Possible data race when saving large results in file storage #869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @alex-van-vliet! I think the only liable solution would be a lock. This would have the drawback that it locks the main thread when an event overlaps with saving the heartbeat event. Would you be willing to implement/test this and provide a pull request? BTW, what are you saving that is that large that it takes several seconds to serialize? Maybe another way of saving your results could solve the issue for you. |
Yes, that is true. I am not familiar enough with the codebase, could another event clash with the heartbeat? If this is not the case, maybe completed could be triggered after the heartbeat has been joined? In this sense, this seems similar to #345. I agree that one solution could be to remove the timeout on the stop heartbeat, but I don't think that killing the thread is a good option as there could be data corruption. We would need to lock every loop that iterates on the observers (basically all emit functions), right (to prevent that one observer receives a different value than another observer)? Then we can also probably remove the timeout because if the heartbeat blocks it is not going to release the lock and the experiment will never finish anyway. But this causes another question from #345, what if the observer is reused in another experiment and the heartbeat from one run clashes with an event from the second run? (I have never had this use case, so I can't really comment) I don't have the time right now to implement this, but I can try when I have. Just a lot of segmentations for further analysis, thanks to the exception handling the completed event is called anyway which saves the results properly, so in this case it doesn't corrupt any data, but I could also save it to a file indeed. |
This could cause the behavior that I and others have observed, in which after completion the info dict is not fully saved in some cases. |
Hi,
When returning a large result from the main function (one that takes several seconds to serialize), there seem to be a data race on the
run_entry
object while the heartbeat is saving the file.For example, with:
I get the following error:
I am using sacred 0.8.2.
My understanding of the issue is the following:
There is a heartbeat, that heartbeat triggers the
heartbeat_event
in the observer, which callssave_json
on therun_entry
. The problem is, this takes some time, so while thesave_json
is flattening therun_entry
, the job is completed and callscompleted_event
which addsstop_time
to therun_entry
, causing therun_entry
to have a new key (becausestop_time
was not present before), thus theflatten
from the heartbeat fails because it is currently iterating on therun_entry
.Now technically, because of the exception handling, the
completed_event
will overwrite therun.json
anyway, so the file does not seem to be corrupted. But this could still create other issues I am not aware of.The text was updated successfully, but these errors were encountered: