-
Notifications
You must be signed in to change notification settings - Fork 23
overwrite/"copy files to one node" behaviour unexpected #1419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
With overwrite set to true, changing the second if to if not self.file_should_be_downloaded(msg):
worklist.ok.append(msg)
continue makes it do what I want. The file is still not downloaded, because it's already there and identical to the file described by the message. But the message does get posted as if the file was just downloaded. So my understanding of the current behaviour is:
My modified behaviour is:
I think both situations are inconsistent with the documentation: https://metpx.github.io/sarracenia/Reference/sr3_options.7.html#overwrite-flag-default-off
|
The current behaviour is correct. It assumes that the existing file (not being overwritten) has already been posted. How does the original file version get there, and why didn't the process that downloaded it post it? ok... I think you are saying that each node has an xs_singleton exchange, so when the publisher receives a file that was downloaded by another node... it sees it on the local disk, which normally means it has already been published. You want it to re-publish anyways? I think the way this would be handled with current code would be shovels with duplicate suppression betwen the xs_singleton exchanges. But with the new behaviour, duplicate suppression will be broken. if the same file gets produced multiple times, it will get published multiple times... usually we want duplicate suppression to suppress the messages, and not just the data. I think the current behaviour described is the 99% correct case, so I don't see there being anything wrong. otoh, I understand the attraction of this unusual behaviour, and it could be enabled with a switch I guess.
Another option might be to make the overwrite more complicated: yes, no, republish or republish_only or message_only ... where republish is like no but with your new wrinkle. I would think you definitely put a guard of some kind on that behaviour and document it well. Other ideas for a name or scheme for the option are welcome. |
Not quite, but close. I could accomplish what I want with a shovel and a sarra. The data is currently spread across two nodes (ddsr-dev): graph LR;
sarra_get_source1("sarra<br>get_source");
sarra_get_source2("sarra<br>get_source");
xs_source{"xs_SOURCE"};
xpublic{"xpublic"};
subgraph ddsr-dev1;
sarra_get_source1;
end
subgraph ddsr-dev2;
sarra_get_source2;
end
xs_source --> sarra_get_source1;
xs_source --> sarra_get_source2;
sarra_get_source1 --> xpublic;
sarra_get_source2 --> xpublic;
I would like to have a sarra process running on one node (e.g. ddsr-dev2) that subscribes to xpublic and posts to xs_singleton:
graph LR;
sarra_get_source1("sarra<br>get_source");
sarra_get_source2("sarra<br>get_source");
sender_to_dest("sender<br>to_dest");
sarra_copy_source_singleton("sarra<br>copy_source_to_singleton");
xs_source{"xs_SOURCE"};
xpublic{"xpublic"};
xs_singleton{"xs_singleton"};
subgraph ddsr-dev1;
sarra_get_source1;
end
subgraph ddsr-dev2;
sarra_get_source2;
sarra_copy_source_singleton
end
xs_source --> sarra_get_source1;
xs_source --> sarra_get_source2;
sarra_get_source1 --> xpublic;
sarra_get_source2 --> xpublic;
xpublic -- amqp --> sarra_copy_source_singleton;
sarra_copy_source_singleton -- post msgs for files downloaded from dev1<br>and 'shovel' from xpublic to xs_singleton for files already on dev2 --> xs_singleton;
xs_singleton --> sender_to_dest;
sarra_get_source1 -- copy files from <br> dev1 to dev2 --> sarra_copy_source_singleton;
sarra_get_source2 -- files already on dev2 --> sarra_copy_source_singleton;
This isn't a critical feature - I just needed to temporarily have a subset of DMS data on one dev node and thought this would work with just adding an extra sarra (I didn't want to modify the existing configs). But it didn't work. I still think the overwrite behaviour is inconsistent with the documentation and v2. I'm going to do some tests to figure it out. |
I confirmed there is an inconsistency between sr3's overwrite behaviour and the documentation/v2. v2 matches the documentation. The documentation says:
(point 2 should also be updated, the sr3overwrite false:
but in this test, the checksum and size are different and the file is not overwritten. overwrite true:
v2overwrite false
overwrite true
summary:
|
one solution is supporting all 4 possibilities:
Although I'm not sure if this is necessary. My desired use case is satisfied by the v2/documented behaviour, although it's inefficient to redownload an identical file. I'm leaning towards making sr3 behave exactly like v2 and not adding the extra complexity. I can't think of cases where we'd want |
I think I agree with Reid. I would expect that the |
The reason "overwrite" was created as an option is because there were lots of files where there was no way to tell which "version" of a file is better... I think the example is RADAR products, where the products with the same name from different processing clusters NEVER have the same checksum or date. In that case, the products are equivalent, and you want to take the first one, and overwriting it with a later version, because receiving files with the same name but different content just confuses people and creates redundant downloads. So is overwrite then trying to solve something better solved by nodupe methods like name_only regardless of how it is done, while upstream there maybe a product with different checksums sharing the same name, the idea is that downstream clients will see a product with a single checksum, and everything will behave normally. The point of the "overwrite no" option is that some files, once created, should be treated as immutable... because no reliable duplicate suppression is available. Yes, because the file metadata and checksum is useless, just treat the first one received as the definitive version. That's what sr3 does with overwrite no and it's as intended, though perhaps not documented properly. overwrite no means never overwrite a file once it has been received. It's for exceptional usage... yeah, given the current behaviour of sr3, I think the default should be overwrite yes When you receive a message, and it is clearly a repeat... I don't understand why you would want to republish. It feels like maybe "overwrite" isn't a helpful keyword for this complexity. perhaps it is fine as it is (though default should likely change to yes.) and we need something else...
In the above scheme... "default" would best be republish=download, overwrite=yes for sarra and subscribe, I'm not convinced of any of it. Still confused. I still think current behaviour of overwrite in sr3 is correct, but default should likely change to true/on/yes download_publish_even_if_file_is_already_here on|off default off, but for your special case, we turn it on. still strange... OK, maybe make this other option specifically about what do with download results. download_publish successful (default) publishes only successful downloads ("normal") or make it a set: successful,too_old,too_new,identical download_publish successful,identical dunno,... unclear. |
Another thing that occurs to me is that the new case could be addresses succinctly if the original sarra on ddsr-dev2 did one of two options:
with either method, the sarra would not need to publish the "missing" files. |
What I'm thinking now:
I like taking the opportunity of removing options when possible. Now the question is how to deal with the special case, where the file is present and want to publish. My guess is that that is just too weird. We should probably implement the more common/intuitive case of multiple post_brokers. |
If we really want to solve the weird edge case... I guess I would go with download_publish as a new option:
It still feels wrong, like inserting an excessive complication to deal with a really niche case. I think multi-publishing is cleaner. |
I just checked the source code, and the default for overwrite is already True in the code... It's just the man page is wrong.
This does not fix the problem that started everything... but it does simplify things a bit. A second PR would be needed to allow enable the sarra get_source on node 2 to publish to xpublic AND xs_singleton... so that the second sarra doesn't need to publish messages about files already present. |
Oh, I just saw Reid had it assigned... didn't realize... well the PR wasn't much work... if it isn't wanted we just won't merge it. no problem. |
replace overwrite=no with nodupe_basis name_only sr3 convert does the replacement but v3 configs will need manual modification.
On a cluster with multiple nodes, I want to have one node that contains all the files and post them to a different exchange.
a config like this:
What I expect to happen:
I think it might be related to this code. In the second
if
statement, I think the messages should be added toworklist.ok
to cause them to be posted, even if they don't need to be downloaded.sarracenia/sarracenia/flow/__init__.py
Lines 1953 to 1961 in 9c44166
I need to do some testing to confirm, but I'm also worried about the consequences of changing this behaviour.
The text was updated successfully, but these errors were encountered: