Skip to content

overwrite/"copy files to one node" behaviour unexpected #1419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
reidsunderland opened this issue Apr 9, 2025 · 12 comments
Open

overwrite/"copy files to one node" behaviour unexpected #1419

reidsunderland opened this issue Apr 9, 2025 · 12 comments
Assignees
Labels
Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. enhancement New feature or request NewUseCase needed to address a use case, we can't yet support. Refactor change implementation of existing functionality. Sugar nice to have features... not super important. UserStory interesting to read to consider improving

Comments

@reidsunderland
Copy link
Member

On a cluster with multiple nodes, I want to have one node that contains all the files and post them to a different exchange.

a config like this:

vip 123

broker     amqp://feeder@shared_broker/
exchange   xpublic
queueName  q_${BROKER_USER}.${PROGRAM}.${CONFIG}.dev

subtopic *.MY-SOURCE.SOME.FILES.#

post_broker    amqp://feeder@shared_broker/
post_exchange  xs_singleton
post_baseUrl   http://${HOSTNAME}/
post_baseDir   /apps/sarra/public_data

acceptUnmatched False
mirror          True

# tried with and without overwrite true, doesn't seem to make a difference
overwrite True

directory ${PBD}
accept .*MY-SOURCE.SOME.FILES.*something.*

What I expect to happen:

  1. Files not on the node with the vip are downloaded and posted (this works)
  2. Files already on the node with the vip are not re-downloaded, but they should be posted (they're not posted)

I think it might be related to this code. In the second if statement, I think the messages should be added to worklist.ok to cause them to be posted, even if they don't need to be downloaded.

# FIXME: decision of whether to download, goes here.
if os.path.isfile(new_path):
if not self.o.overwrite:
self.reject(msg, 204,
"not overwriting existing file %s" % new_path)
continue
if not self.file_should_be_downloaded(msg):
continue

I need to do some testing to confirm, but I'm also worried about the consequences of changing this behaviour.

@reidsunderland reidsunderland added Developer not a problem, more of a note to self for devs about work to do. enhancement New feature or request NewUseCase needed to address a use case, we can't yet support. Refactor change implementation of existing functionality. Sugar nice to have features... not super important. UserStory interesting to read to consider improving labels Apr 9, 2025
@reidsunderland reidsunderland self-assigned this Apr 9, 2025
@reidsunderland
Copy link
Member Author

With overwrite set to true, changing the second if to

if not self.file_should_be_downloaded(msg): 
         worklist.ok.append(msg)
         continue 

makes it do what I want. The file is still not downloaded, because it's already there and identical to the file described by the message. But the message does get posted as if the file was just downloaded.


So my understanding of the current behaviour is:

  • Overwrite off:
    • if file exists, don't overwrite, don't post, even if the file on disk is different than the message
  • Overwrite on:
    • if file exists, download and post ONLY when the file on disk is different than the message

My modified behaviour is:

  • Overwrite off:
    • if file exists, don't overwrite, don't post, even if the file on disk is different than the message
  • Overwrite on:
    • if file exists, download ONLY if the file on disk is different than the message
    • message always gets posted
    • (Note to self: after_work still logs "downloaded ok" in this case, can I make it say "file already exists, not downloaded again"?)

I think both situations are inconsistent with the documentation:

https://metpx.github.io/sarracenia/Reference/sr3_options.7.html#overwrite-flag-default-off

The overwrite option,if set to false, avoid unnecessary downloads under these conditions :

1- the file to be downloaded is already on the user’s file system at the right place and

2- the checksum of the amqp message matched the one of the file.

The default is False.

@reidsunderland reidsunderland added the Discussion_Needed developers should discuss this issue. label Apr 10, 2025
@petersilva
Copy link
Contributor

The current behaviour is correct. It assumes that the existing file (not being overwritten) has already been posted. How does the original file version get there, and why didn't the process that downloaded it post it?

ok... I think you are saying that each node has an xs_singleton exchange, so when the publisher receives a file that was downloaded by another node... it sees it on the local disk, which normally means it has already been published. You want it to re-publish anyways?

I think the way this would be handled with current code would be shovels with duplicate suppression betwen the xs_singleton exchanges.

But with the new behaviour, duplicate suppression will be broken. if the same file gets produced multiple times, it will get published multiple times... usually we want duplicate suppression to suppress the messages, and not just the data. I think the current behaviour described is the 99% correct case, so I don't see there being anything wrong.

otoh, I understand the attraction of this unusual behaviour, and it could be enabled with a switch I guess.

overwrite no
overwrite_no_but_publish  yes

Another option might be to make the overwrite more complicated: yes, no, republish or republish_only or message_only ... where republish is like no but with your new wrinkle.

I would think you definitely put a guard of some kind on that behaviour and document it well. Other ideas for a name or scheme for the option are welcome.
Does that make sense?

@reidsunderland
Copy link
Member Author

ok... I think you are saying that each node has an xs_singleton exchange, so when the publisher receives a file that was downloaded by another node... it sees it on the local disk, which normally means it has already been published. You want it to re-publish anyways?

Not quite, but close. I could accomplish what I want with a shovel and a sarra.

The data is currently spread across two nodes (ddsr-dev):

graph LR;
  sarra_get_source1("sarra<br>get_source");
  sarra_get_source2("sarra<br>get_source");
  xs_source{"xs_SOURCE"};
  xpublic{"xpublic"};

  subgraph ddsr-dev1;
    sarra_get_source1;
  end

  subgraph ddsr-dev2;
    sarra_get_source2;
  end

  xs_source --> sarra_get_source1;
  xs_source --> sarra_get_source2;
  sarra_get_source1 --> xpublic;
  sarra_get_source2 --> xpublic;
Loading

I would like to have a sarra process running on one node (e.g. ddsr-dev2) that subscribes to xpublic and posts to xs_singleton:

  1. Copies the data from ddsr-dev1 to ddsr-dev2 (the sarra already does this without any code modification)
  2. For any data that originally ended up on ddsr-dev2, it doesn't need to be copied since it's already there, but I do want the message to get posted to xs_singleton (I could use a shovel for this part)
graph LR;
  sarra_get_source1("sarra<br>get_source");
  sarra_get_source2("sarra<br>get_source");
  sender_to_dest("sender<br>to_dest");
  sarra_copy_source_singleton("sarra<br>copy_source_to_singleton");
  xs_source{"xs_SOURCE"};
  xpublic{"xpublic"};
  xs_singleton{"xs_singleton"};

  subgraph ddsr-dev1;
    sarra_get_source1;
  end

  subgraph ddsr-dev2;
    sarra_get_source2;
   sarra_copy_source_singleton
  end

  xs_source --> sarra_get_source1;
  xs_source --> sarra_get_source2;
  sarra_get_source1 --> xpublic;
  sarra_get_source2 --> xpublic;

  xpublic -- amqp --> sarra_copy_source_singleton;
  sarra_copy_source_singleton -- post msgs for files downloaded from dev1<br>and 'shovel' from xpublic to xs_singleton for files already on dev2 --> xs_singleton;

  xs_singleton --> sender_to_dest;

  sarra_get_source1 -- copy files from <br> dev1 to dev2 --> sarra_copy_source_singleton;
  sarra_get_source2 -- files already on dev2 --> sarra_copy_source_singleton;
Loading

This isn't a critical feature - I just needed to temporarily have a subset of DMS data on one dev node and thought this would work with just adding an extra sarra (I didn't want to modify the existing configs). But it didn't work.


I still think the overwrite behaviour is inconsistent with the documentation and v2. I'm going to do some tests to figure it out.

@reidsunderland
Copy link
Member Author

I confirmed there is an inconsistency between sr3's overwrite behaviour and the documentation/v2. v2 matches the documentation.

The documentation says:

The overwrite option,if set to false, avoid unnecessary downloads under these conditions :

1- the file to be downloaded is already on the user’s file system at the right place and

2- the checksum of the amqp message matched the one of the file.

(point 2 should also be updated, the file_should_be_downloaded method checks the size, checksum, modification time)

sr3

overwrite false:

  • identical file re-published: not downloaded, not posted (good).

  • different file with same name but different size/checksum published: not downloaded, not posted (inconsistent with documentation)

    • this is because when overwrite is false, we only check if the file exists. The code does not call file_should_be_downloaded to check if the size/checksum/modification time/anything is different.
    • I would call this overwrite never, if we decide to make the overwrite option take multiple keywords
    • I don't think this is a good default.
# first time seeing the file:  downloaded and posted
2025-04-11 15:57:41,575 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 2.56 )  exchange: xs_test subtopic: tmp.watching a file with baseUrl: sftp://localhost/ relPath: tmp/watching/testfile id: nsXFaFJ size: 644342
2025-04-11 15:57:42,147 [INFO] sarracenia.flowcb.log after_work downloaded ok: /net/local/home/sunderlandr/watch_download/tmp/watching/testfile
2025-04-11 15:57:42,148 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_test_sarra_out topic: v02.post.watch_download.tmp.watching a file with baseUrl: sftp://localhost relPath: watch_download/tmp/watching/testfile size: 644342 id: nsXFaFJ

# identical file published again: not downloaded, not posted
2025-04-11 15:58:54,598 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.48 )  exchange: xs_test subtopic: tmp.watching a file with baseUrl: sftp://localhost/ relPath: tmp/watching/testfile id: nsXFaFJ size: 644342
2025-04-11 15:58:54,599 [WARNING] sarracenia setReport unknown report code supplied: 204:not overwriting existing file /net/local/home/sunderlandr/watch_download/tmp/watching/testfile

# different file with same name published: not downloaded, not posted (bad)
2025-04-11 16:01:26,182 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.93 )  exchange: xs_test subtopic: tmp.watching a file with baseUrl: sftp://localhost/ relPath: tmp/watching/testfile id: hL0mrUr size: 12714
2025-04-11 16:01:26,183 [WARNING] sarracenia setReport unknown report code supplied: 204:not overwriting existing file /net/local/home/sunderlandr/watch_download/tmp/watching/testfile

but in this test, the checksum and size are different and the file is not overwritten.

overwrite true:

  • identical file re-published: not downloaded, not posted.

  • different file with same name but different size/checksum published: is downloaded, is posted (good)

# file is downloaded for the first time, does not exist at the destination. Correctly downloaded and posted.
2025-04-11 16:09:26,598 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.81 )  exchange: xs_test subtopic: tmp.watching a file with baseUrl: sftp://localhost/ relPath: tmp/watching/testfile id: nsXFaFJ size: 644342
2025-04-11 16:09:27,276 [INFO] sarracenia.flowcb.log after_work downloaded ok: /net/local/home/sunderlandr/watch_download/tmp/watching/testfile
2025-04-11 16:09:27,277 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_test_sarra_out topic: v02.post.watch_download.tmp.watching a file with baseUrl: sftp://localhost relPath: watch_download/tmp/watching/testfile size: 644342 id: nsXFaFJ

# identical file is posted again by the watch, it gets rejected
2025-04-11 16:28:11,622 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 2.50 )  exchange: xs_test subtopic: tmp.watching 
2025-04-11 16:28:11,643 [INFO] sarracenia.flowcb.log after_work rejected: 406 mtime not newer /net/local/home/sunderlandr/watch_download/tmp/watching/testfile

# different file with same name is posted, overwritten and posted as expected
2025-04-11 16:33:17,406 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 2.28 )  exchange: xs_test subtopic: tmp.watching a file with baseUrl: sftp://localhost/ relPath: tmp/watching/testfile id: hL0mrUr size: 12714
2025-04-11 16:33:17,445 [INFO] sarracenia.flowcb.log after_work downloaded ok: /net/local/home/sunderlandr/watch_download/tmp/watching/testfile
2025-04-11 16:33:17,447 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_test_sarra_out topic: v02.post.watch_download.tmp.watching a file with baseUrl: sftp://localhost relPath: watch_download/tmp/watching/testfile size: 12714 id: hL0mrUr

v2

overwrite false

# first time seeing the file: downloaded and posted
2025-04-11 16:59:41,930 [INFO] file_log downloaded to: /net/local/home/sunderlandr/watch_download_v2/tmp/watching/testfile
2025-04-11 16:59:41,930 [INFO] post_log notice=20250411165935.0416693687 sftp://localhost watch_download_v2/tmp/watching/testfile headers={'from_cluster': 'localhost', 'source': 'bunnymaster', 'mode': '644', 'mtime': '20241018185038', 'atime': '20250411165925.563759565', 'contentType': 'application/octet-stream', 'parts': '1,644342,1,0,0', 'sum': 's,9ec5c568524769f40b17624b25351559393176ba7d50dc502c60af0f0fd76c6651b77c9a564f8ffb58385287cf70362ccf851afb9cd9f117d305ce46d578571b', 'to_clusters': 'localhost'}

# identical file published again: not downloaded, not posted

# different file published: downloaded and posted
2025-04-11 17:15:04,075 [INFO] file_log downloaded to: /net/local/home/sunderlandr/watch_download_v2/tmp/watching/testfile
2025-04-11 17:15:04,076 [INFO] post_log notice=20250411171458.693273544 sftp://localhost watch_download_v2/tmp/watching/testfile headers={'from_cluster': 'localhost', 'source': 'bunnymaster', 'mode': '644', 'mtime': '20250411171445.499850273', 'atime': '20250411171445.499850273', 'contentType': 'text/x-script.python', 'parts': '1,12714,1,0,0', 'sum': 's,84bd26ad4aead91bae87b1297344cdd4b553cd1fd35e4bbc0236a6baf6eb47ecbe71e6f1abb1598d283b2acb572dfe3a89c6fc6d37ed8f728282c258fa546274', 'to_clusters': 'localhost'}

overwrite true

# first time seeing the file:  downloaded and posted
2025-04-11 16:50:26,816 [INFO] file_log downloaded to: /net/local/home/sunderlandr/watch_download_v2/tmp/watching/testfile
2025-04-11 16:50:26,816 [INFO] post_log notice=20250411165021.520756483 sftp://localhost watch_download_v2/tmp/watching/testfile headers={'from_cluster': 'localhost', 'source': 'bunnymaster', 'mode': '644', 'mtime': '20241018185038', 'atime': '20250411165018.0135202408', 'contentType': 'application/octet-stream', 'parts': '1,644342,1,0,0', 'sum': 's,9ec5c568524769f40b17624b25351559393176ba7d50dc502c60af0f0fd76c6651b77c9a564f8ffb58385287cf70362ccf851afb9cd9f117d305ce46d578571b', 'to_clusters': 'localhost'}

# identical file published again:  downloaded and posted again
2025-04-11 16:52:57,356 [INFO] file_log downloaded to: /net/local/home/sunderlandr/watch_download_v2/tmp/watching/testfile
2025-04-11 16:52:57,356 [INFO] post_log notice=20250411165247.902385712 sftp://localhost watch_download_v2/tmp/watching/testfile headers={'from_cluster': 'localhost', 'source': 'bunnymaster', 'mode': '644', 'mtime': '20241018185038', 'atime': '20250411165239.930544138', 'contentType': 'application/octet-stream', 'parts': '1,644342,1,0,0', 'sum': 's,9ec5c568524769f40b17624b25351559393176ba7d50dc502c60af0f0fd76c6651b77c9a564f8ffb58385287cf70362ccf851afb9cd9f117d305ce46d578571b', 'to_clusters': 'localhost'}

# different file published: downloaded and posted
2025-04-11 16:54:58,216 [INFO] file_log downloaded to: /net/local/home/sunderlandr/watch_download_v2/tmp/watching/testfile
2025-04-11 16:54:58,216 [INFO] post_log notice=20250411165452.38927865 sftp://localhost watch_download_v2/tmp/watching/testfile headers={'from_cluster': 'localhost', 'source': 'bunnymaster', 'mode': '644', 'mtime': '20250411165441.849715471', 'atime': '20250411165441.849715471', 'contentType': 'text/x-script.python', 'parts': '1,12714,1,0,0', 'sum': 's,84bd26ad4aead91bae87b1297344cdd4b553cd1fd35e4bbc0236a6baf6eb47ecbe71e6f1abb1598d283b2acb572dfe3a89c6fc6d37ed8f728282c258fa546274', 'to_clusters': 'localhost'}

summary:

version overwrite file exists and is identical file exists and is different
v2 false no download, no post download and post
v2 true download and post download and post
sr3 false no download, no post no download, no post
sr3 true no download, no post download and post

@reidsunderland
Copy link
Member Author

one solution is supporting all 4 possibilities:

overwrite file exists and is identical file exists and is different same as?
never? no download, no post no download, no post current sr3 overwrite false
false/off no download, no post download and post current v2 overwrite false and sr3 overwrite true
true/on no download, do post download and post none, but what I want for my use case
always? force? download and post download and post current v2 overwrite true

Although I'm not sure if this is necessary. My desired use case is satisfied by the v2/documented behaviour, although it's inefficient to redownload an identical file.

I'm leaning towards making sr3 behave exactly like v2 and not adding the extra complexity. I can't think of cases where we'd want overwrite never, but maybe there's a good reason sr3 behaves the way it does and that would justify having the never option?

@andreleblanc11
Copy link
Member

I think I agree with Reid. I would expect that the overwrite would work as v2 did. I think the current behaviour of overwrite True and with it not downloading and not posting the file is bizarre. That's not what I would expect at all to happen when adding this option to a sarra component.

@petersilva
Copy link
Contributor

petersilva commented Apr 12, 2025

The reason "overwrite" was created as an option is because there were lots of files where there was no way to tell which "version" of a file is better... I think the example is RADAR products, where the products with the same name from different processing clusters NEVER have the same checksum or date. In that case, the products are equivalent, and you want to take the first one, and overwriting it with a later version, because receiving files with the same name but different content just confuses people and creates redundant downloads.

So is overwrite then trying to solve something better solved by nodupe methods like name_only
maybe overwrite isn't needed anymore at all because of name_only.

regardless of how it is done, while upstream there maybe a product with different checksums sharing the same name, the idea is that downstream clients will see a product with a single checksum, and everything will behave normally.

The point of the "overwrite no" option is that some files, once created, should be treated as immutable... because no reliable duplicate suppression is available. Yes, because the file metadata and checksum is useless, just treat the first one received as the definitive version. That's what sr3 does with overwrite no and it's as intended, though perhaps not documented properly. overwrite no means never overwrite a file once it has been received. It's for exceptional usage... yeah, given the current behaviour of sr3, I think the default should be overwrite yes

When you receive a message, and it is clearly a repeat... I don't understand why you would want to republish.

It feels like maybe "overwrite" isn't a helpful keyword for this complexity. perhaps it is fine as it is (though default should likely change to yes.) and we need something else...

republish overwrite file exists and is identical file exists and is different same as?
no/off no/off no download, no post no download, no post current sr3 overwrite false
download yes/on no download, no post download and post current v2 overwrite false and sr3 overwrite true
yes/on yes/on no download, do post download and post none, but what I want for my use case
yes/on/download force download and post download and post current v2 overwrite true
no/off force download and no post download and no post don't specify a post_broker?

In the above scheme... "default" would best be republish=download, overwrite=yes for sarra and subscribe,
and republish=yes for all others (the ones that don't download?)

I'm not convinced of any of it. Still confused.

I still think current behaviour of overwrite in sr3 is correct, but default should likely change to true/on/yes
but this new case feels really weird... like it is a 0.00001% one... maybe just make an option for that. make the name start with download_ to signify that it modifies download result processing:

download_publish_even_if_file_is_already_here on|off default off, but for your special case, we turn it on.

still strange... OK, maybe make this other option specifically about what do with download results.

download_publish successful (default) publishes only successful downloads ("normal")
download_publish all ( regardless of whether it was downloaded or not... the new behaviour wanted.)
download_publish none (same as not having a post_broker.)

or make it a set: successful,too_old,too_new,identical

download_publish successful,identical

dunno,... unclear.

@petersilva
Copy link
Contributor

Another thing that occurs to me is that the new case could be addresses succinctly if the original sarra on ddsr-dev2 did one of two options:

  • publish to a third exchange xs_dev2 and have a pair of shovels publish to xpublic and xs_singleton.
  • implement a feature to have sarra able to publish to both xpublic and xs_singleton.

with either method, the sarra would not need to publish the "missing" files.

@petersilva
Copy link
Contributor

petersilva commented Apr 14, 2025

What I'm thinking now:

  • overwrite off ... is really just nodupe_basis name_only should probably implement that in config conversions.
  • the overwrite on behaviour should be the default.... in fact, don't need a setting at all... the nodupe_basis name_only takes care of it.
  • so remove overwrite option entirely.

I like taking the opportunity of removing options when possible.

Now the question is how to deal with the special case, where the file is present and want to publish. My guess is that that is just too weird. We should probably implement the more common/intuitive case of multiple post_brokers.

@petersilva
Copy link
Contributor

If we really want to solve the weird edge case... I guess I would go with download_publish as a new option:

  • none: same as no post_broker being defined... never publish after download.
  • accepted: what happens in shovels/winnows... (or since download=off, ... skipped.)
  • present: all messages related to files present on the file system (the weird edge case where file exists and receive message about it.
  • downloaded: what happens in sarra & subscribe today... only publish files that are downloaded.

It still feels wrong, like inserting an excessive complication to deal with a really niche case. I think multi-publishing is cleaner.

@petersilva
Copy link
Contributor

I just checked the source code, and the default for overwrite is already True in the code... It's just the man page is wrong.
Anyways, this option isn't needed anymore... so here is a PR:

  • removes the overwrite option (going forward the code acts as if it is always true, which was already the default.)
  • in sr3 convert ... rewrite overwrite false as nodupe_basis name_only
  • removes documentation of overwrite

This does not fix the problem that started everything... but it does simplify things a bit.

A second PR would be needed to allow enable the sarra get_source on node 2 to publish to xpublic AND xs_singleton... so that the second sarra doesn't need to publish messages about files already present.

@petersilva
Copy link
Contributor

Oh, I just saw Reid had it assigned... didn't realize... well the PR wasn't much work... if it isn't wanted we just won't merge it. no problem.

petersilva added a commit that referenced this issue Apr 17, 2025
replace overwrite=no with nodupe_basis name_only
sr3 convert does the replacement but v3 configs will need manual
modification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. enhancement New feature or request NewUseCase needed to address a use case, we can't yet support. Refactor change implementation of existing functionality. Sugar nice to have features... not super important. UserStory interesting to read to consider improving
Projects
None yet
Development

No branches or pull requests

3 participants