Use cloudpathlib for the storage API? #1416
Replies: 7 comments
-
Hi @TomNicholas, nice catch! Just note that this API has always been treated as a research prototype due to the ease of use provided by Lithops. In Lithops, we've always pursued simplicity in cloud usage. The main API has consistently been the Storage API, but we've realized that users who want to use Lithops can easily transition to an API they already know, that's why we implemented it. You may have noticed that we also implemented the multiprocessing.Process API. Other projects like Ray or Dask have also implemented similar functionalities, but this doesn't mean we cannot implement our own. There are likely tens of frameworks that do similar things, but with Lithops, it's quite easy to implement and allows users to change just one line of code without learning a new API. When we began developing the cloud proxy internally in 2019, the framework you mentioned didn't exist. Our intention was simply to provide more APIs familiar to users for adoption. I agree the API you mentioned looks good, but in any case, there are probably more than one API attempting to achieve the same goal. We developed it, but our focus has always been on the main Storage API and , of course, the compute api |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick response @JosepSampe !
That's totally reasonable, but I'm not clear from your answer whether you would be for or against using
These are the reasons I like The other reason it's nice is that the two projects seem to care about exactly the same scope: abstracting away details of different cloud providers by providing a common interface, but not trying to extend that interface to work in non-cloud contexts. It seems to me that what lithops aims to do for cloud serverless APIs |
Beta Was this translation helpful? Give feedback.
-
Thanks for @TomNicholas for voicing the question and @JosepSampe for the fast engagement! The only thing I think I'd add, as a relative newcomer to Lithops, is that the documentation placing a (roughly) equal emphasis on the Storage API alongside the Compute API has been hard for me to understand. Perhaps I misunderstand the focus of the project, but from my naive newcomer perspective, it feels to me that in 2024 at least, the truly novel and unique contribution of Lithops is the Compute API, which other packages such as Dask etc do not offer a replica of (even if they try to solve similar user problems, they do it in different ways, with different tradeoffs). I am not aware of another fully OSS Python project that offers a seamless abstraction over both local multiprocessing and cloud-agnostic serverless parallel data processing. This uniqueness and the elegance of the Compute API implementation is what had lead me to recommend we adopt Lithops as the core framework for a contract I am currently working on. By contrast, to me it has seemed that the Storage API mostly exists as an enabler of the uniqueness of the Compute API... (am I misunderstanding, for example, that it's used to facilitate storage monitoring?)... Have I misunderstood the relationship of these components? |
Beta Was this translation helpful? Give feedback.
-
After developing several workflows with Lithops I think it would be a great addition. In my opinion, one of the most painful things with lithops is the Storage layer, not because of the API, but because you need to work with files after downloading them to a remote worker, and that means using the OS storage api. One thing I've talked about with @danielBCN was the ability to have some sort of abstraction or object that you can instantiate around, this object would be a lazy/transparent representation of some path in object storage. At some point i've also tried to create an adapter between cloudpathlib and lithops: https://github.com/abourramouss/cloudpathlib-lithops-adapter |
Beta Was this translation helpful? Give feedback.
-
@abourramouss @cisaacstern @TomNicholas I understand there might be different API for Storage layer. We implemented our own in Lithops, you suggest another way. What is rationale for your suggestion? Do you have some specific use case where existing API doesn't work? Or it just a matter of convenience? In any case, even we implement another Storage API then all changes should be backward compatible and Lithops will need to have 2 Storage APIs... We can't just replace existing with new one and that's it |
Beta Was this translation helpful? Give feedback.
-
Hi, IMO this is quite straightforward: first, it is true that the storage API was implemented when smart_open or cloudpathlib did not exist yet, and now it would be costly to refactor and get an equivalent functionality; but if someone needs some cloudpathlib functionality (or simply because it's more robust or efficient for instance) that lithops storage does not implement, it can be installed it in the runtime and used without any issue in the code, both can coexist and be used when needed for each specific use case |
Beta Was this translation helpful? Give feedback.
-
My rationale is (a) convenience for the user (
Of course there is always a trade-off with any refactoring. But if we felt that this idea would really save effort in the long term, it's not impossible to make a switch - you just have a (long) deprecation cycle. i.e:
Totally up to you if you (or others) think this is worthwhile, but it can be done without keeping two APIs around indefinitely. EDIT: Another thought: Arguably the best time to make this kind of change is earlier on in a project's life span, when you have fewer users who will be impacted. |
Beta Was this translation helpful? Give feedback.
-
@cisaacstern and I were wondering what was the rationale of the lithops project making your own implementation for the lithops cloud proxy storage API.
There are other libraries that deal with this problem already - why not use one of them?
In particular one I like is
cloudpathlib
, which provides classes that deliberately follow the same interface as the pythonpathlib
standard library module, but for different cloud storage providers. It seems to me that yourlithops.storage.CloudFileProxy
class could possibly just be replaced withcloudpathlib.CloudPath
?Beta Was this translation helpful? Give feedback.
All reactions