Skip to content

Job arrays support for HTCondor #5960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vivekvenkris opened this issue Apr 11, 2025 · 3 comments
Open

Job arrays support for HTCondor #5960

vivekvenkris opened this issue Apr 11, 2025 · 3 comments

Comments

@vivekvenkris
Copy link

New feature

Hello,

I would like to ask if the new "Job Array" functionality can also be extended to HT Condor.

The main problem we are facing currently is that the condor scheduler daemon gets overwhelmed if we submit > 10000 jobs across different nextflow pipelines running on our cluster. If these could be submitted as job arrays, this would greatly reduce the load on the scheduler.

The implementation (from the user experience side) can be the same as for other schedulers like SLURM. There will be a array <X> as an additional parameter to the executor. Nextflow then launches one jobs for ever <X> number of process instances, as an array.

Thanks,
Vivek

@vivekvenkris vivekvenkris changed the title Job arrays support for HTcondor Job arrays support for HTCondor Apr 11, 2025
@bentsherman
Copy link
Member

cc @JosephLalli

@JosephLalli
Copy link

@vivekvenkris A few questions, since I've been working on the side to improve Nextflow's support for HTCondor:

Are you at UWisc by chance? I'm most familiar with the UWisc's CHTC workspace, but if you are at another institution I can do my best to provide assistance/advice.

Could you flesh out your reasoning behind job arrays for NXF-HTCondor? I have found that the "executor.submitRateLimit" and (perhaps more importantly) "executor.queueSize" options limit the number of jobs that are simultaneously submitted to the HTCondor scheduler, and ensure that new jobs are submitted at a rate the scheduler node can handle.

I am curious to hear more about how you have configured your environment. The lack of a shared POSIX filesystem has hampered my ability to use HTCondor w/ Nextflow in the past, and I have had to work to update Nextflow's Condor implementation to fully support the use of Seqera's Fusion to simulate a shared POSIX filesystem. It would be great if you have a simpler solution.

PS - While it's not fit for public consumption quite yet, my Nextflow-condor branch is available here: https://github.com/JosephLalli/nextflow

@vivekvenkris
Copy link
Author

Hi @JosephLalli

No, I am not in UWisc.

Yeah, we tried the submitRateLimit but in order for the scheduler to have manageable load, we had to throttle submissions to like almost once a minute, which was unacceptable.

We do have a distributed file system that is available across all the compute nodes and the head node. This is also where the nextflow executable, project and the work directories are stored. For now, other than the stress on the scheduler, this seems to run quite seamlessly. We tried to submit the same amount of load outside nextflow via job arrays and that significantly reduced the load on the scheduler by 80%.

I will have a look at your condor branch, thanks!

Is job array functionality something that you envision to develop support for in the near future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants