Replies: 3 comments 1 reply
-
I have sent out feelers to a couple of people who work in cloud/HPC environments to see if they have thoughts on this. So far my impression is nobody uses the monstrosities in our SLURM/ folder they basically seem like examples that were used in-house. 🤞 |
Beta Was this translation helpful? Give feedback.
-
Our HPC cluster is run with SLURM. Since the nodes have 128 cores each we haven't encountered shortcomings running multiple jobs on a node at the same time or a single job consuming up to the maximum number of cores. I have tried to run Caiman across more than one node though but it always failed. For Clusters with smaller nodes or larger datasets it might however be useful to have the opportunity to do so. Maybe it would work to decouple Caiman from SLURM using the latter one only to do the reservation and provide a hostfile and claim whole nodes in order not to interfere with other jobs. |
Beta Was this translation helpful? Give feedback.
-
Only seeing this discussion now, but just in case, we use SLURM and I confirm that we do not use the SLURM folder integration :)
|
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm currently working on cleaning up the clustering code in Caiman, and I'm hoping to hear from anyone using SLURM with Caiman how you're using it.
Currently the main way we expect anyone using SLURM with Caiman is that you're allocating some powerful CPU (or GPU) node and then running one of the CLI demos on it, probably modified to load your data with your particular parameters, and probably using the "multiprocessing" backend.
There is a possibility that a few of you are doing something different - there's a (currently broken in the codebase) way to run Caiman with slurm integration where caiman uses ipyparallel.Client() in a way that understands slurm and computes over multiple nodes, but as it is currently broken (and probably would not perform well if it were to be functional)..
I am thinking about removing that explicit SLURM integration, leaving people with only the possibility above. The code for it is hard to maintain and it never was documented (and setting it up would be pretty complex). Anyone using it would have needed to have modified the sources and rebuilt caiman themselves without a lot of help. I wanted to try to reach out to anyone using SLURM, in case my understanding is incorrect.
Beta Was this translation helpful? Give feedback.
All reactions