when we use a non-zero minimum in cluster config for resources, they get alive at cluster launch. then this job-related check will never have a value of True:
|
if [[ $job_comment == *"Key=Monitoring,Value=ON"* ]]; then |
because this must be run in the root context, the only chance to do it is in the prolog script to attach it to a job, so basically the plan would be to
- install the docker container anyway in post-install but do not start it
- use prolog and epilog to start and stop the container depending on user's choice to monitor or not
the problem is how to send a signal about the job to prolog and epilog since the custom user env variables are not sent, and the job comment is not sent. Because per slurm manuals, we should not perform scontrol from prolog; this will impair the scaling of the jobs similarly to the API calls (this is related to #34 )
Looking at the variables available at prolog/epilog time I only have 2 ideas so far:
- SLURM_PRIO_PROCESS Scheduling priority (nice value) at the time of submission. Available in SrunProlog, TaskProlog, SrunEpilog and TaskEpilog. We can
#SBATCH --nice 0 or some sensible value to uniquely identify the intention then use the TaskProlog and TaskEpilog to start/stop the monitoring container
- use some crafted slurm job name like
[GM] my job name then pick and interpret this from SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog. Meaning also the use of TaskProlog and TaskEpilog to start/stop the monitoring container
when we use a non-zero minimum in cluster config for resources, they get alive at cluster launch. then this job-related check will never have a value of True:
1click-hpc/modules/40.install.monitoring.compute.sh
Line 59 in 7a833d4
because this must be run in the root context, the only chance to do it is in the prolog script to attach it to a job, so basically the plan would be to
the problem is how to send a signal about the job to prolog and epilog since the custom user env variables are not sent, and the job comment is not sent. Because per slurm manuals, we should not perform
scontrolfrom prolog; this will impair the scaling of the jobs similarly to the API calls (this is related to #34 )Looking at the variables available at prolog/epilog time I only have 2 ideas so far:
#SBATCH --nice 0or some sensible value to uniquely identify the intention then use the TaskProlog and TaskEpilog to start/stop the monitoring container[GM] my job namethen pick and interpret this from SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog. Meaning also the use of TaskProlog and TaskEpilog to start/stop the monitoring container