Skip to content

Commit

Permalink
Add telegraf script to monitor condor queue jobs
Browse files Browse the repository at this point in the history
including compute resources requested by job and the dates
  • Loading branch information
sanjaysrikakulam committed Oct 31, 2023
1 parent 4517c21 commit 94cf30c
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 0 deletions.
7 changes: 7 additions & 0 deletions group_vars/maintenance.yml
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,13 @@ telegraf_plugins_extra:
- timeout = "10s"
- data_format = "influx"
- interval = "1m"
monitor_condor_queue_jobs:
plugin: "exec"
config:
- commands = ["sudo /usr/bin/monitor-condor-queue-jobs"]
- timeout = "10s"
- data_format = "influx"
- interval = "1m"
postgres_extra:
plugin: "exec"
config:
Expand Down
15 changes: 15 additions & 0 deletions roles/hxr.monitor-cluster/files/cluster_queue-condor-jobs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

# This script is used to monitor the condor jobs status in the cluster including the compute resources, job submit time to the queue, job start time, job description, etc.
condor_q -global -autoformat ClusterId JobStatus Cmd RemoteHost RequestCpus RequestMemory QDate JobStartDate JobDescription | awk '{
if ($8 != "undefined") $8 = strftime("%Y-%m-%d %H:%M:%S", $8);
status["0"]="Unexpanded"; status["1"]="Idle"; status["2"]="Running"; status["3"]="Removed"; status["4"]="Completed"; status["5"]="Held"; status["6"]="Submission_err";
jobdesc = $9;
for (i = 10; i <= NF; i++) {
jobdesc = jobdesc "_" $i;
}
printf "condor_queued_jobs_status,clusterid=\"%s\" jobstatus=\"%s\",cmd=\"%s\",remotehost=\"%s\",requestcpus=%s,requestmemory=%s,qdate=\"%s\",jobstartdate=\"%s\",jobdescription=\"%s\"\n", $1, status[$2], $3, $4, $5, $6, strftime("%Y-%m-%d %H:%M:%S", $7), $8, jobdesc
}'
16 changes: 16 additions & 0 deletions roles/hxr.monitor-cluster/tasks/condor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,19 @@
insertafter: EOF
line: 'telegraf ALL=(ALL) NOPASSWD: /usr/bin/monitor-condor-queue'
validate: 'visudo -cf %s'

- name: "Add condor queue jobs status script"
copy:
src: "cluster_queue-condor-jobs.sh"
dest: "/usr/bin/monitor-condor-queue-jobs"
owner: root
group: root
mode: 0755

- name: Allow telegraf to run monitor-condor-queue-jobs
lineinfile:
path: /etc/sudoers
state: present
insertafter: EOF
line: 'telegraf ALL=(ALL) NOPASSWD: /usr/bin/monitor-condor-queue-jobs'
validate: 'visudo -cf %s'

0 comments on commit 94cf30c

Please sign in to comment.