Data provided by enabling globus audit is already used in Grid Operations Center.
But having this data locally gives us the ability to perform various queries to analyse performance of the cluster, create usage reports and identify potential problems. Of course, it would be even better to have this functionality in status.arcs directly...
original auditquery script deletes data form gram_audit_table after sending it to status.arcs
Just retaining this data is not good because the size of data sent is only going to increase. Therefore we create backup_gram_audit_table and modify auditquery to move data into it before deleting:
CREATE TABLE
backup_gram_audit_table
(job_grid_id
text NOT NULL,local_job_id
text,subject_name
text NOT NULL,username
varchar(16) NOT NULL default ,
idempotence_id
varchar(128) default NULL,
creation_time
varchar(40) NOT NULL default,
queued_time
varchar(40) default NULL,stage_in_grid_id
text,stage_out_grid_id
text,clean_up_grid_id
text,globus_toolkit_version
varchar(16) NOT NULL default ,
resource_manager_type
varchar(16) NOT NULL default,
job_description
text NOT NULL,success_flag
varchar(5) NOT NULL default ,
finished_flag
varchar(5) NOT NULL default,
PRIMARY KEY (
job_grid_id
(256)), KEYidxLocalJobId
(local_job_id
(50)) )
example auditquery (because mysql audit database is also password-protected, I removed read permission from this script).
#!/bin/sh
# auditquery Queries the VDT Globus-WS Audit database and emails
# Job-Id/DN records to a monitoring address. Records
# are also logged via syslog, then dropped from the
# database. Intended for invocation at hourly intervals.
# Graham Jenkins <[email protected]> Last changed: 20061220
#
# Specify PATH, initiallise scratch-file
PATH=/usr/bin:/bin:/usr/sbin:/usr/sbin:/sbin:$PATH
. /etc/profile >/dev/null 2>&1
File=/tmp/`basename $0`.$$
trap 'rm -f $File; exit 0' 0 1 2 3 14 15
echo "use auditDatabase;" >$File
#
# Read and log records, create delete statements and send email
mysql -u grid-mysql --port 3306 -h mysql-bg.ceres.auckland.ac.nz -p***** <<EOF 2>/dev/null | sed -n '2,$p' |
use ng2_auditDatabase;
select concat_ws(' ',local_job_id, subject_name)
from gram_audit_table;
EOF
while read Line; do
logger -t Job-DN "$Line"
echo "Job-DN: $Line"
JobId="`echo $Line | awk '{print $1}'`"
echo "insert into backup_gram_audit_table select * from gram_audit_table where local_job_id = '$JobId';" >> $File
echo "delete from gram_audit_table " >>$File
echo "where local_job_id = '$JobId';">>$File
#done | /bin/mail -s "`hostname` JobID `date +%Y%m%d`" [email protected] >/dev/null 2>&1
done | /bin/mail -c [email protected] -s "`hostname` JobID `date +%Y%m%d`" [email protected] >/dev/null 2>&1
#
# Perform deletions
( cat $File
echo "delete from gram_audit_table "
echo "where local_job_id is NULL;" ) | mysql -u grid-mysql --port 3306 -h mysql-bg.ceres.auckland.ac.nz -p***** ng2_auditDatabase >/dev/null 2>&1
exit 0
This section is specific to torque resource manager. Resources with other RMS will need different scripts.
We need some extra tables:
CREATE TABLE
pbs
(local_job_id
text NOT NULL,queued_time
varchar(40) NOT NULL default ,
start_time
varchar(40) NOT NULL default,
end_time
varchar(40) NOT NULL default '',cpu_count
int(11) default NULL,queue
text, PRIMARY KEY (local_job_id
(256)), KEYidxLocalJobId
(local_job_id
(50)) );
CREATE TABLE exit_status
(
local_job_id
varchar(256) NOT NULL,exit_status
int(11) NOT NULL, PRIMARY KEY (local_job_id
) );
This is ftp.bestgrid.org/pub/torque2mysql.py sample python script to convert torque logs into SQL. It can be used as
cat /opt/torque/server_priv/accounting/|python torque2mysql.py |mysql -h mysql-bg.ceres.auckland.ac.nz --port=3306 -u grid-mysql -p*** ng2_auditDatabase
to populate our tables. Duplicate entries are ignored so it is ok to run with old data. A bit slow with all logs, so may be worthwhile to run only with latest logs.
select subject_name, count⭐ as count, sum((unix_timestamp(str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y"))- unix_timestamp(str_to_date(pbs.start_time,"%a %b %d %H:%i:%s %Y")))* pbs.cpu_count) / 60.0 / 60.0 as dif from backup_gram_audit_table,pbs where pbs.local_job_id = backup_gram_audit_table.local_job_id and backup_gram_audit_table.subject_name LIKE '%' and
str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y") > subdate(now(), interval 7 day) group by subject_name order by dif;
subject_name | count | dif |
---|---|---|
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Yuriy Halytskyy | 10 | 0.00000000 |
/C=AU/O=APACGrid/OU=VPAC/CN=Markus Binsteiner | 54 | 0.00027778 |
/DC=au/DC=org/DC=arcs/DC=slcs/O=ARCS IdP/CN=Antoine Fouquet cpTxacQy-chYzfVDjjuAEWLyriM | 1 | 65.70916667 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Peter Shao-Wei Tsai | 10 | 114.40944444 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Jack Flanagan | 11 | 152.33694444 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Peter Wills | 2 | 161.89111111 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Sidney Markowitz | 331 | 205.25861111 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Steven Wu | 115 | 287.16194444 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Asad Ali | 69 | 451.54666667 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Brian Browning | 8 | 497.19055556 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Joseph Heled | 1600 | 534.40166667 |
/C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Sharon Browning | 537 | 1473.76027778 |
Example over the last 7 days
select substr(subject_name, instr(subject_name,'CN=') + 3) as name, pbs.local_job_id, idempotence_id, pbs.queued_time as queued, (unix_timestamp(str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y")) - unix_timestamp(str_to_date(pbs.queued_time,"%a %b %d %H:%i:%s %Y"))) / 60.0 / 60.0 as total, (unix_timestamp(str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y")) - unix_timestamp(str_to_date(pbs.start_time,"%a %b %d %H:%i:%s %Y"))) / 60.0 / 60.0 as run, exit_status from backup_gram_audit_table,pbs,exit_status where subject_name LIKE '%Brian Browning%' and str_to_date(pbs.queued_time,"%a %b %d %H:%i:%s %Y") >
subdate(now(), interval 7 day) and backup_gram_audit_table.local_job_id LIKE '%.hpc-bestgrid.auckland.ac.nz%' and pbs.local_job_id = backup_gram_audit_table.local_job_id and pbs.local_job_id = exit_status.local_job_id
order by str_to_date(backup_gram_audit_table.queued_time,"%Y-%m-%d %H:%i:%s %Y");
name | local_job_id | idempotence_id | queued | total | run | exit_status |
---|
Brian Browning | 101526.hpc-bestgrid.auckland.ac.nz | a4d427fa-7b04-11de-90b4-e2e160fc8bd9 | Tue Jul 28 11:24:37 2009 | 24.01000000 | 18.49138889 | 271 |
---|---|---|---|---|---|---|
Brian Browning | 101527.hpc-bestgrid.auckland.ac.nz | a8b212ba-7b04-11de-a4f3-e2e160fc8bd9 | Tue Jul 28 11:24:43 2009 | 24.01027778 | 18.46694444 | 271 |
Brian Browning | 102814.hpc-bestgrid.auckland.ac.nz | d928de24-7cad-11de-bc0f-e2e160fc8bd9 | Thu Jul 30 14:08:21 2009 | 24.00166667 | 24.00138889 | 271 |
Brian Browning | 102815.hpc-bestgrid.auckland.ac.nz | dd9200da-7cad-11de-a111-e2e160fc8bd9 | Thu Jul 30 14:08:29 2009 | 24.01888889 | 24.01861111 | 271 |
Brian Browning | 102816.hpc-bestgrid.auckland.ac.nz | df649f58-7cad-11de-9848-e2e160fc8bd9 | Thu Jul 30 14:08:31 2009 | 24.02000000 | 24.01972222 | 271 |
Brian Browning | 103973.hpc-bestgrid.auckland.ac.nz | eb43ec7e-7e39-11de-8548-e2e160fc8bd9 | Sat Aug 1 13:23:32 2009 | 24.07500000 | 0.00138889 | 271 |