BB Queue problem 23rd June 09

It was noted at 11:17 on 23rd June 09 that all grid jobs on BB were queued. Jobs resumed at aroudn 12:22. Conclude that it was potentially a problem with certificates not being updated on epgce4. Although the only manual action taken was to update lcg-CA and CRLs, possibility remains that an automatic action on another system was the root cause of the fix.

26/05/09 22:32

Short ops job spotted running in the BB queue. All other jobs queued. Status page suggests that BB is still down wink

26/05/09 17:30

John Owen confirmed that all BB nodes were switched at roughly 3pm due to a fire alarm in the computing centre. This would explain the queued jobs after this time, but not why only one job was seen to be running before this time. It may well be that the CRL update fixed the problem but the power cut caused the jobs to fail? Will keep an eye on status page and retest once BB back up. This may not happen though before the scheduled outage on Monday.

26/05/09 16:36

qs output has returned on epgce4. All jobs remain queued though. SAM tests are starting to fail. Discussed jobs with Dave Hadley. According to Ganga, all of his jobs are running on epgce3 so I've killed those running on epgce4.

26/05/09 15:48

No longer able to qstat or qs on epgce4. Logged onto BlueBEAR proper, same problem. Call with the Uni helpdesk logged (HD486255). It might be due to the scheduled outage on Monday...

26/05/09 14:56

Steve Lloyds jobs are queued in gshort

26/05/09 14:28

Observe that the only queued jobs are in glong. Two short jobs have appeared and are running.

26/05/09 14:05

Jobs noted to be queued again. Re-run fetch CRL script.

23/05/09 16:04

My own ganga jobs start to complete (and fail!) as expected.

23/05/09 13:25

Ganga jobs still running (expected). listdone now returns jobs completed on 23rd June. Curiously, completion times are listed as being throughout the day ( Could be hitting walltime/grid expire time?)

23/05/09 12:28

Ganga jobs start running. Email notification of SAM test (CE-sft-job) failure on epgce4.

23/05/09 12:22

Jobs started running. Waiting for my test ganga job to show some life.

23/05/09 12:09

qstat |grep "R long"|wc -l epgce3 now equal to 61. Probably no problem with epgce3.

23/05/09 12:04

listdone command appears to list details of jobs completed. The last job to complete on epgce4 was at 20090622T175107. On epgce3 it was at 20090623T120033. Does this mean jobs are still completing on epgce3? Does the running job count change? qstat |grep "R long"|wc -l equal to 57.

23/05/09 11:56

Checked /var/log/messages

The line

Jun 23 11:50:07 epgce4 GRAM gatekeeper[29499]: GSS failed Major:01090000 Minor:00000000 Token:00000003

keeps appearing. It's first entry is at Jun 23 04:04:12, which could coincide with the beginning of the log file or the start of yum updates on other machines.

The line also appears in /var/log/messages on epgce3.


Checked /var/log/fetch-crl-cron.log, last update appears to be 1st June. Running crl update script manually: ( /opt/glite/libexec/ >> /var/log/fetch-crl-cron.log 2>&1). Jobs appear to still be queued.


Jobs are running on epgce3, but there seem to be a large number queued.


Logged onto epgce4. qs command shows all user jobs as being queued. None are running. Checked lcg-CA package with rpm -qi lcg-CA. Version 1.29 was installed. Version 1.30 was installed on some of our grid components automatically by yum, but not BB. Ran yum install lcg-CA on epgce4. Version 1.30 installed but no change to queued job status.

