BB Queue problem 23rd June 09

It was noted at 11:17 on 23rd June 09 that all grid jobs on BB were queued. Jobs resumed at aroudn 12:22. Conclude that it was potentially a problem with certificates not being updated on epgce4. Although the only manual action taken was to update lcg-CA and CRLs, possibility remains that an automatic action on another system was the root cause of the fix.


Ganga jobs still running (expected). listdone now returns jobs completed on 23rd June. Curiously, completion times are listed as being throughout the day.


Ganga jobs start running. Email notification of SAM test (CE-sft-job) failure on epgce4.


Jobs started running. Waiting for my test ganga job to show some life.


qstat |grep "R long"|wc -l epgce3 now equal to 61. Probably no problem with epgce3.


listdone command appears to list details of jobs completed. The last job to complete on epgce4 was at 20090622T175107. On epgce3 it was at 20090623T120033. Does this mean jobs are still completing on epgce3? Does the running job count change? qstat |grep "R long"|wc -l equal to 57.


Checked /var/log/messages

The line

Jun 23 11:50:07 epgce4 GRAM gatekeeper[29499]: GSS failed Major:01090000 Minor:00000000 Token:00000003

keeps appearing. It's first entry is at Jun 23 04:04:12, which could coincide with the beginning of the log file or the start of yum updates on other machines.

The line also appears in /var/log/messages on epgce3.


Checked /var/log/fetch-crl-cron.log, last update appears to be 1st June. Running crl update script manually: ( /opt/glite/libexec/ >> /var/log/fetch-crl-cron.log 2>&1). Jobs appear to still be queued.


Jobs are running on epgce3, but there seem to be a large number queued.


Logged onto epgce4. qs command shows all user jobs as being queued. None are running. Checked lcg-CA package with rpm -qi lcg-CA. Version 1.29 was installed. Version 1.30 was installed on some of our grid components automatically by yum, but not BB. Ran yum install lcg-CA on epgce4. Version 1.30 installed but no change to queued job status.

-- ChristopherCurtis - 23 Jun 2009

This topic: Computing > BBQueueJune
Topic revision: r1 - 23 Jun 2009 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback