TWiki
>
Computing Web
>
BBQueueJune
(revision 2) (raw view)
Edit
Attach
---++ BB Queue problem 23rd June 09 It was noted at 11:17 on 23rd June 09 that all grid jobs on BB were queued. Jobs resumed at aroudn 12:22. Conclude that it was potentially a problem with certificates not being updated on epgce4. Although the only manual action taken was to update lcg-CA and CRLs, possibility remains that an automatic action on another system was the root cause of the fix. ---+++ 1604 My own ganga jobs start to complete (and fail!) as expected. ---+++ 1325 Ganga jobs still running (expected). =listdone= now returns jobs completed on 23rd June. Curiously, completion times are listed as being throughout the day. ---+++ 12:28 Ganga jobs start running. Email notification of SAM test (CE-sft-job) failure on epgce4. ---+++ 12:22 Jobs started running. Waiting for my test ganga job to show some life. ---+++ 12:09 =qstat |grep "R long"|wc -l= epgce3 now equal to 61. Probably no problem with epgce3. ---+++ 12:04 =listdone= command appears to list details of jobs completed. The last job to complete on epgce4 was at 20090622T175107. On epgce3 it was at 20090623T120033. Does this mean jobs are still completing on epgce3? Does the running job count change? =qstat |grep "R long"|wc -l= equal to 57. ---+++ 11:56 Checked /var/log/messages The line <verbatim> Jun 23 11:50:07 epgce4 GRAM gatekeeper[29499]: GSS failed Major:01090000 Minor:00000000 Token:00000003 </verbatim> keeps appearing. It's first entry is at Jun 23 04:04:12, which could coincide with the beginning of the log file or the start of yum updates on other machines. The line also appears in /var/log/messages on epgce3. ---+++ 11:46 Checked =/var/log/fetch-crl-cron.log=, last update appears to be 1st June. Running crl update script manually: ( =/opt/glite/libexec/fetch-crl.sh >> /var/log/fetch-crl-cron.log 2>&1=). Jobs appear to still be queued. ---+++ 11:42 Jobs are running on epgce3, but there seem to be a large number queued. ---+++ 11:17 Logged onto =epgce4=. =qs= command shows all user jobs as being queued. None are running. Checked =lcg-CA= package with =rpm -qi lcg-CA=. Version 1.29 was installed. Version 1.30 was installed on some of our grid components automatically by yum, but not BB. Ran =yum install lcg-CA= on =epgce4=. Version 1.30 installed but no change to queued job status. -- Main.ChristopherCurtis - 23 Jun 2009
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r2 - 23 Jun 2009
-
_47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis
?
Computing
Log In
Computing Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
Webs
ALICE
ATLAS
BILPA
CALICE
Computing
General
LHCb
LinearCollider
Main
NA62
Publish
Sandbox
TWiki
Welcome
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback