TWiki
>
Computing Web
>
BBQueueJune
(revision 3) (raw view)
Edit
Attach
---++ BB Queue problem 23rd June 09 It was noted at 11:17 on 23rd June 09 that all grid jobs on BB were queued. Jobs resumed at aroudn 12:22. Conclude that it was potentially a problem with certificates not being updated on epgce4. Although the only manual action taken was to update lcg-CA and CRLs, possibility remains that an automatic action on another system was the root cause of the fix. ---+++ 26/05/09 17:30 John Owen confirmed that all BB nodes were switched at roughly 3pm due to a fire alarm in the computing centre. This would explain the queued jobs after this time, but not why only one job was seen to be running before this time. It may well be that the CRL update fixed the problem but the power cut caused the jobs to fail? Will keep an eye on status page and retest once BB back up. This may not happen though before the scheduled outage on Monday. ---+++ 26/05/09 16:36 qs output has returned on epgce4. All jobs remain queued though. SAM tests are starting to fail. Discussed jobs with Dave Hadley. According to Ganga, all of his jobs are running on epgce3 so I've killed those running on epgce4. ---+++ 26/05/09 15:48 No longer able to qstat or qs on epgce4. Logged onto BlueBEAR proper, same problem. Call with the Uni helpdesk logged (HD486255). It might be due to the scheduled outage on Monday... ---+++ 26/05/09 14:56 Steve Lloyds jobs are queued in gshort ---+++ 26/05/09 14:28 Observe that the only queued jobs are in glong. Two short jobs have appeared and are running. ---+++ 26/05/09 14:05 Jobs noted to be queued again. Re-run fetch CRL script. ---+++ 23/05/09 16:04 My own ganga jobs start to complete (and fail!) as expected. ---+++ 23/05/09 13:25 Ganga jobs still running (expected). =listdone= now returns jobs completed on 23rd June. Curiously, completion times are listed as being throughout the day ( _Could be hitting walltime/grid expire time?_) ---+++ 23/05/09 12:28 Ganga jobs start running. Email notification of SAM test (CE-sft-job) failure on epgce4. ---+++ 23/05/09 12:22 Jobs started running. Waiting for my test ganga job to show some life. ---+++ 23/05/09 12:09 =qstat |grep "R long"|wc -l= epgce3 now equal to 61. Probably no problem with epgce3. ---+++ 23/05/09 12:04 =listdone= command appears to list details of jobs completed. The last job to complete on epgce4 was at 20090622T175107. On epgce3 it was at 20090623T120033. Does this mean jobs are still completing on epgce3? Does the running job count change? =qstat |grep "R long"|wc -l= equal to 57. ---+++ 23/05/09 11:56 Checked /var/log/messages The line <verbatim> Jun 23 11:50:07 epgce4 GRAM gatekeeper[29499]: GSS failed Major:01090000 Minor:00000000 Token:00000003 </verbatim> keeps appearing. It's first entry is at Jun 23 04:04:12, which could coincide with the beginning of the log file or the start of yum updates on other machines. The line also appears in /var/log/messages on epgce3. ---+++ 11:46 Checked =/var/log/fetch-crl-cron.log=, last update appears to be 1st June. Running crl update script manually: ( =/opt/glite/libexec/fetch-crl.sh >> /var/log/fetch-crl-cron.log 2>&1=). Jobs appear to still be queued. ---+++ 11:42 Jobs are running on epgce3, but there seem to be a large number queued. ---+++ 11:17 Logged onto =epgce4=. =qs= command shows all user jobs as being queued. None are running. Checked =lcg-CA= package with =rpm -qi lcg-CA=. Version 1.29 was installed. Version 1.30 was installed on some of our grid components automatically by yum, but not BB. Ran =yum install lcg-CA= on =epgce4=. Version 1.30 installed but no change to queued job status. -- Main.ChristopherCurtis - 23 Jun 2009
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r3 - 26 Jun 2009
-
_47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis
?
Computing
Log In
Computing Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
Webs
ALICE
ATLAS
BILPA
CALICE
Computing
General
LHCb
LinearCollider
Main
NA62
Publish
Sandbox
TWiki
Welcome
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback