A diary of events related to the
HammerCloud tests at Birmingham.
11th Jan 10 - 1060
PANDA/ANALY_BHAM test - failed. Problems with pilot jobs?
11th Jan 10 - 1036
PANDA/ANALY_BHAM test - failed. Problems with pilot jobs?
9th Nov 09 - 805
Test requested just for Birmingham site to ensure that the Hammer Cloud mechnism works.
5th Aug 09 - 548
05/08/09 11:11
Test begins. Both epgce3 and 4 are accepting pilot jobs from Peter. epgce3 set to run 64 jobs simultaneously, epgce4 will run 20.
July 09 - 540
30/07/09 09:10
The SRMV2.2 service failed on the SE late last night, so this caused some of the
HammerCloud tests to fail. This is most likely due to the DPM 1.7 bug reported last week on TB-SUPPORT. The service has been restarted, but the fix has not yet been installed.
The number of ATLAS pilot jobs has been throttled so that Camont jobs can run. Maui config removes extra privileges of Peter's pool account and pilatl group returns to 20,24 MAXPROC. We completed 960 jobs and failed 195.
29/07/09 - 13:06
The Maui attribute
NODEALLOCATIONPOLICY
on epgce3 changed from
LOAD
to
CPULOAD
in an attempt to better balance the job allocation. Birmingham has now processed 546 pilot jobs, with only 2 failures. A small number of pilots are still seen on epgce4.
29/07/09 10:48
Small number (6) of Peter's jobs spotted on epgce4 (along with other, non-LHCb jobs!). 411 jobs now completed by epgce3, with only one failure! Are we saturated though? Some of the Twin CPUs are certainly very busy, but some are empty. So perhaps better load balancing could be investigated? No obvious evidence of network saturation on se1 or sr1, but should there be? I think data is DQ2_COPY'ed to the worker node before it's processed. There is evidence of sr1 CPU becoming busy late last night, corresponding with when the MAXPROC increased - presumably because of the large number of DQ2_COPIES which would have taken place all at once.
- epgsr1 - 29/07/09 10:48:
There is also a large busy time on sr1 late Tuesday afternoon. This corresponds to a large number of jobs running on the local farm using RFIO to access data on the SE. Perhaps SE access be rethought, or the RFIO buffers tweeked again.
- epgsr1 - 29/07/09 10:48:
- Twins - 29/07/09 10:48:
- epgd12 - 29/07/09 12:16:
I don't understand the epgce4 network access plot. Is this a saturation on input speed? I hope not!
- epgce4 - 29/07/09 10:48:
29/07/09 00:55
Increased Peter Loves Pilot DN MAXPROC quota on epgce3 to allow more pilot jobs to run. The target number of pilot jobs would normally be 12 (25% of 40% of 128 slots) on epgce3, but this is a stress test! Birmingham has up until now only completed 160 jobs, and we're falling behind! Would be more useful to have epgce4 online, but this is still refusing jobs.
28/07/09 13:40
Still only one pilot job on epgce4. 12 located on epgce3. Increased MAXPROC to 48 on epgce3 for ATLAS pilot jobs.
This in itself won't work, as the pilot jobs all belong to one user and will be limited by USERCFG[DEFAULT] FSTARGET=40+ MAXPROC=12
directive!
28/07/09 11:30
One pilot job from Peter Love spotted on epgce4.
28/07/09 11:00
Peter Love's pilot jobs spotted on epgce3. lcg-CE upgraded on epgce4, might fix job acceptance problem.
28/07/09 10:00
Test begins. Possible problem on epgce4 not receiving jobs?
July 09 - 519 and 520
23/07/09
Tests finish and remaining submitted jobs are killed. These are then catagorised as "failed", hence the high failure rate at Birmingham. Of the "real" failures, they almost all correspond to an error code of 1123 - "Missing guid(s) for output file(s) in metadata".
The statistics are too low to draw concrete conclusions. It might be worth finding out why QMUL and Liverpool have the best CPU/Walltime ratio. It also looks like the "Output Storage Time" metric could be improved.
- epgsr1 - 23/07/09 10:15:
22/07/09
Number of running pilot jobs on epce3 hits 10. Increased maui.cfg MaxProc to matched atlprd (MaxProc=24, no limit on jobs per node).
22/07/09
epgce4 hits
MaxProc? limit of 20 pilot jobs. epgce3 still on very low number (4) while the rest are queued. Also starting to feel very unresponsive.
22/07/09 16:00
45 pilots and counting on epgce3. All pilot and production jobs queued. Number of running increasing.
22/07/09 15:54
20 pilots jobs have now arrived on epgce3. All of them are queued :s
22/07/09 15:23
Pilot jobs spotted! Running under Peter's credentials. 5 jobs on epgce3, 25 on ce4. Nodes don't look very balanced on ce3 - epgd16 very loaded (75-100%) where as someother nodes are empty. CE3 pilot jobs are queued.
- sr1 network:
- se1 cpu:
- se1 network:
22/07/09 13:57
Removed empty directories from
epgce3:rmdir /home//.globus/job/epgce3.ph.bham.ac.uk/
for users
atl073
,
prdatl08
(Graeme accounts) and
atl052
,
pilatl14
,
prdatl11
,
prdatl19
(Peter accounts). Repeated on epgce4 for
g-atl012
,
g-atlo08
,
g-atlp13
,
g-atlp17
(Peter) and
g-atl057
,
g-atlp08
(Graeme).
22/07/09 13:03
Rebooted epgce3 to see if some vital service is missing.
22/07/09 12:16
No pilot jobs have ever run on epgce3 (previously confused them for production jobs).
22/07/09 11:40
Using Peter Loves scripts to submit test pilot jobs to epgce3. There is evidence in
epgce3:/var/log/messages
that these jobs are arriving at Birmingham:
Jul 22 11:33:26 epgce3 GRAM gatekeeper[32593]: "/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=peter love" mapped to pilatl14 (4214/3003)
Jul 22 11:33:26 epgce3 GRAM gatekeeper[32593]: JMA 2009/07/22 11:33:26 GATEKEEPER_JM_ID 2009-07-22.11:33:26.0000032593.0000000000 has EDG_WL_JOBID ''
Jul 22 11:33:26 epgce3 gridinfo[32598]: JMA 2009/07/22 11:33:26 GATEKEEPER_JM_ID 2009-07-22.11:33:26.0000032593.0000000000 JM exiting
It also seems to have been approved by the Gatekeeper (
epgce3:/var/log/globus-gatekeeper.log
):
TIME: Wed Jul 22 11:33:25 2009
PID: 32593 -- Notice: 6: Got connection 128.142.167.146 at Wed Jul 22 11:33:25 2009
TIME: Wed Jul 22 11:33:26 2009
PID: 32593 -- Notice: 5: Authenticated globus user: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=peter love
lcas client name: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=peter love
...
Successfull mapping done
Mapping service "LCMAPS" returned local user "pilatl14"
Unlike Graeme's jobs, there is no evidence of Peter's pilot jobs in
epgce3:/var/spool/pbs/server_priv/accounting/20090722
, so the job is failing between the gatekeeper and the PBS server.
22/07/09 10:45
Birmingham still hasn't received any pilot jobs. Lawrie notes that only one atlas pilot pool account is configured on epgce3 (Peter Love). Referring to
Glasgow instructions for enabling pilot accounts.
- Pilot roles appear in the numerous user.conf files on epgce3. This should be standardised!
- Pilot roles appear in
epgce3:/etc/shadow
, epgce3:/etc/passwd
and epgce3:/etc/group
- There appear to be pilot account directories in
epgce:/home
.
- Pilot group entries can be found in
epgce:/etc/grid-security/groupmapfile
-
atlaspil
are present in acl_groups in qmgr on epgce3 for both long and short queues.
21/07/09 14:03
"Your CEs broke" is the advice from Graeme! At his suggestion removed all empty directories in
epgce3:~prdatl08/.globus/job/
and
epgce4:~g-atlp08/.globus/job/
.
Actually, this advice may not have been intended for the Birmingham system! Small numbers of pilot jobs continue to trickle through the CEs.
21/07/09 12:43
Load graphs for epgse1 and epgsr1
- epgse1 - 21/07/09 12:43:
- epgsr1 - 21/07/09 12:43:
21/07/09 11:31
Birmingham is slated for 350 pilot jobs per
HammerCloud test over the next two days. I can only see 4 pilot jobs belonging to Graeme Stewart running periodically on the long queue on epgce4 and 3 jobs on epgce3. The cluster is far from full.
MaxProc? maui setting on epgce3 is set to 12 for the ATLAS pilot job group on epgce3 and 20 on epgce4.
--
ChristopherCurtis - 21 Jul 2009