HammerCloud

A diary of events related to the HammerCloud tests at Birmingham.

11th Jan 10 - 1060

PANDA/ANALY_BHAM test - failed. Problems with pilot jobs?

11th Jan 10 - 1036

PANDA/ANALY_BHAM test - failed. Problems with pilot jobs?

9th Nov 09 - 805

Test requested just for Birmingham site to ensure that the Hammer Cloud mechnism works.

5th Aug 09 - 548

05/08/09 11:11

Test begins. Both epgce3 and 4 are accepting pilot jobs from Peter. epgce3 set to run 64 jobs simultaneously, epgce4 will run 20.

July 09 - 540

30/07/09 09:10

The SRMV2.2 service failed on the SE late last night, so this caused some of the HammerCloud tests to fail. This is most likely due to the DPM 1.7 bug reported last week on TB-SUPPORT. The service has been restarted, but the fix has not yet been installed.

The number of ATLAS pilot jobs has been throttled so that Camont jobs can run. Maui config removes extra privileges of Peter's pool account and pilatl group returns to 20,24 MAXPROC. We completed 960 jobs and failed 195.

29/07/09 - 13:06

The Maui attribute NODEALLOCATIONPOLICY on epgce3 changed from LOAD to CPULOAD in an attempt to better balance the job allocation. Birmingham has now processed 546 pilot jobs, with only 2 failures. A small number of pilots are still seen on epgce4.

29/07/09 10:48

Small number (6) of Peter's jobs spotted on epgce4 (along with other, non-LHCb jobs!). 411 jobs now completed by epgce3, with only one failure! Are we saturated though? Some of the Twin CPUs are certainly very busy, but some are empty. So perhaps better load balancing could be investigated? No obvious evidence of network saturation on se1 or sr1, but should there be? I think data is DQ2_COPY'ed to the worker node before it's processed. There is evidence of sr1 CPU becoming busy late last night, corresponding with when the MAXPROC increased - presumably because of the large number of DQ2_COPIES which would have taken place all at once.

  • epgsr1 - 29/07/09 10:48:
    epgsr1-network-540.png

There is also a large busy time on sr1 late Tuesday afternoon. This corresponds to a large number of jobs running on the local farm using RFIO to access data on the SE. Perhaps SE access be rethought, or the RFIO buffers tweeked again.

  • epgsr1 - 29/07/09 10:48:
    epgsr1-cpu-540.png
  • Twins - 29/07/09 10:48:
    twins-cpu-540.png
  • epgd12 - 29/07/09 12:16:
    epgd12-cpu-540.png

I don't understand the epgce4 network access plot. Is this a saturation on input speed? I hope not!

  • epgce4 - 29/07/09 10:48:
    epgce4-network-540.png

29/07/09 00:55

Increased Peter Loves Pilot DN MAXPROC quota on epgce3 to allow more pilot jobs to run. The target number of pilot jobs would normally be 12 (25% of 40% of 128 slots) on epgce3, but this is a stress test! Birmingham has up until now only completed 160 jobs, and we're falling behind! Would be more useful to have epgce4 online, but this is still refusing jobs.

28/07/09 13:40

Still only one pilot job on epgce4. 12 located on epgce3. Increased MAXPROC to 48 on epgce3 for ATLAS pilot jobs. This in itself won't work, as the pilot jobs all belong to one user and will be limited by USERCFG[DEFAULT] FSTARGET=40+ MAXPROC=12 directive!

28/07/09 11:30

One pilot job from Peter Love spotted on epgce4.

28/07/09 11:00

Peter Love's pilot jobs spotted on epgce3. lcg-CE upgraded on epgce4, might fix job acceptance problem.

28/07/09 10:00

Test begins. Possible problem on epgce4 not receiving jobs?

July 09 - 519 and 520

23/07/09

Tests finish and remaining submitted jobs are killed. These are then catagorised as "failed", hence the high failure rate at Birmingham. Of the "real" failures, they almost all correspond to an error code of 1123 - "Missing guid(s) for output file(s) in metadata".

The statistics are too low to draw concrete conclusions. It might be worth finding out why QMUL and Liverpool have the best CPU/Walltime ratio. It also looks like the "Output Storage Time" metric could be improved.

  • epgsr1 - 23/07/09 10:15:
    sr1-network-22.png

22/07/09

Number of running pilot jobs on epce3 hits 10. Increased maui.cfg MaxProc to matched atlprd (MaxProc=24, no limit on jobs per node).

22/07/09

epgce4 hits MaxProc? limit of 20 pilot jobs. epgce3 still on very low number (4) while the rest are queued. Also starting to feel very unresponsive.

22/07/09 16:00

45 pilots and counting on epgce3. All pilot and production jobs queued. Number of running increasing.

22/07/09 15:54

20 pilots jobs have now arrived on epgce3. All of them are queued :s

22/07/09 15:23

Pilot jobs spotted! Running under Peter's credentials. 5 jobs on epgce3, 25 on ce4. Nodes don't look very balanced on ce3 - epgd16 very loaded (75-100%) where as someother nodes are empty. CE3 pilot jobs are queued.

  • sr1 network:
    sr1-network.png

  • se1 cpu:
    se1-cpu.png

  • se1 network:
    se1-network.png

22/07/09 13:57

Removed empty directories from epgce3:rmdir /home//.globus/job/epgce3.ph.bham.ac.uk/ for users atl073, prdatl08 (Graeme accounts) and atl052, pilatl14, prdatl11, prdatl19 (Peter accounts). Repeated on epgce4 for g-atl012, g-atlo08, g-atlp13, g-atlp17 (Peter) and g-atl057, g-atlp08 (Graeme).

22/07/09 13:03

Rebooted epgce3 to see if some vital service is missing.

22/07/09 12:16

No pilot jobs have ever run on epgce3 (previously confused them for production jobs).

22/07/09 11:40

Using Peter Loves scripts to submit test pilot jobs to epgce3. There is evidence in epgce3:/var/log/messages that these jobs are arriving at Birmingham:

Jul 22 11:33:26 epgce3 GRAM gatekeeper[32593]: "/C=UK/O=eScience/OU=Lancaster/L=Physics/CN=peter love" mapped to pilatl14 (4214/3003)
Jul 22 11:33:26 epgce3 GRAM gatekeeper[32593]: JMA 2009/07/22 11:33:26 GATEKEEPER_JM_ID 2009-07-22.11:33:26.0000032593.0000000000 has EDG_WL_JOBID ''
Jul 22 11:33:26 epgce3 gridinfo[32598]: JMA 2009/07/22 11:33:26 GATEKEEPER_JM_ID 2009-07-22.11:33:26.0000032593.0000000000 JM exiting

It also seems to have been approved by the Gatekeeper ( epgce3:/var/log/globus-gatekeeper.log):

TIME: Wed Jul 22 11:33:25 2009
 PID: 32593 -- Notice: 6: Got connection 128.142.167.146 at Wed Jul 22 11:33:25 2009

TIME: Wed Jul 22 11:33:26 2009
 PID: 32593 -- Notice: 5: Authenticated globus user: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=peter love
lcas client name: /C=UK/O=eScience/OU=Lancaster/L=Physics/CN=peter love
...
Successfull mapping done
Mapping service "LCMAPS" returned local user "pilatl14"

Unlike Graeme's jobs, there is no evidence of Peter's pilot jobs in epgce3:/var/spool/pbs/server_priv/accounting/20090722, so the job is failing between the gatekeeper and the PBS server.

22/07/09 10:45

Birmingham still hasn't received any pilot jobs. Lawrie notes that only one atlas pilot pool account is configured on epgce3 (Peter Love). Referring to Glasgow instructions for enabling pilot accounts.

  • Pilot roles appear in the numerous user.conf files on epgce3. This should be standardised!
  • Pilot roles appear in epgce3:/etc/shadow, epgce3:/etc/passwd and epgce3:/etc/group
  • There appear to be pilot account directories in epgce:/home.
  • Pilot group entries can be found in epgce:/etc/grid-security/groupmapfile
  • atlaspil are present in acl_groups in qmgr on epgce3 for both long and short queues.

21/07/09 14:03

"Your CEs broke" is the advice from Graeme! At his suggestion removed all empty directories in epgce3:~prdatl08/.globus/job/ and epgce4:~g-atlp08/.globus/job/. Actually, this advice may not have been intended for the Birmingham system! Small numbers of pilot jobs continue to trickle through the CEs.

21/07/09 12:43

Load graphs for epgse1 and epgsr1

  • epgse1 - 21/07/09 12:43:
    epgse1.png

  • epgsr1 - 21/07/09 12:43:
    epgsr1-day.png

21/07/09 11:31

Birmingham is slated for 350 pilot jobs per HammerCloud test over the next two days. I can only see 4 pilot jobs belonging to Graeme Stewart running periodically on the long queue on epgce4 and 3 jobs on epgce3. The cluster is far from full. MaxProc? maui setting on epgce3 is set to 12 for the ATLAS pilot job group on epgce3 and 20 on epgce4.

-- ChristopherCurtis - 21 Jul 2009

Topic attachments
I Attachment Action Size Date Who Comment
pngpng epgce4-network-540.png manage 20.8 K 29 Jul 2009 - 11:03 ChristopherCurtis  
pngpng epgd12-cpu-540.png manage 15.1 K 29 Jul 2009 - 12:15 ChristopherCurtis  
pngpng epgse1.png manage 23.5 K 21 Jul 2009 - 12:49 ChristopherCurtis epgse1 - 21/07/09 12:43
pngpng epgsr1-cpu-540.png manage 17.2 K 29 Jul 2009 - 11:04 ChristopherCurtis  
pngpng epgsr1-day.png manage 19.1 K 21 Jul 2009 - 12:51 ChristopherCurtis epgsr1 - 21/07/09 12:43
pngpng epgsr1-network-540.png manage 21.0 K 29 Jul 2009 - 11:04 ChristopherCurtis  
pngpng se1-cpu.png manage 15.4 K 22 Jul 2009 - 15:43 ChristopherCurtis se1 cpu
pngpng se1-network.png manage 16.5 K 22 Jul 2009 - 15:43 ChristopherCurtis se1 network
pngpng sr1-network-22.png manage 21.0 K 23 Jul 2009 - 10:31 ChristopherCurtis  
pngpng sr1-network.png manage 20.9 K 22 Jul 2009 - 15:42 ChristopherCurtis sr1 network
pngpng twins-cpu-540.png manage 14.8 K 29 Jul 2009 - 11:12 ChristopherCurtis  
Topic revision: r15 - 26 Jan 2010 - 10:21:22 - ChristopherCurtis
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback