ATLAS Grid Monitoring Links

A list of useful links concerning ATLAS Grid Activities.

Panda Production

Production jobs tend to have a high CPU time, and so it's important that they run successfully. The status of production jobs running in any cloud can be checked here. Click on the UK link, and this will give more detailed information about individual sites.

Failed jobs for SouthGrid? sites during the last 12 hours can be found here:

BHAM CAM OX RAL

Not all errors are site specific. Some are related to the way the task has been defined (eg wrong database file requested, wrong jobOptions file used). More information can usually be found in the athena_stdout.txt log file, but it's usually easier to let ATLAS come to you with problems!

If there is a problem with a site, the production queues are usually moved offline. The status of the production queues can be checked here:

BHAM CAM OX RAL

If queues are not in the online state, they won't receive jobs! They will normally be in the online state, but could be marked as offline or brokeroff in which case they won't receive any jobs. If a problem with a site has been reported as being fixed, queues will be moved to the test status first whilst shifters manually send batches of test jobs. If the test jobs complete successfully a site will be moved back online.

You can request queue states be changed by emailing atlas-support-cloud-uk@cern.ch.

Panda Pilots

These jobs usually represent users jobs, so they're more I/O bound and more prone to crashing. They can also represent GangaRobot? /HammerCloud tests. Queue status can be checked here:

BHAM CAM OX RAL

The same queue states are available for the pilot queues are for the production queues. For sites to receive jobs, a queue must be in the online state!

The UK pilot factories are managed by Peter Love and Graeme Stewart, both of whom are very helpful when ploughing through Athena log files!

Log files from the Glasgow pilot factor are available here: Glasgow Factory (replace date in URL and click through to relevant site).

Logs from the Sheffield pilot factory are available on request...

DDM

The status of ATLAS data transfers between sites over the last 4 hours are shown here.

Each row represents transfers into a site. Clicking on the number in the Transfer Errors column will expand the error and give more information about why they failed. Note that errors are classified according to the destination site, and not the cause of the error. For example, if there is a problem at RAL, transfer errors from RAL to Oxford will appear under Oxford.

A list of error codes is available here.

ATLAS eLog

All known problems should be logged by shifters here. Sometimes it's useful to know what ATLAS knows about a problem!

BRIS BHAM CAM OX RAL

Other Useful Links

On the subject of software installations, if it looks like there is a problem with a software release at your site, get Alessandro de Salvo involved as soon as possible! He can reinstall anything/everything very quickly!

-- ChristopherCurtis - 29 Nov 2010

Topic revision: r1 - 29 Nov 2010 - 22:30:35 - ChristopherCurtis
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback