ATLAS Grid Monitoring Links
A list of useful links concerning ATLAS Grid Activities.
Panda Production
Production jobs tend to have a high CPU time, and so it's important that they run successfully. The status of production jobs running in any cloud can be checked
here. Click on the
UK
link, and this will give more detailed information about individual sites.
Failed jobs for
SouthGrid? sites during the last 12 hours can be found here:
BHAM CAM OX RAL
Not all errors are site specific. Some are related to the way the task has been defined (eg wrong database file requested, wrong jobOptions file used). More information can usually be found in the
athena_stdout.txt
log file, but it's usually easier to let ATLAS come to you with problems!
If there is a problem with a site, the production queues are usually moved offline. The status of the production queues can be checked here:
BHAM CAM OX RAL
If queues are not in the online state, they won't receive jobs! They will normally be in the online state, but could be marked as
offline
or
brokeroff
in which case they won't receive any jobs. If a problem with a site has been reported as being fixed, queues will be moved to the
test
status first whilst shifters manually send batches of test jobs. If the test jobs complete successfully a site will be moved back online.
You can request queue states be changed by emailing
atlas-support-cloud-uk@cern.ch.
Panda Pilots
These jobs usually represent users jobs, so they're more I/O bound and more prone to crashing. They can also represent
GangaRobot? /HammerCloud tests. Queue status can be checked here:
BHAM CAM OX RAL
The same queue states are available for the pilot queues are for the production queues. For sites to receive jobs, a queue must be in the online state!
The UK pilot factories are managed by
Peter Love and
Graeme Stewart, both of whom are very helpful when ploughing through Athena log files!
Log files from the Glasgow pilot factor are available here:
Glasgow Factory (replace date in URL and click through to relevant site).
Logs from the Sheffield pilot factory are available on request...
DDM
The status of ATLAS data transfers between sites over the last 4 hours are shown
here.
Each row represents transfers
into a site. Clicking on the number in the
Transfer Errors
column will expand the error and give more information about why they failed. Note that errors are classified according to the destination site, and not the cause of the error. For example, if there is a problem at RAL, transfer errors from RAL to Oxford will appear under Oxford.
A list of error codes is available
here.
ATLAS eLog
All known problems should be logged by shifters
here. Sometimes it's useful to know what ATLAS knows about a problem!
BRIS BHAM CAM OX RAL
Other Useful Links
On the subject of software installations, if it looks like there is a problem with a software release at your site, get
Alessandro de Salvo involved as soon as possible! He can reinstall anything/everything very quickly!
--
ChristopherCurtis - 29 Nov 2010