Running on the GRID

As more real data and MC production becomes available it will only be possible to access it all on the GRID. This page explains how to run an analysis task on the GRID.

Prerequisites

You need to be running on a computer with an up to date version of Alien and recent versions of ROOT, GEANT and AliRoot. Installation of these packages is not covered here. You should consult the ALICE offline pages, in particular Alien and ROOT. There is also a useful guide to maintaining multiple combinations of versions here.

You should also have an analysis task which can successfully be compiled and run locally and on a PROOF cluster (eg CAF or SKAF). The first requirement should ensure that when your code is sent onto the GRID it can be compiled on each worker node as necessary. The second condition should allow the output from your task to merged at the end as PROOF also does this. The task should have the usual run.C macro for running it.

Introduction

The simplest means to run on the GRID is to use the Alien plugin. This involves modifying one configuration macro followed by some changes to the macro used to run your task. For clarity these will be dealt with separately below. The very first task is to download the tutorial example from http://aliweb.cern.ch/secure/Offline/sites/aliceinfo.cern.ch.secure.Offline/files/uploads/AnalysisTrain/alienplugin.tgz as advised here.

This should be untar'd in a clean directory.

Configuration Modifications

Here is the CreateAlienHandler.C macro. The lines with red portions show the changes made.

AliAnalysisGrid* CreateAlienHandler()
{
// Check if user has a valid token, otherwise make one. This has limitations.
// One can always follow the standard procedure of calling alien-token-init then
// source /tmp/gclient_env_$UID in the current shell.
  if (!AliAnalysisGrid::CreateToken()) return NULL;
  AliAnalysisAlien *plugin = new AliAnalysisAlien();
// Set the run mode (can be "full", "test", "offline", "submit" or "terminate")
  plugin->SetRunMode("full");
// Set versions of used packages
  plugin->SetAPIVersion("V1.1x");
  plugin->SetROOTVersion("v5-26-00b");
  plugin->SetAliROOTVersion("v4-19-04-AN");
// Declare input data to be processed.
// Method 1: Create automatically XML collections using alien 'find' command.
// Define production directory LFN
  plugin->SetGridDataDir("/alice/data/2010/LHC10b");%
  // On real reconstructed data:
// plugin->SetGridDataDir("/alice/data/2009/LHC09d");
// Set data search pattern
  plugin->SetDataPattern("*ESDs.root");
// Data pattern for reconstructed data
// plugin->SetDataPattern("*ESDs/pass4/*ESDs.root");
 plugin->SetRunPrefix("000"); //  *uncomment* this for real data
// ...then add run numbers to be considered
  plugin->AddRunNumber(117054);
// plugin->AddRunNumber(104065); // real data
// plugin->SetOutputSingleFolder("output");
// plugin->SetOutputToRunNo();
// Method 2: Declare existing data files (raw collections, xml collections, root file)
// If no path mentioned data is supposed to be in the work directory (see SetGridWorkingDir())
// XML collections added via this method can be combined with the first method if
// the content is compatible (using or not tags)
// plugin->AddDataFile("tag.xml");
// plugin->AddDataFile("/alice/data/2008/LHC08c/000057657/raw/Run57657.Merged.RAW.tag.root");
// Define alien work directory where all files will be copied. Relative to alien $HOME.
  plugin->SetGridWorkingDir("pt117054");
// Declare alien output directory. Relative to working directory.
  plugin->SetGridOutputDir("output117054"); // In this case will be $HOME/work/output
// Declare the analysis source files names separated by blancs. To be compiled runtime
// using ACLiC on the worker nodes.
  plugin->SetAnalysisSource("AliAnalysisTaskPt.cxx");
// Declare all libraries (other than the default ones for the framework. These will be
// loaded by the generated analysis macro. Add all extra files (task .cxx/.h) here.
  plugin->SetAdditionalLibs("AliAnalysisTaskPt.h AliAnalysisTaskPt.cxx");
// Declare the output file names separated by blancs.
// (can be like: file.root or file.root@ALICE::Niham::File)
// plugin->SetOutputFiles("Pt.ESD.1.root");
  plugin->SetDefaultOutputs();
// Optionally define the files to be archived.
// plugin->SetOutputArchive("log_archive.zip:stdout,stderr@ALICE::NIHAM::File root_archive.zip:*.root@ALICE::NIHAM::File");
// plugin->SetOutputArchive("log_archive.zip:stdout,stderr"); // Comment out as now done automatically
// Optionally set a name for the generated analysis macro (default MyAnalysis.C)
  plugin->SetAnalysisMacro("TaskPt.C");
// Optionally set maximum number of input files/subjob (default 100, put 0 to ignore)
  plugin->SetSplitMaxInputFileNumber(100);
// Optionally modify the executable name (default analysis.sh)
  plugin->SetExecutable("TaskPt.sh");
// Optionally set number of failed jobs that will trigger killing waiting sub-jobs.
// plugin->SetMaxInitFailed(5);
// Optionally resubmit threshold.
// plugin->SetMasterResubmitThreshold(90);
// Optionally set time to live (default 30000 sec)
  plugin->SetTTL(30000);
// Optionally set input format (default xml-single)
  plugin->SetInputFormat("xml-single");
// Optionally modify the name of the generated JDL (default analysis.jdl)
  plugin->SetJDLName("TaskPt.jdl");
// Optionally modify job price (default 1)
  plugin->SetPrice(1);   
// Optionally modify split mode (default 'se')  
  plugin->SetSplitMode("se");
  return plugin;
}

Explanation of changes

  1. Enable the most up to date ROOT and AliRoot versions deployed in the GRID. There are regular announcements to the alice-project-analysis-task-force list (to which you should subscribe) telling you the latest versions.
  2. Set the directory from which the search for data begins. Consult Monalisa to find the period and pass you are interested in, click on it and see the directory name in the 'Output dir' column.
  3. You need to uncomment this SetRunPrefix line for real data as it uses the run number to search the directory hierarchy and the subdirectories have a 000 in their names.
  4. Choose a run number to look at, again consulting Monalisa.
  5. Change the name of the working directory from the default - not compulsory but useful.
  6. Change the name of the output directory from the default - again useful to see what it happening.
  7. Comment out the line plugin->SetOutputArchive("log_archive.zip:stdout,stderr"); as this is now handled automatically. (Mail from A. Gheata 29/6/10).

Testing

At this point you can test your changes to the CreateAlienHandler.C macro simply by using the runGrid.C which you downloaded to run the simple pT analysis task. First obtain an alien token in the usual way then do:

root runGrid.C

This should submit your jobs leaving you in the alien shell. You can check on the progress of your jobs using commands such as 'ps'. (Will add some more detailed notes and/or links on monitoring jobs). Basically they should start with status 'I' for inserting moving, 'W' for waiting and then to 'R' for running. You should be patient! It may take some time for the job to start running of there are already job(s) running at the site hosting the copie(s) of the files that you request. Ideally the will all end in status 'D' for done (and not EV, ESV etc. which are various error states).

Merging

You can wait until all jobs are done and then exit the alien shell. The merging of the outputs from the subjobs will then take place and the output file returned to your working directory. In case you do want not wait for all your jobs to be done you can exit and merging of the competed jobs takes place. At a later time, after checking in aliensh that all the jobs are done, you can re-run the merging. To do that you should edit CreateAlienHandler.C, changing the line

plugin->SetRunMode("full");
to 
plugin->SetRunMode("terminate");

Don't forget to change it back next to you run.

Running your own analysis task

By running the procedure outlined above you have verified that all of the grid infrastructure is working correctly. However it is not very interesting to run the AliAnalysisTaskPt task. You could modify it to do more interesting things but more likely it that you already have an analysis task that you do not want to re-write. To get your task running on the grid you need to make further modifications to the CreateAlienHandler.C macro.

plugin->SetAnalysisSource("AliAnalysisTaskPt.cxx");

is changed to use your own source code, for example

plugin->SetAnalysisSource("AliMult.cxx");
plugin->SetAdditionalLibs("AliAnalysisTaskPt.h AliAnalysisTaskPt.cxx");

is changed to include all the .cxx and .h files that you need. Eg:

plugin->SetAdditionalLibs("AliMult.h AliMult.cxx");

You should also change the following lines so that the names given to the various automatically produced files are more sensible, although this is not strictly necessary:

plugin->SetGridWorkingDir("pt117054");

to eg,

plugin->SetGridWorkingDir("mult117054");
plugin->SetAnalysisMacro("TaskPt.C");

to eg,

plugin->SetAnalysisMacro("TaskMult.C");
plugin->SetExecutable("TaskPt.sh");

to eg,

plugin->SetExecutable("TaskMult.sh");
plugin->SetJDLName("TaskPt.jdl");

to eg,

plugin->SetJDLName("TaskMult.jdl");

The next step is modify your macro which you use to run your job. It is not possible to specify exactly what needs to be done for each individual case. Basically you are merging the runGrid.C macro provided with your existing "run.C" macro. The following are generally needed though. Add the following to the libraries loaded at the start:

  gSystem->Load("libCore.so"); 
  gSystem->Load("libTree.so");
  gSystem->Load("libGeom.so");

Add the following in place of the connection to PROOF;

// Create and configure the alien handler plugin
  gROOT->LoadMacro("CreateAlienHandler.C");
  AliAnalysisGrid *alienHandler = CreateAlienHandler(); 
  if (!alienHandler) return;

Remove

   gROOT->LoadMacro("$ALICE_ROOT/PWG0/CreateESDChain.C");

When creating your task change

  gProof->Load("AliMult.cxx++g"); 

to

  gROOT->LoadMacro("AliMult.cxx++g"); 

After creating the analysis manager, add the line

    mgr->SetGridHandler(alienHandler); 

Finally change

    mgr->StartAnalysis("proof","/ALICE/pp007000/LHC10b_000114783_p1");

to

  mgr->StartAnalysis("grid");

I have attached two files showing the changes made to Arvinder's macro for running on SKAF to run on the grid to serve as an example.

  • skaf.C: Original macro to run user task before adaption

  • runGridAliMult.C: Real example of macro adapted to run on Grid using the Alien plugin.

Monitoring your task once you have submitted it

When you have submitted your grid job you are given a job number. You can check on all the jobs you have submitted from within aliensh with

  ps

or, for only the jobs that are currently running,

  ps -X

To check what is happening with a specific job, do

  ps -trace jobnumber all

where "all" is an optional argument that prints the entire status history of the job.

To look at the output of a job when it is in progress use the spy command:

  spy jobnumber workdir

or

  spy jobnumber stdout
Where workdir prints the job working directory and stdout (obviously) prints the stdout. These only work when a job is running, ie has status "R."

There are other commands (see the offline bible/alien tutorial) but these are probably the most useful.

When you look at a job with ps, it can have various status codes. When a job runs without problems, it will go through them in the order below. For more information see the offline bible or this job status flow chart:

I: inserting

W: waiting

ST: started/staging

R: running

SV: saving

SVD: saved

D: done

If there is a problem, it can give one of these error codes:

ESP: typically, the required dataset path does not exist. Check the run has been reconstructed, and the right pass chosen.

EI / EA: normally a service failure rather than a problem with your job. Try again later.

EIB: error in download of input files. Probably the input file does not exist in the storage element or the storage element is unreachable from the job worker node.

EV: job validation failed (your validation script returned nonzero)

ESV: an output file could not be saved, probably due to an unavailable storage element.

Z / EXP: job got lost on a worker node due to a node or network failure. Resubmit the job.

There is also an ESP error which doesn't seem to be documented anywhere, but seems to be due to a problem (like a typo) in the jdl.

An alternative monitoring solution is provided by Monalisa in the 'My Jobs' link in the bar at the top.

-- PatrickScott - 20 May 2010

Topic attachments
I Attachment Action Size Date Who Comment
cC runGridAliMult.C manage 2.0 K 19 May 2010 - 17:13 LeeBarnby Real example of macro adapted to run on Grid using the Alien plugin.
cC skaf.C manage 2.0 K 19 May 2010 - 17:14 LeeBarnby Original macro to run user task before adaption
Topic revision: r11 - 01 Dec 2010 - 01:28:01 - ArvinderPalaha
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback