TWiki> General Web>Lhcb? >TMVA (30 Sep 2013, _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61james_32mccarthy? ) EditAttach

Using the TMVA Tool in ROOT

This page describes how to use the TMVA tool, which is a tool for performing a Multivariate analysis in root. More details can be found in the TMVA manual here http://tmva.sourceforge.net/docu/TMVAUsersGuide.pdf This page will focus on using TMVA with a Boosted Decision Tree (BDT), but other techniques work in a similar way. For more detailed code see the test code available locally in:
 /home/lhsoft/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc46-opt/root/tmva/test/ 
or from lxplus:
 /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc46-opt/root/tmva/test/ 

The TMVA tool is split into two stages: the training stage, and the application stage.

The Training Stage

First you need to load the TMVA libraries so ROOT can interpret them.
 TMVA::Tools::Instance(); 
You need to make an output file which will contain the restults from the TMVA.
 TFile* outputFile = TFile::Open(outfileName, "RECREATE"); 
You need to declare a TMVA Factory, which carries out the analysis for you.
 TMVA::Factory *factory = new TMVA::Factory("TMVAClassification",outputFile,"V:!Silent:Color:Transformations=I:DrawProgressBar:AnalysisType=Classification"); 
You can specify the configuration options here, including the verbosity level of the output. For the list of transformations, use I (the identity transformation) to get plots of input variables for signal and background. This is very useful for comparisons. All the options are available here: http://tmva.sourceforge.net/optionRef.html#Factory

You need to have 2 samples for the TMVA, one background sample and one signal sample. You can apply cuts to the samples to ensure they are signal and background. For example, you might want to match the signal to MCTruth, and cut on the reconstructed mass of the background to ensure it is in the sideband. It is important that the signal and background represent the real signal and background as closely as possible for optimal performance of the BDT. This means you might want to cut on your trigger lines before the BDT, so that only signal which passes the trigger is used for training. In addition, any cuts applied to variables used in the training (or correlated with such variables) need to be the same in data and Monte Carlo, or TMVA will have a hard time using them.

  Double_t signalWeight = 1.0;
  Double_t backgroundWeight=1.0;
  
  TFile* signal = new TFile("/home/lhdata1/BdEtapKs/Rhog/MonteC/Stripping20/MC11-magall-BdetapKs-rhog-DD-stripping20-2.root");
  TTree* sigTree = (TTree*)(signal->Get("tree"));
   
  TFile* data = new TFile("/home/lhdata1/BdEtapKs/Rhog/data/2012/Stripping20/data2012-magup-Bhadron-B2XEtaB2etapKSDDLine-stripping20-20mil.root");
  TTree* dataTree = (TTree*)(data->Get("DecayTreeTuple/DecayTree"));
             
  factory->AddSignalTree(sigTree,signalWeight);
  factory->AddBackgroundTree(dataTree,backgroundWeight);

  TCut mycuts="(B0_TRUEID==511||B0_TRUEID==-511)";
  TCut mycutb="B0_MM>5400."; 

Next, define the list of variables you want to use to train the BDT. These can either be variables found in your ntuple (i.e. B0_PT) or you can perform mathematical operations on the variables first (i.e. log(B0_FDCHI2_OWNPV)). For example, for a variable which tends to be close to zero, you can get a better separation between signal and background by taking the logarithm of the variable. You can rename the variable if you wish, using the format "name := operation(variable)". Finally you need ned to specify whether the variable is float or double ('F') or integer/boolean ('I')

  factory->AddVariable("log_B0_FD := log(B0_FDCHI2_OWNPV)",'F');
  factory->AddVariable("B0_PT",'F');
  factory->AddVariable("log_B0_DIRA := log(1-B0_DIRA_OWNPV)",'F'); 

Prepare the factory for training, along with the cuts defined for the signal and background trees. The samples you have will be split with half being used to train, and half to test that you don't have any overtraining. You can define how you want them split, here randomly.

   factory->PrepareTrainingAndTestTree(mycuts,mycutb,"random"); 

Now define the TMVA method you want, in this case a BDT. You can also specify options for the method here. We specify the number of trees to train in the BDT and the maximum depth of each tree. More trees and a greater depth is better, but if you don't have the statistics to train it then you will overtrain the BDT, so you need to compromise.

   factory->BookMethod(TMVA::Types::kBDT,"BDT","NTrees=400:MaxDepth=2"); 

Finally, train and test the BDT.

  factory->TrainAllMethods();
  
  factory->TestAllMethods();
  
  factory->EvaluateAllMethods();

Once this is done, a root file is created with the results of the training. you can view this file automatically using:

    if (!gROOT->IsBatch()) TMVAGui( outfileName ); 

Using this, you can see the performance of each variable (signal vs background), and also the BDT response for signal and background after training. If you look at the overtraining test, then you can see the BDT response of the test and training samples superimposed. If they do not match, then the BDT has been overtrained, which usually means you don't have enough events in you training samples, and the BDT will not perform as well as you expect.

Variables

Signal vs Bkg

variables_id_c1_DD.png

BDT response.

No overtraining is present

overtrain_BDT.png

The important output from the TMVA tool is the weights file (TMVAClassification.weights.xml and TMVAClassification.class.C). These are the files used to apply the BDT on some data.

The Application Stage

The application stage takes the BDT trained using signal and background samples, and applies it to a sample with an unknown mixture of signal and background (i.e. real data).

Define the input file with the data:

  TFile* data = new TFile("real_data.root"); 
  TTree* dataTree = (TTree*)(data->Get("tree"));

Also the name of the output file which will contain the BDT response of each event:

  TFile *target = new TFile("real_data-mva_output.root","RECREATE" );
  TTree *tree = new TTree("tree","treelibrated tree");

The application is handled using the Reader class in TMVA

   TMVA::Tools::Instance();
  TMVA::Reader *reader = new TMVA::Reader( "V:Color:!Silent" );   

Define the varibles which were used in training, and assign them to a variable. The names of the variables must match exactly the weights files, and must also be defined in the same order.

  Float_t var[3];

  reader->AddVariable("log_B0_FD := log(B0_FDCHI2_OWNPV)",&var[0]);
  reader->AddVariable("B0_PT",&var[1]);
  reader->AddVariable("log_B0_DIRA := log(1-B0_DIRA_OWNPV)",&var[2]);

Define the TMVA method used, and add the weights files obtained from the training stage to the reader:

  reader->BookMVA("BDT method", "TMVAClassification.weights.xml");  

Now define the variables in the data sample, corresponding to the variables used in the BDT.

  Float_t userVar[3];

  dataTree->SetBranchAddress("B0_FDCHI2_OWNPV",&userVar[0]);
  dataTree->SetBranchAddress("B0_PT",&userVar[1]);
  dataTree->SetBranchAddress("B0_DIRA_OWNPV",&userVar[2]);

You might need to define some variables to add to the output file to view later

 
 Float_t B0_MM, KS0_MM, eta_prime_MM, B0_PVFit_M;

  dataTree->SetBranchAddress("B0_MM",&B0_MM);
  dataTree->SetBranchAddress("KS0_MM",&KS0_MM);
  dataTree->SetBranchAddress("eta_prime_MM",&eta_prime_MM);
  dataTree->SetBranchAddress("B0_PVFit_M",&B0_PVFit_M);

  tree->Branch("B0_MM",&B0_MM);
  tree->Branch("KS0_MM",&KS0_MM);
  tree->Branch("eta_prime_MM",&eta_prime_MM);
  tree->Branch("B0_PVFit_M",&B0_PVFit_M);

Add a branch to store the BDT response of events:

  Float_t BDT_response
  tree->Branch("BDT_response",&BDT_response);

Loop over all events in the data sample, and evaluate the BDT response. This is also where you will need to perform the mathematical operation such as log(var) etc. Finally write out the tree to the output file.

  for (Long64_t ievt=0; ievt<dataTree->GetEntries();ievt++) {

    if (ievt%100000 == 0) std::cout << "--- ... Processing event: " << ievt <<std::endl;
    
    dataTree->GetEntry(ievt);

    var[0]=TMath::Log(userVar[0]);
    var[1]=userVar[1];
    var[2]=TMath::Log(1-userVar[2]);

    BDT_response=reader->EvaluateMVA("BDT method");

    tree->Fill();

  }
  
  tree->Write();
  

Now you can get cut on the BDT response and cut out all of that nasty background (almost).

Fig2a.png

The full example scripts, from which the snippets of code used here are taken, can be found and run in

 /home/lhdata1/Y4Project/ExampleScripts 

For the training stage, run

 root -l tmvaBDT.C 

and for the application stage:

 root -l tmvaApplication.C 

-- JamesMccarthy - 23 Nov 2012

Topic attachments
I Attachment History Action Size Date Who Comment
Pngpng Fig2a.png r1 manage 20.2 K 30 Sep 2013 - 20:16 UnknownUser  
Pngpng overtrain_BDT.png r1 manage 29.3 K 30 Sep 2013 - 19:40 UnknownUser  
Pngpng variables_id_c1_DD.png r1 manage 28.0 K 30 Sep 2013 - 19:39 UnknownUser  
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 30 Sep 2013 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61james_32mccarthy?
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback