Local Grid Bonding

General DPM performance issues

The data that can be read/written from a disk pool node is limited by several factors:

  • the speed that data can be read/written from DPM disk areas. This is limited by the intrinsic speed of the disk devices in the RAID, and the method of connection of the RAID units to the node, which in our case is via a dual SCSI interface, each at 320 MBytes/sec.
  • the speed that data can pass through the network between client and server. This is affected by number and speed of network interfaces, and the network switch setup.
  • in both these aspects, there is contention between the different clients accessing the SE or pool nodes simultaneously: each receive a share of the available bandwidth. This favours a setup where there are multiple disk pool nodes, either each exclusively handling a reasonably small quantity of data, or each able to have uncontended access to all the storage.
  • There is also contention within a worker node (WN), between the different clients (eg rfcp) on the same node, as these share the same WN network interface(s).

Implementing network bonding

I've implemented network bonding on the disk pool node epgsr1, in order to help with a bottleneck which became clear during the running of STEP 09. Subsequently we have implemented bonding on all our disk pool nodes.

Switch Setup

The two gigabit interfaces on epgsr1 are now both connected to the dLink 48-port switch, on ports 17 and 18. In readiness, those two ports have been declared as trunked, using the switch GUI. The switch has been physically labelled accordingly.

Trunking means that the switch will be aware that outgoing packets (from epgsr1) on those two ports can have the same source MAC address, and that incoming packets (to epgsr1) are to be distributed amongst the two ports. In practice, the port offset (0, 1, ...) within the trunk-set is given by (source MAC addr XOR dest MAC address) modulo (number of trunked ports). This is confirmed by tests performed on several combinations of nodes. This is a form of load sharing which generally works quite well on average.

It's an unfortunate fact, though, that all the MAC addresses of our Supermicro twin nodes are even, as is the MAC address of the edge switch gateway that communicates through to the outside world (including BlueBEAR WNs). So for incoming data, one switch port is used predominantly, rather than the load being shared across both the ports in the trunk. (Possible solutions are: use the eth1 port on half the workers in place of eth0; increase the number of gigabit interfaces on epgsr1 to 3).

This is not a big issue for us, because most of the benefit of trunking is in reading from the disk pool node, which is outgoing traffic, not incoming. The port used for outgoing traffic is determined by the bonding module in the Linux system on epgsr1: see next section.

Note added 2011: for some time now we have been making use of 4-way bonding on all our disk pool nodes. This improves the sharing of incoming data between the eth0-3 ports.

Pool node Setup

A good reference for the kernel bonding module is /usr/share/doc/kernel-doc- version /Documentation/networking/bonding.txt. which is also on the web. I've chosen to use the balance-rr mode of distributing packets, which means that packets are transmitted in round-robin (sequential) order to the ports in the bond. This provides load balancing and fault tolerance, as it says in the docs.

In brief,

  • /etc/modprobe.conf has two lines added, to allow loading of the bonding driver when the interface is referenced:
                alias bond0 bonding
                options bond0 mode=balance-rr miimon=100

  • In /etc/sysconfig/network-scripts directory, files ifcfg-eth0, ifcfg-eth1 and ifcfg-bond0 were modified / created as required. In practice, this was first done in a duplicate directory network-scripts.bonding, with a copy of the original at directory network-scripts.normal, to make it easy to move between the two scenarios.
  • ifcfg-eth0:
    DEVICE=eth0
    BOOTPROTO=none
    ONBOOT=yes  
    MASTER=bond0
    SLAVE=yes
    USERCTL=no
    
  • ifcfg-eth1-3: same as above except DEVICE=eth1 to DEVICE=eth3
  • ifcfg-bond0:
    DEVICE=bond0
    BOOTPROTO=none
    ONBOOT=yes
    IPADDR=147.188.xx.8
    GATEWAY=147.188.xx.1
    NETWORK=147.188.xx.0
    NETMASK=255.255.255.0
    USERCTL=no
    TYPE=Ethernet
    
  • The gigabit ports were connected to the new trunked pair on the dLink switch.
  • A service network restart was done, from the node console.
  • It took several seconds before network connectivity was resumed, while the switch worked out where to send packets in the new scenario. This was done on a live system, and network connections that were already in place continued without break.

Monitoring of the interfaces

My ifrate command gives output like this, in a busy environment with lots of outgoing data, and a little incoming data:

bond0:    8 MB/s in  245 MB/s out  eth0:    0 MB/s in  122 MB/s out  eth1:    8 MB/s in   123 MB/s out

Effect on Pool node CPU utilisation

The rfiod daemons are more active now they can deliver data faster, but are still well within the capabilities of the AMD quad core processor on this node. Note added 2011: the disk pool nodes now use dual quad Intel processors.

-- Originally created by LawrenceLowe - 11 Jun 2009

Topic revision: r5 - 06 Oct 2011 - 15:01:15 - LawrenceLowe
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback