Local Grid Bonding
General DPM performance issues
The data that can be read/written from a disk pool node is limited by several factors:
- the speed that data can be read/written from DPM disk areas. This is limited by the intrinsic speed of the disk devices in the RAID, and the method of connection of the RAID units to the node, which in our case is via a dual SCSI interface, each at 320 MBytes/sec.
- the speed that data can pass through the network between client and server. This is affected by number and speed of network interfaces, and the network switch setup.
- in both these aspects, there is contention between the different clients accessing the SE or pool nodes simultaneously: each receive a share of the available bandwidth. This favours a setup where there are multiple disk pool nodes, either each exclusively handling a reasonably small quantity of data, or each able to have uncontended access to all the storage.
- There is also contention within a worker node (WN), between the different clients (eg rfcp) on the same node, as these share the same WN network interface(s).
Implementing network bonding
I've implemented
network bonding on the disk pool node epgsr1, in order to help with a bottleneck which became clear during the running of STEP 09. Subsequently we have implemented bonding on all our disk pool nodes.
Switch Setup
The two gigabit interfaces on epgsr1 are now both connected to the dLink 48-port switch, on ports 17 and 18. In readiness, those two ports have been declared as
trunked, using the switch GUI. The switch has been physically labelled accordingly.
Trunking means that the switch will be aware that outgoing packets (from epgsr1) on those two ports can have the same source MAC address, and that incoming packets (to epgsr1) are to be
distributed amongst the two ports. In practice, the port offset (0, 1, ...) within the trunk-set is given by (source MAC addr XOR dest MAC address) modulo (number of trunked ports). This is confirmed by tests performed on several combinations of nodes. This is a form of load sharing which generally works quite well on average.
It's an unfortunate fact, though, that all the MAC addresses of our Supermicro twin nodes are even, as is the MAC address of the edge switch gateway that communicates through to the outside world (including
BlueBEAR WNs). So for incoming data, one switch port is used predominantly, rather than the load being shared across both the ports in the trunk. (Possible solutions are: use the eth1 port on half the workers in place of eth0; increase the number of gigabit interfaces on epgsr1 to 3).
This is not a big issue for us, because most of the benefit of trunking is in
reading from the disk pool node, which is outgoing traffic, not incoming. The port used for outgoing traffic is determined by the
bonding module in the Linux system on epgsr1: see next section.
Note added 2011: for some time now we have been making use of 4-way bonding on all our disk pool nodes. This improves the sharing of incoming data between the eth0-3 ports.
Pool node Setup
A good reference for the kernel bonding module is /usr/share/doc/kernel-doc-
version /Documentation/networking/bonding.txt. which is also
on the web. I've chosen to use the balance-rr mode of distributing packets, which means that packets are transmitted in round-robin (sequential) order to the ports in the bond. This provides load balancing and fault tolerance, as it says in the docs.
In brief,
- In /etc/sysconfig/network-scripts directory, files ifcfg-eth0, ifcfg-eth1 and ifcfg-bond0 were modified / created as required. In practice, this was first done in a duplicate directory network-scripts.bonding, with a copy of the original at directory network-scripts.normal, to make it easy to move between the two scenarios.
- ifcfg-eth0:
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no
- ifcfg-eth1-3: same as above except DEVICE=eth1 to DEVICE=eth3
- ifcfg-bond0:
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
IPADDR=147.188.xx.8
GATEWAY=147.188.xx.1
NETWORK=147.188.xx.0
NETMASK=255.255.255.0
USERCTL=no
TYPE=Ethernet
- The gigabit ports were connected to the new trunked pair on the dLink switch.
- A service network restart was done, from the node console.
- It took several seconds before network connectivity was resumed, while the switch worked out where to send packets in the new scenario. This was done on a live system, and network connections that were already in place continued without break.
Monitoring of the interfaces
My ifrate command gives output like this, in a busy environment with lots of outgoing data, and a little incoming data:
bond0: 8 MB/s in 245 MB/s out eth0: 0 MB/s in 122 MB/s out eth1: 8 MB/s in 123 MB/s out
Effect on Pool node CPU utilisation
The rfiod daemons are more active now they can deliver data faster, but are still well within the capabilities of the AMD quad core processor on this node. Note added 2011: the disk pool nodes now use dual quad Intel processors.
-- Originally created by
LawrenceLowe - 11 Jun 2009