chroot as an alternative to full virtualization

Author: L.S.Lowe. File: syschroot. This update: 20120602. Part of Guide to the Local System.

We have a cluster and a set of desktop PCs where the base system is Fedora 12 or Fedora 15, but where there are also available complete SL4.8 and SL5.4 sub-systems, where SL is Scientific Linux, similar to CentOS and based on Red Hat Enterprise Linux. There's the opportunity to extend that to have a bigger set of different sub-systems, including 32-bit and 64-bit versions of the same system, if we wanted them.

Additional filesystems which you would normally expect to see in a native system (like the filesystem that contains your $HOME directory, and optionally the /tmp filesystem) are specially mounted inside the alternative images too.

Users switch to one of the alternative sub-systems using a specially written binary like my imageswitch command invoked as sl4 or sl5 or (earlier) /bin4/bash which provides the required chroot to the alternative system. For our desktop systems, users get a SL4 and SL5 icon which opens a window in the corresponding environment. For jobs, users use the qsub4 or qsub5 commands, which make use of the -D option of the qsub command in Torque, as Torque provides its own perfectly useable internal chroot facility (since version 1.1.0p3).

This is a bit different to the normal use of the chroot system call, which normally is used in order to give less access to system facilities, rather than similar access to a different system: for example a chrooted environment used in a web server for security purposes. See reference links at the bottom of the page for more examples of Container virtualization.

Why does it work?

When you enter one of the special commands like /bin4/sh, then the command changes the filesystem root to the root of the alternative installation, eg SL4, changes the current directory to the same directory but in the SL4 image, and then invokes the command of the same name in the SL4 image. So /bin4/bash invokes bash in the changed-root system. Because your $HOME and other files are mounted in the SL4 image also, you can continue to see them.

Your programs will therefore use the scripts and run-time libraries and other system files of the SL4 or SL5 installation, rather than the native installation. So they are operating in a virtual application environment. Those libraries may call a system kernel interface to provide certain facilities, and in this case that kernel is the native Fedora 12 kernel, not a SL4 or SL5 kernel. However, the assumption is that the kernel facilities are backward-compatible between Fedora 12, SL5 and SL4.

As an analogy, consider a fully-resolved (static) program. If and when a program is compiled on one system with all library references resolved, and so with no run-time libraries, and you copy it to a second different system, you would expect it to work flawlessly, if the second system was later but backwards-compatible with the first system. It would be a very brave or foolhardy vendor or kernel developer who altered the kernel interfaces to cause such a static program to fail. In our case, we are not fully-resolving the library references but instead we are providing all the run-time libraries of the first system too; we rely equally on the backward-compatibility of the kernel interfaces, and so for the same reason we expect that method to work flawlessly too.

What the syschroot method preserves

it preserves all environment variables, including Torque ones like PBS_JOBID and PBS_O_WORKDIR,
it accepts and passes on the supplied arguments,
it can be used in a script #!interpreter line,
it preserves the current directory,
it preserves the current umask

The extra mounts required

The alternative system(s) need to have additional mounts in place in order for things to work as expected. For example, the users' files will almost certainly need to be available within the chrooted system, and so will need to be mounted there. This mounting can be done at native-system boot time: there is no need to do it later at the time the user(s) actually switch to the alternative system.

The --bind option of mount is invaluable here (similar to -o bind). Without bind, you might have to mount mount your NFS file-systems at several different places, which for me (from monitoring packets for a mount with a 2.6.38 kernel) is different from and less efficient than bind mounting, and, without bind, it's possible that some file-systems like GPFS would not be able to be replicated within the chrooted system.

So the following might be added in the /etc/rc.d/rc.local file of the native system:

ch=/sysroot/SL48
nfsmounts=... customise ...
for f in /dev /dev/pts /dev/shm /proc /sys /tmp /selinux; do
  mount --bind $f $ch/$f
done

for f in $nfsmounts; do
  mkdir -p $ch/$f
  mount --bind $f $ch/$f
done

This needs customising for a particular scenario:

Mounting all of /dev gives us easy access to the /dev/null, /dev/zero and /dev/random special files, if these are not already populated in the alternative image, and to the /dev/log socket (see syslogd section below). Some may prefer to populate the /dev directory with a particular small set of device-files. If you choose the latter, then note that:

Mounting /dev/pts, /dev/shm, /proc and /sys can be done without using bind: eg for /proc: mount -t proc myproc $ch/proc
/dev/ptmx and directory /dev/pts gives access to pseudo-terminals, without which commands like xterm cannot be executed in the chrooted environment
/dev/tty is required for certain commands to work (eg klog)
/dev/fuse is required for fusermount to work

/proc is required for certain commands to work (...).
/tmp might be included if this is a separate large file-space, so that a similarly large space will be available in the chrooted environment.
The /tmp/.X11-unix/X0 special file needs to be available in an image for X applications to work in the chrooted environment on (say) a desktop PC, either by bind-mounting all of /tmp as above, or bind-mounting just that particular special file/socket.
/selinux has been included above, but in fact selinux needs to be tailored, or off, in order for a chrooted bash to work, as of SL5.2. (...).
For Torque/PBS worker nodes, nodefile and jobfile directories /var/spool/torque/aux and /var/spool/torque/mom_priv/jobs (or whatever the naming convention is on your system) need to be bind-mounted to the directories of the same name in the image(s). You can achieve that most easily by bind-mounting just /var/spool/torque, and that works fine.

Other considerations

There may be lots of other things that need to be considered; some of them are:

/etc/resolv.conf needs to be consistent between native and chrooted images (so do a copy at boot time), for any DNS lookups to work.
/etc/nsswitch.conf also needs to be consistent, for network-based authentication and look-ups.
/etc/hosts needs to be consistent, if populated in the base system.
/etc/passwd and /etc/group files should be consistent, unless the authentication method is network-based: LDAP for example.
/etc/pam.d/system-auth, if customised for longer / stronger passwords, also needs to customised in the images, for consistency of enforcement.
/etc/mail/submit.cf needs to be consistent if you want mail commands to work. In addition, for the rare cases where an email transaction initially fails and gets deferred, you might consider running a sendmail client daemon per image, or do a bind-mount of /var/spool/clientmqueue directories, or have a cron task to move files in those directories into the base system (which is what I do). Watch out for different group-ids for smmsp in different systems, though, which if true requires your cron task to chgrp the moved files. (Fedora 12 had that issue; Fedora 14/15 reverts to uid:gid 51:51 like RHEL).
/etc/mtab, where this is a regular file (rather than a sym-link to /proc/mounts), can be made consistent, eg stream-edited, into the images, so that a df command shows available file-systems.
/var/lib/dbus/machine-id can be copied into the images, so that certain facilities work (more details?).
/etc/cups/client.conf can be made consistent in the images, to give the same CUPS client set-up.

By consistent, I don't necessary mean identical, but giving the same effect.

In a system where one is trying to achieve isolation between the different virtual systems (we're not!) then of course the above remarks don't necessarily apply.

Security

The system administrators need to keep sub-systems up-to-date as far as non-kernel security fixes are concerned, just as they would do if those systems were stand-alone, because (for example) a setuid program which is discovered to have an exploit which gives uncontrolled root access is a security hazard, whether it be in a native or chrooted environment. So, for example, a glibc vulnerable to CVE-2010-3847 would be exploitable. It would be incorrect to assume that the chrooted environment provides some protection from such hazards because, as is well known, it's not hard for a root user to break out into the native environment, and in any case a malicious root user can do damage enough without breaking out. On the other hand, kernel-exploit vulnerabilities of the sub-systems are not of concern, as we only use the native system's kernel, and vulnerabilities in daemons (like sshd or ntpd) are unlikely to have an impact since there is usually no reason to run non-native versions of those daemons. With use of kernel capabilities, particularly a capability bounding set in kernel versions 2.6.25 onwards, it should be possible to restrict what privileges a user can acquire in their sessions, anyway.

Examples of using SL4 in a login session

For a local user desktop, users are provided with icons to invoke terminal sessions in the various environments: here's an example showing a Fedora 12 KDE panel with terminal icons for Fedora 12, SL5, SL4, plus various ssh calls to remote machines, with further icons for the latest versions of Firefox, ICAClient, EVO, Skype, and Google Earth running under Fedora 12:

The following comments are for non-local sessions, and are a bit historical.

For a login session in SL4, after logging in type in: /bin4/bash. To drop back to the native environment, type exit.

To run just one particular script in SL4, you can do this without going into an SL4 login session. Let's say the first line of the script is #!/bin/sh. You can either enter:

         /bin4/sh myscript    any args

or put #!/bin4/sh as the first line of that script, and then run it simply by:

         myscript    any args

The following are available to our users at the time of writing: /bin4/sh, /bin4/bash, /bin4/ksh, /bin4/zsh, /bin4/csh, /bin4/tcsh.

If you choose to do this /bin4/ script modification, note that inside the SL4 environment, the /bin4/ binaries also exist, but are simply soft-links to the /bin/ binaries, so behave in a consistent way, and so there is no need to keep a second copy of the script which has the conventional invocation as its first line.

Scripts which are invoked by other scripts already in an SL4 environment are run in that same SL4 environmnent. That applies in an interactive session and in a job. So there is no need to invoke those in a special way or go to the trouble of modifying them.

Using SL4 in a submitted job

Our users are now provided with qsub4 and qsub5 commands, which make use of the -D option of qsub, so the following old notes are just included for interest.

Users can submit a job to run in SL4 from within a SL4 or SL5 session. The method is exactly the same: jobs do NOT inherit the operating system of the submitting system (that is, the system doing the qsub).

To run a job totally within SL4, submit it with the qsub option -S /bin4/bash. That is, either use that option on the qsub command line, or put this in the submitted job script:

         #PBS -S /bin4/bash

Alternatively, you can choose that the script only (and not the initialisation of the job) is run under SL4, using one of the techniques described above for login sessions: using the /bin4/ binaries either to invoke a script, or as the first line of a script, possibly the job-script.

An alternative is to use the -D option of the qsub command. Torque has its own chroot facility built in. The -D option should specify the root directory of the alternative system, eg:

         qsub -D /sysroot/SL48 myjob.sh

Why implement it that way?

Why implement SL4 or SL5 as an image inside another system, when you could simply have a different system image for each node? The cluster we wanted this facility for originally uses the ClusterVision OS infrastructure, which easily allows for different workers having different images.

Well, having worker nodes which run SL4 natively was certainly an option, and is still an option for us.

One thing in favour of the chosen method is that we don't need to consider the numbers game, of how many nodes of each system. A worker node can run a mixture of processes from different distros, rather than being dedicated to one or another. Another benefit is running with a single kernel version, which the vendor believed would improve GPFS stability.

Another scenario could be to run the different systems virtually, using Xen, KVM, or another virtualisation technique. However, for our uni cluster, that would involve GPFS running on different kernel versions, with two or more instances per worker node, and the cluster vendor was not in favour of that approach, having already declared a preference for having one kernel version for GPFS throughout the cluster. There is also the question of the comparative efficiency or inefficiency of having fully-virtualized environments compared with our environment: the scheduling between different VMs, additional kernels in memory, multiple NFS mounts of each filesystem per PC, multiple virtual network interfaces, sharing of local-disk filesystems like /tmp, and so on.

Also this method adapts very easily to use on an individual person's desktop PC, which allows the system administrator to provide the latest flavour of Linux operating system and so provide the latest snazzy applications, while preserving the backward compatibility the user often requires to run their own analysis applications.

Knowing what system you're on

So you always know which release you are on, you can add the following to your $HOME/.bashrc:

         release=$(lsb_release -r -s)
         echo You are running on release $release >&2
         PS1=$release-$PS1

(The redirection on the echo is important, as always inside a .bashrc, if you want scp and sftp to continue to work). Remember that in a job this won't give the intended information if you have used the #!/bin4/bash method of invoking the job script, because in that case your login scripts will run under the native system, not the alternative system.

Application configure/make/cmake considerations

Some program applications which are built by users may detect the current environment in a configure script, like distribution and architecture, in order to build with appropriate options, and this needs to work in a chrooted environment too.

Various methods may be used to detect the environment: the lsb_release command, the /etc/redhat-release file and similar, a gcc --version command, an rpm query for the gcc package in various architectures, the existence of a /lib64 directory, and so on. All of those work well in a chrooted environment too.

If it's likely that a configure script might use the uname -m command, and if the environment is a 32-bit distro running under a 64-bit kernel, then on the face of it there's a difficulty. Ideally, the command which calls chroot(2) can also use syscall(2) or personality(2) directly to set the execution domain to 32-bit where required, and that's what my current imageswitch command does. Or the ordinary user can invoke the configure or make script prefixed by the linux32 command, to change the apparent environment. As yet another choice, the sysadmin can rename the supplied uname command to uname.bin, and add one the following uname scripts below.

     #!/bin/sh
     uname.bin "${@}" | sed 's/x86_64/i686/g'

or, making use of the setarch or i386 or linux32 command:

     #!/bin/sh
     linux32 uname.bin "${@}"

Logging to syslog from within a chrooted environment

If you use the logger command from within the chrooted environment, or if you have daemons running within that environment which use the syslog system call, then the logging won't actually get recorded anywhere, in a vanilla chrooted environment. This is because those facilities rely on writing to a socket /dev/log which is opened for input by the syslogd/rsyslogd daemon, and this daemon is by default only running in the native base system. There are several ways around this (you only need one of these!):

Mount the whole /dev directory over the /chrooted/dev directory with the --bind option, as suggested in the earlier section on mounts.

Mount the specific /dev/log file over a /mychroot/dev/log file; for example:

         touch /mychroot/dev/log
	 mount --bind /dev/log /mychroot/dev/log

For rsyslogd, add a file to directory /etc/rsyslog.d/ containing one or more $AddUnixListenSocket /mychroot/dev/log lines. In other/older syslogd environments, modify /etc/sysconfig/syslog and add one or more -a /mychroot/dev/log strings to the definition of the variable SYSLOGD_OPTIONS. In either case, there is no need to create the extra socket(s): the daemon will do it when it starts up.

Forward compatibility for new systems

The method applies best where the chrooted systems are older (or the same) operating systems compared with the native system. But if you chroot from the native system to a newer system (one that normally uses a more recent kernel), this may be rejected with the following message, arising from /lib[64]/ld*so: FATAL: kernel too old. For example, an original Fedora 12 kernel 2.6.31 can't chroot to a Fedora 14 system which normally uses kernel 2.6.35, because of that rejection. However, an end-of-life-cycle Fedora 12 kernel 2.6.32 didn't have that problem, and so for example I was able to use a Fedora 14 image on a Fedora 12 native system, before I moved to Fedora 14 as my native system.

Other methods and implementations

As it turns out, after the event, we and ClusterVision are not alone in using these techniques! (Not surprising, as the basic chroot facility has been built into Unix/Linux since the early days. However, it's the much more recent introduction of bind mounting in kernel 2.4 that has transformed the possibilities in this area). Some of these other methods are quite elaborate, with special kernels, and can provide a degree of isolation between the different virtual systems that we ourselves are not seeking. Here are some references:

Linux Virtualization Wiki and Technical Overview and Technical comparison (see Containers column)
OpenVZ
Pushing Torque jobs in a chroot environment
CHOS - chroot OS
PMV

L.S.Lowe
Birmingham Particle Physics Group