We have a cluster and a set of desktop PCs
where the base system is Fedora 12 or Fedora 15, but where there are
also available complete SL4.8 and SL5.4 sub-systems,
where SL is Scientific Linux, similar to CentOS and based on Red Hat Enterprise Linux.
There's the opportunity to extend that to have a bigger set of
different sub-systems, including 32-bit and 64-bit versions of the same
system, if we wanted them.
Additional filesystems which you would normally expect to see in a
native system
(like the filesystem that contains your $HOME directory, and optionally
the /tmp filesystem)
are specially mounted inside the alternative images too.
Users switch to one of the alternative sub-systems using a specially written binary like my imageswitch command invoked as sl4 or sl5 or (earlier) /bin4/bash which provides the required chroot to the alternative system. For our desktop systems, users get a SL4 and SL5 icon which opens a window in the corresponding environment. For jobs, users use the qsub4 or qsub5 commands, which make use of the -D option of the qsub command in Torque, as Torque provides its own perfectly useable internal chroot facility (since version 1.1.0p3).
This is a bit different to the normal use of the chroot system call, which normally is used in order to give less access to system facilities, rather than similar access to a different system: for example a chrooted environment used in a web server for security purposes. See reference links at the bottom of the page for more examples of Container virtualization.When you enter one of the special commands like /bin4/sh, then the command changes the filesystem root to the root of the alternative installation, eg SL4, changes the current directory to the same directory but in the SL4 image, and then invokes the command of the same name in the SL4 image. So /bin4/bash invokes bash in the changed-root system. Because your $HOME and other files are mounted in the SL4 image also, you can continue to see them.
Your programs will therefore use the scripts and run-time libraries and other system files of the SL4 or SL5 installation, rather than the native installation. So they are operating in a virtual application environment. Those libraries may call a system kernel interface to provide certain facilities, and in this case that kernel is the native Fedora 12 kernel, not a SL4 or SL5 kernel. However, the assumption is that the kernel facilities are backward-compatible between Fedora 12, SL5 and SL4.
As an analogy, consider a fully-resolved (static) program.
If and when a program is compiled on one system with all library
references
resolved, and so with no run-time libraries, and you copy it to a
second different system, you would expect it to work flawlessly, if the
second system was later but backwards-compatible with the first system.
It would be a very brave or foolhardy vendor or kernel developer who
altered the kernel interfaces to cause such a static program to fail.
In our case, we are not fully-resolving the library references
but instead we are providing all the run-time libraries of the first
system too;
we rely equally on the backward-compatibility of the kernel interfaces,
and so for the same reason we expect that method to work flawlessly too.
The --bind option of mount is invaluable here (similar to -o bind). Without bind, you might have to mount mount your NFS file-systems at several different places, which for me (from monitoring packets for a mount with a 2.6.38 kernel) is different from and less efficient than bind mounting, and, without bind, it's possible that some file-systems like GPFS would not be able to be replicated within the chrooted system.
So the following might be added
in the /etc/rc.d/rc.local file
of the native system:
ch=/sysroot/SL48
nfsmounts=... customise ...
for f in /dev /dev/pts /dev/shm /proc /sys /tmp /selinux; do
mount --bind $f $ch/$f
done
for f in $nfsmounts; doThis needs customising for a particular scenario:
mkdir -p $ch/$f
mount --bind $f $ch/$f
done
In a system where one is trying to achieve isolation between the
different virtual systems (we're not!) then of course the above remarks
don't necessarily apply.
The following comments are for non-local sessions, and are a bit historical.
For a login session in SL4, after logging in type in: /bin4/bash. To drop back to the native environment, type exit.
To run just one particular script in SL4, you can do this without going into an SL4 login session. Let's say the first line of the script is #!/bin/sh. You can either enter:
/bin4/sh myscript any argsor put #!/bin4/sh as the first line of that script, and then run it simply by:
myscript any argsThe following are available to our users at the time of writing: /bin4/sh, /bin4/bash, /bin4/ksh, /bin4/zsh, /bin4/csh, /bin4/tcsh.
If you choose to do this /bin4/ script modification, note that inside the SL4 environment, the /bin4/ binaries also exist, but are simply soft-links to the /bin/ binaries, so behave in a consistent way, and so there is no need to keep a second copy of the script which has the conventional invocation as its first line.
Scripts which are invoked by other scripts already in an SL4 environment are run in that same SL4 environmnent. That applies in an interactive session and in a job. So there is no need to invoke those in a special way or go to the trouble of modifying them.
Users can submit a job to run in SL4 from within a SL4 or SL5 session. The method is exactly the same: jobs do NOT inherit the operating system of the submitting system (that is, the system doing the qsub).
To run a job totally within SL4, submit it with the qsub option -S /bin4/bash. That is, either use that option on the qsub command line, or put this in the submitted job script:
#PBS -S /bin4/bash
Alternatively, you can choose that the script only (and not the
initialisation of the job)
is run under SL4, using one of the techniques described above for login
sessions: using the /bin4/ binaries either to invoke a script, or as
the first line of a script,
possibly the job-script.
An alternative is to use the -D
option of the qsub command. Torque has its own chroot facility built
in. The -D option should
specify the root directory of the alternative system, eg:
qsub -D /sysroot/SL48 myjob.sh
Well, having worker nodes which run SL4 natively was certainly
an option, and is still an option for us.
One thing in favour of the chosen method is that we don't need to consider the numbers game, of how many nodes of each system. A worker node can run a mixture of processes from different distros, rather than being dedicated to one or another. Another benefit is running with a single kernel version, which the vendor believed would improve GPFS stability.
Another scenario could be to run the different systems virtually, using Xen, KVM, or another virtualisation technique. However, for our uni cluster, that would involve GPFS running on different kernel versions, with two or more instances per worker node, and the cluster vendor was not in favour of that approach, having already declared a preference for having one kernel version for GPFS throughout the cluster. There is also the question of the comparative efficiency or inefficiency of having fully-virtualized environments compared with our environment: the scheduling between different VMs, additional kernels in memory, multiple NFS mounts of each filesystem per PC, multiple virtual network interfaces, sharing of local-disk filesystems like /tmp, and so on.
Also this method adapts very easily to use on an individual person's
desktop PC, which allows the system administrator to provide the latest
flavour of Linux operating system and so provide the latest snazzy
applications, while preserving the backward compatibility the user
often requires to run their own analysis applications.
release=$(lsb_release -r -s)
echo You are running on release $release >&2
PS1=$release-$PS1
(The redirection on the echo is important, as always inside a .bashrc, if you want scp and sftp to continue to work). Remember that in a job this won't give the intended information if you have used the #!/bin4/bash method of invoking the job script, because in that case your login scripts will run under the native system, not the alternative system.
Various methods may be used to detect the environment: the lsb_release command, the /etc/redhat-release file and similar, a gcc --version command, an rpm query for the gcc package in various architectures, the existence of a /lib64 directory, and so on. All of those work well in a chrooted environment too.
If it's likely that a configure script might use the uname -m command, and if the environment is a 32-bit distro running under a 64-bit kernel, then on the face of it there's a difficulty. Ideally, the command which calls chroot(2) can also use syscall(2) or personality(2) directly to set the execution domain to 32-bit where required, and that's what my current imageswitch command does. Or the ordinary user can invoke the configure or make script prefixed by the linux32 command, to change the apparent environment. As yet another choice, the sysadmin can rename the supplied uname command to uname.bin, and add one the following uname scripts below.
#!/bin/sh uname.bin "${@}" | sed 's/x86_64/i686/g'or, making use of the setarch or i386 or linux32 command:
#!/bin/sh linux32 uname.bin "${@}"
If you use the logger command from within the chrooted environment, or if you have daemons running within that environment which use the syslog system call, then the logging won't actually get recorded anywhere, in a vanilla chrooted environment. This is because those facilities rely on writing to a socket /dev/log which is opened for input by the syslogd/rsyslogd daemon, and this daemon is by default only running in the native base system. There are several ways around this (you only need one of these!):
touch /mychroot/dev/log mount --bind /dev/log /mychroot/dev/log
$AddUnixListenSocket /mychroot/dev/log
lines.
In other/older syslogd environments,
modify /etc/sysconfig/syslog and add one or more
-a /mychroot/dev/log
strings
to the definition of the variable SYSLOGD_OPTIONS.
In either case, there is no need to create the extra socket(s):
the daemon will do it when it starts up.
As it turns out, after the event, we and ClusterVision are not alone in using these techniques!
(Not surprising, as the basic chroot facility has been built into Unix/Linux since the early days.
However, it's the much more recent introduction of bind mounting in kernel 2.4 that has transformed
the possibilities in this area).
Some of these other methods are quite elaborate,
with special kernels, and can provide a degree of isolation between the
different virtual systems that we ourselves are not seeking. Here are
some references: