Saturday, July 9, 2011

Berkely Lab Checkpoint/Restart (BLCR) Installation and Run Procedure

What is BLCR?

BLCR (Berkeley Lab Checkpoint/Restart) allows programs running on Linux to be "checkpointed" (written entirely to a file), and then later "restarted". BLCR can be found at http://ftg.lbl.gov/checkpoint.

Web Links

https://ftg.lbl.gov/projects/CheckpointRestart/
https://ftg.lbl.gov/CheckpointRestart/CheckpointDownloads.shtml
https://ftg.lbl.gov/CheckpointRestart/downloads/blcr-0.8.2.tar.gz
https://ftg.lbl.gov/CheckpointRestart/downloads/blcr-0.8.2-1.src.rpm
https://upc-bugs.lbl.gov//blcr/doc/html/BLCR_Admin_Guide.html
https://upc-bugs.lbl.gov//blcr/doc/html/BLCR_Users_Guide.html
https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html

Installation Procedure

# cd Desktop
# wget https://ftg.lbl.gov/CheckpointRestart/downloads/blcr-0.8.2.tar.gz
# tar xzvf blcr-0.8.2.tar.gz
# cd blcr-0.8.2
# mkdir builddir
# cd builddir/
# ../configure --with-linux=/usr/src/kernels/2.6.18-128.el5-x86_64/ --with-system-map=/boot/System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp --with-vmlinux=/boot/vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp --enable-multilib --enable-testsuite --enable-init-script

*******************************************************************
***** WARNING WARNING WARNING WARNING WARNING WARNING WARNING *****
*******************************************************************
* The kernel source does not match currently the running kernel.  *
* Compilation will produce modules unsuitable for the currently   *
* running kernel, which may not be what you intended.             *
*******************************************************************
***** WARNING WARNING WARNING WARNING WARNING WARNING WARNING *****
*******************************************************************
======================================================================
Please review the following configuration information:
  Kernel source directory = /usr/src/kernels/2.6.18-128.el5-x86_64/
  Kernel build directory = /usr/src/kernels/2.6.18-128.el5-x86_64/
  Kernel symbol table = /boot/System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp/boot/vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp
  Kernel version probed from kernel build = 2.6.18-128.el5
  Kernel running currently = 2.6.18-128.1.6.el5_lustre.1.8.0.1smp
======================================================================

Warning: Proceeding with this warning would lead to the installation failure.

This can be fixed with the following procedure. BLCR needs to be able to examine a linux kernel source tree that has been configured, and this configuration must match the kernel that you will run BLCR against. If you do not have a configured linux kernel source tree, you may be able to create one fairly easily. Many distributions provide a 'config' file that is all you need to easily produce a configured kernel source tree.

# uname -r
2.6.18-128.1.6.el5_lustre.1.8.0.1smp
 
# cp -a /usr/src/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/ /tmp/
# cd /tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/
# cp configs/kernel-2.6.18-2.6-rhel5-x86_64-smp.config .config
# make prepare-all scripts
# cd /state/partition1/blcr-0.8.2/builddir/
# ../configure --with-linux=/tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/ --with-system-map=/boot/System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp --with-vmlinux=/boot/vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp --enable-multilib --enable-testsuite --enable-init-script

*******************************************************************
***** WARNING WARNING WARNING WARNING WARNING WARNING WARNING *****
*******************************************************************
* The kernel source does not match currently the running kernel.  *
* Compilation will produce modules unsuitable for the currently   *
* running kernel, which may not be what you intended.             *
*******************************************************************
***** WARNING WARNING WARNING WARNING WARNING WARNING WARNING *****
*******************************************************************
======================================================================
Please review the following configuration information:
  Kernel source directory = /tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/
  Kernel build directory = /tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/
  Kernel symbol table = /boot/System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp/boot/vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp
  Kernel version probed from kernel build = 2.6.18-128.1.6.el5_lustre.1.8.0.1custom
  Kernel running currently = 2.6.18-128.1.6.el5_lustre.1.8.0.1smp
======================================================================

Warning: Kernel version probed from kernel build = 2.6.18-128.1.6.el5_lustre.1.8.0.1custom doesn't match with Kernel running currently = 2.6.18-128.1.6.el5_lustre.1.8.0.1smp. Proceeding with this warning would lead to installation failure.

This can be fixed with the following procedure. We need to change the Kernel version in the Makefile in the Linux kernel source directory copied to /tmp.
# cd /tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/
# vi Makefile

Handy Hint: Change the line "EXTRAVERSION = -128.1.6.el5_lustre.1.8.0.1custom" to "EXTRAVERSION = -128.1.6.el5_lustre.1.8.0.1smp". We just have to replace tag "custom" with "smp".

# cp configs/kernel-2.6.18-2.6-rhel5-x86_64-smp.config .config
# make prepare-all scripts
# cd /state/partition1/blcr-0.8.2/builddir/

Configuring BLCR

# ../configure --with-linux=/tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/ --with-system-map=/boot/System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp --with-vmlinux=/boot/vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp --enable-multilib --enable-testsuite --enable-init-script
 
======================================================================
Please review the following configuration information:
  Kernel source directory = /tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/
  Kernel build directory = /tmp/linux-2.6.18-128.1.6.el5_lustre.1.8.0.1/
  Kernel symbol table = /boot/System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp/boot/vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp
  Kernel version probed from kernel build = 2.6.18-128.1.6.el5_lustre.1.8.0.1smp
  Kernel running currently = 2.6.18-128.1.6.el5_lustre.1.8.0.1smp
====================================================================== 

Compiling BLCR

# make

Testing the Build

# make insmod check
 
======================
All 58 tests passed
(2 tests were not run)
======================
Make sure blcr modules are loaded by grepping for blcr in the lsmod output. There should be two modules "blcr" and "blcr_imports".
# lsmod | grep blcr
blcr                  139268  0
blcr_imports           46208  1 blcr

Note: "make insmod check" loads BLCR kernel modules before doing check. Hence, loading them again with insmod would fail and there is no need for it.
If only "make check" is used in building the package then BLCR kernel modules need to be loaded separately. These module need to be loaded in order as shown below.

# insmod /usr/local/lib64/blcr/2.6.18-128.1.6.el5_lustre.1.8.0.1smp/blcr_imports.ko
# insmod /usr/local/lib64/blcr/2.6.18-128.1.6.el5_lustre.1.8.0.1smp/blcr.ko

Installing BLCR

# make install

Useful Information: By default BLCR will install into /usr/local.

Loading the kernel modules by default at boot time

Useful Information: Adding '--enable-init-script' to the configure flags installs blcr init script in /usr/local/etc/init.d/blcr. We need to copy this script to /etc/init.d/ and then modify the script, chkconfig to make it work as boot up script (service).

# vi /etc/init.d/blcr  
# chkconfig --add blcr
Follow the below procedure to modify the script and then save it.
Copy the blcr kernel modules from /usr/local/lib64/blcr/`uname -r`/ to /lib/modules/`uname -r`/kernel/drivers/misc/
# cp /usr/local/lib64/blcr/`uname -r`/*.ko /lib/modules/`uname -r`/kernel/drivers/misc/
# depmod -a
# vi /etc/init.d/blcr

Modify line 10:  module_dir=
                             to
                 module_dir=/usr/local/lib64/blcr/2.6.18-128.1.6.el5_lustre.1.8.0.1smp

Note: Next to module_dir= add the path of the directory containing blcr kernel modules.

Modify line 38:  modprobe $1 || (do_checkmod $1 || insmod ${module_dir}/${1}.ko)
                             to
                 modprobe $1 > /dev/null 2>&1 || (do_checkmod $1 || insmod ${module_dir}/${1}.ko)

Modify line 43:  modprobe -r $1 || (do_checkmod $1 && rmmod $1)
                             to
                 modprobe -r $1 > /dev/null 2>&1 || (do_checkmod $1 && rmmod $1)

Modify line 88:  if [ "x$rc1$rc2" != "x111" ] ; then
                             to
                 if [ "x$rc1$rc2" != "x11" ] ; then

Note: " > /dev/null 2>&1" next to modprobe is not necessary at all. Even when modprobe doesn't work, insmod works to load blcr modules. But as it tries to use the command modprobe first it gives "FATAL: Module blcr_imports not found" and "FATAL: Module blcr not found" error messages for the command modprobe. Then it runs insmod command successfully to load blcr modules with ok message. Adding " > /dev/null 2>&1" next to modprobe takes off this confusion.

If you don't want to copy the blcr kernel modules to /lib/modules/`uname -r`/kernel/drivers/misc/, then you can also do this as shown below.
# vi /etc/init.d/blcr

Modify line 10:  module_dir=
                             to
                 module_dir=/usr/local/lib64/blcr/2.6.18-128.1.6.el5_lustre.1.8.0.1smp

Note: Next to module_dir= add the path of the directory containing blcr kernel modules.

Modify line 38:  modprobe $1 || (do_checkmod $1 || insmod ${module_dir}/${1}.ko)
                             to
                 modprobe $1 > /dev/null 2>&1 || (do_checkmod $1 || insmod ${module_dir}/${1}.ko)

Modify line 43:  modprobe -r $1 || (do_checkmod $1 && rmmod $1)
                             to
                 modprobe -r $1 > /dev/null 2>&1 || (do_checkmod $1 && rmmod $1)

Modify line 88:  if [ "x$rc1$rc2" != "x111" ] ; then
                             to
                 if [ "x$rc1$rc2" != "x11" ] ; then

Note: " > /dev/null 2>&1" next to modprobe is not necessary at all. Even when modprobe doesn't work, insmod works to load blcr modules. But as it tries to use the command modprobe first it gives "FATAL: Module blcr_imports not found" and "FATAL: Module blcr not found" error messages for the command modprobe. Then it runs insmod command successfully to load blcr modules with ok message. Adding " > /dev/null 2>&1" next to modprobe takes off this confusion.

Note: There is no need to modify lines 38 and 43 as modules are loaded through insmod command as long as you don't care about error messages from command modprobe. No matter what, I believe we need to modify line 88 though.

# chkconfig --add blcr
# chkconfig --list blcr
blcr               0:off    1:off    2:off    3:on    4:on    5:on    6:off
# service blcr status
BLCR subsytem is active
# lsmod | grep blcr
blcr                  139268  0
blcr_imports           46208  1 blcr
# service blcr stop
Unloading BLCR:                                            [  OK  ]
# lsmod | grep blcr
# service blcr start
Loading BLCR:                                              [  OK  ]
# lsmod | grep blcr
blcr                  139268  0
blcr_imports           46208  1 blcr
# service blcr reload
Unloading BLCR:                                            [  OK  ]
Loading BLCR:                                              [  OK  ]
# lsmod | grep blcr
blcr                  139268  0
blcr_imports           46208  1 blcr
#
Useful Information
1) If you haven't used --enable-init-script configure option a template init script, etc/blcr.rc is provided in the BLCR source directory, blcr-0.8.2/etc/. Modify this as shown above to suit your system.

    # cp /state/partition1/blcr-0.8.2/etc/blcr.rc /etc/init.d/blcr
    # chmod 755 blcr
    # vi /etc/init.d/blcr

 2) Line 10 should be like this: module_dir=/usr/local/lib64/blcr/2.6.18-128.1.6.el5_lustre.1.8.0.1smp. Replace the text next to "module_dir=" with the path of blcr kernel modules. In my case it is "/usr/local/lib64/blcr/2.6.18-128.1.6.el5_lustre.1.8.0.1smp".
 3) Modify all other lines just like above.

Updating ld.so.cache

Nearly all Linux distributions use a caching mechanism for resolving dynamic library dependencies. If you have installed BLCR's shared library in a directory that is cached by the mechanism, then you will need to update this cache. To do so, run the ldconfig command as root; no command-line arguments are needed.

Handy Hint: Add the line "/usr/local/lib64" to the file "/etc/ld.so.conf" if configured with --enable-multilib or create a file under /etc/ld.so.conf.d/ with the line "/usr/local/lib64" or /usr/local/lib if configured without --enable-multilib.

# vi /etc/ld.so.conf
# more /etc/ld.so.conf
/lib64
/usr/lib64
/usr/kerberos/lib64
/opt/nmi/lib
/usr/lib64/qt-3.1/lib
/usr/lib64/mysql
/usr/X11R6/lib64
/usr/local/lib64
# ldconfig

Note: If configured without --enable-multilib replace the line /usr/local/lib64 with /usr/local/lib.

Note: If configured with --prefix= or --libdir= options that cause BLCR's shared library (libcr.so) to be installed in other than /lib or /usr/lib or any directory listed in /etc/ld.so.conf or any directory listed in a file under /etc/ld.so.conf.d/ then there is no need to run the ldconfig command. Although, it should always be safe to run the ldconfig command.

Note: Note that if you passed no --prefix= or --libdir= options to BLCR's configure script, then you should check /etc/ld.so.conf and /etc/ld.so.conf.d/ for /usr/local/lib (the default location) to determine if you actually need to run the ldconfig command.

Note: If passed --prefix= or --libdir= options to BLCR's configure script that cause BLCR's shared library (libcr.so) to be installed in other than /lib or /usr/lib or any directory listed in /etc/ld.so.conf or any directory listed in a file under /etc/ld.so.conf.d/, then you need to create a file like blcr.sh in /etc/profile.d/ with permissions 755 (-rwxr-xr-x).

# cd /etc/profile.d/
# more blcr.sh
#!/bin/sh
export LD_LIBRARY_PATH=/usr/local/lib/:/usr/local/lib64/
# chmod 755 blcr.sh
# source /etc/profile.d/blcr.sh
# echo $LD_LIBRARY_PATH
/usr/local/lib/:/usr/local/lib64/

Building a binary RPM from source RPMS

We can build RPMS from a source RPM (with a .src.rpm suffix) rather than the .tar.gz version of the BLCR distribution. Source RPMs are available on BLCR website. These source RPMs are configured to build for the running kernel, with --prefix=/usr and to configure with --enable-multilib on 64-bit platforms. Built RPMs will be placed in a subdirectory of /usr/src/redhat/RPMS.

Warning: To build binary RPMs from the source RPM, we need to do little bit tweaking on our systems as kernel version probed from kernel build = 2.6.18-128.1.6.el5_lustre.1.8.0.1custom doesn't match with Kernel running currently = 2.6.18-128.1.6.el5_lustre.1.8.0.1smp. Proceeding with this mismatch would lead to installation failure.

Handy Hint: Trick is to create links to vmlinuz, system map and kernel build in their respective directories with the tag custom in place of original tag smp.

Follow this procedure to build RPMS.
# cd /lib/modules/
# ln -s 2.6.18-128.1.6.el5_lustre.1.8.0.1smp 2.6.18-128.1.6.el5_lustre.1.8.0.1custom
# cd /boot/
# ln -s System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1smp System.map-2.6.18-128.1.6.el5_lustre.1.8.0.1custom
# ln -s vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1smp vmlinuz-2.6.18-128.1.6.el5_lustre.1.8.0.1custom
# rpmbuild --rebuild --define 'kernel_ver 2.6.18-128.1.6.el5_lustre.1.8.0.1custom' blcr-0.8.2-1.src.rpm --target `uname -p`

Note: If installed from RPMS the path to executables is /usr/bin and to libraries it is /usr/lib64 (64 bit) as well as /usr/lib (32 bit). Most probably, /usr/lib64 would already be there in the file /etc/ld.so.conf. If it is not there make sure to add it as a separate line to this file. No need to add /usr/lib as this is always there in the system path and more over we just need 64 bit libraries as our machines are 64 bit.

Running BLCR

$ vi blcr.c
$ more blcr.c
#include "stdio.h"
int main( int argc, char *argv[] )
{ 
int i;
             for (i=0; i<100; i++)
             { 
                            printf("i = %d\n", i);
                            fflush(stdout);
                            sleep(1);
              } 
} 
$ gcc blcr.c -o blcr
$ cr_run ./blcr > output.txt &
[1] 17830
$ tail -f output.txt       # 'more output.txt' to see different output before checkpointing and after restart.
$ ps | grep blcr | grep -v grep
17830 pts/0 00:00:00 blcr
$ cr_checkpoint --term 17830       # creates a contex.pid file and kills the process
[1]+ Terminated       cr_run ./blcr >output.txt
$ ls context.*
context.17830
$ cr_restart context.17830 &       # viola ! start from where it was checkpointed

No comments:

PBS Script Generator: Interdependent dropdown/select menus in Javascript

PBS SCRIPT GENERATOR
SH/BASH TCSH/CSH
Begin End Abort

About Me

LA, CA, United States
Here I write about the battles that have been going on in my mind. It's pretty much a scribble.

Sreedhar Manchu

Sreedhar Manchu
Higher Education: Not a simple life anymore