Linux iscsitarget+tools

There are several public iscsi-target implementation cookbook recipes, however this one attempts to provide unique features including:

active/passive storage controller
configuration management across cluster
dynamic updating of running ietd environment
dynamically add new targets and LUNs
handle size changes of LUNs
logical volume backup

For simplicity of failover and management it was decided to use an active/passive storage controller arrangement. Both storage controllers must have access to the same set of disks (commonly served using Fiber Channel technology, or using a physical SCSI bus where each HBA is configured to NOT reset the bus).

The reason for converting Fiber Channel to iSCSI has 2 reasons:

To deliver iSCSI to hosts is cheaper. To get decent performance, one should consider an iSCSI HBA. The cost of FC HBA is comparable to an iSCSI HBA, however the additional costs of fiber port licenses and SFPs changes that. In a lot of cases a software iSCSI initiator is sufficient.
The disks delivered via FC have no built-in snapshot support or remote replication capability. It was therefore decided to use LVM for snapshots and lvm_tools for remote replication. Hot replication using drbd was not considered because it has the same issue as a RAID1 setup: if you delete data, the data is also gone on the mirror (i.e. the remote site). It would be better to have 2 sets of remote disks, one for drbd and another for snapshots.

Cluster configuration consistency is provided by rsync. This project provides tools to dynamically update the running ietd environment while maintaining the static configuration files (like ietd.conf, initiators.allow, initiators.deny). New targets and LUNs are automatically added. The tools will detect LUN size changes if they are served via LVM logical volumes. The client will see the resized LUN without restarting ietd.

The following sections detail the hardware setup, operating system installation, iscsitarget, iscsitarget-tools and lvm-tools:

Hardware
Software
Download
Author

Hardware Setup

The purpose of the hardware is to provide SP (storage processor) services to clients. Since a single physical storage processor is a single point of failure, it was decided to use two physical storage processors, one active the other passive. Together they act as a single logical storage processor.

        ^   ^                                        ^   ^
        |   |                                        |   |
        |   |              to iSCSI clients          |   |
        |   |                                        |   |
+--------------------+                       +--------------------+
|    Gig Ethernet    |                       |    Gig Ethernet    |
+--------------------+                       +--------------------+
        |   |                                        |   |
        |   | Ethernet (Gig EtherChannel)            |   | Ethernet (Gig EtherChannel)
        |   |                                        |   |
+--------------------+                       +--------------------+  Single Logical
| iscsi-target-node1 |	                     | iscsi-target-node2 |  Storage Processor
+--------------------+	                     +--------------------+  (node1, node2)
   | |  |                                            |      | |
   | |  +--------------------+ FC +------------------+      | |
   | |                       |    |                         | |
   | |PSU1, PSU2      +--------------------+                | |PSU1, PSU2
   | |                |      FC Switch     |                | |
   | |                +--------------------+                | |
   | |                       |    |                         | |
   | |                       | FC |                         | |
   | |                       |    |                         | |
   | |                +--------------------+                | |
   | |                |    Disk Storage    |-+              | |
   | |                +--------------------+ |              | |
   | |                 +---------------------+              | |
   | |                                                      | |
   | +-------------------------------------------+          | |
   |                                             |          | |
   | +-------------------------------------------^----------+ |
   | |                                           |            |
+--------------------+                       +--------------------+
|    Rack PDUs       |                       |    Rack PDUs       |
+--------------------+                       +--------------------+

Details about the above diagram starting from the bottom going up:

Rack PDUs (Power Distribution Units), controllable over the network, are required to power off a node 1 in case node 2 is unable to see node 1. The storage processor nodes have multiple power supplies, hence each PS needs to be managed by a different Rack PDU incase the unit fails and to provide power from 2 different uninterruptible power supplies.
The FC infrastructure is not drawn with fault tolerance in mind. It requires redundant paths, multiple FC switches, multiple HBAs per iscsi-target node to be fault tolerant.

The FC infrastructure could be replaced by shared SCSI disks, keep in mind that each HBA must not reset the bus during boot.
Each storage processors has FC HBA(s) which serve the raw disks and 2 Gig Ethernet ports which are bonded together. Over the bonded interface flow 2 VLANs, one for iSCSI traffic and one for cluster messages (this could be done also over serial ports between the 2 machines).

Software Setup

Several open source packages are used to make the iscsi storage process work:

CentOS4 (RedHat 4 also works)
Heartbeat v1.2.3
iscsitarget v0.4.15 (including code up to svn revision 145, plus additional patches)
iscsitarget-tools
lvm-tools

CentOS4 is the recommended OS of choice. Fedora Core 4, 5 have been used previously without success as snapshots are not stable and putting the storage processor under load crashes the nodes (take 5 clients, have them each do a dd if=/dev/zero of=/dev/sdg bs=1M and setup snapshots on each of those logical volumes exposed to the clients - it will break!).

Heartbeat is used to provide clustering services, such as a virtual IP, activating/deactivating of LVM volume groups, starting and stopping iscsi-target services.

The iscsitarget package is used to provide iscsi services. Since the current setup involves VMware ESX v3, it requires at least svn revision 78. Additional patches are included to provide the display of VPD (SCSI Vital Product Packages) to identify disk serial and disk id numbers associated with LUNs served by ietd (/proc/net/iet/vpd). Another patch is include to display the current state of the non-persistent reservations (/proc/net/iet/reservation).

The iscsitarget-tools package provides the tools to dynamically manage ietd without restarting it.

The lvm-tools package provides tools to copy LVM logical volumes.

Read the next sections which detail the setup and configuration of each of the above packages. At the end there is a section to download all of the above source code.

CentOS4

Install the latest version of the OS, install all updates.

Ethernet Networking

There are 2 physical interfaces which are bonded, carry VLANs for management, cluster services and iscsi traffic. The configuration files are located in /etc/sysconfig/network-scripts.

# ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes

# ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes

# ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
MTU=9000		<- enable MTU 9000 on the switch gear also!

# ifcfg-bond0.900	<- vlan 900 is for iscsi
VLAN=yes
DEVICE=bond0.900
BOOTPROTO=none
ONBOOT=yes
NETWORK=#.#.#.#		<- adapt to your network
NETMASK=#.#.#.#
IPADDR=#.#.#.#
BROADCAST=#.#.#.#

# ifcfg-bond0.901	<- vlan 901 is for management
VLAN=yes
DEVICE=bond0.901
BOOTPROTO=none
ONBOOT=yes
NETWORK=#.#.#.#		<- adapt to your network
NETMASK=#.#.#.#
IPADDR=#.#.#.#
BROADCAST=#.#.#.#

# ifcfg-bond0.902	<- vlan 902 is for cluster communications
VLAN=yes
DEVICE=bond0.902
BOOTPROTO=none
ONBOOT=yes
NETWORK=#.#.#.#		<- adapt to your network
NETMASK=#.#.#.#
IPADDR=#.#.#.#
BROADCAST=#.#.#.#

Remember to configure the bonding method. Method 2 (balance-xor) appears to stream data more consistently than the default of 0 (balance-rr).

# /etc/modprobe.conf
alias bond0 bonding
options bonding mode=2 miimon=100

Services Startup and Shutdown

Ensure that all services like iscsi-target are configured to not start by default. They will be managed by the heartbeat cluster services.

Disks and LVM

It is assumed the reader understands Linux LVM. The critical thing during the startup of a storage processor node is to NOT activate the volume groups that will be served, but rather let heartbeat activate them.

There appears to be a problem with this logic. The rc.sysinit scripts inherently activate all volume groups, however since no process accesses them (only the device mapper entries are mapped), they can be deactivated in rc.local. After deactivation heartbeat can be started. Heartbeat can then manage the volume groups.

The rc.local file looks like:

#
# disable volume groups
#
vgchange -an

#
# then start heartbeat (this guarantees that we don't have the vg open
# on two nodes during heartbeat's startup)
#
service heartbeat start

touch /var/lock/subsys/local

Deactivating all volume groups automatically skips volume groups that are in use, like VolGroup00.

To ensure timely failover between node 1 and node 2 ensure that the size of snapshot logical volumes is not too large.

For example, if a volume group contains a 250GB snapshot (actual snapshot usage) it takes about 6 minutes to activate the volume group. The failover from node 1 to node 2 must be under 45 seconds for the clients to continue functioning properly.

The kernel has to not only read through that snapshot, but also must allocate the snapshot data structures. Monitoring the FC interface during that time reveals that the FC port throughput was about 6MB/s (normally the disks can provide at least 90-130MB/s).

Heartbeat

CentOS distribution provides heartbeat version 1 and version 2. Either version could be used, however the configuration files reflect version 1.

There are 3 configuration files that must be configured for heartbeat to function.

/etc/ha.d/authkeys

The authkeys file contains the authentication between the 2 cluster nodes. The following configuration is used:

auth 2
2 sha1 somesecretpreferablelong

/etc/ha.d/ha.cf

The ha.cf contains heartbeat communication configuration items, the most important ones are listed below:

bcast bond0.902
auto_failback off
stonith_host iscsi-targetX-node2.amherst.edu external foo /etc/ha.d/shoot-iscsi-targetX-node1.sh
node iscsi-targetX-node1.amherst.edu
node iscsi-targetX-node2.amherst.edu
ping #.#.#.#
respawn hacluster /usr/lib/heartbeat/ipfail

The broadcast method is the simplest one to define, as it allows ha.cf to be identical on both nodes. Ensure that the netmask (255.255.255.252) is very small to limit the broadcast to a small network range.

Auto failback is off, as it appears that if there was a failover from node 1 to node 2, it is not smart to have it fail over back and forth between nodes. If node 2 cannot handle it, then there are a lot more problems. Node 2 is needed to change FC disk configurations and upgrade kernels which require a reboot of the storage processor.

The stonith_host definition allows node 2 to shoot node 1 in the head if node 2 thinks node 1 is dead. The STONITH operation uses the rack PDUs and power off node 1. The other method is to fence off the FC storage path for node 1. It is however, smarter to just kill node 1 to avoid confusion. This can be implemented as follows:

snmpset -v 1 -c private $device1 PowerNet-MIB::rPDUOutletControlOutletCommand.$port i 2
snmpset -v 1 -c private $device2 PowerNet-MIB::rPDUOutletControlOutletCommand.$port i 2

Warning: just putting the above lines into the stonith shell script is a bad idea. It is recommended to create a secure stonith proxy host that has access to the rack PDUs only. The storage processors can then ssh into the stonith proxy using a non-root account, followed by a key (this prevents accidental death if you ssh to the stonith proxy from node 2):

ssh stonith@stonithhost a098sdfsad8f90asdf09s8adf08as

On that stonith proxy host, setup .ssh/authorized_keys2 as follows (this should be one line):

from="#.#.#.#",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty,
command="./stonith.sh iscsi-targetX-node1"
ssh-rsa AAAAB3NzaC1.....hk0= root@iscsi-targetX-node2.amherst.edu

This allows only the node 2 to ssh in from #.#.#.#. It will forcibly execute the stonith.sh in the home directory of the stonith user with first parameter iscsi-targetX-node1. By doing this, only a specific node can power off some other specific node, and no other non-related cluster node. The stonith.sh script is summarized as follows:

sys=$1
key=$SSH_ORIGINAL_COMMAND

case "$sys" in
	iscsi-targetX-node1)
		if [ "$key" == "a098sdfsad8f90asdf09s8adf08as" ]; then
			snmpset -v 1 -c private $device1 ...
			snmpset -v 1 -c private $device2 ...
esac

The ping IP address should be the default gateway of the VLAN that provides iscsi services. Without this, if node 2 cannot ping node 1 on the iscsi VLAN, then node 2 will kill node 1. With this, if node 2 cannot ping node 1 AND node 2 cannot ping the ping address (the default gateway), it will not bring down node 1, since there is some other communication problem on the network.

/etc/ha.d/haresource

The resource file defines which resources are managed by the heartbeat cluster software:

virtual IP address (this is the IP address iscsi initiators must use to contact the iscsi-target services.
LVM volume groups
the iscsi-target service

The following is an example of this file.

#
# The IPaddr2 script is required, because on CentOS the name of the resulting
# cluster interface bond0.900:0 is too long and does not appear in the ifconfig
# listing. It appears as bond0.900 just like the main interface for vlan 900.
# Use IPaddr2 to get around this - to get all IP addresses type: ip addr.
#
iscsi-targetX-node1.amherst.edu IPaddr2::#.#.#.# \
	amherst_lvm::vg_diskset0_vol0 \
	amherst_lvm::vg_diskset1_vol0 \
	iscsi-target

The amherst_lvm resource script is located under /etc/ha.d/resource.d and is a copy of the LVM script provided by heartbeat, modified to get around the LVM_VERSION detection problem. CentOS4 does not provide /sbin/lvmiopversion. Comment out the version detection code and set LVM_VERSION="200":

#LVM_VERSION=`/sbin/lvmiopversion`
LVM_VERSION="200"
#rc=$?
#if
# [ $rc -ne 0 ]
#then
# ha_log "ERROR: LVM: $1 could not determine LVM version"
# return $rc
#fi

The /usr/lib/heartbeat/hb_takeover can be used to manually take over the services from the other node. It is initially recommended to ONLY configure the virtual IP address as a resource, then add volume groups and finally add the iscsi-target service.

Cluster Configuration Synchronization

Use rsync to keep the configuration consistent. The following shell script is used to synchronize the configuration from node 1 to node 2:

#
# rsync files spec'd in cluster_sync.conf, source is / (root)
# on this system, destination is nfs-node2:/ (root)
#
rsync -avrR \
	--delete \
	--files-from=amh_cluster_sync.conf \
	/ iscsi-targetX-node2:/

and the configuration file that defines the --files-from parameter:

# startup scripts
/etc/rc.d/rc.local

# cluster
/etc/ha.d/

# iscsi-target
/etc/iscsi-target/
/etc/ietd.conf
/etc/initiators.allow
/etc/initiators.deny

iscsitarget

Install the iscsitarget RPM and iscsitarget-kernel[-smp] module. The download section provides source code and binaries for CentOS4.

The following enhancements have been made:

proc interface for VPD
proc interface for non-persistent reservations
init.d script modified to support cluster failover

The proc interface for VPD (Vital Product Pages) display the LUN's scsi_id and scsi_sn defined by ietd.conf:

tid:1 name:iqn.1990-01.edu.amherst.iscsi-target:target_test
	lun:0 path:/dev/vg_test/lv_test
		vpd_83_scsi_id: 49 45 54 00 00 00 00 00 00 00 00 00 02 00 00 00 09 11 00 00 0d 00 00 00 IET.....................
		vpd_80_scsi_sn: AMHDSK-061207-02

The proc interface for reservation displays the LUN's SCSI RESERVE/RELEASE status and includes the display of initiator who is holding the lock:

tid:1 name:iqn.1990-01.edu.amherst.iscsi-target:target_test
	lun:0 path:/dev/vg_test/lv_test
		reserve:10 release:10 reserved:0 reserved_by:none

The reserve/release numbers should increment in sync most times, the reserved value indicates how many times ietd was unable to honor the RESERVE command, also if the resource is currently reserved the initiator name is displayed.

The init.d script has been modified to block iscsi traffic while shutting down ietd and allowing iscsi traffic as the daemon starts. This prevents clients from seeing the TCP connection drop, instead they simply hang for a few seconds, then discover that the connection does not function and they will reconnect (this is what happens during the failover process from node 1 to node 2).

In theory this patch should not be necessary, however Windows clients do get ugly when their storage disappears. If one sees red coded events in the event view about the Plug-and-Play manager complaining that a disk disappeared, then it is too late. This patch prevents this event.

The iscsitarget-tools package contains the tools to configure and manage the configuration files of ietd.

iscsitarget-tools

Install the iscsitarget-tools RPM provided in the download section. It delivers a set of shell scripts in /etc/iscsi-target.

common.conf (global configuration file)
target.conf (contains the definition of targets and LUNs)
update.sh (the heart of iscsitarget-tools, manages ietd and its configuration files)
iet-tools.plx (several low level management tools for ietd)

/etc/iscsi-target/common.conf

The common.conf file is actually a shell script which defines several variables and functions. The functions are called while the update.sh is executed.

Set the TARGET_PREFIX to the iscsi qualified name of your cluster. Each target name is automatically prefixed with the TARGET_PREFIX.
The function global_options adds options to ietd.conf that are of global scope. Invoke the option command followed by variable number of parameters. Example:

function global_options
{
	option IncomingUser "portalusername portalpassword"
}

The function target_common_option is invoked for each defined target. Targets can override these settings, however, commonly it is sufficient to have a set of consistent options for all targets.

function target_common_option
{
	option MaxRecvDataSegmentLength 131072
	option MaxXmitDataSegmentLength 131072
}

The functions lun_callback_pre, lun_callback, lun_callback_post are used to invoke custom code to manage the LUNs exposed by ietd (refer to target.conf for details).

The most common function is to invoke code that generates a script to backup the LVM logical volumes exposed as LUNs by ietd. Refer to the lvm-tools package for more details.

function lun_callback_pre
{
	> $ROOT/lv-backup.sh
}

function lun_callback
{
	lun_path=$1

	source_dev=$(echo $lun_path | cut -f2 -d/)
	source_vg=$(echo $lun_path | cut -f3 -d/)
	source_lv=$(echo $lun_path | cut -f4 -d/)

	target_dev=$source_dev
	target_vg=vg_backup
	target_lv=${source_lv}_bk

	line="lvm_snap_copy -s lvm:$lun_path -t lvm:/$target_dev/$target_vg/$target_lv"
	echo $line >> $ROOT/lv-backup.sh
}

function lun_callback_post
{
	return
}

/etc/iscsi-target/target.conf

The target.conf contains the definitions of targets and their LUNs. A sample is shown below:

LUN_TYPEIO=blockio

function scsi_idsn
{
	echo "ScsiId=$1,ScsiSN=$1"
}

#
# Generate targets and lun assignments and localized options
#
function build_targets_luns
{
	target clustertarget1 #.#.#.#,#.#.#.#,...
		lun 0 /dev/vg_test/lv_test1_0 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 1 /dev/vg_test/lv_test1_1 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 2 /dev/vg_test/lv_test1_2 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 3 /dev/vg_test/lv_test1_3 $(scsi_idsn AMHDSK-YYMMDD-nn)

	target clustertarget2 #.#.#.#,#.#.#.#,...
		lun 0 /dev/vg_test/lv_test2_0 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 1 /dev/vg_test/lv_test2_0 $(scsi_idsn AMHDSK-YYMMDD-nn)
}

Since svn revision 96, there is the blockio type which avoids using the Linux cache when reading data (requires fast disks and we have seen a 2MB/s increase doing reads). The LUN_TYPEIO is set by default to fileio. Override it as required by the target/LUNs defined below.

The function scsi_idsn provides the parameters used by ietd to defined the SCSI ID (VPD 0x83) and SCSI Serial Number (VPD 0x80).

The build_targets_luns function consists of statements to create targets and LUNs:

target <target_name> [<ip>,...]

The target command requires a target_name which is prefixed with the value of the TARGET_PREFIX.

An optional list of comma separated IP addresses should be specified to control which clients can access this target.

This IP address list will also be used during resizing of LUNs. When update.sh detects a size change of a logical volume, it will block iscsi traffic of the target's client's IP address(es), then close all sessions that have to do with the target, then remove the LUN from ietd and re-add it, then remove the iscsi traffic block. The clients see the LUN size change.
lun <lun#> <logical volume path> [<args>...]

The lun command requires the logical unit number which is then followed by the path to the logical volume exposed as a LUN. The remaining arguments are the same as defined in ietd.conf man page under TARGET OPTIONS, Lun.

The lun command assumes the type to be fileio.

/etc/iscsi-target/update.sh

This command script updates and maintains the running ietd.conf, initiators.allow, initiators.deny dynamically.

The tool can be executed in one of two modes:

online: updates configuration files and running ietd process (default)
offline: updates configuration files only

The offline mode is useful when ietd cannot be started, this happens in particular when iscsitarget is initially installed. There are initially no targets, so ietd does not start. One target must be defined.

[root@iscsi-targetX iscsi-target]# ./update.sh -h
./update.sh [online|offline]
Default runmode is online. Choose offline if IETD is not running
and you still want to update the ietd.conf file

Detailed features:

Add new targets. Access is controlled using IP addresses, IP addresses are added to initiators.allow, and ALL is added to initiators.deny.
Add new LUNs. If the LUN size changed (only LVM logical volumes are supported), then the update.sh will make the ietd environment recognize the change. This is currently not part of ietd.

One must first use lvextend or lvresize to increase or decrease (ouch!) the size of a logical volume. Then use the update.sh script to adjust the running ietd process. The LUN resizing process works as follows:

using iptables block iscsi communication of the IP addresses defined by the target statement
close all iscsi sessions of the target
remove the LUN from ietd
add the LUN back to ietd (now it knows the new size)
using iptables unblock iscsi communications previously blocked in step 1

While executing the update.sh script, it is generally wise to keep an eye in another terminal window on the following commands:

watch -d cat /proc/net/iet/volume

watch -d cat /proc/net/iet/session

Currently the update.sh script will NOT remove LUNs no longer defined in target.conf from the running environment. A future release might incorporate this feature.

lvm-tools

Install the lvm-tools RPM provided in the download section. In delivers a set of shell scripts to copy logical volumes:

lvm_copy: copies a logical volume, contents of devices and files to local or remote system
lvm_snap_create: create a snapshot of a logical volume
lvm_snap_remove: remove a snapshot of a logical volume
lvm_snap_copy: create a snapshot of a logical volume, copy it using lvm_copy, then remove the snapshot
lvm_iostat: tool to display in realtime the iostats of logical volumes (it translates the dm-1 device names into the actual logical volume names)

/etc/lvm/lvm_tools.conf

The lvm_tools configuration file adjusts the following default parameters:

# default snap name suffix
DEFAULT_SNAP_SUFFIX=snap

# in percent (SM < 100GB, XL >= 100GB)
DEFAULT_SNAP_SPACE_SM=20
DEFAULT_SNAP_SPACE_XL=15

# enable md5 checks when doing remote or locally targeted snapshots
ENABLE_MD5=0

#
# enable raw devices on local copy sources (/dev/raw/rawN) (CentOS4 tested)
# this feature is incompatible with TOOL_PV and compression of LVs
#
ENABLE_RAW=1

#
# Expected minimum transfer rate (MB/s)
#
MIN_TRANSFER_RATE=4

It is not recommended to enable MD5 checks in a production environment, as it causes the lvm_copy command to read the source once using md5sum, then transfers the data, then read the target using md5sum.

/usr/sbin/lvm_copy

lvm_copy is basically a glorified dd command, however it can handle as a source and destination a mixture of files, lvm or devices (like /dev/sdg). The command syntax:

/usr/sbin/lvm_copy -s source_device -t target_device [-z] [-p]

lvm_copy will copy from a source lvm device to a target, the target will
be created based on the source lvm device size.

-s	source device

	source syntax:
		file:/tmp/somefile
		lvm:/dev/vg/lv
		dev:/dev/sda

-t	target device

	target syntax:
		file:/tmp/somefile
		lvm:/dev/vg/lv
		dev:/dev/sda
		file:host:/tmp/somefile
		lvm:host:/dev/vg/lv
		dev:host:/dev/sda

-z compress with gzip (target must be a file)

-p prepare target but do NOT copy the data (useful only for lvm)

The host field is used by ssh.

lvm_copy determines the size of the source, then validates the target device. If the target device is a file, it will be overwritten. If the target is a logical volume, lvm_copy creates it if it does not exist and will extend it if the size does not match the source. If the target is a device, lvm_copy verifies that the device is large enough to hold contents of the source.

lvm_copy supports reading from raw devices when handling dev or lvm source types. This eliminates the read caching in the linux kernel and will speed up the copy process.

lvm_copy uses dd to transport the data, optionally to a remote host (using ssh, public/private key cryptography is assumed).

lvm_snap_create

lvm_snap_create creates a snapshot of a logical volume, the command simplifies the snapshot creation as it automatically uses snapshot sizes based on a percentage of the origin logical volume.

/usr/sbin/lvm_snap_create -s source_device [-t target_suffix] [-n snap_size]
-s source lvm device
-t target suffix, by default, snap
-n snap size, by default 20% of LV < 100GB, 15% of LV >= 100GB

lvm_snap_remove

lvm_snap_remove remotes a snapshot logical volume. The command ensures that the logical volume removed is actually the snapshot and not accidentally the origin volume.

/usr/sbin/lvm_snap_remove -s source_device
-s source lvm device (must be a snapshot device)

lvm_snap_copy

lvm_snap_copy automates the snap copy process by invoking lvm_snap_create, lvm_copy and lvm_snap_remove to snap copy a logical volume.

/usr/sbin/lvm_snap_copy -s source_device -t target_device [-z]

lvm_snap_copy first creates a snapshot of a lvm source device, then invokes
lvm_copy to copy the snapshot content to the target lvm or target file, then
it removes the snapshot.

Refer to lvm_copy for more details about the source and target parameters.
The source paramter can ONLY be of type lvm, since you cannot create snapshot
of a device (like /dev/sda) or file.

-s source lvm device
-t target lvm or other device
-z compress (with gzip) only works if target is not lvm (i.e. file or
remote file)

lvm_iostat

lvm_iostat is modelled after iostat and displays logical volume IO throughput, like the number of reads, writes, reads/second, writes/s, io queue size and io queue time. The standard iostat displays device names like dm-3, lvm_iostat translates those names into the actual volume group, logical volume names.

/usr/sbin/lvm_iostat [-h] [-n name] -r|-d

-h help
-r realtime data
-d dump data (for rrd)

-n device name (ex.: 'hd' would match hda, hdb; 'dm' would match device
mapper devices; default matches hd[a-z]|sd[a-z]|dm-). Specify a
regular expression in quotes, like dm-.* to refer to all device
mapper devices.

Example view:

device  displayname                  read                write  read/s  write/s  #io queue  #io qtime[ms]
hda     hda                      0     0             0      0       0        0          0              0
sda     sda              115150848  109M  292581842944   272G       0        0          0              0
sdb     sdb            33843712000   31G 1247439607616   1.1T       0        0          0              0
sdc     sdc              653681664  623M    8757610496   8.2G       0        0          0              0
dm-8    vg_raidweb_0-lv_1   188416  184k             0      0       0        0          0              0
dm-9    vg_raidweb_0-lv_2   188416  184k             0      0       0        0          0              0
dm-7    vg_raidweb_0-lv_3   188416  184k             0      0       0        0          0              0

Use watch -d to get a realtime view of the IO stats.

ddless

The ddless tool copies data from a source to a destination block device. It will initially copy all data and subsequently only copy the changes. This can be helpful if you would like to create snapshots at the destination side of the replication because it will only write the actual changed blocks. This is accomplished by reading 1MB chuncks of data, segmenting that into 16K pieces for which we calculate a CRC32 and Google's MurmurHash. Together that makes a 64-bit checksum. If the source's checksum for a 16K segment differs from the previous run, a 16K segment is written to the destination device. For performance reasons, multiple neighboring 16K segments are written as one larger segment.

ddless operaters in several different modes:

copy data from source to destination device while maintaining a checksum file
read data from device and produce a new checksum
read data from device and print out the read transfer rates (nice tool to see how disk performance changes as we read from the outer trackers to the inner tracks).

Make sure to test ddless with the -d (direct IO enabled) parameter. It helps performance by bypassing the kernel buffers. One can control the read rate by specifying the -r switch measured in MB/s. The tool ddless has been tested with 32/64 bit CentOS 4,5 platforms. The largest source/destination device was 15TB.

The name of the tool is a pun on more/less :)

ddless by Steffen Plotner, release date: 2008.08.03

Copy source to target keeping track of the segment checksums. Subsequent
copies are faster because we assume that not all of the source blocks change.

        ddless  [-d] -s source [-r read_rate_mb_s] -c checksum [-b] -t target
                [-m max_change_gigabytes -i cmd] [-v]

Produce a checksum file using the specified device. Hint: the device could be
source or target. Use the target and a new checksum file, then compare it to
the existing checksum file to ensure data integrity of the target.

        ddless  [-d] -s source -c checksum [-v]

Determine disk read speed zones, outputs data to stdout.

        ddless  [-d] -s source [-v]

Outputs the built in parameters

        ddless  -p

Parameters
        -d      direct io enabled (i.e. bypasses buffer cache)

        -s      source device
        -r      read rate of source device in megabytes/sec
        -c      checksum file
        -b      bail out with exit code 3 because a new checksum file is
                required, no data is copied from source to target
        -t      target device

        -m      max number of bytes to be written to target device, when
                reached invoke the command defined by parameter -i
        -i      interrupt writing when -m limit is reached and invoke this
                command to, for example, lvextend snap or lvremove snap
                of target device

        -p      display parameters (segment size is known as chunksize in LVM2)
        -v      verbose
        -vv     verbose+debug

Exit codes:
        0       successful
        1       a runtime error code, unable to complete task (detailed perror
                and logical error message are output via stderr
        2       max number of gigabytes to be written limit was reached,
                operation successful
        3       new checksum file required, only returned if -b is specified

Download

The tools mentioned above are available for download in source and binary form and are licensed under the GPL.

Last update: 2008-10-13

Use the tools at your own risk.

Author

Steffen Plotner
Systems Administrator/Programmer
Systems & Networking Amherst College
Amherst, MA 01002
Email: swplotner {at} amherst.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License v2 as published by the Free Software Foundation.