Linux iscsitarget+tools

There are several public iscsi-target implementation cookbook recipes, however this one attempts to provide unique features including:

For simplicity of failover and management it was decided to use an active/passive storage controller arrangement. Both storage controllers must have access to the same set of disks (commonly served using Fiber Channel technology, or using a physical SCSI bus where each HBA is configured to NOT reset the bus).

The reason for converting Fiber Channel to iSCSI has 2 reasons:

  1. To deliver iSCSI to hosts is cheaper. To get decent performance, one should consider an iSCSI HBA. The cost of FC HBA is comparable to an iSCSI HBA, however the additional costs of fiber port licenses and SFPs changes that. In a lot of cases a software iSCSI initiator is sufficient.
  2. The disks delivered via FC have no built-in snapshot support or remote replication capability. It was therefore decided to use LVM for snapshots and lvm_tools for remote replication. Hot replication using drbd was not considered because it has the same issue as a RAID1 setup: if you delete data, the data is also gone on the mirror (i.e. the remote site). It would be better to have 2 sets of remote disks, one for drbd and another for snapshots.

Cluster configuration consistency is provided by rsync. This project provides tools to dynamically update the running ietd environment while maintaining the static configuration files (like ietd.conf, initiators.allow, initiators.deny). New targets and LUNs are automatically added. The tools will detect LUN size changes if they are served via LVM logical volumes. The client will see the resized LUN without restarting ietd.

The following sections detail the hardware setup, operating system installation, iscsitarget, iscsitarget-tools and lvm-tools:

Hardware Setup

The purpose of the hardware is to provide SP (storage processor) services to clients. Since a single physical storage processor is a single point of failure, it was decided to use two physical storage processors, one active the other passive. Together they act as a single logical storage processor.

        ^   ^                                        ^   ^
        |   |                                        |   |
        |   |              to iSCSI clients          |   |
        |   |                                        |   |
+--------------------+                       +--------------------+
|    Gig Ethernet    |                       |    Gig Ethernet    |
+--------------------+                       +--------------------+
        |   |                                        |   |
        |   | Ethernet (Gig EtherChannel)            |   | Ethernet (Gig EtherChannel)
        |   |                                        |   |
+--------------------+                       +--------------------+  Single Logical
| iscsi-target-node1 |	                     | iscsi-target-node2 |  Storage Processor
+--------------------+	                     +--------------------+  (node1, node2)
   | |  |                                            |      | |
   | |  +--------------------+ FC +------------------+      | |
   | |                       |    |                         | |
   | |PSU1, PSU2      +--------------------+                | |PSU1, PSU2
   | |                |      FC Switch     |                | |
   | |                +--------------------+                | |
   | |                       |    |                         | |
   | |                       | FC |                         | |
   | |                       |    |                         | |
   | |                +--------------------+                | |
   | |                |    Disk Storage    |-+              | |
   | |                +--------------------+ |              | |
   | |                 +---------------------+              | |
   | |                                                      | |
   | +-------------------------------------------+          | |
   |                                             |          | |
   | +-------------------------------------------^----------+ |
   | |                                           |            |
+--------------------+                       +--------------------+
|    Rack PDUs       |                       |    Rack PDUs       |
+--------------------+                       +--------------------+

Details about the above diagram starting from the bottom going up:

Software Setup

Several open source packages are used to make the iscsi storage process work:

CentOS4 is the recommended OS of choice. Fedora Core 4, 5 have been used previously without success as snapshots are not stable and putting the storage processor under load crashes the nodes (take 5 clients, have them each do a dd if=/dev/zero of=/dev/sdg bs=1M and setup snapshots on each of those logical volumes exposed to the clients - it will break!).

Heartbeat is used to provide clustering services, such as a virtual IP, activating/deactivating of LVM volume groups, starting and stopping iscsi-target services.

The iscsitarget package is used to provide iscsi services. Since the current setup involves VMware ESX v3, it requires at least svn revision 78. Additional patches are included to provide the display of VPD (SCSI Vital Product Packages) to identify disk serial and disk id numbers associated with LUNs served by ietd (/proc/net/iet/vpd). Another patch is include to display the current state of the non-persistent reservations (/proc/net/iet/reservation).

The iscsitarget-tools package provides the tools to dynamically manage ietd without restarting it.

The lvm-tools package provides tools to copy LVM logical volumes.

Read the next sections which detail the setup and configuration of each of the above packages. At the end there is a section to download all of the above source code.


Install the latest version of the OS, install all updates.

Ethernet Networking

There are 2 physical interfaces which are bonded, carry VLANs for management, cluster services and iscsi traffic. The configuration files are located in /etc/sysconfig/network-scripts.

# ifcfg-eth0
# ifcfg-eth1

# ifcfg-bond0
MTU=9000		<- enable MTU 9000 on the switch gear also!

# ifcfg-bond0.900	<- vlan 900 is for iscsi
NETWORK=#.#.#.#		<- adapt to your network
# ifcfg-bond0.901	<- vlan 901 is for management
NETWORK=#.#.#.#		<- adapt to your network
# ifcfg-bond0.902	<- vlan 902 is for cluster communications
NETWORK=#.#.#.#		<- adapt to your network

Remember to configure the bonding method. Method 2 (balance-xor) appears to stream data more consistently than the default of 0 (balance-rr).

# /etc/modprobe.conf
alias bond0 bonding
options bonding mode=2 miimon=100

Services Startup and Shutdown

Ensure that all services like iscsi-target are configured to not start by default. They will be managed by the heartbeat cluster services.

Disks and LVM

It is assumed the reader understands Linux LVM. The critical thing during the startup of a storage processor node is to NOT activate the volume groups that will be served, but rather let heartbeat activate them.

There appears to be a problem with this logic. The rc.sysinit scripts inherently activate all volume groups, however since no process accesses them (only the device mapper entries are mapped), they can be deactivated in rc.local. After deactivation heartbeat can be started. Heartbeat can then manage the volume groups.

The rc.local file looks like:

# disable volume groups
vgchange -an

# then start heartbeat (this guarantees that we don't have the vg open
# on two nodes during heartbeat's startup)
service heartbeat start

touch /var/lock/subsys/local

Deactivating all volume groups automatically skips volume groups that are in use, like VolGroup00.

To ensure timely failover between node 1 and node 2 ensure that the size of snapshot logical volumes is not too large.

For example, if a volume group contains a 250GB snapshot (actual snapshot usage) it takes about 6 minutes to activate the volume group. The failover from node 1 to node 2 must be under 45 seconds for the clients to continue functioning properly.

The kernel has to not only read through that snapshot, but also must allocate the snapshot data structures. Monitoring the FC interface during that time reveals that  the FC port throughput was about 6MB/s (normally the disks can provide at least 90-130MB/s).


CentOS distribution provides heartbeat version 1 and version 2. Either version could be used, however the configuration files reflect version 1.

There are 3 configuration files that must be configured for heartbeat to function.


The authkeys file contains the authentication between the 2 cluster nodes. The following configuration is used:

auth 2
2 sha1 somesecretpreferablelong


The contains heartbeat communication configuration items, the most important ones are listed below:

bcast bond0.902
auto_failback off
stonith_host external foo /etc/ha.d/
ping #.#.#.#
respawn hacluster /usr/lib/heartbeat/ipfail

The broadcast method is the simplest one to define, as it allows to be identical on both nodes. Ensure that the netmask ( is very small to limit the broadcast to a small network range.

Auto failback is off, as it appears that if there was a failover from node 1 to node 2, it is not smart to have it fail over back and forth between nodes. If node 2 cannot handle it, then there are a lot more problems. Node 2 is needed to change FC disk configurations and upgrade kernels which require a reboot of the storage processor.

The stonith_host definition allows node 2 to shoot node 1 in the head if node 2 thinks node 1 is dead. The STONITH operation uses the rack PDUs and power off node 1. The other method is to fence off the FC storage path for node 1. It is however, smarter to just kill node 1 to avoid confusion. This can be implemented as follows:

snmpset -v 1 -c private $device1 PowerNet-MIB::rPDUOutletControlOutletCommand.$port i 2
snmpset -v 1 -c private $device2 PowerNet-MIB::rPDUOutletControlOutletCommand.$port i 2

Warning: just putting the above lines into the stonith shell script is a bad idea. It is recommended to create a secure stonith proxy host that has access to the rack PDUs only. The storage processors can then ssh into the stonith proxy using a non-root account, followed by a key (this prevents accidental death if you ssh to the stonith proxy from node 2):

ssh stonith@stonithhost a098sdfsad8f90asdf09s8adf08as

On that stonith proxy host, setup .ssh/authorized_keys2 as follows (this should be one line):

command="./ iscsi-targetX-node1"
ssh-rsa AAAAB3NzaC1.....hk0=

This allows only the node 2 to ssh in from #.#.#.#. It will forcibly execute the in the home directory of the stonith user with first parameter iscsi-targetX-node1. By doing this, only a specific node can power off some other specific node, and no other non-related cluster node. The script is summarized as follows:


case "$sys" in
		if [ "$key" == "a098sdfsad8f90asdf09s8adf08as" ]; then
			snmpset -v 1 -c private $device1 ...
			snmpset -v 1 -c private $device2 ...

The ping IP address should be the default gateway of the VLAN that provides iscsi services. Without this, if node 2 cannot ping node 1 on the iscsi VLAN, then node 2 will kill node 1. With this, if node 2 cannot ping node 1 AND node 2 cannot ping the ping address (the default gateway), it will not bring down node 1, since there is some other communication problem on the network.


The resource file defines which resources are managed by the heartbeat cluster software:

The following is an example of this file.

# The IPaddr2 script is required, because on CentOS the name of the resulting
# cluster interface bond0.900:0 is too long and does not appear in the ifconfig
# listing. It appears as bond0.900 just like the main interface for vlan 900.
# Use IPaddr2 to get around this - to get all IP addresses type: ip addr.
# IPaddr2::#.#.#.# \
	amherst_lvm::vg_diskset0_vol0 \
	amherst_lvm::vg_diskset1_vol0 \

The amherst_lvm resource script is located under /etc/ha.d/resource.d and is a copy of the LVM script provided by heartbeat, modified to get around the LVM_VERSION detection problem. CentOS4 does not provide /sbin/lvmiopversion. Comment out the version detection code and set LVM_VERSION="200":

# [ $rc -ne 0 ]
# ha_log "ERROR: LVM: $1 could not determine LVM version"
# return $rc

The /usr/lib/heartbeat/hb_takeover can be used to manually take over the services from the other node. It is initially recommended to ONLY configure the virtual IP address as a resource, then add volume groups and finally add the iscsi-target service.

Cluster Configuration Synchronization

Use rsync to keep the configuration consistent. The following shell script is used to synchronize the configuration from node 1 to node 2:

# rsync files spec'd in cluster_sync.conf, source is / (root)
# on this system, destination is nfs-node2:/ (root)
rsync -avrR \
	--delete \
	--files-from=amh_cluster_sync.conf \
	/ iscsi-targetX-node2:/

and the configuration file that defines the --files-from parameter:

# startup scripts

# cluster

# iscsi-target


Install the iscsitarget RPM and iscsitarget-kernel[-smp] module. The download section provides source code and binaries for CentOS4.

The following enhancements have been made:

The proc interface for VPD (Vital Product Pages) display the LUN's scsi_id and scsi_sn defined by ietd.conf:

	lun:0 path:/dev/vg_test/lv_test
		vpd_83_scsi_id: 49 45 54 00 00 00 00 00 00 00 00 00 02 00 00 00 09 11 00 00 0d 00 00 00 IET.....................
		vpd_80_scsi_sn: AMHDSK-061207-02

The proc interface for reservation displays the LUN's SCSI RESERVE/RELEASE status and includes the display of initiator who is holding the lock:

	lun:0 path:/dev/vg_test/lv_test
		reserve:10 release:10 reserved:0 reserved_by:none

The reserve/release numbers should increment in sync most times, the reserved value indicates how many times ietd was unable to honor the RESERVE command, also if the resource is currently reserved the initiator name is displayed.

The init.d script has been modified to block iscsi traffic while shutting down ietd and allowing iscsi traffic as the daemon starts. This prevents clients from seeing the TCP connection drop, instead they simply hang for a few seconds, then discover that the connection does not function and they will reconnect (this is what happens during the failover process from node 1 to node 2).

In theory this patch should not be necessary, however Windows clients do get ugly when their storage disappears. If one sees red coded events in the event view about the Plug-and-Play manager complaining that a disk disappeared, then it is too late. This patch prevents this event.

The iscsitarget-tools package contains the tools to configure and manage the configuration files of ietd.


Install the iscsitarget-tools RPM provided in the download section. It delivers a set of shell scripts in /etc/iscsi-target.


The common.conf file is actually a shell script which defines several variables and functions. The functions are called while the is executed.

function global_options
	option IncomingUser "portalusername portalpassword"
function target_common_option
	option MaxRecvDataSegmentLength 131072
	option MaxXmitDataSegmentLength 131072
function lun_callback_pre
	> $ROOT/

function lun_callback

	source_dev=$(echo $lun_path | cut -f2 -d/)
	source_vg=$(echo $lun_path | cut -f3 -d/)
	source_lv=$(echo $lun_path | cut -f4 -d/)


	line="lvm_snap_copy -s lvm:$lun_path -t lvm:/$target_dev/$target_vg/$target_lv"
	echo $line >> $ROOT/

function lun_callback_post


The target.conf contains the definitions of targets and their LUNs. A sample is shown below:


function scsi_idsn
	echo "ScsiId=$1,ScsiSN=$1"

# Generate targets and lun assignments and localized options
function build_targets_luns
	target clustertarget1 #.#.#.#,#.#.#.#,...
		lun 0 /dev/vg_test/lv_test1_0 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 1 /dev/vg_test/lv_test1_1 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 2 /dev/vg_test/lv_test1_2 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 3 /dev/vg_test/lv_test1_3 $(scsi_idsn AMHDSK-YYMMDD-nn)

	target clustertarget2 #.#.#.#,#.#.#.#,...
		lun 0 /dev/vg_test/lv_test2_0 $(scsi_idsn AMHDSK-YYMMDD-nn)
		lun 1 /dev/vg_test/lv_test2_0 $(scsi_idsn AMHDSK-YYMMDD-nn)

Since svn revision 96, there is the blockio type which avoids using the Linux cache when reading data (requires fast disks and we have seen a 2MB/s increase doing reads). The LUN_TYPEIO is set by default to fileio. Override it as required by the target/LUNs defined below.

The function scsi_idsn provides the parameters used by ietd to defined the SCSI ID (VPD 0x83) and SCSI Serial Number (VPD 0x80).

The build_targets_luns function consists of statements to create targets and LUNs:


This command script updates and maintains the running ietd.conf, initiators.allow, initiators.deny dynamically.

The tool can be executed in one of two modes:

  1. online: updates configuration files and running ietd process (default)
  2. offline: updates configuration files only

The offline mode is useful when ietd cannot be started, this happens in particular when iscsitarget is initially installed. There are initially no targets, so ietd does not start. One target must be defined.

[root@iscsi-targetX iscsi-target]# ./ -h
./ [online|offline]
Default runmode is online. Choose offline if IETD is not running
and you still want to update the ietd.conf file

Detailed features:

One must first use lvextend or lvresize to increase or decrease (ouch!) the size of a logical volume. Then use the script to adjust the running ietd process. The LUN resizing process works as follows:

  1. using iptables block iscsi communication of the IP addresses defined by the target statement
  2. close all iscsi sessions of the target
  3. remove the LUN from ietd
  4. add the LUN back to ietd (now it knows the new size)
  5. using iptables unblock iscsi communications previously blocked in step 1

While executing the script, it is generally wise to keep an eye in another terminal window on the following commands:

watch -d cat /proc/net/iet/volume
watch -d cat /proc/net/iet/session

Currently the script will NOT remove LUNs no longer defined in target.conf from the running environment. A future release might incorporate this feature.


Install the lvm-tools RPM provided in the download section. In delivers a set of shell scripts to copy logical volumes:


The lvm_tools configuration file adjusts the following default parameters:

# default snap name suffix

# in percent (SM < 100GB, XL >= 100GB)

# enable md5 checks when doing remote or locally targeted snapshots

# enable raw devices on local copy sources (/dev/raw/rawN) (CentOS4 tested)
# this feature is incompatible with TOOL_PV and compression of LVs

# Expected minimum transfer rate (MB/s)

It is not recommended to enable MD5 checks in a production environment, as it causes the lvm_copy command to read the source once using md5sum, then transfers the data, then read the target using md5sum.


lvm_copy is basically a glorified dd command, however it can handle as a source and destination a mixture of files, lvm or devices (like /dev/sdg). The command syntax:

/usr/sbin/lvm_copy -s source_device -t target_device [-z] [-p]

lvm_copy will copy from a source lvm device to a target, the target will
be created based on the source lvm device size.

-s	source device

	source syntax:

-t	target device

	target syntax:

-z compress with gzip (target must be a file)

-p prepare target but do NOT copy the data (useful only for lvm)

The host field is used by ssh.

lvm_copy determines the size of the source, then validates the target device. If the target device is a file, it will be overwritten. If the target is a logical volume, lvm_copy creates it if it does not exist and will extend it if the size does not match the source. If the target is a device, lvm_copy verifies that the device is large enough to hold contents of the source.

lvm_copy supports reading from raw devices when handling dev or lvm source types. This eliminates the read caching in the linux kernel and will speed up the copy process.

lvm_copy uses dd to transport the data, optionally to a remote host (using ssh, public/private key cryptography is assumed).


lvm_snap_create creates a snapshot of a logical volume, the command simplifies the snapshot creation as it automatically uses snapshot sizes based on a percentage of the origin logical volume.

/usr/sbin/lvm_snap_create -s source_device [-t target_suffix] [-n snap_size]
-s source lvm device
-t target suffix, by default, snap
-n snap size, by default 20% of LV < 100GB, 15% of LV >= 100GB


lvm_snap_remove remotes a snapshot logical volume. The command ensures that the logical volume removed is actually the snapshot and not accidentally the origin volume.

/usr/sbin/lvm_snap_remove -s source_device
-s source lvm device (must be a snapshot device)


lvm_snap_copy automates the snap copy process by invoking lvm_snap_create, lvm_copy and lvm_snap_remove to snap copy a logical volume.

/usr/sbin/lvm_snap_copy -s source_device -t target_device [-z]

lvm_snap_copy first creates a snapshot of a lvm source device, then invokes
lvm_copy to copy the snapshot content to the target lvm or target file, then
it removes the snapshot.

Refer to lvm_copy for more details about the source and target parameters.
The source paramter can ONLY be of type lvm, since you cannot create snapshot
of a device (like /dev/sda) or file.

-s source lvm device
-t target lvm or other device
-z compress (with gzip) only works if target is not lvm (i.e. file or
remote file)


lvm_iostat is modelled after iostat and displays logical volume IO throughput, like the number of reads, writes, reads/second, writes/s, io queue size and io queue time. The standard iostat displays device names like dm-3, lvm_iostat translates those names into the actual volume group, logical volume names.

/usr/sbin/lvm_iostat [-h] [-n name] -r|-d

-h help
-r realtime data
-d dump data (for rrd)

-n device name (ex.: 'hd' would match hda, hdb; 'dm' would match device
mapper devices; default matches hd[a-z]|sd[a-z]|dm-). Specify a
regular expression in quotes, like dm-.* to refer to all device
mapper devices.

Example view:

device  displayname                  read                write  read/s  write/s  #io queue  #io qtime[ms]
hda     hda                      0     0             0      0       0        0          0              0
sda     sda              115150848  109M  292581842944   272G       0        0          0              0
sdb     sdb            33843712000   31G 1247439607616   1.1T       0        0          0              0
sdc     sdc              653681664  623M    8757610496   8.2G       0        0          0              0
dm-8    vg_raidweb_0-lv_1   188416  184k             0      0       0        0          0              0
dm-9    vg_raidweb_0-lv_2   188416  184k             0      0       0        0          0              0
dm-7    vg_raidweb_0-lv_3   188416  184k             0      0       0        0          0              0 

Use watch -d to get a realtime view of the IO stats.


The ddless tool copies data from a source to a destination block device. It will initially copy all data and subsequently only copy the changes. This can be helpful if you would like to create snapshots at the destination side of the replication because it will only write the actual changed blocks. This is accomplished by reading 1MB chuncks of data, segmenting that into 16K pieces for which we calculate a CRC32 and Google's MurmurHash. Together that makes a 64-bit checksum. If the source's checksum for a 16K segment differs from the previous run, a 16K segment is written to the destination device. For performance reasons, multiple neighboring 16K segments are written as one larger segment.

ddless operaters in several different modes:

Make sure to test ddless with the -d (direct IO enabled) parameter. It helps performance by bypassing the kernel buffers. One can control the read rate by specifying the -r switch measured in MB/s. The tool ddless has been tested with 32/64 bit CentOS 4,5 platforms. The largest source/destination device was 15TB.

The name of the tool is a pun on more/less :)

ddless by Steffen Plotner, release date: 2008.08.03

Copy source to target keeping track of the segment checksums. Subsequent
copies are faster because we assume that not all of the source blocks change.

        ddless  [-d] -s source [-r read_rate_mb_s] -c checksum [-b] -t target
                [-m max_change_gigabytes -i cmd] [-v]

Produce a checksum file using the specified device. Hint: the device could be
source or target. Use the target and a new checksum file, then compare it to
the existing checksum file to ensure data integrity of the target.

        ddless  [-d] -s source -c checksum [-v]

Determine disk read speed zones, outputs data to stdout.

        ddless  [-d] -s source [-v]

Outputs the built in parameters

        ddless  -p

        -d      direct io enabled (i.e. bypasses buffer cache)

        -s      source device
        -r      read rate of source device in megabytes/sec
        -c      checksum file
        -b      bail out with exit code 3 because a new checksum file is
                required, no data is copied from source to target
        -t      target device

        -m      max number of bytes to be written to target device, when
                reached invoke the command defined by parameter -i
        -i      interrupt writing when -m limit is reached and invoke this
                command to, for example, lvextend snap or lvremove snap
                of target device

        -p      display parameters (segment size is known as chunksize in LVM2)
        -v      verbose
        -vv     verbose+debug

Exit codes:
        0       successful
        1       a runtime error code, unable to complete task (detailed perror
                and logical error message are output via stderr
        2       max number of gigabytes to be written limit was reached,
                operation successful
        3       new checksum file required, only returned if -b is specified


The tools mentioned above are available for download in source and binary form and are licensed under the GPL.

Last update: 2008-10-13

Use the tools at your own risk.


Steffen Plotner
Systems Administrator/Programmer
Systems & Networking Amherst College
Amherst, MA 01002
Email: swplotner {at}

Copyright (C) 2007-2008, Steffen Plotner, Amherst College

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License v2 as published by the Free Software Foundation.