There are several public iscsi-target implementation cookbook recipes, however this one attempts to provide unique features including:
For simplicity of failover and management it was decided to use an active/passive storage controller arrangement. Both storage controllers must have access to the same set of disks (commonly served using Fiber Channel technology, or using a physical SCSI bus where each HBA is configured to NOT reset the bus).
The reason for converting Fiber Channel to iSCSI has 2 reasons:
Cluster configuration consistency is provided by rsync. This project provides tools to dynamically update the running ietd environment while maintaining the static configuration files (like ietd.conf, initiators.allow, initiators.deny). New targets and LUNs are automatically added. The tools will detect LUN size changes if they are served via LVM logical volumes. The client will see the resized LUN without restarting ietd.
The following sections detail the hardware setup, operating system installation, iscsitarget, iscsitarget-tools and lvm-tools:
The purpose of the hardware is to provide SP (storage processor) services to clients. Since a single physical storage processor is a single point of failure, it was decided to use two physical storage processors, one active the other passive. Together they act as a single logical storage processor.
^ ^ ^ ^ | | | | | | to iSCSI clients | | | | | | +--------------------+ +--------------------+ | Gig Ethernet | | Gig Ethernet | +--------------------+ +--------------------+ | | | | | | Ethernet (Gig EtherChannel) | | Ethernet (Gig EtherChannel) | | | | +--------------------+ +--------------------+ Single Logical | iscsi-target-node1 | | iscsi-target-node2 | Storage Processor +--------------------+ +--------------------+ (node1, node2) | | | | | | | | +--------------------+ FC +------------------+ | | | | | | | | | |PSU1, PSU2 +--------------------+ | |PSU1, PSU2 | | | FC Switch | | | | | +--------------------+ | | | | | | | | | | | FC | | | | | | | | | | | +--------------------+ | | | | | Disk Storage |-+ | | | | +--------------------+ | | | | | +---------------------+ | | | | | | | +-------------------------------------------+ | | | | | | | +-------------------------------------------^----------+ | | | | | +--------------------+ +--------------------+ | Rack PDUs | | Rack PDUs | +--------------------+ +--------------------+
Details about the above diagram starting from the bottom going up:
Several open source packages are used to make the iscsi storage process work:
CentOS4 is the recommended OS of choice. Fedora Core 4, 5 have been used previously without success as snapshots are not stable and putting the storage processor under load crashes the nodes (take 5 clients, have them each do a dd if=/dev/zero of=/dev/sdg bs=1M and setup snapshots on each of those logical volumes exposed to the clients - it will break!).
Heartbeat is used to provide clustering services, such as a virtual IP, activating/deactivating of LVM volume groups, starting and stopping iscsi-target services.
The iscsitarget package is used to provide iscsi services. Since the current setup involves VMware ESX v3, it requires at least svn revision 78. Additional patches are included to provide the display of VPD (SCSI Vital Product Packages) to identify disk serial and disk id numbers associated with LUNs served by ietd (/proc/net/iet/vpd). Another patch is include to display the current state of the non-persistent reservations (/proc/net/iet/reservation).
The iscsitarget-tools package provides the tools to dynamically manage ietd without restarting it.
The lvm-tools package provides tools to copy LVM logical volumes.
Read the next sections which detail the setup and configuration of each of the above packages. At the end there is a section to download all of the above source code.
Install the latest version of the OS, install all updates.
There are 2 physical interfaces which are bonded, carry VLANs for management, cluster services and iscsi traffic. The configuration files are located in /etc/sysconfig/network-scripts.
# ifcfg-eth0 DEVICE=eth0 BOOTPROTO=none ONBOOT=yes MASTER=bond0 SLAVE=yes
# ifcfg-eth1 DEVICE=eth1 BOOTPROTO=none ONBOOT=yes MASTER=bond0 SLAVE=yes # ifcfg-bond0 DEVICE=bond0 BOOTPROTO=none ONBOOT=yes MTU=9000 <- enable MTU 9000 on the switch gear also! # ifcfg-bond0.900 <- vlan 900 is for iscsi VLAN=yes DEVICE=bond0.900 BOOTPROTO=none ONBOOT=yes NETWORK=#.#.#.# <- adapt to your network NETMASK=#.#.#.# IPADDR=#.#.#.# BROADCAST=#.#.#.#
# ifcfg-bond0.901 <- vlan 901 is for management VLAN=yes DEVICE=bond0.901 BOOTPROTO=none ONBOOT=yes NETWORK=#.#.#.# <- adapt to your network NETMASK=#.#.#.# IPADDR=#.#.#.# BROADCAST=#.#.#.#
# ifcfg-bond0.902 <- vlan 902 is for cluster communications VLAN=yes DEVICE=bond0.902 BOOTPROTO=none ONBOOT=yes NETWORK=#.#.#.# <- adapt to your network NETMASK=#.#.#.# IPADDR=#.#.#.# BROADCAST=#.#.#.#
Remember to configure the bonding method. Method 2 (balance-xor) appears to stream data more consistently than the default of 0 (balance-rr).
# /etc/modprobe.conf alias bond0 bonding options bonding mode=2 miimon=100
Ensure that all services like iscsi-target are configured to not start by default. They will be managed by the heartbeat cluster services.
It is assumed the reader understands Linux LVM. The critical thing during the startup of a storage processor node is to NOT activate the volume groups that will be served, but rather let heartbeat activate them.
There appears to be a problem with this logic. The rc.sysinit scripts inherently activate all volume groups, however since no process accesses them (only the device mapper entries are mapped), they can be deactivated in rc.local. After deactivation heartbeat can be started. Heartbeat can then manage the volume groups.
The rc.local file looks like:
# # disable volume groups # vgchange -an # # then start heartbeat (this guarantees that we don't have the vg open # on two nodes during heartbeat's startup) # service heartbeat start touch /var/lock/subsys/local
Deactivating all volume groups automatically skips volume groups that are in use, like VolGroup00.
To ensure timely failover between node 1 and node 2 ensure that the size of snapshot logical volumes is not too large.
For example, if a volume group contains a 250GB snapshot (actual snapshot usage) it takes about 6 minutes to activate the volume group. The failover from node 1 to node 2 must be under 45 seconds for the clients to continue functioning properly.
The kernel has to not only read through that snapshot, but also must allocate the snapshot data structures. Monitoring the FC interface during that time reveals that the FC port throughput was about 6MB/s (normally the disks can provide at least 90-130MB/s).
CentOS distribution provides heartbeat version 1 and version 2. Either version could be used, however the configuration files reflect version 1.
There are 3 configuration files that must be configured for heartbeat to function.
The authkeys file contains the authentication between the 2 cluster nodes. The following configuration is used:
auth 2 2 sha1 somesecretpreferablelong
The ha.cf contains heartbeat communication configuration items, the most important ones are listed below:
bcast bond0.902 auto_failback off stonith_host iscsi-targetX-node2.amherst.edu external foo /etc/ha.d/shoot-iscsi-targetX-node1.sh node iscsi-targetX-node1.amherst.edu node iscsi-targetX-node2.amherst.edu ping #.#.#.# respawn hacluster /usr/lib/heartbeat/ipfail
The broadcast method is the simplest one to define, as it allows ha.cf to be identical on both nodes. Ensure that the netmask (255.255.255.252) is very small to limit the broadcast to a small network range.
Auto failback is off, as it appears that if there was a failover from node 1 to node 2, it is not smart to have it fail over back and forth between nodes. If node 2 cannot handle it, then there are a lot more problems. Node 2 is needed to change FC disk configurations and upgrade kernels which require a reboot of the storage processor.
The stonith_host definition allows node 2 to shoot node 1 in the head if node 2 thinks node 1 is dead. The STONITH operation uses the rack PDUs and power off node 1. The other method is to fence off the FC storage path for node 1. It is however, smarter to just kill node 1 to avoid confusion. This can be implemented as follows:
snmpset -v 1 -c private $device1 PowerNet-MIB::rPDUOutletControlOutletCommand.$port i 2 snmpset -v 1 -c private $device2 PowerNet-MIB::rPDUOutletControlOutletCommand.$port i 2
Warning: just putting the above lines into the stonith shell script is a bad idea. It is recommended to create a secure stonith proxy host that has access to the rack PDUs only. The storage processors can then ssh into the stonith proxy using a non-root account, followed by a key (this prevents accidental death if you ssh to the stonith proxy from node 2):
ssh stonith@stonithhost a098sdfsad8f90asdf09s8adf08as
On that stonith proxy host, setup .ssh/authorized_keys2 as follows (this should be one line):
from="#.#.#.#",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty, command="./stonith.sh iscsi-targetX-node1" ssh-rsa AAAAB3NzaC1.....hk0= root@iscsi-targetX-node2.amherst.edu
This allows only the node 2 to ssh in from #.#.#.#. It will forcibly execute the stonith.sh in the home directory of the stonith user with first parameter iscsi-targetX-node1. By doing this, only a specific node can power off some other specific node, and no other non-related cluster node. The stonith.sh script is summarized as follows:
sys=$1 key=$SSH_ORIGINAL_COMMAND case "$sys" in iscsi-targetX-node1) if [ "$key" == "a098sdfsad8f90asdf09s8adf08as" ]; then snmpset -v 1 -c private $device1 ... snmpset -v 1 -c private $device2 ... esac
The ping IP address should be the default gateway of the VLAN that provides iscsi services. Without this, if node 2 cannot ping node 1 on the iscsi VLAN, then node 2 will kill node 1. With this, if node 2 cannot ping node 1 AND node 2 cannot ping the ping address (the default gateway), it will not bring down node 1, since there is some other communication problem on the network.
The resource file defines which resources are managed by the heartbeat cluster software:
The following is an example of this file.
# # The IPaddr2 script is required, because on CentOS the name of the resulting # cluster interface bond0.900:0 is too long and does not appear in the ifconfig # listing. It appears as bond0.900 just like the main interface for vlan 900. # Use IPaddr2 to get around this - to get all IP addresses type: ip addr. # iscsi-targetX-node1.amherst.edu IPaddr2::#.#.#.# \ amherst_lvm::vg_diskset0_vol0 \ amherst_lvm::vg_diskset1_vol0 \ iscsi-target
The amherst_lvm resource script is located under /etc/ha.d/resource.d and is a copy of the LVM script provided by heartbeat, modified to get around the LVM_VERSION detection problem. CentOS4 does not provide /sbin/lvmiopversion. Comment out the version detection code and set LVM_VERSION="200":
#LVM_VERSION=`/sbin/lvmiopversion` LVM_VERSION="200" #rc=$? #if # [ $rc -ne 0 ] #then # ha_log "ERROR: LVM: $1 could not determine LVM version" # return $rc #fi
The /usr/lib/heartbeat/hb_takeover can be used to manually take over the services from the other node. It is initially recommended to ONLY configure the virtual IP address as a resource, then add volume groups and finally add the iscsi-target service.
Use rsync to keep the configuration consistent. The following shell script is used to synchronize the configuration from node 1 to node 2:
# # rsync files spec'd in cluster_sync.conf, source is / (root) # on this system, destination is nfs-node2:/ (root) # rsync -avrR \ --delete \ --files-from=amh_cluster_sync.conf \ / iscsi-targetX-node2:/
and the configuration file that defines the --files-from parameter:
# startup scripts /etc/rc.d/rc.local # cluster /etc/ha.d/ # iscsi-target /etc/iscsi-target/ /etc/ietd.conf /etc/initiators.allow /etc/initiators.deny
Install the iscsitarget RPM and iscsitarget-kernel[-smp] module. The download section provides source code and binaries for CentOS4.
The following enhancements have been made:
The proc interface for VPD (Vital Product Pages) display the LUN's scsi_id and scsi_sn defined by ietd.conf:
tid:1 name:iqn.1990-01.edu.amherst.iscsi-target:target_test lun:0 path:/dev/vg_test/lv_test vpd_83_scsi_id: 49 45 54 00 00 00 00 00 00 00 00 00 02 00 00 00 09 11 00 00 0d 00 00 00 IET..................... vpd_80_scsi_sn: AMHDSK-061207-02
The proc interface for reservation displays the LUN's SCSI RESERVE/RELEASE status and includes the display of initiator who is holding the lock:
tid:1 name:iqn.1990-01.edu.amherst.iscsi-target:target_test lun:0 path:/dev/vg_test/lv_test reserve:10 release:10 reserved:0 reserved_by:none
The reserve/release numbers should increment in sync most times, the reserved value indicates how many times ietd was unable to honor the RESERVE command, also if the resource is currently reserved the initiator name is displayed.
The init.d script has been modified to block iscsi traffic while shutting down ietd and allowing iscsi traffic as the daemon starts. This prevents clients from seeing the TCP connection drop, instead they simply hang for a few seconds, then discover that the connection does not function and they will reconnect (this is what happens during the failover process from node 1 to node 2).
In theory this patch should not be necessary, however Windows clients do get ugly when their storage disappears. If one sees red coded events in the event view about the Plug-and-Play manager complaining that a disk disappeared, then it is too late. This patch prevents this event.
The iscsitarget-tools package contains the tools to configure and manage the configuration files of ietd.
Install the iscsitarget-tools RPM provided in the download section. It delivers a set of shell scripts in /etc/iscsi-target.
The common.conf file is actually a shell script which defines several variables and functions. The functions are called while the update.sh is executed.
function global_options { option IncomingUser "portalusername portalpassword" }
function target_common_option { option MaxRecvDataSegmentLength 131072 option MaxXmitDataSegmentLength 131072 }
function lun_callback_pre { > $ROOT/lv-backup.sh } function lun_callback { lun_path=$1 source_dev=$(echo $lun_path | cut -f2 -d/) source_vg=$(echo $lun_path | cut -f3 -d/) source_lv=$(echo $lun_path | cut -f4 -d/) target_dev=$source_dev target_vg=vg_backup target_lv=${source_lv}_bk line="lvm_snap_copy -s lvm:$lun_path -t lvm:/$target_dev/$target_vg/$target_lv" echo $line >> $ROOT/lv-backup.sh } function lun_callback_post { return }
The target.conf contains the definitions of targets and their LUNs. A sample is shown below:
LUN_TYPEIO=blockio function scsi_idsn { echo "ScsiId=$1,ScsiSN=$1" } # # Generate targets and lun assignments and localized options # function build_targets_luns { target clustertarget1 #.#.#.#,#.#.#.#,... lun 0 /dev/vg_test/lv_test1_0 $(scsi_idsn AMHDSK-YYMMDD-nn) lun 1 /dev/vg_test/lv_test1_1 $(scsi_idsn AMHDSK-YYMMDD-nn) lun 2 /dev/vg_test/lv_test1_2 $(scsi_idsn AMHDSK-YYMMDD-nn) lun 3 /dev/vg_test/lv_test1_3 $(scsi_idsn AMHDSK-YYMMDD-nn) target clustertarget2 #.#.#.#,#.#.#.#,... lun 0 /dev/vg_test/lv_test2_0 $(scsi_idsn AMHDSK-YYMMDD-nn) lun 1 /dev/vg_test/lv_test2_0 $(scsi_idsn AMHDSK-YYMMDD-nn) }
Since svn revision 96, there is the blockio type which avoids using the Linux cache when reading data (requires fast disks and we have seen a 2MB/s increase doing reads). The LUN_TYPEIO is set by default to fileio. Override it as required by the target/LUNs defined below.
The function scsi_idsn provides the parameters used by ietd to defined the SCSI ID (VPD 0x83) and SCSI Serial Number (VPD 0x80).
The build_targets_luns function consists of statements to create targets and LUNs:
This command script updates and maintains the running ietd.conf, initiators.allow, initiators.deny dynamically.
The tool can be executed in one of two modes:
The offline mode is useful when ietd cannot be started, this happens in particular when iscsitarget is initially installed. There are initially no targets, so ietd does not start. One target must be defined.
[root@iscsi-targetX iscsi-target]# ./update.sh -h ./update.sh [online|offline] Default runmode is online. Choose offline if IETD is not running and you still want to update the ietd.conf file
Detailed features:
One must first use lvextend or lvresize to increase or decrease (ouch!) the size of a logical volume. Then use the update.sh script to adjust the running ietd process. The LUN resizing process works as follows:
While executing the update.sh script, it is generally wise to keep an eye in another terminal window on the following commands:
watch -d cat /proc/net/iet/volume
watch -d cat /proc/net/iet/session
Currently the update.sh script will NOT remove LUNs no longer defined in target.conf from the running environment. A future release might incorporate this feature.
Install the lvm-tools RPM provided in the download section. In delivers a set of shell scripts to copy logical volumes:
The lvm_tools configuration file adjusts the following default parameters:
# default snap name suffix DEFAULT_SNAP_SUFFIX=snap # in percent (SM < 100GB, XL >= 100GB) DEFAULT_SNAP_SPACE_SM=20 DEFAULT_SNAP_SPACE_XL=15 # enable md5 checks when doing remote or locally targeted snapshots ENABLE_MD5=0 # # enable raw devices on local copy sources (/dev/raw/rawN) (CentOS4 tested) # this feature is incompatible with TOOL_PV and compression of LVs # ENABLE_RAW=1 # # Expected minimum transfer rate (MB/s) # MIN_TRANSFER_RATE=4
It is not recommended to enable MD5 checks in a production environment, as it causes the lvm_copy command to read the source once using md5sum, then transfers the data, then read the target using md5sum.
lvm_copy is basically a glorified dd command, however it can handle as a source and destination a mixture of files, lvm or devices (like /dev/sdg). The command syntax:
/usr/sbin/lvm_copy -s source_device -t target_device [-z] [-p] lvm_copy will copy from a source lvm device to a target, the target will be created based on the source lvm device size. -s source device source syntax: file:/tmp/somefile lvm:/dev/vg/lv dev:/dev/sda -t target device target syntax: file:/tmp/somefile lvm:/dev/vg/lv dev:/dev/sda file:host:/tmp/somefile lvm:host:/dev/vg/lv dev:host:/dev/sda -z compress with gzip (target must be a file) -p prepare target but do NOT copy the data (useful only for lvm) The host field is used by ssh.
lvm_copy determines the size of the source, then validates the target device. If the target device is a file, it will be overwritten. If the target is a logical volume, lvm_copy creates it if it does not exist and will extend it if the size does not match the source. If the target is a device, lvm_copy verifies that the device is large enough to hold contents of the source.
lvm_copy supports reading from raw devices when handling dev or lvm source types. This eliminates the read caching in the linux kernel and will speed up the copy process.
lvm_copy uses dd to transport the data, optionally to a remote host (using ssh, public/private key cryptography is assumed).
lvm_snap_create creates a snapshot of a logical volume, the command simplifies the snapshot creation as it automatically uses snapshot sizes based on a percentage of the origin logical volume.
/usr/sbin/lvm_snap_create -s source_device [-t target_suffix] [-n snap_size] -s source lvm device -t target suffix, by default, snap -n snap size, by default 20% of LV < 100GB, 15% of LV >= 100GB
lvm_snap_remove remotes a snapshot logical volume. The command ensures that the logical volume removed is actually the snapshot and not accidentally the origin volume.
/usr/sbin/lvm_snap_remove -s source_device -s source lvm device (must be a snapshot device)
lvm_snap_copy automates the snap copy process by invoking lvm_snap_create, lvm_copy and lvm_snap_remove to snap copy a logical volume.
/usr/sbin/lvm_snap_copy -s source_device -t target_device [-z] lvm_snap_copy first creates a snapshot of a lvm source device, then invokes lvm_copy to copy the snapshot content to the target lvm or target file, then it removes the snapshot. Refer to lvm_copy for more details about the source and target parameters. The source paramter can ONLY be of type lvm, since you cannot create snapshot of a device (like /dev/sda) or file. -s source lvm device -t target lvm or other device -z compress (with gzip) only works if target is not lvm (i.e. file or remote file)
lvm_iostat is modelled after iostat and displays logical volume IO throughput, like the number of reads, writes, reads/second, writes/s, io queue size and io queue time. The standard iostat displays device names like dm-3, lvm_iostat translates those names into the actual volume group, logical volume names.
/usr/sbin/lvm_iostat [-h] [-n name] -r|-d -h help -r realtime data -d dump data (for rrd) -n device name (ex.: 'hd' would match hda, hdb; 'dm' would match device mapper devices; default matches hd[a-z]|sd[a-z]|dm-). Specify a regular expression in quotes, like dm-.* to refer to all device mapper devices.
Example view:
device displayname read write read/s write/s #io queue #io qtime[ms] hda hda 0 0 0 0 0 0 0 0 sda sda 115150848 109M 292581842944 272G 0 0 0 0 sdb sdb 33843712000 31G 1247439607616 1.1T 0 0 0 0 sdc sdc 653681664 623M 8757610496 8.2G 0 0 0 0 dm-8 vg_raidweb_0-lv_1 188416 184k 0 0 0 0 0 0 dm-9 vg_raidweb_0-lv_2 188416 184k 0 0 0 0 0 0 dm-7 vg_raidweb_0-lv_3 188416 184k 0 0 0 0 0 0
Use watch -d to get a realtime view of the IO stats.
The ddless tool copies data from a source to a destination block device. It will initially copy all data and subsequently only copy the changes. This can be helpful if you would like to create snapshots at the destination side of the replication because it will only write the actual changed blocks. This is accomplished by reading 1MB chuncks of data, segmenting that into 16K pieces for which we calculate a CRC32 and Google's MurmurHash. Together that makes a 64-bit checksum. If the source's checksum for a 16K segment differs from the previous run, a 16K segment is written to the destination device. For performance reasons, multiple neighboring 16K segments are written as one larger segment.
ddless operaters in several different modes:
Make sure to test ddless with the -d (direct IO enabled) parameter. It helps performance by bypassing the kernel buffers. One can control the read rate by specifying the -r switch measured in MB/s. The tool ddless has been tested with 32/64 bit CentOS 4,5 platforms. The largest source/destination device was 15TB.
The name of the tool is a pun on more/less :)
ddless by Steffen Plotner, release date: 2008.08.03 Copy source to target keeping track of the segment checksums. Subsequent copies are faster because we assume that not all of the source blocks change. ddless [-d] -s source [-r read_rate_mb_s] -c checksum [-b] -t target [-m max_change_gigabytes -i cmd] [-v] Produce a checksum file using the specified device. Hint: the device could be source or target. Use the target and a new checksum file, then compare it to the existing checksum file to ensure data integrity of the target. ddless [-d] -s source -c checksum [-v] Determine disk read speed zones, outputs data to stdout. ddless [-d] -s source [-v] Outputs the built in parameters ddless -p Parameters -d direct io enabled (i.e. bypasses buffer cache) -s source device -r read rate of source device in megabytes/sec -c checksum file -b bail out with exit code 3 because a new checksum file is required, no data is copied from source to target -t target device -m max number of bytes to be written to target device, when reached invoke the command defined by parameter -i -i interrupt writing when -m limit is reached and invoke this command to, for example, lvextend snap or lvremove snap of target device -p display parameters (segment size is known as chunksize in LVM2) -v verbose -vv verbose+debug Exit codes: 0 successful 1 a runtime error code, unable to complete task (detailed perror and logical error message are output via stderr 2 max number of gigabytes to be written limit was reached, operation successful 3 new checksum file required, only returned if -b is specified
The tools mentioned above are available for download in source and binary form and are licensed under the GPL.
Last update: 2008-10-13
Use the tools at your own risk.
Steffen Plotner
Systems Administrator/Programmer
Systems & Networking
Amherst College
Amherst, MA 01002
Email: swplotner {at} amherst.edu
Copyright (C) 2007-2008, Steffen Plotner, Amherst College
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License v2 as published by the Free Software
Foundation.