First topic: core changes in the iSCSI software initiator from ESX 3.x.
The ESX software iSCSI initiator was completely rewritten for vSphere 4. This was done primarily for performance reasons, but also because the vSphere 4 compatibility base for Linux drivers transitioned from the 2.4 kernel to the 2.6 kernel. Remember that while the vmkernel doesn’t “run” Linux, or “run on Linux” – the core driver stack has common elements with Linux. Along with the service console running a Linux variant, these are the two common sources of the “VMware runs on Linux” theory – which is decidedly incorrect.
As an aside, there is also an interest in publishing a iSCSI HBA DDK, allowing HBA vendors to write and supply their own drivers, decoupled from ESX releases. The changes could also allow storage vendors to write and supply components to manage sessions to make better use of the pluggable multipathing capability delivered in ESX4. (Neither the HBA DDK nor the session capability have been released, yet. Development, documentation and certification suites are still underway.)
Some of the goodness that was in ESXi 3.5, has also made it into all ESX versions:
- The requirement for a Console OS port on your iSCSI network has been removed. All iSCSI control path operations are done through the same vmkernel port used for the data path. This compares with ESX 3.x where iSCSI control operations required a console port. This is a very good thing: no console port needed for ESX4 – all versions.
- Enabling the iSCSI service also automatically configures all the firewall properties needed.
Performance is improved several ways:
- Storage paths are more efficient and keep copying and potentially blocking operations to a minimum.
- Systems using Intel Nehalem processors can offload digest calculation the the processors' built-in CRC calculation engine.
- However, the biggest performance gain is allowing the storage system to to scale to the number of NICs available on the system. The idea is that the storage multipath system can make better use of the multiple paths it has available to it than NIC teaming at the network layer.
- If each physical NIC on the system looks like a port to a path to storage, the storage path selection policies can make better use of them.
Second topic: iSCSI Multipathing
This is the perhaps the most important change in the vSphere iSCSI stack.
iSCSI Multipathing is sometimes also referred to as "port binding.", However, this term is ambiguous enough (often it makes people think of “link aggregation” incorrectly) that we should come up with a better term…
By default, iSCSI multipathing is not enabled in vSphere4. The ESX iSCSI initiator uses vmkernel networking similarly to ESX 3.5, out of the box. The initiator presents a single endpoint and NIC teaming through the ESX vswtich takes care of choosing the NIC. This allows easy upgrades from 3.5 and simple configuration of basic iSCSI setups.
Setting up iSCSI multipathing requires some extra effort because of the additional layer of virtualization provided by the vSwitch. The ESX vmkernel networking stack, used by the iSCSI initiator, communicates with virtual vmkernel NICs, or vmkNICs. The vmkNICs are attached to a virtual switch, or vswitch, that is then attached to physical NICs.
Once iSCSI multipathing is set up, each port on the ESX system has its own IP address, but they all share the same iSCSI initiator iqn. name.
So – setup in 4 easy steps:
Step 1 – configure multiple vmkNICs
Ok, the first obvious (but we’re not making any assumptions) is that you will need to configure multiple physical Ethernet interfaces, and multiple vmkernel NIC (vmkNIC) ports, as shown in the screenshot below.
You do this by navigating to the Properties dialog for a vSwitch and select “add”, or by simply clicking on “add Networking” and add additional vmkNICs.
This can also be done via the command line:
esxcfg-vmknic --server -a -i 10.11.246.51 -n 255.255.255.0
Note: certain vmkNIC parameters (such as jumbo frame configuration) can only be done as the vmkNIC is being initially configured. Changing them subsequently requires removing and re-adding the vmkNIC. For the jumbo frame example, see that section later in this post.
Step 2 – configure explicit vmkNIC-to-vmNIC binding.
To make sure the vmkNICs used by the iSCSI initiator are actual paths to storage, ESX configuration requires the vmkNIC is connected to a portgroup that only has one active uplink and no standby uplinks. This way, if the uplink is unavailable, the storage path is down and the storage multipathing code can choose a different path. Let’s be REALLY clear about this – you shouldn’t use link aggregation techniques with iSCSI – you should/will use MPIO (which defines end-to-end paths from initiator to target). This isn’t stating that these aren’t bad (they are often needed in the NFS datastore use case) – but remember that block storage models use MPIO in the storage stack, not the networking stack for multipathing behavior.
Setting up the vmkNICs to use only a single uplink can be done through the UI, as shown below – just select the adapter in the the “active” list and move it down to “unused adapters”, such that each vmkNIC used for iSCSI has only one active physical adapter.
Instructions for doing this are found in Chapter 3 of the iSCSI SAN Configuration Guide, currently page 32.
Step 3 – configuring the iSCSI initiator to use the multiple vmkNICs
Then the final step requires command line configuration. This step is where you assign, or bind, the vmkNICs to the ESX iSCSI initiator. Once the vmkNICs are assigned, the iSCSI initiator uses these specific vmkNICs as outbound ports, rather than the vmkernel routing table. Get the list of the vmkNICs used for iSCSI (in the screenshot below, this was done using the vicfg-vmknic --server –l command
Then, explicitly tell the iSCSI software initiator to use all the appropriate iSCSI vmkNICs using the following command:
esxcli –-server swiscsi nic add -n -d
To identify the vmhba name, navigate to the “Configuration” tab in the vSphere client, and select “Storage Adapters”. You’ll see a screen like the one below. In the screenshot below, the vmhba_name is “vmhba38”. Note also in the screenshot below, the 2 devices have four paths.
The end result of this configuration is that you end up with multiple paths to your storage. How many depends on your particular iSCSI target (storage vendor type). The iSCSI intiator will login to each iSCSI target reported by the “Send targets” command issued to the iSCSI target listed in the “Dynamic Discovery” dialog box from each iSCSI vmkNIC.
Before we proceed we need to introduce the storage concept of a storage portal to those whom may not be familiar with iSCSI. At a high level an iSCSI portal is the IP address(es) and port number of a SCSI storage target. Each storage vendor may implement storage portals in slightly different manners.
In storage nomenclature you will see devices “runtime name” represented in the following format: vmhba#:C#:T#:L#. The C represents a controller, the T is the SCSI target, and the L represents the LUN.
With single-portal storage, such as EqualLogic or LeftHand systems, you'll get as many paths to the storage as you have vmkNICs (Up to the ESX maximum of 8 per LUN/Volume) for iSCSI use. These storage systems only advertise a single storage port, even though connections are redirected to other ports, so ESX establishes one path from each server connection point (the vmkNICs) to the single storage port.
A single-portal variation is an EMC Celerra iSCSI target . In the EMC Celerra case, a large number of iSCSI targets can be configured, but a LUN exists behind a single target – and the Celerra doesn’t redirect in the way EqualLogic or Lefthand do. In the EMC Celerra case, configure an iSCSI target network portal with multiple IP addresses. This is done by simply assigning multiple logical (or physical) interfaces to a single iSCSI target. ESX will establish one path from each server connection (the vmkNICs) to all the IP addresses of the network portal.
Yet other storage advertises multiple ports for the storage, either with a separate target iqn. name or with different target portal group tags (pieces of information returned to the server from the storage during initial discovery). These multi-portal storage systems, such as EMC CLARiiON, NetApp FAS, and IBM N-Series, allow paths to be established between each server NIC and each storage portal. So, if your storage has three vmkNICs assigned for iSCSI and your storage has two portals, you'll end up with six paths.
These variations shouldn’t be viewed as intrinsically better/worse (at least for the purposes of this multivendor post – let’s leave positioning to the respective sales teams). Each array has a different model for how iSCSI works.
There are some limitations for multiple-portal storage that require particular consideration. For example, EMC CLARiiON currently only allows a single login to each portal from each initiator iqn. Since all of the initiator ports have the same iqn., this type of storage rejects the second login. (You can find log messages about this with logins failing reason 0x03 0x02, "Out of Resources."). You can work around this problem by using the subnet configuration described here. Details on the CLARiiON iSCSI target configuration and multipathing state can be seen in the EMC Storage Viewer vCenter plugin.
By default storage arrays from NetApp, including the IBM N-Series, provide an iSCSI portal for every IP address on the controller. This setting can be modified by implementing access lists and / or disabling iSCSI access on physical Ethernet ports. The NetApp Rapid Cloning Utility provides an automated means to configure these settings from within vCenter.
Note that iSCSI Multipathing is not currently supported with Distributed Virtual Switches, either the VMware offering or the Cisco Nexus 1000V. Changes are underway to fix this and allow any virtual switch to be supported.
Step 4 – Enabling Multipathing via the Pluggable Storage Architecture
Block storage multipathing is handled by the MPIO part of the storage stack, and selects paths (for both performance and availability purposes) based on an end-to-end path.
This is ABOVE the SCSI portion of the storage stack (which is above iSCSI which in turn is above the networking stack). Visualize the “on-ramp” to a path as the SCSI initiator port. More specifically in the iSCSI case, this is based on the iSCSI session – and after step 3, you will have multiple iSCSI sessions. So, if you have multiple iSCSI sessions to a single target (and by implication all the LUNs behind that target), you have multiple ports, and MPIO can do it’s magic across those ports.
This next step is common across iSCSI, FC, & FCoE.
When it comes to path selection, bandwidth aggregation and link resiliency in vSphere, customers have the option to use one of VMware's Native Multipathing (NMP) Path Selection Policies (PSP), 3rd party PSPs, or 3rd party Multipthing Plug-ins (MPP) such as PowerPath V/E from EMC.
All vendors on this post support all of the NMP PSPs that ship with vSphere, so we’ll put aside the relative pros/cons of 3rd party PSPs and MPPs in this post, and assume use of NMP.
NMP is included in all vSphere releases at no additional cost. NMP is supported in turn by two “pluggable modules”. The Storage Array Type Plugin (SATP) identifies the storage array and assigns the appropriate Path Selection Plugin (PSP) based on the recommendations of the storage partner.
VMware ships with a set of native SATPs, and 3 PSPs: Fixed, Most Recently Used (MRU), & Round Robin (RR). Fixed and MRU options were available in VI3.x and should be familiar to readers. Round Robin was experimental in VI3.5, and is supported for production use in vSphere (all versions)
Configuring NMP to use a specific PSP (such as Round Robin) is simple and easy. You can do it in the vSphere Client under configuration, storage adapter, select the devices, and right click for properties. That shows this dialog box (note that Fixed or MRU are always the default, and with those policies, depending on your array type – you may have many paths as active or standby, only one of them will be shown as “Actve (I/O)”):
You can change the Path Selection Plugin with the pull down in the dialog box. Note that this needs to be done manually for every device, on every vSphere server when using the GUI. It’s important to do this consistently across all the hosts in the cluster. Also notice that when you switch the setting in the pull-down, it takes effect immediately – and doesn’t wait for you to hit the “close” button.
You can also configure the PSP for any device using this command:
esxcli -–server nmp device setpolicy --device --psp
Alternatively, vSphere ESX/ESXi 4 can be configured to automatically choose round robin for any device claimed by a given SATP. To make all new devices that use a given SATP to automatically use round robin, configure ESX/ESXi to use it as the default path selection policy from command line.
esxcli --server corestorage claiming unclaim --type location
esxcli --server nmp satp setdefaultpsp --satp --psp VMW_PSP_RR
esxcli --server corestorage claimrule load
esxcli --server corestorage claimrule run
Three Additional Questions and Answers on the Round Robin PSP…
Question 1: “When shouldn’t I configure Round Robin?”
Answer: While configuring interface you may note that Fixed and MRU are always the default PSP associated with the native SATP options – across all arrays. This is a protective measure in case you have VMs running Microsoft Cluster Services (MSCS). Round Robin can interfere with applications that use SCSI reservations for sharing LUNs among VMs and thus is not supported with the use of LUNs with MSCS. Otherwise, there’s no particular reason not to use NMP Round Robin, with the additional exception of the note below (your iSCSI array requires the use of ALUA, and for one reason or another you cannot change that)
Question 2: “If I’m using an Active/Passive array – do I need to use ALUA”?
Answer: There is another important consideration if you are using an array that has an “Active/Passive” LUN ownership model when using iSCSI. With these arrays, Round-Robin can result in path thrashing (where a storage target bounces behind storage processors in a race condition with vSphere) if the storage target is not properly configured.
Of the vendors on this list, EMC CLARiiON and NetApp traditionally are associated with an “Active/Passive” LUN ownership model – but it’s important to note that the NetApp iSCSI target operates in this regard more like the EMC Celerra iSCSI target and is “Active/Active” (fails over with the whole “brain” when the cluster itself fails over rather than the LUN transiting from one “brain to another” - “brain” in Celerra-land is called a Datamover, in NetApp land is called a cluster controller).
Conversely the CLARiiON iSCSI target LUN operates the same as an CLARiiON FC target LUN – and trespasses from one storage processor to another. So – ALUA configuration is important for CLARiiON for iSCSI, Fibre-Channel/FCoE connected hosts, and NetApp when using Fibre Channel/FCoE connectivity (beyond the scope of this post). So – if you’re not a CLARiiON iSCSI customers, or using CLARiiON or NetApp with Fibre Channel/FCoE) customer (since this multipathing section applies to FC/FCoE), you can skip to the third Round Robin note.
Active/Passive LUN ownership models in VMware lingo doesn’t mean that one storage processor (or “brain” of the array) is idle – rather that a LUN is “owned” (basically “behind”) one of the two storage processors at any given moment. If using EMC CLARiiON CX4 (or NetApp array with Fibre-Channel) and vSphere, the LUNs should be configured for Asymmetric Logical Unit Access (ALUA). When ALUA is configured, as opposed to the ports on the “non-owning storage processor” showing up in the vSphere client as “standby”, they show up as “active”. Now – they will not be used for I/O in a normal state – as the ports on the “non-owning storage processor” are “non-optimized” paths (there is a slower, more convoluted path for I/O via those ports). This is shown in the diagram below.
On each platform – configuring ALUA entails something specific.
On an EMC CLARiiON array when coupled with vSphere (which implements ALUA support specifically with SCSI-3 commands, not SCSI-2), you need to be running the latest FLARE 28 version (specifically 04.28.000.5.704 or later). This in turn currently implies CX4 only, not CX3, and not AX. You then need to run the Failover Wizard and configure the hosts in the vSphere cluster to use failover mode 4 (ALUA mode). This is covered in the CLARiiON/VMware Applied Tech guide (the CLARiiON/vSphere bible) here, and is also discussed on this post here.
Question 3: “I’ve configured Round Robin – but the paths aren’t evenly used”
Answer: The Round Robin policy doesn’t issue I/Os in a simple “round robin” between paths in the way many expect. By default the Round Robin PSP sends 1000 commands down each path before moving to the next path; this is called the IO Operation Limit. In some configurations, this default configuration doesn't demonstrate much path aggregation because quite often some of the thousand commands will have completed before the last command is sent. That means the paths aren't full (even though queue at the storage array might be). When using 1Gbit iSCSI, quite often the physical path is often the limiting factor on throughput, and making use of multiple paths at the same time shows better throughput.
You can reduce the number of commands issued down a particular path before moving on to the next path all the way to 1, thus ensuring that each subsequent command is sent down a different path. In a Dell/EqualLogic configuration, Eric has recommended a value of 3.
You can make this change by using this command:
esxcli --server nmp roundrobin setconfig --device --iops --type iops
Note that cutting down the number of iops does present some potential problems. With some storage arrays caching is done per path. By spreading the requests across multiple paths, you are defeating any caching optimization at the storage end and could end up hurting your performance. Luckily, most modern storage systems don't cache per port. There's still a minor path-switch penalty in ESX, so switching this often probably represents a little more CPU overhead on the host.
That’s it!
If you go through these steps, and you will a screen that looks like this one. Notice that Round Robin is the Path Selection configuration, and the multiple paths to the LUN are both noted as “Active (I/O)”. With an ALUA-configured CLARiiON, the paths to the “non-owning” storage processor ports will show as “Active” – meaning they are active, but not being used for I/O
This means you’re driving traffic down multiple vmknics (and under the vSphere client performance tab, you will see multiple vmknics chugging away, and if you look at your array performance metrics, you will be driving traffic down multiple target ports).
Now, there are couple other important notes – so let’s keep reading :-)
Third topic: Routing Setup
With iSCSI Multipathing via MPIO, the vmkernel routing table is bypassed in determining which outbound port to use from ESX. As a result of this VMware officially says that routing is not possible in iSCSI SANs using iSCSI Multipathing. Further – routing iSCSI traffic via a gateway is generally a bad idea. This will introduce unnecessary latency – so this is being noted only academically. We all agree on this point – DO NOT ROUTE iSCSI TRAFFIC.
But, for academic thoroughness, you can provide minimal routing support in vSphere because a route look-up is done when selecting the vmknic for sending traffic. If your iSCSI storage network is on a different subnet AND you iSCSI Multipathing vmkNICs are on the same subnet as the gateway to that network, routing to the storage works. For example look at this configuration:
- on the vSphere ESX/ESXi server:
- vmk0 10.0.0.3/24 General purpose vmkNIC
- vmk1 10.1.0.14/24 iSCSI vmkNIC
- vmk2 10.1.0.15/24 iSCSI vmkNIC
- Default route: 10.0.0.1
- on the iSCSI array:
- iSCSI Storage port 1: 10.2.0.8/24
- iSCSI Storage port 2: 10.2.0.9/24
In this situation, vmk1 and vmk2 are unable to communicate with the two storage ports because the only route to the storage is accessible through vmk0, which is not set up for iSCSI use. If you add the route:
Destination: 10.2.0.0/24 Gateway: 10.1.0.1 (and have a router at the gateway address)
then vmk1 and vmk2 are able to communicate with the storage without interfering with other vmkernel routing setup.
Fourth topci: vSwitch setup
There are no best practices for whether vmkNICs should be on the same or different vswitches for iSCSI Multipathing. Provided the vmkNIC only has a single active uplink, it doesn't matter if there are other iSCSi vmkNICs on the same switch or not.
Configuration of the rest of your system should help you decide the best vswitch configuration. For example, if the system is a blade with only two NICs that share all iSCSI and general-purpose traffic, it makes best sense for both uplinks to be on the same vswitch (to handle teaming policy for the general, non-iSCSI traffic). Other configurations might be best configured with separate vswitches.
Either configuration works.
Forth topic: Jumbo frames
Jumbo frames are supported for iSCSI in vSphere 4. There was confusion about whether or not they were supported with ESX 3.5 – the answer is no, they are not supported for vmkernel traffic (but are supported for virtual machine traffic).
Jumbo frames simply means that the size of largest the Ethernet frame passed between one host and another on the Ethernet network is larger than than the default. By default, the "Maximum Transmission Unit" (MTU) for Ethernet is 1500 bytes. Jumbo frames are often set to 9000 bytes, the maximum available for a variety of Ethernet equipment.
The idea is that larger frames represent less overhead on the wire and less processing on each end to segment and then reconstruct Ethernet frames into the TCP/IP packets used by iSCSI. Note that recent Ethernet enhancements TSO (TCP Segment Offload) and LRO (Large Receive Offload) lessen the need to save host CPU cycles, but jumbo frames are still often configured to extract any last benefit possible.
Note that jumbo frames must be configured end-to-end be useful. This means the storage, Ethernet switches, routers and host NIC all must be capable of supporting jumbo frames – and Jumbo frames must be correctly configured end-to-end on the network. If you miss a single Ethernet device, you will get a significant number of Ethernet layer errors (which are essentially fragmented Ethernet frames that aren’t correctly reassembled).
Inside ESX, jumbo frames must be configured on the physical NICs, on the vswitch and on the vmkNICs used by iSCSI. The physical uplinks and vswitch are set by configuring the MTO of the vswitch. Once this is set, any physical NICs that are capable of passing jumbo frames are also configured. For iSCSI, the vmkNICs must also be configured to pass jumbo frames.
Unfortunately, the vSwitch and the vmkNICs must be added (or, if already existing, removed and re-created) from the command line to provide jumbo frame support: Note this will disconnect any active iSCSI connections so this should be done as a maintenance operation while VMs residing on the Datastores/RDMs are running on other ESX hosts. (I know this sounds like an “of course” but just a good warning).
Below is an example:
# esxcfg-vmknic --server -l|cut -c 1-161
Interface Port Group/DVPort IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type
vmk1 iSCSI2 IPv4 10.11.246.51 255.255.255.0 10.11.246.255 00:50:56:7b:00:08 1500 65535 true STAT
vmk0 iSCSI1 IPv4 10.11.246.50 255.255.255.0 10.11.246.255 00:50:56:7c:11:fd 9000 65535 true STAT
# esxcfg-vmknic --server -d iSCSI2
# esxcfg-vmknic --server -a -i 10.11.246.51 -n 255.255.255.0 -m 9000 iSCSI2
# esxcfg-vmknic --server -l|cut -c 1-161
Interface Port Group/DVPort IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type
vmk0 iSCSI1 IPv4 10.11.246.50 255.255.255.0 10.11.246.255 00:50:56:7c:11:fd 9000 65535 true STAT
vmk1 iSCSI2 IPv4 10.11.246.51 255.255.255.0 10.11.246.255 00:50:56:7b:00:08 9000 65535 true STAT
If the vmkNICs are already set up as iSCSI Multipath vmkNICs, you must remove them from the iSCSI configuration before deleting them and re-adding them with the changed MTU.
Fifth topic: Delayed ACK
Delayed ACK is a TCP/IP method of allowing segment acknowledgements to piggyback on each other or other data passed over a connection with the goal of reducing IO overhead.
If your storage system is capable of supporting delayed ACK, verify with your vendor if delayed ACK should be enabled.
Sixth topic: other configuration recommendations:
Most of the original multivendor iSCSI post “general recommendations” are as true as ever. When setting up the Ethernet network for iSCSI (or NFS datastores) use – don’t think of it as “it’s just on my LAN”, but rather “this is the storage infrastructure that is supporting my entire critical VMware infrastructure”. IP-based storage needs the same sort of design thinking traditionally applied to FC infrastructure – and when you do, it can have the same availability envelope as traditional FC SANs. Here are some things to think about:
- Are you separating you storage and network traffic on different ports? Could you use VLANs for this? Sure. But is that “bet the business” thinking? It’s defensible if you have a blade, and a limited number of high bandwidth interfaces, but think it through… do you want a temporarily busy LAN to swamp your storage (and vice-versa) for the sake of a few NICs and switch ports? So if you do use VLANs, make sure you are thorough and implement QoS mechanisms. If you’re using 10GbE using VLANs can make a lot of sense and cut down on your network interfaces, cables, and ports, sure – but GbE – not so much.
- Think about Flow-Control (should be set to receive on switches and transmit on iSCSI targets)
- Either disable spanning tree protocol (only on the most basic iSCSI networks) – or enable it only with either RSTP or portfast enabled. Another way to accomplish this if you share the network switches with the LAN, you can filter / restrict bridge protocol data units on storage network ports
- If at all possible, use Cat6a cables rather than Cat5e (and don’t use Cat5). Yes, Cat5e can work – but remember – this is “bet the business”, right? Are you sure you don’t want to buy that $10 cable?
- Things like cross-stack Etherchannel trunking can be handy in some configurations where iSCSI is used in conjunction with NFS (see the “Multivendor NFS post” here)
- Each Ethernet switch also varies in its internal architecture – for mission-critical, network intensive Ethernet purposes (like VMware datastores on iSCSI or NFS), amount of port buffers, and other internals matter – it’s a good idea to know what you are using.
In closing.....
We would suggest that anyone considering iSCSI with vSphere should feel very confident that their deployments can provide high performance and high availability. You would be joining many, many customer enjoying the benefits of VMware and advanced storage that leverages Ethernet.
With the new iSCSI initiator, the enablement of multiple TCP sessions per target, and the multipathing enhancements in vSphere ESX 4 it is possible to have highly availabily and high performing storage using your existing Ethernet infrastructure. The need for some of the workarounds discussed here for ESX 3.5 can now be parked in the past.
To make your deployment a success, understand the topics discussed in this post, but most of all ensure that you follow the best practices of your storage vendor and VMware.