Wednesday, 25 April 2012

Why is the device status on ESX shown "degraded"?


Today I would like to touch upon a specific  field in command output "esxcli storage core device list" related to "Status" of the device.

~ # esxcli storage core device list
naa.90090160c1e01de150060160c1e01de1
   Display Name: DGC iSCSI Disk (naa.90090160c1e01de150060160c1e01de1)
   Has Settable Display Name: true
   Size: 0
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.90090160c1e01de150060160c1e01de1
   Vendor: EFG
   Model: ABD
   Revision: 0326
   SCSI Level: 4
   Is Pseudo: true
   Status: degraded
   Is RDM Capable: true
   Is Local: false
   Is Removable: false

This field for the device is specifically reserved for indicating the path status of the device.When the device has more than one path to the target( storage array), the path status is "on". When all the paths to the target are down ( either off / dead ) the device status is "dead". If  there is only one path to the target then the status is "degraded".

Also note that when the device is in PDL( Permanent Device Loss , This scenarios arises when a device is  unmapped from the storage array while  ESX was using the device) the status of the device is "not connected".

If ESX fails to recognize the state of the device ( if all above mentioned scenarios are not applicable) then the device status is "unknown".

Permanent Device Loss -Planned/Unplanned.



Planned versus unplanned PDL

A planned PDL occurs when there is an intent to remove a device presented to the ESXi host. The datastore must first be unmounted, then the device needs to be detached before the storage device is unpresented at the storage array.
1.  esxcli storage filesystem  list --> get the vmfs datastore name
2.  esxcli storage filesystem unmount -l Data_Store_Name --> unmount the datastore
3.  esxcli storage core device set --state=off -d naa.xxxx  --> set the device state to off and safely remove the device.

An unplanned PDL occurs when the storage device is unexpectedly unmapped from the storage array without the unmount and detach being executed on the ESXi host. To cleanup an unplanned PDL:
  1. All running virtual machines from the datastore must be powered off, easy way to do is, login to esxi , kill the vmx file of the respective vms.  and unregistered from the affected datastore. 
  2. From the vSphere Client, go to the Configuration tab of the ESXi host, click Storage.
  3. Right-click on the datastore being removed, and choose Unmount.

    A Confirm Datastore Unmount window displays. When the prerequisite criteria have been passed, the OK button appears.
  4. Perform a rescan on all of the ESXi hosts that had visibility to the LUN.
  5. After removing the device the paths to the device still shows "died", to remove the died paths to the device do rescan "esxcfg-rescan -d" or "esxcli storage core adapter  rescan  --adapter vmhbaxxx"

    Note
    : If there are active references to the device or pending I/O, the ESXi host still lists the device after the rescan..

Use of -b --boot option in esxcli storage nmp satp rule remove .


Use of -b boot option is to remove the system satp rules [default satp rules]  and add user defined rules.
User cannot add a satp rule with same vendor/model of default sapt rule, as esxi won't allow duplicate rules. So if the user still wants to add a rule with same vendor/model with extra options, user has to delete the system satp rules.

~ # esxcli storage nmp satp rule remove
Error: Missing required parameter -s|--satp

Usage: esxcli storage nmp satp rule remove [cmd options]

Description:
  remove                Delete a rule from the list of claim rules for the given SATP.

Cmd options:
  -b|--boot             This is a system default rule added at boot time. Do not modify esx.conf or add to host profile.

How to find supported esxcli commands in esxi5.0

Get all esxcli commands supported in  ESXi 5.0

~ # localcli esxcli command list
Namespace                                                   Command
---------------------------------------------------------------
storage.core.device                                        setconfig
storage.core.device.smart                               get
storage.core.device.stats                                get
storage.core.plugin.registration                        remove
storage.nfs                                                     list
storage.nmp.psp.generic.deviceconfig             set
storage.nmp.psp                                             list
storage.nmp.psp.roundrobin.deviceconfig        get
storage.nmp.psp.roundrobin.deviceconfig        set
storage.nmp.satp.generic.deviceconfig             get
storage.nmp.satp.generic.deviceconfig             set
storage.nmp.satp.generic.pathconfig                get
storage.nmp.satp.generic.pathconfig                set
storage.nmp.satp                                            list

:
:
:
:
....etc....

Friday, 20 May 2011

Unclaiming a device from ESX.


Need for unclaiming an ESX device usually arises when you want to change, the plugin claiming the device or paths to the device. For example if you want to mask a device, then you may need to first add the claimrules and then unclaim the claimrules that are currently acting upon the devices.

User needs to note that, path, adapter, plugin etc based  unclaims succeed only when device is free. In other words device should not be actively servicing IOs. If VMs are powered on, or there are IOs issued to a RDM disks, then the command is bound to fail. Unclaim often fails on local disks, as you may have scratch partition and dump partition configured on it.

There are different ways to unclaim a device.
You can uncalim claimrules on device basis as follows
~ # esxcli corestorage claiming unclaim  -t  device --device naa.6009999999999284000064c349cc3cd9

Claimrules can also be claimed on basis of device vendor names too.
~ # esxcli corestorage claiming unclaim  -t  vendor --vendor IBM

In ESX user can unclaim claimrules based on path too.
 esxcli corestorage claiming unclaim  -t  path --path vmhba2:C0:T0:L111

Less popular version are: 
Driver based unclaiming
~# esxcli corestorage claiming unclaim  -t  driver --driver qla2xxx

Plugin based unclaim.
There is also provision to unclaim devices on basis of plugin names.
~ # esxcli corestorage claiming unclaim  -t  plugin --plugin MASK_PATH

If all the claimrules are hard to remember, the you can try to unclaim all the devices in ESX.
ESX will try to unclaim all the claimrules working on non busy devices. Please note that this command will return device busy messages in most of the case as it tries to unclaim the local disk too,where user might have configured swap,dump and scratch partitions.
~ # esxcli corestorage claiming unclaim  -t location
Errors:
Unable to perform unclaim.  Error message was : Unable to unclaim paths.  Busy or in use devices detected.  See VMkernel logs for more information.

After unclaiming do not forget to load and run the new claimrules. Load and Run operations will read /etc/vmware/etc.conf file and apply the claimrules to unclaimed devices.
~ # esxcli corestorage claimrule load
~ # esxcli corestorage claimrule run

Saturday, 30 April 2011

Interesting storage startups

While VMware has captured a large market of server virtualization, it looks like there are some gaps in the io and storage virtualization areas. With more and more companies emerging in storage virtualization space, its time to see what are the innovations that can define future of storage.To have a glimpse of storage virtualization lets see some startups and their innovation.

StorSimple, develops solutions for  Hybrid Cloud Storage  for Windows and VMware infrastructure.
StorSimple offers has developed application-optimized hybrid cloud storage appliances to businesses and organizations that want to integrate the cloud securely and transparently into their existing on premises applications. The StorSimple 5010 and 7010 appliances have recently achieved VMware Ready™ status, passing rigorous VMware testing and interoperability criteria for use in production environments, and are now listed on the VMware Partner Product Catalog. 
See what StorSimple has to offer here.

Xsigo is the technology leader in Data Center I/O virtualization, helping organizations reduce costs and improve business agility. Xsigo’s  I/O Director consolidates server connectivity with a solution that provides unprecedented management simplicity and interoperability with open standards.Xsigo was actively used in VMwares VMworld's  demos to support IO for thousands of workloads.Xsigo helps enterprises to scale its virtual infrastructure  dynamically.It also reduces maintenance cost by reducing the number of failure points.

Fusion-io has been in enterprise market for quite some time now.It mainly deals with flash based PCIe cards.  The enterprise server vendors like IBM,HP and Dell are already shipping fusion IO based servers.All the latest updates on Fusion-io products is available here.

Apris aims to maximize application performance and minimize infrastructure costs in a data center by addressing I/O bottlenecks at the server and the storage array. The company offers a simple approach to provisioning and managing I/O resources by enabling the PCI Express protocol to traverse over the Ethernet data center fabric.You can find more information here.
VDI workloads are still a bottleneck in Desktop user experience.People see lots of performance degradation with desktop boot-storms.IO-Turbine main focus is to build innovation around VDI to reduce IO bottlenecks and application latency.
TINTRI:
A company that offers storage solutions for storing VMs.It is a VM centric storage appliance built for VMware.It also has inline de-dupe and compression.It actively uses SSDs for performance boost.
More info about Tintri.

Sunday, 24 April 2011

Different ways of configuring PSP RR for your ESX devices.

PSP RR is one of the best PSP(Path Selection Plugin) if you want to leverage multipathing in your SAN environment. PSP_RR  can help you gain higher throughput by scheduling IOs through multiple paths.

I will not discuss performance benefits here as there is a blog already on it. I would like to share information on different ways of configuring PSP RR for your SAN devices.

Currently there are three ways of configuring PSP RR in ESX4.1.
1. Change the default PSP of the SATP claiming your SAN device.  
2. Add a new SATP claimrule with PSP RR for device vendor and model.
3. Add a SATP claimrule with PSP RR with your device name.  

1. Change the default PSP of the SATP claiming your SAN device.
~ #  esxcli nmp satp setdefaultpsp  --satp=VMW_SATP_CX  --psp=VMW_PSP_RR                     

This is the most easiest way of configuring PSP_RR for your devices.By changing the default PSP of the SATP claiming your devices, you can configure all the devices in ESX to use PSP_RR. But this has a side effect. There might be some devices from different array vendor claimed by  the same SATP.This may lead to unexpected performance problems. You should use this option, only when you know what devices will be connected to your ESX host.This method is also documented in VMware KB article.
CleanUp:
Run the same claimrule, with default PSP name for the particular SATP.                                     
esxcli nmp satp setdefaultpsp --satp=VMW_SATP_CX --psp=VMW_PSP_MRU

2. Add a new SATP claimrule with PSP_RR for device vendor and model.
~ # esxcli nmp satp addrule  --satp=VMW_SATP_CX --psp=VMW_PSP_RR --vendor=myVendor  --model=mymodel
This is one of the best option that lets user select a PSP for a specific target.Lets say you have two arrays Array1 and Array2 both claimed by same SATP. If you want the devices corresponding to Array1 to be configured with PSP RR without affecting the devices from Array2, then this is the right option.This is one of the way by which you can mass configure devices with PSP RR for specific target.This will not change the default PSP for the SATP, but will insert a new SATP rule into the SATP rule list, for the target with specific Vendor and Model name.
Cleanup can be done using
 ~ # esxcli nmp satp deleterule  --satp=VMW_SATP_CX --psp=VMW_PSP_RR --vendor=myVendor  --model=mymodel 

3. Add a SATP-PSP claimrule with your device name.
When you have configured MSCS on your ESX host and using some of the LUNs for the MSCS cluster, then above to options are not the right ones. The reason SCSI3 reservations used by MSCS. There is a VMware KB article on this.When you want to configure a few of the specific devices with PSP RR then you can run
~ # esxcli nmp satp addrule  --satp=VMW_SATP_CX --psp=VMW_PSP_RR --device=naa.600a0b8000479284000004f04c8ddfa5
For cleanup:
~ # esxcli nmp satp deleterule  --satp=VMW_SATP_CX --psp=VMW_PSP_RR --device=naa.600a0b8000479284000004f04c8ddfa5
Few things to note.
For all the SATP rules to take effect, unclaim, load and run new SATP rules.
~ # esxcli corestorage claiming unclaim -t location
~ # esxcli corestorage claimrule load
~ # esxcli corestorage claimrule run
Newly added rules will be visible in the satp rule list.
~ # esxcli nmp satp listrules         
                                                                                                                      
The rules will be permanently added to /etc/vmware/esx.conf file.The changes will persist across the reboot.To undo the changes, use the esxcli commands as mentioned above.

Without host profiles you can use GUI to configure PSP RR on device basis only[Option 3].