Wednesday 25 April 2012

Why is the device status on ESX shown "degraded"?


Today I would like to touch upon a specific  field in command output "esxcli storage core device list" related to "Status" of the device.

~ # esxcli storage core device list
naa.90090160c1e01de150060160c1e01de1
   Display Name: DGC iSCSI Disk (naa.90090160c1e01de150060160c1e01de1)
   Has Settable Display Name: true
   Size: 0
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.90090160c1e01de150060160c1e01de1
   Vendor: EFG
   Model: ABD
   Revision: 0326
   SCSI Level: 4
   Is Pseudo: true
   Status: degraded
   Is RDM Capable: true
   Is Local: false
   Is Removable: false

This field for the device is specifically reserved for indicating the path status of the device.When the device has more than one path to the target( storage array), the path status is "on". When all the paths to the target are down ( either off / dead ) the device status is "dead". If  there is only one path to the target then the status is "degraded".

Also note that when the device is in PDL( Permanent Device Loss , This scenarios arises when a device is  unmapped from the storage array while  ESX was using the device) the status of the device is "not connected".

If ESX fails to recognize the state of the device ( if all above mentioned scenarios are not applicable) then the device status is "unknown".

Permanent Device Loss -Planned/Unplanned.



Planned versus unplanned PDL

A planned PDL occurs when there is an intent to remove a device presented to the ESXi host. The datastore must first be unmounted, then the device needs to be detached before the storage device is unpresented at the storage array.
1.  esxcli storage filesystem  list --> get the vmfs datastore name
2.  esxcli storage filesystem unmount -l Data_Store_Name --> unmount the datastore
3.  esxcli storage core device set --state=off -d naa.xxxx  --> set the device state to off and safely remove the device.

An unplanned PDL occurs when the storage device is unexpectedly unmapped from the storage array without the unmount and detach being executed on the ESXi host. To cleanup an unplanned PDL:
  1. All running virtual machines from the datastore must be powered off, easy way to do is, login to esxi , kill the vmx file of the respective vms.  and unregistered from the affected datastore. 
  2. From the vSphere Client, go to the Configuration tab of the ESXi host, click Storage.
  3. Right-click on the datastore being removed, and choose Unmount.

    A Confirm Datastore Unmount window displays. When the prerequisite criteria have been passed, the OK button appears.
  4. Perform a rescan on all of the ESXi hosts that had visibility to the LUN.
  5. After removing the device the paths to the device still shows "died", to remove the died paths to the device do rescan "esxcfg-rescan -d" or "esxcli storage core adapter  rescan  --adapter vmhbaxxx"

    Note
    : If there are active references to the device or pending I/O, the ESXi host still lists the device after the rescan..

Use of -b --boot option in esxcli storage nmp satp rule remove .


Use of -b boot option is to remove the system satp rules [default satp rules]  and add user defined rules.
User cannot add a satp rule with same vendor/model of default sapt rule, as esxi won't allow duplicate rules. So if the user still wants to add a rule with same vendor/model with extra options, user has to delete the system satp rules.

~ # esxcli storage nmp satp rule remove
Error: Missing required parameter -s|--satp

Usage: esxcli storage nmp satp rule remove [cmd options]

Description:
  remove                Delete a rule from the list of claim rules for the given SATP.

Cmd options:
  -b|--boot             This is a system default rule added at boot time. Do not modify esx.conf or add to host profile.

How to find supported esxcli commands in esxi5.0

Get all esxcli commands supported in  ESXi 5.0

~ # localcli esxcli command list
Namespace                                                   Command
---------------------------------------------------------------
storage.core.device                                        setconfig
storage.core.device.smart                               get
storage.core.device.stats                                get
storage.core.plugin.registration                        remove
storage.nfs                                                     list
storage.nmp.psp.generic.deviceconfig             set
storage.nmp.psp                                             list
storage.nmp.psp.roundrobin.deviceconfig        get
storage.nmp.psp.roundrobin.deviceconfig        set
storage.nmp.satp.generic.deviceconfig             get
storage.nmp.satp.generic.deviceconfig             set
storage.nmp.satp.generic.pathconfig                get
storage.nmp.satp.generic.pathconfig                set
storage.nmp.satp                                            list

:
:
:
:
....etc....