/ troubleshooting

SAN Troubleshooting Netapp Storage

SAN Troubleshooting

A couple of years a go, I wrote a basic troubleshooting guide for a SAN environment, but it was never shared.
I believe is a good idea to have a list of considerations to take specially when there are a ton of variables to consider, so I've decided to shared this list on my blog,
If you are working on IT as Storage/Unix/Windows admin, you know how things are durig an issue on production, specially with managers, international teams when no one knows what is really going on and someone important is screaming on the phone waiting for answers.
This list has truly helped me out asking questions to easily identify what the problem is, and also helps to improve the time to response if you are ready on those critical situations.

So pretty much I based this guide on past experiences and vendor recommendations.

mono_coco

Let's check the list

  1. Do not freak out (very important)
  2. Calm yourself down (breath)

And we are ready to get the information

  • What exactly the problem is? (you will need to understand the situation first)
  • What has been changed recently? (sometimes it was just a firmware update/upgrade on the host)
  • What is the scope of the problem?
  • Is Only one device/application, clustering affected?
  • Are any specific protocols or applications related to the problem?
  • Is the problem more prevalent at a given time of day or load condition?
  • Is this issue impacting reads, writes, both, or something else?
  • Is this issue persistent or only happening occasionally?
  • When you get all the information above, you should have a clear view of what is going on.
    Ask for some minutes to perform the next steps... Take your time and remember to be calm.

    Next Steps

    •	Identify storage controllers – Volumes – LUN’s involved
    
    •	Based on timeline given from (sysadmins/applications teams) generate Latency vs IOPS on volumes/luns involved.
    
    •	Take a look on Fiber channel switch port stats for respective server aliases
    
    •	CRC (cyclic redundancy checking) errors: A high number of CRC errors can indicate a problem with GBIC/SFP connectors or problems with the physical cabling for a given port.
    
    •	Examining port utilization, can help understand throughput and help identify whether there is a bottleneck here.
    
    •	LOGS- check the SAN logs and compare the running configuration to the documentation.
    
    •	Is it SAN reporting events or errors that may be related?
    
    •	Have any recent Storage changes occurred?
    
    •	Are there any SAN-related messages in the hosts system message logs?
    
    •	Can other hosts see the storage controller involved?
    
    •	Is the storage port logged into the FC switch?
    
    •	Create a report and provide your feedback on it
    
    •	Create a case with the Storage vendor if needed.
    
    

    So, Where is really the problem ?

    I got the image from ... honestly I do not remember it was two years a go, but it explains very well where to look at.

    SAN_Troubleshooting

    HOST

    Application, Filesystems, Volume manager,Virtual memory, HBA driver

    •	Filesystems: check iSCSI, LUN’s ,NFS status include daemons
    •	HBA drivers: Review drives & test SAN surfer, multipathing configuration and how is managed MPP?, HBA status
    

    STORAGE UNIT

    Fiber channel, SAN/LAN switch , Storage controller, Cache, Backend, Disks

    •	Fiber channel: On case of many CRC errors request review of fiber jumpers, trunks, patch panel.
    
    •	SAN/LAN switch: Review logs for failed ports CRC errors,
    	ISL disconnections, multipathing, disable ports, power and fan redundant,
    	Save statis and reset counters, gbic status, clock on the same time on both switches, zone status.
    
    •	Storage Controller: Review redundant power supply, fan, connectivity issues to switch,
    	HA enable and working fine.
    
    •	Cache: Review failed memory banks, battery modules, disable modules
    
    •	Back end: Review failed paths, LUN status
    
    •	Disks: Review failed disks, alarm on shelves, connectivity issues, general alarm, firmware problems.
    

    There is a lot of things can go wrong in a complex Storage Area Network, based on the symptoms, narrowing a problem down to a probable cause in one of these areas should speed troubleshooting and resolution, each failure type can be grouped into one of the following areas:
    (Links are based on Netapp Storage appliance but basically the same kind of logic applies for different vendors such IBM, EMC or HP)

    Compatibility issues

    • Check Netapp or SAN vendor compatibility matrix against the host
    • Check host vendors software (volume manager) (cluster services), etc.

    https://support.netapp.com/NOW/products/interoperability

    Volume Manager Issues

    • Go further with sysadmins and investigate Volume manager used version
    • Disk layout Concat vs Stripped
    • Is well known disk layout on concat layout can cause performance

    A RAID theory
    https://www.cuddletech.com/veritas/raidtheory.html

    From an SQL admin
    https://sqlblog.com/blogs/merrill_aldrich/archive/2009/07/26/san-disk-array-performance-beware-lun-concatenation.aspx

    From Oracle
    https://docs.oracle.com/cd/E19683-01/806-6111/6jf2ve3ga/

    LUN alignment

    I have A LOT to say about LUN alignment issues, due I spent almost a year fixing production environments with Lun's created in a wrong format, because I'm done with it, this is a copy/past from Netapp.com

    File system misalignment is a storage industry problem that generates an un-optimized workload on a storage system.
    Please refer to a complete guide how to address LUN alignment problems on SAN environment on the following guides:

    How to identify LUNs misaligned
    https://kb.netapp.com/support/index?page=content&id=1014109&actp=search&viewlocale=en_US&searchid=null

    How to create aligned partitions
    https://kb.netapp.com/support/index?page=content&id=1010717&locale=en_US

    Best Practices for File System Alignment in Virtual Environments
    https://media.netapp.com/documents/tr-3747.pdf

    Multipath configuration

    A bad multipath configuration can lead to disconnections on the host side, Storage controller work harder
    Following recommendations:
    Ask the sysadmins to contact their vendor and review multipath configuration to support High availability

    How to verify HP-UX fibre channel configurations with multipathing I/O (MPIO)
    https://kb.netapp.com/support/index?page=content&id=1010434

    How to verify VMware ESX Fibre Channel configurations with Multipathing I/O (MPIO)
    https://kb.netapp.com/support/index?page=content&id=1011577

    How to verify Windows fibre channel configurations with multipathing I/O (MPIO)
    https://kb.netapp.com/support/index?page=content&id=1012650

    Partnerpath configuration

    • Storage Controller work harder
    • Performance issues on controller to reach the LUN thru the partner

    What do partner path configuration means?
    https://kb.netapp.com/support/index?page=content&id=3010111&actp=LIST_POPULAR

    Incorrect configuration or zoning

    If needed, go ahead validate your zoning setup configuration with Netapp or SAN vendor
    FiberChannel and iSCSI best practices
    https://library.netapp.com/ecm/ecm_get_file/ECMM1280844

    Exceeding the capacity limits

    Check Fabric Zoning and host vendors software if clustering or tunning on host side need to investigate further with sysadmins

    Please feel free to contribute with your experience and perspective, on how to address such kind of problems.