27 November 2014 / troubleshooting

SAN Troubleshooting Netapp Storage

SAN Troubleshooting

A couple of years a go, I wrote a basic troubleshooting guide for a SAN environment, but it was never shared.
I believe is a good idea to have a list of considerations to take specially when there are a ton of variables to consider, so I've decided to shared this list on my blog,
If you are working on IT as Storage/Unix/Windows admin, you know how things are durig an issue on production, specially with managers, international teams when no one knows what is really going on and someone important is screaming on the phone waiting for answers.
This list has truly helped me out asking questions to easily identify what the problem is, and also helps to improve the time to response if you are ready on those critical situations.

So pretty much I based this guide on past experiences and vendor recommendations.

mono_coco

Let's check the list

Do not freak out (very important)
Calm yourself down (breath)

And we are ready to get the information

What exactly the problem is? (you will need to understand the situation first)

What has been changed recently? (sometimes it was just a firmware update/upgrade on the host)

What is the scope of the problem?

Is Only one device/application, clustering affected?

Are any specific protocols or applications related to the problem?

Is the problem more prevalent at a given time of day or load condition?

Is this issue impacting reads, writes, both, or something else?

Is this issue persistent or only happening occasionally?

When you get all the information above, you should have a clear view of what is going on.
Ask for some minutes to perform the next steps... Take your time and remember to be calm.

Next Steps

•	Identify storage controllers – Volumes – LUN’s involved

•	Based on timeline given from (sysadmins/applications teams) generate Latency vs IOPS on volumes/luns involved.

•	Take a look on Fiber channel switch port stats for respective server aliases

•	CRC (cyclic redundancy checking) errors: A high number of CRC errors can indicate a problem with GBIC/SFP connectors or problems with the physical cabling for a given port.

•	Examining port utilization, can help understand throughput and help identify whether there is a bottleneck here.

•	LOGS- check the SAN logs and compare the running configuration to the documentation.

•	Is it SAN reporting events or errors that may be related?

•	Have any recent Storage changes occurred?

•	Are there any SAN-related messages in the hosts system message logs?

•	Can other hosts see the storage controller involved?

•	Is the storage port logged into the FC switch?

•	Create a report and provide your feedback on it

•	Create a case with the Storage vendor if needed.

So, Where is really the problem ?

I got the image from ... honestly I do not remember it was two years a go, but it explains very well where to look at.

SAN_Troubleshooting

HOST

Application, Filesystems, Volume manager,Virtual memory, HBA driver

•	Filesystems: check iSCSI, LUN’s ,NFS status include daemons
•	HBA drivers: Review drives & test SAN surfer, multipathing configuration and how is managed MPP?, HBA status

STORAGE UNIT

Fiber channel, SAN/LAN switch , Storage controller, Cache, Backend, Disks

•	Fiber channel: On case of many CRC errors request review of fiber jumpers, trunks, patch panel.

•	SAN/LAN switch: Review logs for failed ports CRC errors,
	ISL disconnections, multipathing, disable ports, power and fan redundant,
	Save statis and reset counters, gbic status, clock on the same time on both switches, zone status.

•	Storage Controller: Review redundant power supply, fan, connectivity issues to switch,
	HA enable and working fine.

•	Cache: Review failed memory banks, battery modules, disable modules

•	Back end: Review failed paths, LUN status

•	Disks: Review failed disks, alarm on shelves, connectivity issues, general alarm, firmware problems.

Common problems for a deep explanation and useful links

There is a lot of things can go wrong in a complex Storage Area Network, based on the symptoms, narrowing a problem down to a probable cause in one of these areas should speed troubleshooting and resolution, each failure type can be grouped into one of the following areas:
(Links are based on Netapp Storage appliance but basically the same kind of logic applies for different vendors such IBM, EMC or HP)

Compatibility issues

Check Netapp or SAN vendor compatibility matrix against the host
Check host vendors software (volume manager) (cluster services), etc.

https://support.netapp.com/NOW/products/interoperability

Volume Manager Issues

Go further with sysadmins and investigate Volume manager used version
Disk layout Concat vs Stripped
Is well known disk layout on concat layout can cause performance

A RAID theory
https://www.cuddletech.com/veritas/raidtheory.html

From an SQL admin
https://sqlblog.com/blogs/merrill_aldrich/archive/2009/07/26/san-disk-array-performance-beware-lun-concatenation.aspx

From Oracle
https://docs.oracle.com/cd/E19683-01/806-6111/6jf2ve3ga/

LUN alignment

I have A LOT to say about LUN alignment issues, due I spent almost a year fixing production environments with Lun's created in a wrong format, because I'm done with it, this is a copy/past from Netapp.com

File system misalignment is a storage industry problem that generates an un-optimized workload on a storage system.
Please refer to a complete guide how to address LUN alignment problems on SAN environment on the following guides:

How to identify LUNs misaligned
https://kb.netapp.com/support/index?page=content&id=1014109&actp=search&viewlocale=en_US&searchid=null

How to create aligned partitions
https://kb.netapp.com/support/index?page=content&id=1010717&locale=en_US

Best Practices for File System Alignment in Virtual Environments
https://media.netapp.com/documents/tr-3747.pdf

Multipath configuration

A bad multipath configuration can lead to disconnections on the host side, Storage controller work harder
Following recommendations:
Ask the sysadmins to contact their vendor and review multipath configuration to support High availability

How to verify HP-UX fibre channel configurations with multipathing I/O (MPIO)
https://kb.netapp.com/support/index?page=content&id=1010434

How to verify VMware ESX Fibre Channel configurations with Multipathing I/O (MPIO)
https://kb.netapp.com/support/index?page=content&id=1011577

How to verify Windows fibre channel configurations with multipathing I/O (MPIO)
https://kb.netapp.com/support/index?page=content&id=1012650

Partnerpath configuration

Storage Controller work harder
Performance issues on controller to reach the LUN thru the partner

What do partner path configuration means?
https://kb.netapp.com/support/index?page=content&id=3010111&actp=LIST_POPULAR

Incorrect configuration or zoning

If needed, go ahead validate your zoning setup configuration with Netapp or SAN vendor
FiberChannel and iSCSI best practices
https://library.netapp.com/ecm/ecm_get_file/ECMM1280844

Exceeding the capacity limits

Check Fabric Zoning and host vendors software if clustering or tunning on host side need to investigate further with sysadmins

Please feel free to contribute with your experience and perspective, on how to address such kind of problems.