Anda di halaman 1dari 10

EMC2 EMC CLARiiON

Support Troubleshooting CX Series Boot Problems


Procedure
CLAR-PSP-093
Date Last Modified - August 19, 2003

Introduction
When a CX series storage processor fails to boot, a diagnosis must be made to see if some minor corrective action is
sufficient to rectify the situation, if hardware must be replaced, or if the storage processor “boot-drives” must be re-imaged.
This document describes the procedure by which CX series array can be analyzed to make a decision. The actual procedure
for re-imaging an array is beyond the scope of this document.

Diagnosis
The primary goal of this diagnosis is to get the customer’s storage array running at 100% functionality as rapidly as possible.
A secondary goal is to retrieve necessary and sufficient information so that the root cause of the failure can be determined
off-site.

Certain types of failures will require a re-imaging a pair of SP boot drives by using the documented procedure for using a CX
series Recovery Drive. Other types of failures will require an SP replacement and others may require re-cabling or
reconfiguration of the array.

It should be noted that SPA boots from a mirrored boot drive on disks 0 and 2. SPB boots from a mirrored boot drive on disks
1 and 3.. If a drive fails physically, the SP will boot from the survivor. If the image becomes corrupted, it will likely happen
to both boot drives for an SP since they are a mirrored pair in the “Boot area”.

Main areas of Array function which could cause an array to not boot
• Storage Processor Power-on Self Test (POST) failure
• Storage Processor BIOS Test Failure
• Inaccessibility to the Boot Drives in the chassis where they reside
• Chassis with boot drives is not on back-end Bus 0
• Chassis with boot drives is not “enclosure 0”.

Activities where you may find an SP un-bootable.


• First time installation
• Following the removal and insertion of drives 0_0 through 0_3. These drive are SPECIFIC to their slots and if relocated
will leave the array un-bootable.
• During an NDU procedure. Panics that occur between the deactivate and the activate phases of an NDU can leave an SP
in a state that is non-bootable. The SP can be caught with the registry entries for drivers required for a boot from Fibre
deleted.

CLAR-PSP-093 Page 1 of 10
Determine the circumstances of the failure
Ask the customer the circumstances of the failure. What, if anything, was the customer doing when the SP failed? Determine
if one or both of the SPs fail to boot.

Perform a Visual Inspection of the Array


Are there any loose cables on the back of the array? Are the enclosures wired correctly? Are the enclosure numbers set
correctly? Correct any problems and reboot the SP.
Observe the SP’s Amber BOOT LED next to the LAN connector on the air-dam of the SP
Determine the blink rate of the amber LED on the air dam of the CX series storage processor. These rates identify how
far the SP has progressed through the boot process.
Blink Rate Meaning
¼ Hz Power up and BIOS Initialization Phase
once every 4 seconds If the SP continues blinking at this rate, REPLACE the SP

1 Hz Extended POST Testing Phase


once every second When the previous phase finishes, the LED will begin to
blink at a one second rate. To signify it is in POST Testing.
If the LED never enters the 4 HZ rate and stays at this 1
hertz rate, the SP has not successfully finished the POST
phase. For a CX600, if the XPE chassis can not locate the
Boot chassis, you will see a constant 1HZ blink rate. If all is
connected properly or a non-CX600, Replace the SP

4 Hz Boot Phase
4 times per second If this Rate starts, POST has completed and the boot phase
has begun. If the Boot LED doe not go out and stay out, the
SP has not successfully booted.
See the following sections of this document to diagnose

Steady Off Boot Success, ready for IO (NDUMON)

LED at 4HZ rate - Ping the SP that Fails to Boot


An SP can be pinged from a service laptop or any management station that is connected via LAN by typing the following
from an MS-DOS command prompt: ping <IP address>. This form of the command is appropriate for a static test after the
boot has completed.
An SP can be pinged dynamically by the following command: ping <IP address> -t. This will continuously ping the SP until
the command prompt window is closed. After issuing this command, the SP can be rebooted to see if it becomes ping-able at
any time during the boot process.

• It Is Ping-able
If you are able to ping the faulty SP after the boot process has completed, this is a sign that NT has booted and the
network interface drivers have loaded. It is not likely that there are any hardware issues with the Storage Processor
at this time

• While monitoring the dynamic Ping, it alternates between answering the Ping and Time-outs.
If the SP was pinged dynamically then you may see the SP become ping-able for a while during the boot process and
then become unping-able again. This may repeat several times until the SP finally remains ping-able. This indicates
CLAR-PSP-0xx Page 2 of 10
that there has been a repeated panic/failure or reboot in one of the core software components or the peer SP is
resetting the SP being pinged. The SP may remain ping-able once the reboot count is exhausted after 4 unsuccessful
reboots. The SP may never become ping-able if the peer SP is constantly resetting it.
An indication of an SP that Boots NT from the its FC boot drives and panics, is rapid disk activity on a pair of Boot
drives (0/2 or 1/3) following the subsequent boot After the panic. If this panic/reboot occurs 4 times, the reboot
counter will be tripped and the SP will remain with Flare not running, but it may be running NT. The BOOT LED
on the SP Air Dam will be flashing at a rapid (4hz) rate indicating that the Flare driver/application has not
successfully loaded and begun. Call EMC/CLARiiON support. There is a possibility of Constant reboots of an SP.
This indicates that the SP reboot counter is being rest, not allowing the “4 reboot counter” from being reached. Call
EMC/CLARiiON Support.

• The SP Remains Ping-able


Attempt to SymRemote into the faulty SP when it appears that the SP has finished it’s reboot process (the SP
remains ping-able).

If the attempt to SymRemote into faulty the SP is successful then Involve the CLARiiON Technical Sup[port group
to help use debug techniques to root cause the failure and take appropriate action. Determine what drivers have
started, look at event logs, collect any core dumps that are available, etc.

If the attempt to SymRemote into the faulty SP fails then we suspect a user space issue. The3r have been case where
NT has started but a user space component of the core software has hung and Symremote cannot access the SP.
There is really no choice but to re-image the Pair of boot drives effected via the Boot disk recovery procedure.

It should be noted that the reboot count protects an SP from failures in the majority of components of the core
software. If any of the core softare component that are not required for boot fails repeatedly then the reboot count
will eventually be exhausted and NT will boot without attempting to load the remainder of the core software. There
are several device drivers that are required to boot from Fibre and thus are not subject to this reboot count. A failure
of these drivers will make an SP un-bootable.

BOOT LED at a constant 4 Hz Rte & The SP is not Ping-able


If you can’t ping the SP after allowing sufficient time for reboots to be exhausted then you can assume that NT has not
booted sufficiently to start the network drivers.

Observe the Console Port Output


Attach the service laptop COM port to the console port of the failed SP and observe the output via Hyperterminal. This
output should be captured for off-line analysis. The console port output provides more information about the nature of an SP
failure. BIOS/POST Failures.

CLAR-PSP-0xx Page 3 of 10
Extended POST
Failures of the Extended POST diagnostics will result in an error code being displayed in the console output. The error
codes generated by the Extended POST diagnostics will attempt to isolate the fault to a field replaceable unit (FRU).
These error codes are documented in the Chameleon II & X1 Power On Self Test (POST) Functional Specifications.
Replace the FRU specified by the error code and restart the SP.
There are also non-fatal warning messages that are displayed by the Extended POST diagnostics on the console output.

If there are no Extended POST diagnostic failures then the storage processor should have access to the disks that make up the
mirrored boot drive. An SP could still fail to boot due to an inability to read from the boot drives. This could be due to
• backend loop failure
• faulty physical drive
• incorrect cabling.

The following diagram shows a CX600 that can not find the Boot chassis. This is due to a cabling error.

DDBS Failures
DDBS (Data Directory Boot Service) is a facility that is called by Extended POST to determine which half of the
mirrored boot drive the storage processor should boot from.

The DDBS console output in the table below shows that both halves of the mirrored boot drive are valid for boot. There
were no inconsistencies that would cause DDBS to disqualify either half of the mirror for booting NT. Extended POST
found the NT image and declared that disk 0 and 2 are both valid for booting this SP (SPA).

If DDBS finds any inconsistencies that cause it to disqualify a disk for boot then error messages are generated. It is
acceptable for one half of the mirror to be disqualified since a rebuild may be required.

CLAR-PSP-0xx Page 4 of 10
The below diagram shows a CX600 chassis with non-bootable drives in slots 1 & 2 and no drives in slots 2 & 3 (same as
non-bootable drives)
If the boot drives are in the correct slots there is likely a Utility partition for each SP. This would allow the Drives to be re-
imaged. See clar-psp-078

Too few reads from the boot disk


During the boot process reads of the selected boot disk are performed to retrieve code and data from disk. These reads
are performed via the int13 interface. Extended POST keeps track of the number of reads performed and periodically
writes a line to the console output. The last such line from appendix A is as follows:

int13 - READ PARAMETERS (1735) (typical # of reads)


The number in parenthesis (1735) is the number of reads performed. Although we can’t say exactly how many int13
reads will be performed during a boot, the “typical” number of reads is noteworthy.
Recently observed
# of reads For CX200 - int13 - READ PARAMETERS (1852)
# of reads For CX400 int13 - READ PARAMETERS (1810)
# of reads For CX600 int13 - READ PARAMETERS (1710)

If an SP fails to boot and significant fewer reads have been performed than expected then the boot process has hung at
some point prior to completion. The SP boot drives should be re-imaged.

CLAR-PSP-0xx Page 5 of 10
No Console Output Failures
If there are no Console Output failures, you can Ping the SP, and you can symremote onto the SP , but it appears to be UN-
Managed by Navisphere and the management server does restart by going to ipaddress/setup and restarting the management
server, call EMC/CLARiiON Support.

Appendix A – Typical Console Port Output for a Successful Boot


Console paramaters for the CX-series
9600 baud
8-bit
no parity
1 stop-bit

The following is the typical console port output that is observed when booting from Fibre. Comments are in a blue italicized
font. Additional details can be found in the C2 and X1 Power On Self Test (POST) Functional Specifications.

CX600

ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
ÄÄÄÄÄ¿
³ PhoenixBIOS Setup Utility ³
³ ³
³ CPU Type : Intel(R) XEON(TM) System ROMz : E9DB - FFFF ³
³ CPU Speed : 2000 MHz BIOS Date : 11/04/02 ³
³ ³
³ System Memory : 640 KB COM Ports : 03F8 02F8 0300 0308 ³
³ Extended Memory : 4119552 KB LPT Ports : 03BC ³
³ Shadow Ram : 384 KB Display Type : EGA \ VGA ³
³ Cache Ram : 512 KB PS/2 Mouse : Not Installed ³
³ ³
³ Hard Disk 0 : None ³
³ Hard Disk 1 : None ³
³ Hard
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

Press Any Key to Continue

Copyright (c) EMC Corporation , 2002


Disk Array Subsystem Controller
Model: CX600
DiagName: Extended POST
DiagRev: Rev. 02.57
Build Date: Wed Nov 13 10:07:27 2002
StartTime: 12/06/2002 21:59:08
SaSerialNo: LKE00022906788

AabcdBCDabEabcdFGHabIabcJabcKabLabMabcNabOabPabQabRabSabTabUabVabWabXYZAA

POST RAN without error


CLAR-PSP-0xx Page 6 of 10
A failure here will typically indicate a replacement SP is required

Initializing back end FIBRE...

PCI Config Reg: 2.4.1 0x0157

FCDMTL 0 [2.4.1] Dual Mode Fibre init - OSW DB PTR 0x20000000

FCDMTL 0 [2.4.1] Cached memory - 0xF5E67 bytes @ 0x200006A8

FCDMTL 0 [2.4.1] Noncached memory - 0xBF3BF bytes @ 0x200F650F (0x200F650F phys)

FCDMTL 0 [2.4.1] DVM Initialized

FCDMTL 0 [2.4.1] IMQ base ptr = 20170000; IMQ length = 8000

Dualmode fibre init completed

FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x4, cmd=0x1

FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4002, info=0x0

FCDMTL 0 [2.4.1] TPM Lnk Up: state=0xA000000, flg=0x84

FCDMTL 0 [2.4.1] DVM Disc Comp- Dev List Size: 3

FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x208, cmd=0x0

Link Event: 0x00030001

Link Event: 0x00030005

FCDMTL 0 [2.4.1] TPM Notify: st=0xA000002, flg=0x200, cmd=0x4

Device Event (0xFFFFFC): 0x00030015

FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x200, cmd=0x2

FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4004, info=0x0

FCDMTL 0 [2.4.1] TPM Lnk Dwn: st=0xA000001, flg=0x201, evnt=0x4004

FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4002, info=0x0

FCDMTL 0 [2.4.1] TPM Lnk Up: state=0xA000001, flg=0x8005

FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E1


This is normal
FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E2

FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E8

CLAR-PSP-0xx Page 7 of 10
FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E4

FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: EF

Device Event (0xE1): 0x00030012

Device Event (0xE2): 0x00030012

Device Event (0xE4): 0x00030012

Device Event (0xE8): 0x00030012

Device Event (0xEF): 0x00030012

FCDMTL 0 [2.4.1] DVM Disc Comp- Dev List Size: 8

FCDMTL 0 [2.4.1] TPM Notify: st=0xA000001, flg=0x8209, cmd=0x0

Link Event: 0x00030001

FCDMTL 0 [2.4.1] TPM Notify: st=0xA000002, flg=0x200, cmd=0x5

FCDMTL 0 [2.4.1] TPM Resume SCSI Issued: tgdb = 0x200004C4

Device Event (0xEF): 0x00030016


Target 0 is online
Target 1 is online
Target 2 is online
Target 3 is online
Target 4 is online

Finding the first 4 drives in the chassis. This does not mean they are bootable, only found.

Relocating Data Directory Boot Service (DDBS)...

Autoflash POST?

POST/DIAG image located at sector LBA 0x00012048

Autoflash BIOS?

This is where the SP would update POST or BIOS if required. IT would go back into a reboot at
this point if it needed to be updated

BIOS image located at sector LBA 0x00011048

EndTime: 12/06/2002 21:59:41

int13 - RESET (1)


DDBS: MDB read from both disks. Finding Valid Boot Drives. This is a good sign.
DDBS: Chassis and disk WWN seeds match. If this does not display there is no
DDBS: First disk is valid for boot. valued boot image found.
CLAR-PSP-0xx Page 8 of 10
DDBS: Second disk is valid for boot.

NT FLARE image (0x00400009) located at sector LBA 0x0002284B

Disk Set: 1 3

Total Sectors: 0x005821A1

Relative Sectors: 0x0000003F

Calculated mirror drive geometry:


Sectors: 63
Heads: 240
Cylinders: 382
Capacity: 5775840 sectors

Total Sectors: 0x005821A1

Relative Sectors: 0x0000003F

Calculated mirror drive geometry:


Sectors: 63
Heads: 240
Cylinders: 382
Capacity: 5775840 sectors

int13 - READ PARAMETERS (19)

int13 - READ PARAMETERS (22)

int13 - DRIVE TYPE (57)

int13 - READ PARAMETERS (58)


Normal for every
int13 - DRIVE TYPE (59)
array
Error : Invalid Drive ID - 0x81

int13 - CHECK EXTENSIONS PRESENT (61)

int13 - GET DRIVE PARAMETERS (Extended) (62)

int13 - READ PARAMETERS (63)

int13 - READ PARAMETERS (65)


If the Console output stalls in this area,
NT will most likely be Ping-able and SYM
int13 - READ PARAMETERS (1140) Remote may work, but the Flare Application
is not running and the Boot LED is likely
int13 - READ PARAMETERS (1177) Stuck in a 4hz blink state. Re-imaging the
int13 - READ PARAMETERS (1211) boot pair is the only option.

int13 - READ PARAMETERS (1245)

int13 - READ PARAMETERS (1283)

int13 - READ PARAMETERS (1423)


CLAR-PSP-0xx Page 9 of 10
int13 - READ PARAMETERS (1456)

int13 - READ PARAMETERS (1489)

int13 - READ PARAMETERS (1544)

int13 - READ PARAMETERS (1575)

int13 - READ PARAMETERS (1643)

--------------------------end-----------------------

CLAR-PSP-0xx Page 10 of 10

Anda mungkin juga menyukai