Anda di halaman 1dari 17

Netbackup Tape Troubleshooting

Problem
Troubleshooting Drive/Library Issues in NetBackup
This document provides information on, and how to resolve, various tape drive
issues that may be encountered whilst using NetBackup.

Solution
It is important to understand that NetBackup does not write data directly to a tape
drive. For example: when using Solaris, NetBackup relies on the operating system to
write the data to the tape using the st tape driver. The only slight involvement with
NetBackup is that it specifies the block size to use - but this is still passed to the
operating system. Other operating systems work in a similar manner.
The SCSI pass-through driver (sg driver on Solaris) - allows SCSI commands to be
passed directly to the drive. For example, the 'test-unit-ready' SCSI command is
used, for example, when mounting a tape. On occasion, it is necessary to
recreate/rebuild the pass-through driver. The most common symptom that involves
the pass-through driver is if the scan command does not show all expected
devices. Other issues involving the pass-through driver are very rare.
The majority of drive/tape issues have a cause outside of NetBackup. When
troubleshooting these issues it is advisable to start the troubleshooting process at
the hardware/firmware level.
It should always be considered that although NetBackup reports an error, it does not
mean that NetBackup is the cause.
Common drive issues include:
Scan command

TAPE_ALERT
ASC/ASCQ
Missing Path
Positioning errors
Read/ Write errors
I/O Errors
External event has caused rewind
Tapes not reaching capacity (for example) 300GB of Data is written to a 400GB
(native capacity) capacity tape
Tapes being incorrectly marked as 'read only'
Library Inventory Issues
Robot load issue - "Error bptm error requesting media TpErrno = Robot operation
failed"
Missing drives, or drives disappearing and reappearing
Tapes failing to mount in NetBackup, but visable and usable by operating system
commands
Issues moving tapes to/ from slots or drives
Issues with Cartridge memory
Cleaning tape

In the first instance, it is always worth power cycling the library or drives reporting an
issue, as well as rebooting the associated servers, Many of the errors referenced in
this tech note can be sometimes be cleared this way. In the event this does not clear
the issue, it has at least been eliminated from being the cause.

Scan Command

Scenario: The scan command shows no devices at all, or, that some of the devices,
or all of the devices appear and reappear when the command is run repeatedly.
Firstly, it must be confirmed that the operating system can see and communicate
correctly with the tape drives.
The devices appearing in (for example) 'Device Manager' (Windows) or cfgadm
(Solaris) is NOT necessarily sufficient confirmation that the devices are correctly
configured to the operating system.

It has been seen that although devices appear to be visible to the operating system,
SAN issues prevented full/correct communication, and as a result, the scan
command failed.
Two things need to be checked before further troubleshooting is carried out:
1. Ensure no backups are running on the drives (only applicable if the drives are
shared). A SCSI reservation of a drive due to a backup may prevent the drive from
responding to, and thus appearing in the output of the scan command.
2. Rebuild the passthrough driver (Unix only). If the drive/operating system
configuration has not changed, then this is very unlikely to be the issue. However, it
can be eliminated from being the cause by recreating the passthrough links and files.
See the device configuration guide for information on how to do this.
Aside from the exceptions, above issues with the scan command are not caused by
NetBackup. When it is understood how the scan command works, it is clear how the
root of the issues are external to NetBackup.
Although the scan command is supplied by Symantec, it does not issue any
NetBackup commands, or interact with NetBackup in any way. When run, it issues
operating system level SCSI commands to the devices configured in the operating
system, and the output of the command is sent from the devices themselves. There
are no settings, tuning or troubleshooting that can be performed on the scan
command.
Windows servers do not require a passthrough driver. Providing that there are no
backups running on other servers that may share the drives, then the problem will be
caused by either an issue regarding the SAN, firmware, hardware or drivers.
Consideration should be given to SAN infrastructure (e.g. switches), HBAs or the
physical drive/library.
Unix servers require a passthrough driver. For example, on Solaris this is called the
sg driver. This is required as the SCSI commands issued to query the device cannot
be passed to the devices via the regular operating system driver.
If the scan command shows devices appearing and re-appearing, then the
passthrough driver is not the cause. If the device(s) permanently disappear, it may
be worth reconfiguring the passthrough driver. If the issue is not resolved, then the
issue will be as per Windows servers, that is, SAN infrastructure (e.g. switches),
HBAs or the physical drive/library. Consideration should also be given to HBA
configuration files, as incorrect settings in these have been seen to prevent output
from the scan command being returned.

Providing the passthrough driver is configured, Symantec recommends to consult


your hardware vendors and/or Operating System/SAN Administrators to further
investigate scan command issues.
Known Issues:
Some 6GB SAS HBAs are not compatable mpt_sas driver as details in Oracles
Technote: http://docs.oracle.com/cd/E19253-01/821-0382/821-0382.pdf
TapeAlert/Tape Alert

A tape alert message is a critical, warning, or informational alert that occurs due to a
tape drive or robotic library hardware event. These "tape alert" messages are stored
on the tape drive or robotic library. Applications like NetBackup query the tape device
or robotic library for these "tape alert" messages and display the "tape alerts" to the
user. "Tape alert" messages are reported in the NetBackup bptm log. The tape alert
technology detects and logs hardware and media errors.
It is important to remember that while NetBackup displays these "tape alerts", the
alerts occur due to a tape drive or robotic library hardware event. Check the Event
Viewer/system log for any hardware related errors. Contact the Original Equipment
Manufacturer (OEM) for support.
As a TapeAlert is sent from the drive itself, it is impossible that this can be caused by
NetBackup.
For example:
Oct 11 08:59:31 media bptm[3771]: [ID 228150 daemon.warning] TapeAlert
Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive TLD0_LTO4_DRIVE1
(index 4), Media Id R0TP01

To further investigate TapeAlert issues, Symantec recommends contacting your


hardware vendor.
A link to the TechNote "Description of Tape Alerts and code definitions" is provided at
the bottom of this TechNote

ASC/ ASCQ

SCSI Sense keys describe a 'state', which are returned when a command requests
a 'check condition' status.
In this example, robtest was failing to load a tape into a drive.
Initiating MOVE_MEDIUM from address 1000 to 500
move_medium failed, CHECK CONDITION
sense key = 0x5, asc = 0x30, ascq = 0x0, INCOMPATIBLE MEDIUM INSTALLED
The analysis can be broken down as follows :
Sense Key 0x5 - Illeagal Request
ASC/ASCQ 0x30/00 - Incompatible Medium Inserted
In a similar manner to Tape Alerts, SCSI Sense Keys are produced by the device,
not by NetBackup.
As ASC/ASCQ alerts are sent from the hardware, it is impossible for them to be
caused by NetBackup.
It has been seen that a power cycle of the drive (not soft reset) can sometimes clear
ASC/ASCQ errors.
Further information on these values can be found at http://www.t10.org
To further investigate ASC/ASCQ issues, Symantec recommends contacting your
hardware vendor.

Note
If hardware encryption is in use via NetBackup KMS, an issue with the service may
cause the drives to send out ASC/ASCQ errors relating to encryption. In this
instance, although the drive is sending he message, the cause may be the KMS
service, and so this should be given consideration.

Missing Path

Missing path means that the Operating System has lost connectivity to the drives. At
this point, you will find that the devices are also missing from the scan output, simply
because the scan command only communicates with devices found at the Operating
System level.
For this issue, NetBackup is not the cause, however, when the issue is resolved it
may be the case that the paths to the devices change, thus making the NetBackup
config incorrect. If this is the case, the devices will need to be deleted and
reconfigured within NetBackup. If the devices come back with the same operating
system paths, then no further action should be required.

Positioning Errors

Positioning errors occur when the operating system is unable to position, fastforward or rewind the tape.
The error message seen may differ slightly, depending on when the error occurs.
Example 1
<2> write_data: block position check: actual 62504, expected 31254

Example 2
1/11/2010 7:50:13 AM - Error bptm(pid=3364) ioctl (MTREW) failed on media
id W00229, drive index 0, The I/O bus was reset. (1111) (bptm.c.8039)

NetBackup requests the operating system to position the tape, at various points of
the backup. Failure to correctly position, although detected by NetBackup, is most
commonly caused by:
1.
2.
3.
4.

Hardware error
Tape error
Driver issue
Firmware issue

As NetBackup does not directly position tapes, to further investigate positioning


errors issues, Symantec recommends contacting your hardware vendor.
Note

a) One known issue can be seen in the bptm log, affecting NBU 6.5.6 to 7.0.1.
Error bptm (pid=2164) ioctl (MTWEOF) failed on media id V01497, drive index
0, The physical end of the tape has been reached.

EEB 2182228 resolves this issue.


If the issue is not resolved by this EEB, or, you see this issue at earlier or later
version of NetBackup (before 6.5.6 or after 7.0.1) , then the issue is related to
firmware of hardware.
b) Between NetBackup 6.5.6 - 7.1.0.3 duplications of MPX backups may result in a
positioning error / status 94. To investigate this Symantec suggests to log a call and
quote eTrack 2229875
Read/ Write errors

The reading or writing operation is performed at the operating system/driver level.


Therefore, although this issue is detected and reported in the NetBackup logs, it is
not caused by NetBackup.
The cause of read/ write errors are usually an issue with the tape drive or media
cartridge.
For example:
write_data: cannot write image to media id XXXXXX, drive index #, Data
error (cyclic redundancy check).
Example 2 io_write_block: write error on
media id MIR107, drive index 0, writing header block, 1117 Example 3 Error
bptm(pid=5268) cannot read image from media id 500507, drive index 1, err =
234

Note
a) McAffee Anti_virus software is known to be a possible cause of Status 84 errors
on Windows Media Servers.
b) Cyclic redundancy check errors indicate faulty hardware.
c) MSEO is not compatible with Asynchronous Tapemarks which were introduced in
NetBackup 7.1 Symptoms include write and/or read errors on tapes encrypted with
MESO. Creating the empty file '

/usr/openv/netbackup/db/config/DISABLE_IMMEDIATE_WEOF ' will resolve the


issue

I/O Error

I/O errors are caused at a hardware level, and are only detected by NetBackup.
For example:
11:20:18.246 [8504.5292] <4> write_data: WriteFile failed with: The request could
not be performed because of an I/O device error. (1117); bytes written = 65536; size
=0
To further investigate I/O Errors, Symantec recommends contacting your hardware
vendor.

Known issues

open failed in io_open I/O error


This exact error can be caused by mis-configeration of the drives so this should be
checked in the first instance. If the issue remains after confirmation that the
configuration is correct, then the issue should be further investigated as a
hardware/firmware issue.

External event has caused rewind

This issue is (potentially) serious and requires immediate investigation, as data can
be lost. NetBackup will display this error if the block position calculation check by
NetBackup does not match the position reported by the drive. It will not be certain
that a full rewind has occurred (impossible to tell from a simple block check), but it

will mean that the position check has failed, and most likely that the calculated
position is less than the expected position.
The error will look similar to the following:
<2> io_terminate_tape: block position check: actual 4, expected 5
<16> write_data: FREEZING media id XXXXXX, External event caused rewind
during write, all data on media is lost
NetBackup keeps track of how much data it is sending to the operating system to
write to the device. NetBackup will ask the tape device for its position as an integrity
check after the end of each write. If this position does not match what NetBackup
has calculated the position should be, then the job will fail with a media write error.
If a full rewind has occurred, this will overwrite the NetBackup header on the tape,
making it unreadable. If this has happened, the data on the media is lost. The most
common cause of this is a SCSI reset on the SAN, which causes a rewind of the
drive(s) whilst they are being written to. This event is undetectable by NetBackup,
and is only discovered after the event, when the block position check is made.
NetBackup cannot cause SCSI resets on the SAN because the tape positioning and
read/write operations are all controlled by the Operating System itself.
If the issue is a position error (as opposed to a 'Full' rewind) a message similar to the
following will be seen upon inspection of the bptm log.
<2> write_data: block position check: actual 62504, expected 31254
<16> write_data: FREEZING media id XXXXXX, too many data blocks written, check
tape/driver block size configuration
The possible causes are numerous, and most commonly include:
Tape driver issue
Tape drive firmware issue
SAN fault
HBA driver or firmware issue, or other fault
Switch Fault

If the drives are attached to a NDMP device, it must be ensured that the SCSI
reservation on the NDMP device is set to match the SCSI reservation type of
NetBackup.
To further investigate "External event has caused rewind" issues, Symantec
recommends contacting your hardware /operating system support vendor.
Note
The SCSI reservation is set/held by the Host Bus Adaptor. However, NetBackup
sends the reserve command through the SCSI pass-thru path for the device, so this
needs to be configured correctly.
Known Issues:

NDMP

If the issue is occurring on drives that are shared (SSO) between an NDMP filer and
NetBackup, and the drives are zoned directly to the filer, then the issue can be
caused if the SCSI reservation type set in NetBackup is not the same as the SCSI
reservation type set on the filer.
If this is the case the issue can be resolved by following these steps :
In the 'Host Properties' > 'Media Type' tab in NetBackup, check the SCSI reservation
set, SPC2 or SCSI persistent
Change the type of SCSI reservation on the filer, to match the type you have set in
NetBackup.
Reboot the Robotic Library to break all the current reservation.
The following TechNote has a detailed explanation of SCSI reservation:
http://www.symantec.com/docs/HOWTO32767

HP-UX 11.31 IA64 / atdd driver

Scenario: BPTM block position check fails one block short using IBM atdd driver
6.0.0.96 on HP-UX 11.31 IA64
This issue is actually caused by the HP ATDD driver writing the EOT mark
incorrectly. However, Symantec has produced a NetBackup 7.0.1 EEB to
workaround this issue (ETrack 2142743 /TECH155113)
Using the ATDD driver with NetBackup 7.0.1 and later on HP-UX 11.31 IA64 requires
atdd driver 6.0.2.8 or later. Upgrade to the new ATDD driver resolves the issue.

Tapes not reaching capacity

Scenario: 300 GB of Data is written to a 400 GB capacity tape


NetBackup passes data to the OS, one block at a time, to be written to the tape
drive. NetBackup has no understanding of tape capacity. In theory, it would keep
writing to the same tape "forever".
When the tape physically passes the logical end-of-tape, this is detected by the tape
drive firmware. The tape drive firmware then sets a 'flag' in the tape driver (this would
be the st driver in the case of Solaris). There is still enough physical space on the
tape for the current block to be written, so this completes successfully. NetBackup
then attempts to send the next block of data (via the operating system) but now the
tape driver refuses, as the 'tape full' flag is set. The st driver then passes this 'tape
full' message to the operating system, which passes it to NetBackup. Only when this
has happened will Netbackup request the tape to be changed.
Common causes of this issue are tape drive firmware, or faulty hardware.
There are no settings in NetBackup that influence tape capacity. To further
investigate Tape Capacity issues, Symantec recommends contacting your hardware
vendor.

Tapes being incorrectly marked as 'read only'

NetBackup has no understanding of 'read only'. This state is set by the tape drive,
usually by means of a small, physical switch on the tape cartridge.
Therefore, if a tape is being reported as 'read only' this issue cannot be the fault of
NetBackup.
'Read only' is reported by the firmware of the tapedrive, and logged by NetBackup,
we see this as a Tapealert :
0x09: 'Cartridge write protected
It has been seen on occasion that firmware issues of the tape drive have caused
tape media to be incorrectly reported as read only.

Library Inventory Issues

NetBackup does not directly 'Inventory' a library. Instead it queries the library and
waits to be told what tapes (via their barcodes) are located in which element address
(slots/drives). If, for example, NetBackup cannot 'see' a particular cartridge(s) it is
because the library is 'hiding' the location, not because of any setting within
NetBackup.
For example, common symptoms of library issues include tapes appearing in the
incorrect/wrong slot, and tapes/slots not appearing at all. It is impossible for this to
be caused by NetBackup.
To further investigate Library issues, Symantec recommends contacting your
hardware vendor.
Note
Issues involving NetBackup and the Virtual I/O slots on the IBM 3500 series libraries
where ALMS/Virtual I/O are enabled are occasionally seen.
Problems involving Virtual I/O slots cannot be caused by NetBackup because there
are no settings in NetBackup that can influence the behavior of the Virtual I/O slots.

It has been found that the library setting "Queued Exports" should be set to 'HIDE'
from within the IBM web console to allow tapes to be moved from the virtual I/O slots
to the slots within the logical library.

Robot load issue - "Error bptm error requesting media TpErrno = Robot
operation failed"

This error is seen in the bptm log, and depending on the logging set, may be
referenced in the .../volmgr/debug log, and possibly also the operating system event
log.
An excellent way to check this is to use the robtest command. A link to a TechNote
for documentation on Robtest is available at the end of the TechNote.
The robtest command does not issue any NetBackup commands. It only sends
operating system level SCSI commands to the library, and the output seen from the
command is sent from the library firmware. Given this description, it is clear to see
that Robtest failures cannot be caused by NetBackup.
For example, this robtest command issues a move media request from slot 86 to
drive 2:
m s86 d2
move_medium failed
sense key = 0x4, asc = 0x15, ascq = 0x1, MECHANICAL POSITIONING ERROR
As robtest has only sent a SCSI move request, straight away this failure can be seen
to not be caused by NetBackup.
Further, the error is referencing an 'ASC/ASCQ' error, which, as explained in the
"ASC/ASCQ" section of this tech note, is never caused by NetBackup.
To further investigate robotic operation issues, Symantec recommends involving the
Library's vendor.

Missing drives, or drives disappearing and reappearing

In cases where, for example, tpautconf -report_disc shows inconsistent numbers of


missing devices when the command is run at different times.
tpautoconf -report_disc will report "Missing Device" if a device that is configured and
available within NetBackup has become undetectable from the Operating System.
For example:
======================= Missing Device (Drive) ======================
Inquiry = "IBM Ultrium 3-SCSI
Serial Number = HM74536FFS
Drive Path = /dev/rmt/0cbn
Drive Name = DRV_F2D3_LTO5
In this case, NetBackup is only reporting that the Operating System cannot find a
device that was previously available.
If a different number of devices are missing at different times (that is, the devices
'disappear' and 'reappear') this is very likely a SAN issue.
NetBackup has no control over the communication of between the devices and the
operating system.
If a device is showing as missing, then there must be an issue outside of NetBackup.
Problems on the SAN are a very common cause of this issue.

Tapes failing to mount in NetBackup, but visable and usable by operating


system commands
Cases have been seen in which tapes are physically loaded into the tape drive, and
are accessible and respond correctly to operating system commands such as mt and
dd, but NetBackup is unable to mount the tape. The job hangs on the tape mount,
failing with status 98 after some time.

Understandably, this could be seen to suggest Netbackup is at fault, however, upon


investigation it was found that the fault was caused by the tape drive firmware.

Issues moving tapes to/from slots or drives


Failure to move tapes to/ from slots or drives will have a cause outside of
NetBackup. Moving tapes is achieved via industry standard scsi-commands - not
NetBackup commands.
Various messages could be seen, depending on the exact fault, for example :
Auto empty media export request rejected by TLDCD; Cannot move from media
access port
Here it is seen that an operation to empty the CAP/MAP during an inventory is
failing.
Attempting to move the tape using robtest produced the following error:
m p1 s28
Attempting to move the tape in port 1 of the CAP to slot 28

Initiating MOVE_MEDIUM from address 10 to 1027


move_medium failed
sense key = 0x4, asc = 0x40, ascq = 0x1, UNKNOWN ERROR, KEY: 0x04, ASC:
0x40, ASCQ: 0x01
As seen in the ASC/ASCQ section earlier in this tech note, errors such as this cannot
be caused by NetBackup.
In this case, the cause of the issue was due to the fact that the robot was unable to
access its own slots.
To investigate issues moving media within the robot, Symantec recommends to
contact the hardware vendor.

Issues with Cartridge memory


LTO tapes contain a small EEPROM chip, known as LTO-CM.
This has multiple uses. For example, it is used by the drive to determine the LTO
tape generation, it keeps a 'error log', and manufacturer details of the tape.
It also contains information on the position of data contained on the tape, which
allows for fast block positioning.
Errors will be reported if the cartridge memory fails.
For example, the following messages were reported by the Library, when a LTO-CM
chip failed in a cartridge:
Description: The memory in the tape cartridge has failed.
Description: The tape drive encountered a problem while loading a tape cartridge.
Description: The tape drive detected an internal hardware problem.
Description: The tape drive has an error which requires the tape cartridge to be
ejected for error recovery
These issues are related to hardware, and Symantec recommends to contact the
hardware vendor for further investigation.

Cleaning Tape

An unusual issue has been seen at NetBackup 7.5. On occassion, a cleaning cycle
run by NetBackup will fail.
The symptoms may differ slightly :
A. The tape cannot be unloaded, the /var/adm/message log will show:
Mar 14 12:49:38 server02 tldcd[19756]: [ID 559682 daemon.notice] TLD(2)
closing/unlocking robotic path
Mar 14 12:49:38 server02 tldcd[9536]: [ID 919746 daemon.notice] inquiry() function
processing library ADIC Scalar i2000 607A:

Mar 14 12:49:38 server02 tldd[9524]: [ID 583323 daemon.notice] DecodeClean:


TLD(2) drive 5, Actual status: Unable to SCSI unload drive
Mar 14 12:49:39 server02 ltid[9497]: [ID 512328 daemon.notice] LTID - received
ROBOT MESSAGE, Type=55, LongParam=0, Param1=1, Param2=10
Mar 14 12:49:39 server02 ltid[9497]: [ID 581313 daemon.error] Cleaning for drive 1
failed, status = Unable to SCSI unload drive
Mar 14 12:49:48 server02 bptm[19765]: [ID 946237 daemon.warning] TapeAlert
Code: 0x0b, Type: Informational, Flag: CLEANING MEDIA, from drive PER-i2000Drive5 (index 1), Media Id CLN001
Mar 14 12:49:49 server02 ltid[9497]: [ID 560358 daemon.notice] LTID - Sent
ROBOTIC request, Type=3, Param2=1
B. Once the tape drive is cleaned a new tape is loaded and reloaded repeatedly,
the /usr/openv/volmgr/debug/robots log will show:
12:43:53.753 [3016] <4> AddTldLtiReqEntry: Processing ROBOT_CLEAN request...
12:43:53.753 [3016] <5> CleanDrive: TLD(0) Cleaning Tape 4TP012 on drive 5, from
slot 41
12:43:53.758 [3424] <4> io_open:
Drive Path = /dev/rmt/6cbn
...
12:43:54.018 [3026] <5> tldcd:mount_unmount_drive: Processing MOUNT, TLD(0)
drive 5, slot 41, barcode CLN4TP012L4 , vsn 4TP012
...
12:46:01.764 [3026] <5> tldcd:mount_unmount_drive: Processing UNMOUNT,
TLD(0) drive 5, slot 41, barcode CLN4TP012L4 , vsn 4TP012
...
12:46:12.789 [3016] <5> GetResponseStatus: DecodeClean: TLD(0) drive 5, Actual
status: Unable to SCSI unload drive

The cause of this issue is due to the 'access bit to be set to 1' on the tape drive
The issue is resolved with EEB 2714761