Contents
Methods of Analyzing and Locating Faults....................................................Page3
Classified Troubleshooting Analysis.............................................................Page18
The course describes the general troubleshooting procedure and the methods of rectifying
the common faults.
Reference:
The general principles for fault locating can be summarized as "external first, then internal;
station first, then board; high-severity alarms first, then low-severity alarms.“ The principles
can not be used separately, three principles should cooperate with each other.
External first, then internal
During fault localization, firstly confirming that external conditions are normal, for
example, line optical fiber is correct or there is no power failure or switching
equipment fault, and so on.
Station first, then board
The most causes of faults are board’s failure in the subrack, so finding the affected
NE firstly, then locate the failure to the certain board.
High-severity alarms first, then low-severity alarms
High-severity alarms should be analyzed firstly, for example, critical alarms and
major alarms. Then go further for low-severity alarms, such as, minor alarms and
warnings.
The most popular methods of locating hardware faults can be summarized as "Analyze
first, then loopback, and finally replace the board."
That is, when fault occurs, first determine the possible faulty points by analyzing the alarm
events, performance data and signal flow. Then locate the fault to a particular NE by
looping back station by station. Finally, clear the fault by replacing faulty board.
Besides the alarms, in the RTN 980L system, to query the transmit and receiving power are
also important and useful.
The advantages and disadvantages of fault locating by querying fault information through
NM are as follows:
Comprehensive: it is able to obtain the fault information network wide.
Accurate: it is able to obtain the current alarms and the alarm generation time as
well as history alarms. It is also able to obtain the specific values of the
performance events.
If there are too many alarms and performance events, it is difficult to find the clue
of analysis.
It all depends on the normal operation of the computer, software, and
communication equipment. If one of the three is faulty, it reduces or even loses the
fault information query capability of the approach.
On the OptiX RTN 980L, there are running and alarm indicators in different colors that
reflect the current running status of the equipment or the severities of existing alarms.
The HARD_BAD is an alarm indicating hardware errors. The board that reports the alarm
fails to work. If the board is configured with the 1+1 protection, the protection switching
may be triggered.
The NESF_LOST is an alarm indicating that the NE software is lost. This alarm is reported
when the system control, cross-connect, and timing board detects that the NE software is
lost.
The NO_BD_SOFT is an alarm indicating that the board software is lost. If the board
software is lost, the board fails to work normally.
The FAN_FAIL is an alarm indicating that the fan is faulty. When the FAN_FAIL alarm
occurs, the heat dissipation of the system is affected.
The POWER_ALM is an alarm indicating that the power module is abnormal. If the alarm is
reported by a board of the IDU, the possible causes are as follows: Cause 1: The input
power or the PIU is abnormal. Cause 2: The power module is abnormal.If the alarm is
reported on the RFU/ODU, the cause is as follows: Cause 1: The power module of the
RFU/ODU is faulty.
The BD_STATUS is an alarm indicating that the board is not in position. When the
BD_STATUS alarm occurs, the board that reports the alarm fails to work.
The MW_LOF is an alarm indicating that the radio frame is lost. The services are
interrupted by MW_LOF. If the system is configured with protection, protection switching
may be triggered.
The MW_CFG_MISMATCH is an alarm of configuration mismatch on radio links. This alarm
occurs when an NE detects configuration mismatch on both ends of a radio link. For
example, the number of E1 signals, the number of STM-1 signals, AM enabling, 1588
overhead enabling, modulation mode may be configured differently on both ends of a
radio link.
The CONFIG_NOSUPPORT is an alarm indicating that the configuration is not supported.
This alarm is reported if the ODU detects that the specified parameters do not meet the
requirements of the ODU.
The RADIO_RSL_LOW is an alarm indicating that the radio receive power that comes from
opposite side is very low. This alarm is reported if the detected receive power is equal to or
lower than the lower threshold of the ODU (-90 dBm).
The RADIO_RSL_HIGH is an alarm indicating that the radio receive power that comes from
opposite side is very high. This alarm is reported if the detected receive power is equal to
or higher than the upper threshold of the ODU (-20 dBm). The service transmission is
affected. If the system is configured with 1+1 protection, protection switching may be
triggered.
The RADIO_MUTE is an alarm indicating that radio transmitter is mute. The transmitter of
ODU does not transmit services.
The IF_CABLE_OPEN is an alarm indicating that the IF cable is open. When the
IF_CABLE_OPEN alarm occurs, the service on the IF port that reports the alarm is
interrupted.
The MW_LIM is an alarm indicating that a mismatched radio link identifier is detected. This
alarm is reported if an IF board detects that the link ID in the radio frame overheads is
inconsistent with the specified link ID.
The MW_RDI is an alarm indicating that there are defects at the remote end of the radio
link.This alarm is reported when the IF board detects an RDI in the radio frame overheads.
The RPS_INDI is an alarm indicating that the radio protection switching is detected.
The LOOP_ALM is an alarm indicating that a loop occurs. When the LOOP_ALM alarm
occurs, the looped port or path cannot carry services.
The TEMP_ALARM alarm indicates that the board temperature crosses the threshold.
In the case of SDH boards, the R_LOS is an alarm indicating that the signals on the receive
line side are lost. In the case of IF boards, the R_LOS is an alarm indicating that the radio
frames on the receive line side are lost. The services are interrupted. If the system is
configured with protection, the protection switching may be triggered.
The ETH_LOS is an alarm of the loss of Ethernet port connection. When the ETH_LOS
alarm occurs, the service at the port that reports the alarm is interrupted.
The T_ALOS is an alarm indicating that the 2 Mbit/s analog signal is lost at the specific port.
The 2Mbit/s services can not be accessed by RTN 900.
The TU_AIS is an alarm indicating that the TU path has interruption. This alarm is reported
if a board detects that the TU pointer is all 1s.
Analysis
Totally 3 alarms in the system. The high severity alarm among them is “MW-LOF”
in NE1, it means the receiving radio signals loss of the frame, just like receive no
signal. And the alarm “MW-RDI” is caused by the previous alarm obviously. Finally,
the “RPS-INDI” indicated that the 1+1 protection switch in the microwave link or
equipment is taken place, for there were no other alarms on the service, most
probably after the automatic protection switch, the services were ok.
Based on the above analysis, the key of the faulty is the reason which caused the
“MW-LOF” alarm in NE1. By the alarm definition, we can list out the possible
reasons below:
The IF cable or the IF board faulty in NE1, this doubt can be checked the
loopback operations which will be introduced afterwards.
This method is usually applied to clear the external problems or to locate the
interconnectivity problems.
If the power supply is doubted abnormal, use a multimeter to measure the input voltages.
If you suspect that the poor interconnectivity between the microwave equipment and
other equipment is due to the grounding, use a multimeter to measure the voltage
between the shielding layer of coaxial ports of the transmitter and receiver of the
interconnection path. If the voltage value exceeds 0.5 V, there must be some problem
with the grounding. If you doubt that the poor interconnectivity is due to the incorrect
signal, you can use appropriate analyzers to observe whether the frame signals are normal,
whether the overhead bytes are normal, and whether there is any alarms.
This method provides highly accurate results. However, this method rather depends on
meters and professional knowledge.
Sometimes a running board enters abnormal state because of transient power supply
behavior, low voltage or strong external electromagnetic interference, and so on. Service
interruption and inband DCN communication interruption, might be or might not be
accompanied with corresponding alarms. The configuration data might also be correct. In
this case, the fault can be cleared and the normal service can be resumed in time by
resetting board, restarting the station, re-sending the configuration or switching the
service to the standby path.
The main disadvantage of this method is uncertainty, because the problem is not fully
known and there is probability that the alarm persists after board or even power reset.
This method is not recommended.
Note:
Normally, the warm reset of boards does not affect the running services. The cold
reset affects the running services.
The cold reset takes a longer time than the warm reset. After the reset, data of
boards is not lost.
Based on the preceding purposes, the RMON defines a serial of statistic formats and
functions to realize the data exchange between the control stations and detection stations
that complies with the RMON standards. To meet the requirements of different networks,
the RMON provides flexible detection modes and control mechanism. What's more, the
RMON provides error diagnosis, planning and information receiving of the performance
events of the entire network. The RMON complies with the standards, such as the RFC
1757 and RFC 2819.
If the transmit power is abnormal. The first case is that the transmit power exceeds the
range that the ODU supports. The second case is that the difference between the transmit
power and the set value is more than 2 dB when the ATPC is disabled. The relevant alarms
and performance events are as follows:
RADIO_TSL_HIGH
RADIO_TSL_LOW
TSL_CUR
TSL_MAX , TSL_MIN
In the following two cases, the RSL is abnormal. The one case is that the receive power is
lower than the ideal value (Ideal value = Planned value - 3 dB). The second case is that the
receive power is lower than the receiver sensitivity or higher than the free space receive
power due to fading. The relevant alarms and performance events are as follows:
RADIO_RSL_HIGH
RADIO_RSL_LOW
RSL_CUR
RSL_MAX, RSL_MIN
In the case of the radio link whose AM function is enabled, the receiver sensitivity
is the specific receiver sensitivity at the guaranteed capacity.
Generally, external interference is classified into co-channel interference and adjacent
channel interference.
Co-channel interference refers to crosstalk from two different radio transmitters
that use the same frequency channel. Hence, the entire spectrum may be affected.
Adjacent channel interference refers to signal impairment to one frequency, due to
presence of another signal on a nearby frequency. Hence, a part of the spectrum is
affected.
Interference is closely related to the frequency. Hence, the radio link may be faulty
in one direction if interference exists on the radio link.
The IF bit errors refer to the bit errors that the Hybrid IF board detects through the self-
defined overhead byte in the microwave frame. The related alarms and performance
events are as follows:
MW_BER_EXC,MW_BER_SD,IFBBE,IFES,IFSES,IFCSES,IFUAS
The RS bit errors refer to the bit errors that the line processing unit or the IF board that
works in SDH mode through the B1 overhead byte in the RS overhead. The related alarms
and performance events are as follows:
B1_EXC,B1_SD,RS_CROSSTR,RSBBE,RSES,RSSES,RSCSES,RSUAS
The IF board that works in PDH mode may also detect the previous RS bit error
alarms and performance events. In this case, the IF board detects bit error alarms
and performance events in the PDH microwave frame through the self-defined B1
byte.
The MS bit errors refer to the bit errors that the line board detects through the B2 byte in
the MS overhead. The related alarms and performance events are as follows:
B2_EXC,B2_SD,MS_CROSSTR,MSBBE,MSES,MSSES,MSCSES,MSUAS
The HP bit errors refer to the bit errors that the line processing unit or the IF board that
works in SDH mode through the B3 byte in the HP overhead. The related alarms and
performance events are as follows:
B3_EXC,
B3_SD,
HP_CROSSTR,
HPBBE,HPES,
HPSES,
HPCSES,
HPUAS
The LP bit errors refer to the bit errors that the tributary board or Hybrid IF board detects
through the V5 byte in the VC-12 overhead. The related alarms and performance events
are as follows:
BIP_EXC,
BIP_SD,
LP_CROSSTR,
LPBBE,
LPES,
LPSES,
LPCSES,
LPUAS
The VC-12 numbering method of the OptiX equipment is different from the numbering
method of the equipment of certain vendors. The OptiX equipment applies the timeslot
numbering method. The numbering formula is:
VC-12 number = TUG-3 number + (TUG-2 number - 1) x 3 + (TU-12 number - 1) x
21.This method is also called as the method of numbering by order
Certain equipment applies the line numbering method. The numbering formula is:
VC-12 number = (TUG-3 number - 1) x 21 + (TUG-2 number - 1) x 3 + TU-12
number. This method is also called as the interleaved method
The overhead bytes(J0,J1,C2,J2,V5) at both ends are inconsistent, pay special
attention to the following alarms:
J0_MM,HP_TIM,LP_TIM,HP_SLM,LP_SLM
The indexes of the SDH interfaces do not meet the requirements, common indexes
of the optical interfaces are as follows:
According to the fault causes, the operator can perform checking operation as follow:
Check the impedance of the E1 path. Ensure that the impedance of the E1 path is
consistent with the cable type.
Check whether all the equipment and the DDF in the equipment room are jointly
grounded.
Check whether the shielding layers of the coaxial cable connectors on the DDF are
connected to the protection ground.
Check whether the shielding layers of coaxial cables are grounded in the same
manner.
Check whether the wires of the cable are correctly connected.
Check whether the cable is broken or pressed.
Check whether the cable signal is interfered (for example, when the trunk cable is
bound with the power cable, the cable signal is interfered by the power signal).
Checking the cables involves checking the cables from the DDF to the client side
and checking the cables from the DDF to the transmission equipment side.
Check the following indexes:
The Ethernet service interruption indicates that the Ethernet service is completely
interrupted.
The Ethernet service deterioration indicates that the Ethernet service is abnormal. For
example, the network access speed is low, the equipment delay is long, the packet loss
occurs, or incorrect packets exist in the received or transmitted data.
According to the fault causes, the operator can perform checking operation as follow:
Check whether a loopback is set for the Ethernet port or the transmission line.
Check whether the parameter settings of the Ethernet port, such as the port
enabled state, working mode, and flow control, are the same as the parameter
settings of the Ethernet port on the interconnected equipment
check whether the Ethernet protocol and the Ethernet service configurations
(especially the attributes of the Ethernet port) are correct.
Pay special attention to the following equipment alarms:
POWER_ALM,FAN_FAIL,HARD_BAD,BD_STATUS,NESF_LOST,TEMP_ALAR
M,RADIO_RSL_HIGH,RADIO_RSL_LOW,RADIO_TSL_HIGH,RADIO_TSL_LOW
,IF_INPWR_AB, AM_DOWNSHIFT
Pay special attention to the following line alarms:
MW_LIM,MW_LOF,MW_BER_EXC,MW_BER_SD,MW_RDI,
MW_FEC_UNCOR
Check the RMON performance events and alarms.
Fault Causes:
Incorrect operations are performed.
Service configuration data is inconsistent between the local end and the
opposite end.
Fault Causes:
Incorrect operations are performed.
Service configuration data is inconsistent between the local end and the
opposite end.
According to the fault causes, the operator can perform checking operation as follow:
Check whether the ring current switch "RING" on the phone set is set to"ON".
Check whether the dialing mode switch is set to "T", namely, the dual tonemulti-
frequency mode.An orderwire phone set should be in on-hook state when it is not
incommunication, and the upper-right red indicator in the front view of the
orderwire phone set should be off. If the red indicator is on, it indicates that the
phone set is in off-hook state. Press the "TALK" button in the front of phone set to
hook it up. In certain occasions, the maintenance personnel press the "TALK"
button is pressed by mistake. As a result, the phone set stay in off-hook state all
the time and the orderwire call from the other NEs cannot get through.
Check whether all orderwire phone numbers on a subnet are of the same length.
Check whether all orderwire phone numbers on a subnet are unique.
Check whether the overhead bytes of all the NEs on a subnet are the same.
Check whether the orderwire port is set correctly