Target Audience For internal use Product Version Refer to product versions
WCDMA RAN
Prepared by Document Version V1.3
Maintenance Dept.
Reviewed by Date
Reviewed by Date
Approved by Date
Solutions to RNC Emergencies INTERNAL
Revision Records
2009-3-13 V1.3 Modify the method of collecting CHR log. Xu Zhijie 43183
Contents
1 Overview........................................................................................................................................ 4
1 Overview
Typically the following two types of emergencies (incidents) are present in the live
network:
Emergencies of the access type
Emergencies of the KPI deterioration type
This document provides some practical solutions to these emergencies, aiming at:
Restoring services quickly;
Shortening the period in which services are influenced (from the time when the
influence begins to the time when services are restored);
Improving customer satisfaction, and;
Improving skills of front-line GTS and related support personnel.
Typically, in the case of access emergencies, users cannot enjoy services, such as
AMR voice, PS service. That is, UE cannot access the network by dial-up.
Except the transmission problems, usually the services are required to be restored
within one hour at the field. You can follow the following three steps to analyze and
deal with the problems:
View related alarms.
Trace and analyze related signaling.
Analyze traffic statistics.
The engineers at the field need to check the following items:
Type of the affected services, CS or PS?
Whether UE can be normally registered?
In the CS case, whether the originated or terminating call is affected?
In the PS case, whether UE can be attached, whether the service can be
successfully set up, and whether the rate is normal?
Since when the service is affected. Usually it is subject to the time when the first
user complained.
Range of the affected cell. You can judge whether the cluster cell or all the cells
are affected based on the complaints.
The engineers at the field can analyze alarms, trace signaling, and traffic statistics to
check:
Which interface (IU/IUB/IUR) affects the service.
The range of affected users. Is a single subsystem, a subrack or all RNC
affected? If a subsystem causes the failure, you can swap or reset the SPU, or
reset the corresponding FMR/DPU board.
Whether an interface board causes the failure. If yes, you can swap or reset the
interface board.
While asking the customer about the operation process, you must query alarms and
trace signaling at the same time, and thus make sure that the emergencies can be
restored in time.
If the following transmission alarms are present, the alarm time is consistent with the
failure time, and the alarms seriously affect IU/IUR interface or IUB interface (many
E1s under the interface board are abnormal, which affects many Node Bs connected
with the interface board), take the following measures to restore the service:
Optical interface alarms
ALM-901 Optical port loss of signal ALM-981 The CSU reference out of lock
ALM-906 Optical port loss of cell delineation ALM-984 Ingress tributary unit loss of pointer
ALM-954 Tributary unit alarm indication signal ALM-985 Egress tributary unit loss of pointer
ALM-955 Tributary unit loss of pointer ALM-988 Optical port out of cell delineation
F4F5 alarms
E1/T1 alarms
ALM-1102 E1/T1 loss of frame alignment ALM-1104 E1/T1 alarm indication signal
Link alarms
ALM-1005 IMA link remote TX unusable ALM-1027 FRAC IMA link is blocked
ALM-1007 IMA/UNI link loss of cell delineation ALM-2602 PPP/MLPPP link down
ALM-1015 Fractional IMA link loss of frame ALM-2603 PPP/MLPPP link loop
FE port alarms
ALM-851 FE/GE link down ALM-853 FE/GE link receive defect indication
MSP alarms
ALM-2501 MSP K1/K2 mismatch alarm ALM-2506 MSP unit-bid mode mismatch alarm
ALM-2502 MSP K2 mismatch alarm ALM-9272 WRSS MSP K1/K2 mismatch alarm
If MSP alarms are present, you must collect MSP logs from related boards before and
after swapping and resetting the interface board. You can use the DSP MSPREP
command to collect and feedback logs.
The above alarms indicate that the optical interface is faulty, or the intermediate
transmission devices are faulty, or the optical fiber is faulty, or the data (with MSP
enabled) negotiated between the local end and the peer end is not consistent. You
need to judge whether the unit that gave alarms is associated with the affected
service and the alarm time is consistent with the failure time. If yes, take the following
measures.
Handling measures:
(1) If a backup interface board is available, swap the interface board. Otherwise go
to step 2. Check whether the service is restored. If not, go to step 2.
(2) Reset the interface board. If the failure still persists, go to step 3.
(3) Notify the user to check whether the intermediate transmission device or the
peer device is faulty.
If the following alarms are present and the alarm time is consistent with the failure
time, take the following measures to restore the service:
ALM-110 Board voltage abnormity ALM-654 Inter board HiGig link fault
ALM-113 High ambient temperature alarm ALM-655 Intra board HiGig link fault
Failure of high-speed
ALM-120 Board fault-abnormal voltage 1.30V ALM-659 communication link on the
backplane of service boards
ALM-121 Board fault-abnormal voltage 1.50V ALM-661 GESW Trunk group link fault alarm
ALM-122 Board fault-abnormal voltage 1.80V ALM-662 GE Trunk group link fault alarm
ALM-124 Board fault-abnormal voltage 3.30V ALM-666 DSP communication link failure
ALM-128 PIU chip selftest error ALM-667 GESW subboard GE link descend
ALM-130 Subboard PLL status abnormal ALM-668 SPU GESW panel GE link descend
ALM-223 CPLD 33M clock alarm ALM-751 GE/FE conversion unit fault
ALM-379 The board logic fault ALM-9029 APC chip fault alarm
ALM-601 ATM switching module failure ALM-9242 WRSS fan fault alarm
The above alarms indicate that the device is faulty or the ambient environment is
abnormal. You need to reset or change the board, or improve the ambient
environment.
Handling measures:
(4) If the above alarms are present, follow the handing measures in the alarm help
to restore the service as soon as possible.
(5) If the service cannot be restored by taking the measures in the alarm help and
the board cannot be reset, run the RST BRD command to reset the
corresponding board.
(6) After the board is reset, the problem still persists. And if a backup board is
available, run the INH BRD command to inhibit the board.
(7) Replace the board at the local end. In case of non-emergency, replace the board
at night.
(8) If the fan subrack failure alarm is present, you need to provide some devices to
cool down the equipment, such as an electric fan.
(9) If the water alarm is present, improve the environment of the equipment room.
If the following alarms about the IU interface signaling plane are present and the
alarm time is consistent with the failure time, take the following measures to restore
the service.
ALM-1404 MTP-3b signaling route unavailable ALM-1615 AAL2 adjacent node unavailable
If the above alarms are present, view the alarm parameters to check whether the
signaling connected with the IU interface is disconnected or intermittent. The failure
may be caused by the interface board, the intermediate transmission device, or the
inconsistent data (with MSP enabled) negotiated between the local end and the peer
end.
Handling measures:
(10) If a single link is intermittent, you can block the link to solve the issue
temporarily:
- DEA MTP3BLNK
- DEA M3LNK
(11) Swap the IU interface board, and check whether the service is restored.
(12) If not, swap the SPU board.
(13) If the problem still persists, reset the active and standby IU interface boards at
the same time, and then check whether the service is restored. If not, reset the
active and standby SPU boards at the same time.
(14) If the problem still persists, trace the failure link messages (SAAL/SCTP/SCCP),
then check whether one-way link is present or packets are dropped, and thus
check whether the intermediate transmission device or the peer device is faulty.
If sure that the intermediate transmission device or the peer device is faulty,
notify the customer to troubleshoot the transmission devices.
(15) If the problem still persists, collect the messages at the
SAAL/SCTP/MTP3B/M3UA/SCCP layer, alarm logs, and CHR logs in the failure
period, and then feedback them to R&D personnel.
If the following alarms about the IU interface user plane are present and the alarm
time is consistent with the failure time, take the following measures to restore the
service.
ALM-1603 AAL2 path blocked by peer end ALM-1711 Path forward congested
If the above alarms are not present, use the following methods to check the
networking:
ATM networking:
If the peer device supports the LB function and the RNC version is not 29, you can
check the path status by using the LOP:VCL command (select AAL2Path). If the
path status is UP, it means that the path is normal. If the path status is DOWN,
delete the faulty path and then add a path. Then use the LOP:VCL command to
query the path status. If the status is still DOWN, swap or reset the interface
board. If the problem still persists, it indicates that the intermediate transmission
device or the CN device is faulty. You need to notify the customer to troubleshoot
the transmission devices.
If the peer device does not support the LB function or the RNC version is V29, run
the DSP AALVCCPFM command to query packet sending and receiving of
AAL2Path. If path sends and receives packets, it means that the path status is
normal. If the path can only receive packets and cannot send packets, it indicates
that the local device is faulty. Delete the faulty path and then add a path to restore
the service. If the problem still persists, swap or reset the IU interface board, and
then observe packet sending and receiving of the path. If the path can still send
packets but cannot receive packets, it means that the intermediate transmission
device or the CN device is faulty. You need to notify the customer to troubleshoot
the transmission devices.
After having detected the faulty AAL2Path in the above two ways, you can block
AAL2Path to solve the problem temporarily. If no faulty path is detected, you can
use the following methods:
Block AAL2Path(BLK AAL2PATH) of all IU interfaces, and then unblock
AAL2Path(UBL AAL2PATH) one by one. In the process of unblocking path, check
whether the CS service is normal. If yes, unblock the next AAL2Path. Each time IU
interface can keep only one activated AAL2Path. If the CS service is abnormal, it
means that this AAL2Path is faulty. Bock it to mitigate the problem temporarily.
If all the AAL2Paths cannot be blocked at the field, you can block faulty AAL2Paths
by querying idle CIDs. Run the DSP AAL2PATH command to query CID and
bandwidth use. For an abnormal AAL2 PATH, the number of idle CIDs is almost
248.
If the IUPS user plane is faulty, you can run the PING IP command to check
whether the IP address of the SGSN interface board (it supports the ping function)
and that of the SGSN GTPU are normal. If pinging the IP address of the peer
interface board times out, it means that the IPOA link is faulty. Swap or reset the
IU interface board to troubleshoot the RNC. If the problem still persists, notify the
customer to troubleshoot the intermediate transmission device or the SGSN. If the
pinging the GTPU address of the SGSN times out, you need to notify the customer
to troubleshoot the SGSN.
IP networking:
Handling measures:
(16) BLK the paths whose alarms are present, and then check whether the service is
restored.
(17) Swap the IU interface board, and then check whether the service is restored.
(18) If the problem still persists, reset the IU interface board.
If the following alarms are present and the alarm time is consistent with the failure
time, take the following measures to restore the service.
If a lot of above alarms are present, it indicates that there are too heavy traffics on the
link.
Sometimes the link is not congested. RNC sends a lot of SCCP-layer CR messages,
but CN returns CREF messages instead of CC messages. If the IU interface cannot
initiate RAB assignment normally, it is recommended to trace the SCCP-layer
messages of the IU interface at the field to check whether CN returns CREF
messages instead of CC messages. If yes, try to deactivate and activate the MTP3B
link set to restore the service.
Handling measures:
(19) View the traffic statistics of the RNC level at the field. If the number of equivalent
Erlang of CS/PS is lower than 70% as usual, you need to block part of the cells
or NodeBs to decrease the traffic amount.
(20) Enable flow control at the IU interface. Flow control of the IU interface is
controlled by switches in the following versions:
- V29B072SP05 and above patch
- V210B061SP02 and above patch
- V110B061SP01 and above patch
The command is shown as follows:
- SET FCSW: BT=SPU, FCSW=ON, PRINTSW=OFF;
If PRINTSW is OFF, it means that IU flow control is enabled. ON means that IU
flow control is disabled, ON by default.
(21) If the alarm SCCP DSP unavailable (ALM-1506) is present, you can deactivate
and activate the link to restore the service.
- DEA MTP3BLKS
- ACT MTP3BLKS
- DEA M3LNK
- ACT M3LNK
(22) If the congestion alarm is present in path, you can decrease the factor of the IU
interface. In addition, you need to add more paths.
If the following alarms are present and the alarm time is consistent with the failure
time, take the following measures to restore the service.
If a lot of above alarms are present, it indicates that the IUB interface is faulty.
Through alarms, you can find out the interface board of the corresponding Node B or
the control subsystem where the Node B resides. This type of problems may be
caused by the failure of the interface board, intermediate transmission device, or the
SPU. If many alarms of cell unavailable (ALM-2006) are present, run the DSP CELL
command to check the reason why the cells are unavailable, and then remove faults
based on specific reasons.
Handling measures:
(23) Check whether the alarm of public channel setup failure is present in many cells
and whether a cell terminates an interface board or an SPU board subsystem:
- If it is associated with the public channel, swap or reset the interface board
that the cell terminates.
- In case of other reasons, swap the SPU board.
- If this problem cannot be solved by Step 1 and 2, reset the subrack.
(24) If the IUB interfaces of a large number of faulty Node Bs are led from the same
interface board, swap or reset it.
(25) If a large number of faulty Node Bs are terminated on the same SPU, swap or
reset the active and standby SPU boards.
(26) If the problem still persists, you need to trace the corresponding SAAL link to
check whether any packet is dropped by the intermediate transmission device,
and ask for help from the headquarters.
(27) If you cannot see any signaling tracing at the IU interface, it indicates that all the
CS/PS services of RNC are interrupted. You need to reset the RNC.
(28) If you can only trace initial direct-transfer messages, but cannot trace messages
from CN, trace the SCCP messages at the IU interface.
- If a large number of CR messages sent to CN are traced, but no CC
message is traced, swap the SPU board, and then deactivate/activate the
MTP3B link set.
- If both a large number of CR messages sent to CN and CREF messages
are traced, it indicates that LAC, SAC, or RAC at the CN side is not correctly
configured. Contact the CN personnel to locate the problem.
(29) When tracing the IU interface messages, if RNC receives the ERR IND message
from CN after sending the initial directly-transmitted message, you need to run
the RST IU command. If the problem still persists, reset the IU interface board.
(30) If a large number of messages are traced, judge whether the registration or
attach flow can be accomplished by viewing the direct-transfer message at the
IU interface.
- If the registration or attach flow is rejected by CN, check whether the
LAC/SAC/RAC is configured and activated.
(31) Check whether RAB ASSIGNMENT REQ exists.
- If not, cooperate with CN personnel to locate the problem.
- If yes, view the result in the RAB ASSIGNMENT RSP message.
If it is successful and RNC does not release the message within five
seconds, it means that the failure has nothing to do with RNC. RNC
judges whether the messages belong to the same user based on the
user ID contained in the messages. If RNC releases the message
immediately, AAL2Path/IPPATH of the IU interface may be not
connected. You can troubleshoot paths by referring to IU Interface User
Plane Alarms.
If it is unsuccessful because the IU interface transmission resources
failed to be established, the transmission resources of the IU interface
are faulty. If no alarm is given, reset AAL2PATH (RST AAL2PATH:
PATHID=0;). If the service cannot be restored, you can block all
AAL2Paths, and then unblock them one by one. If it is unsuccessful
because the IUB resources or air interfaces fail or are unspecial,
continue tracing the IUB interface or IOS/CDT/IFTS.
(32) If you find that the IUB interface is faulty by checking IOS/CDT/IFTS, RL does
not respond, or RL setup fails, it indicates that the SPU board may be faulty. It is
recommended to swap the SPU board. If the service cannot be restored, swap
the IUB interface board. If the problem still persists, reset the faulty subrack.
(33) If you find that the air interface is faulty by checking IOS/CDT/IFTS and RB does
not respond, activate/deactivate the cell first. If the problem still persists, reset all
the FRM/DPU boards in the faulty subrack. If the problem still persists, reset the
entire subrack.
Before resetting FMR/DPU boards, you must determine to reset which ones based on
the subsystem where the faulty cells are located. For different versions, FRM/DPU
boards correspond to different SPU boards.
V1 (V18 and V110):
Subsystem 0 corresponds to FRM boards in odd slots.
Subsystem 1 corresponds to FRM boards in even slots.
V2 (V29)
Only SPM 2 and SPM 4 exist in the RSS subrack. The following figure shows the
corresponding SPU/DPU board slots:
Only SPM 0, SPM 2 and SPM 4 exist in the RBS subrack. The following figure
shows the corresponding SPU/DPU board slots:
V2 (V210)
To share resources inside RNC, V2 adopts MPU to manage the relations between
SPU subsystems and DPUs. The users of all the SPU subsystems are likely to be
assigned to DSPs of each DPU.
KPI deterioration problems cover low RRC establishment rate, low RAB
establishment rate, or high call drop ratio, which greatly affects customer satisfaction.
View the ID of the DSP where faults occurred, as shown in the following figure:
Handling measures:
(34) If the error code about MACC/MACD/RLC is displayed many times and the same
CPU ID is displayed, or the CPU IDs displayed belong to the same FMR board,
reset the DSP or the FMR/DPU board.
(35) If most faults are present on the same DSP or on the same board, reset the DSP
or the FMR/DPU board.
CPU ID can be converted by the following special tools:
- For V1(V17/V18/V110), you can use the following tool to convert CPU/DSP
ID:
CPUid.exe
- For V2(V29/CV210), you can use the following tool to convert CPU/DSP ID:
cpuid_V29.exe
(36) If only one DSP is faulty, reset it. If the problem still persists, disable the DSP.
(37) If only one FMR/DPU board is faulty, reset it. If the problem still persists, disable
the FMR/DPU board.
5.1 Emergency
In case of emergency, that is, before the services are restored, the on-site personnel
are required to obtain necessary logs and send them to the headquarters for location.
The size of logs must be as small as possible. The logs must contain faulty points.
For example, the log about CDT/IOS tracing must contain call drop points. CHR logs
or the logs in this document must be those recorded during the fault period. The logs
provided to the headquarters must be filtered by on-site personnel. They must be
short and contain faulty points. In this case, they can be sent to the headquarters
quickly. All logs must be compressed with RAR or ZIP to reduce the file size.
For access emergencies:
CHR logs recorded during the fault period, especially the subrack number and
specific time. (Required)
Alarm logs recorded during the fault period, especially specific time. (Required)
IFTS/CDT/IOS tracing logs (CDT/IFTS must contain the statistics of L2 user
plane and L2 data reported within 100 seconds). The on-site personnel must
check that the logs to be sent contain fault information. (Required)
Traffic statistics data collected from two hours before the fault occurred to the
fault period. (Optional)
RNC configuration script. (Optional)
Text logs recorded during the fault period, especially the subrack number and
specific time. (Optional)
For KPI deterioration emergencies:
Traffic statistics data from two hours before the fault occurred to the fault period.
(Required)
CHR logs recorded during the fault period, especially the subrack number and
specific time. (Required)
IFTS/CDT/IOS tracing logs (CDT/IFTS must contain the statistics of L2 user
plane and L2 data reported within 100 seconds). The on-site personnel must
check that the logs to be sent contain fault information. (Required)
RNC configuration script. (Required)
Text logs recorded during the fault period, especially the subrack number and
specific time. (Optional)
When tracing IFTS, the subsystem must be that corresponding to the cell being
traced. You can run the LST CELL command to obtain the corresponding SPU
subsystem ID. You can fill RRC EST Cause as required. In the CS service case,
select Originating Conversational Call/Terminating Conversational Call; in
the PS service case, select Originating Interactive Call/Originating
Background Call/Terminating Interactive Call/Terminating Background Call.
Traffic Type is optional.
(43) CDT tracing with internal print message.
Find the following directory from the PC where LMT is located:
D:\HW
LMT\adaptor\clientadaptor\RNC\BSC6810V200R010C01B051\style\defaultstyle\l
ocale\en_US\rnctest
The blue part describes the RNC version and the language type. Take the
English version of V210B051 as the example.
Find the RncTestConfig.xml file, and then open it with UE or Notepad. Then find
the following part:
<DESC descname="CDTMSGTYPE">
<PARAS>
</PARAS>
</DESC>
<DESC descname="CDTMSGTYPE">
<PARAS>
</PARAS>
</DESC>
For data transmission problems, make sure that the problems reoccur within 100
seconds after tracing begins.
In the following figure, CHR and text logs of V29 are obtained. CHR logs are saved in
the famlog directory, while text logs are saved in the FamLogFmt directory.
In the following figure, CHR and text logs of V210 are obtained. CHR logs are saved
in the fmt directory, while text logs are saved in the txt directory.
To feed back text logs and CHR logs quickly, on-site personnel needs to know the
specific time when faults occurred and the number of the faulty subrack, based on
which they can obtain valid logs. The size of log files should be as small as possible
to shorten the time spent on file delivery.