Target Audience This training is intended for engineers who are supporting the E-series platform Ein the field. A basic familiarity with the product and a good network troubleshooting skills are expected.
Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?
Type of crashes
Software panics
Software not designed to handle a specific condition.
Processor Exceptions
Processor on the SRP or LM hits a violation while processing data.
Detector crashes
Recovery and Detection mechanism implemented to address forwarding fault conditions
Software panics
Software panics (example)
time of reset: THU NOV 01 2007 00:57:26 CDT run state: primary image type: application location: slot (6) build date: 0x46392b64 THU MAY 03 2007 00:23:00 UTC reset type: panic task: cliActor file: osSemaphore.cc line: 153 arg: 38516744 last errno: 0x3d0001 pc: 0x1c480f74: fatalPanic__Fv +0x8 lr: 0x1c532eec: take__11OsSemaphore +0x1c0 <output truncated>
SRP crash seen when clearing an improperly terminated SSH session using the 'clear line vty from another SSH session. The fix involved changes to the SSH application behavior when clearing a VTY session. KB 31413
Software panics
Software panics (example)
time of reset: Thu Aug 2 01:17:18 2007 run state: unknown (0) image type: application image id: 0x464c3086 build date: 0x464c3086 Thu May 17 2007 10:37:58 GMT location: internal slot (1), processor 0, boardId 0x33, boardRev 0x3 reset type: panic file: dhcpDemux.cc line: 221 task: scheduler last errno: 0x30065 pc: 0x3a1f5c -> fatalPanic(void) offset: 0x8 lr: 0x709c4 -> DhcpDemux::receivePacket(Uid, Uid, OsBuffer &, unsigned char) <output truncated>
Crash noticed on line modules running DHCP Relay proxy application after an SRP failover. DHCP application on LM received a DHCP packet before it was ready Fix involved discarding DHCP packets until DHCP application is ready KB 29531
Processor Exceptions
Processor Exceptions (example)
time of reset: Tue Mar 19 18:35:45 200 location: slot 9 (a), processor 0, boardId 0x19, boardRev 0x0 image type: application build date: 0x3c7608b1 (Fri Feb 22 09:00:33 2002) reset type: processor exception 0x200 (machine check) task: icc pc: 0x37f2ccc -> memPartAlignedAlloc offset: 0x108 lr: 0x37f2c9c -> memPartAlignedAlloc offset: 0xd8 dar: 0x00000000 cr: 0x24000080 xer: 0x00000000 fpcsr: 0x0369beec srr1: 0x0010b030 dsisr: 0x00000000 ctr: 0x00000000 <output truncated>
SRP crash due to L2 cache memory parity error The SRP CPU encounters a parity error when reading data from the L2 data cache No software fix available for the problem. Historically the crash never recurs on the same SRP KB 2443
Processor Exceptions
Processor Exceptions (example)
time of reset: Thu Aug 23 20:48:09 2007 run state: primary image type: application image id: 0x462e5408 build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMT location: internal slot (7), processor 0, boardId 0x3e, boardRev 0 reset type: processor exception 0x300 (data access: protection violation (read attempt)) task: IpSubscriberMana pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74 lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24 <output truncated>
SRP crash due to a data access violation. IPSM application wrote data to an incorrect location. SRP CPU does not that data in that location hence it crashes while trying to read from that location IPSM application was fixed to ensure that it does not write to the invalid location KB 29909
Detector crashes
Detector crashes
Internal mechanism developed by Juniper to detect and recover from forwarding faults The objective is to minimize the forwarding impact on the router The system initially tries to recover the fault without any external impact. If that is not possible, a crash is performed. Also records information about these faults for troubleshooting purpose. E KB 16800: Enhanced PFTE support on E-series
Detector crashes
Detector crashes
2 basic mechanisms:
Run by SRP
Commonly known as PIMTE (Ping/Icc Monitoring Threshold Exceeded) or PFTE (Ping Failure Threshold Exceeded) Frequently polls the line modules to check their health (aka ping) Thresholds are defined for applications interacting between SRP and LM If thresholds are exceeded, the SRP decides on what action should be taken. Additional information is written to a file with extension .tsa TSA file generation does not always mean there was a crash !!! These crashes have a generic crash signature. Crash can occur on the standby SRP or Line Module Reboot.hty, coredump and TSA file (if present) are required in each case Reboot.hty, to analyse the root cause.
Detector crashes
Detector crashes: PIMTE (example)
time of reset: Fri Nov 10 00:13:00 2006 run state: unknown (0) image type: application image id: 0x4550dd8b build date: 0x4550dd8b Tue Nov 07 2006 19:24:59 GMT location: internal slot (4), processor 0, boardId 0xff, boardRev 0xff reset type: panic, msg "Ping/ICC monitoring threshold exceeded" file: ontrolNetwork.cc line: 775 task: scheduler last errno: 0x110001 pc: 0x9e4228 -> fatalPanic(void) offset: 0x8 lr: 0x122088 -> Hw2Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessageconst &, CbusMessage &, CbusReplyDoneAction *&) offset: offset: 0x1308 <output truncated> time of reset: Tue Aug 29 11:21:41 2006 run state: standby image type: application image id: 0x44ed842e build date: 0x44ed842e Thu Aug 24 2006 10:49:18 Eastern Standard Time location: internal slot (9), processor 0, boardId 0x3e, boardRev 0 reset type: unknown software error signature (0xfadead4), msg "Ping/ICC monitoring threshold exceeded" file: ontrolNetwork.cc line: 752 task: cbusSlave last errno: 0x380003 pc: 0x4cc98f10 -> fatalPanic(void) offset: 0x8 lr: 0x4ce26e7c -> Hw1SrpSlaveControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &) offset: 0x1548 <output truncated>
Detector crashes
Detector crashes: PFTE (example)
time of reset: Mon Apr 24 12:44:33 2006 run state: unknown (0) image type: boot image id: 0x4390bdc4 build date: 0x4390bdc4 Fri Dec 02 2005 21:33:56 GMT location: internal slot (5), processor 0, boardId 0x33, boardRev 0x3 reset type: panic, msg "ping failure threshold exceeded" file: ontrolNetwork.cc line: 1182 task: scheduler last errno: 0 pc: 0x19235d4 -> fatalPanic(void) offset: 0x8 lr: 0x1999d90 -> Hw1Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &, CbusReplyDoneAction *&) offset: offset: 0xbf8 <output truncated>
Detector crashes
Detector crashes
2 basic mechanisms that run:
Run by the Line module
Commonly known as ic1Detector crashes Various components on the line module inform the IC (line module CPU) about any forwarding faults. Based on the severity of the problem, the line module decides what action should be taken The IC (line module CPU) initially attempts to recover the particular component. If it can not be recovered or if the problem recurs, a crash is taken The ic1Detector crashes are seen only on Line modules. They have a generic crash signature. Reboot.hty and coredump are required in each case to analyse the root cause.
Detector crashes
Detector crashes: ic1Detector (example)
time of reset: Mon Aug 28 14:55:44 2006 run state: unknown (0) image type: application image id: 0x44a294a3 build date: 0x44a294a3 Wed Jun 28 2006 14:39:31 GMT location: internal slot (5), processor 0, boardId 0xff, boardRev 0xff reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash" file: ic1Detector.cc line: 718 task: scheduler last errno: 0x110001 pc: 0x9b528c -> fatalPanic(void) offset: 0x8 lr: 0x1286544 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa1c <output truncated> time of reset: Tue Jan 22 11:32:39 2008 run state: unknown (0) image type: application image id: 0x46d571bf build date: 0x46d571bf Wed Aug 29 2007 13:16:47 GMT location: internal slot (15), processor 0, boardId 0xff, boardRev 0xff reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash" file: ic1Detector.cc line: 738 task: scheduler last errno: 0x110001 pc: 0xa6b740 -> fatalPanic(void) offset: 0x8 lr: 0x147dc24 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa54 <output truncated>
Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?
Standby SRP
Make a copy of the Standby SRPs reboot.hty file on the primary SRPs flash:
copy standby:reboot.hty <filename>.hty
Standby SRP:
copy standby:reboot.hty <FTPserver>:<path>/<filename>.hty
time of reset: Identifies the time the reset took place run state: Relevant only for the SRP. Identifies if the SRP was in primary or standby
state when the reset occurred. Set to unknown for line modules.
image type: Identifies the type of image the SRP or line module was running when it
reloaded. This can be boot, diag or application image.
image id: Internal ID used by Juniper to identify the release build date: Identifying the date when the release on this router was built.
task: Identifies the task that was running on the SRP or line module when it was reset. On
the line module this will always be set to scheduler as that is the only task that runs on the line module.
Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?
Contact JTAC
It is recommended to contact JTAC whenever there is a crash on the router
http://www.juniper.net/kb
Agenda
Why does an E-series router crash? EWhat happens after a crash? What can I do once it crashes? What information does JTAC need?
Core dumps (if any) Files with extension .tsa (if any) system.log file Copy of the router configuration in CNF and SCR format
- Depending upon the problem, there may be other outputs that JTAC may require.
This is required in special cases only and JTAC will provide all necessary information in such cases.
Summary
Crashes can be a good thing Assess the severity of the crash and its impact in the network Work closely with JTAC to analyze the root cause A good understanding of the E-series behavior Ehelps build customer confidence
Questions. ???
www.juniper.net
39