Deep Dive: Troubleshooting 101

With a single operating system Junos OS, a large portfolio of physical, virtual networking and security products, Juniper has become an important player in the global market.

  #Network as a Service   #Deep Dive   #Juniper Networks  
Diana Stucki
+41 58 510 13 54
diana.stucki@umb.ch

According to Wikipedia:

  • Juniper is the third largest market-shareholder overall for routers and switches used by ISPs.
  • It is the fourth largest market-shareholder for edge routers and second largest market-shareholder for core routers with 25% of the core market.
  • It is also the second largest market-shareholder for firewall products with a 24.8% share of the firewall market.
  • In Datacenter security appliances, Juniper is the second-place market-share holder behind Cisco
  • In WLAN, where Juniper used to hold a more marginal market share, it is now expanding through its acquisition of Mist Systems, now a visionary in WLAN.

This makes Juniper products present in hundreds of thousands of organizations across the globe to run their networking operations in a reliable, secure and flexible way.

When everything goes well, life is great, but what should you do if your Junos device is experiencing unexpected or unexplained issues? First of all, don’t panic, UMB as a Juniper Partner and JTAC (Juniper Technical Assistance Center) is here to support you. If you don’t have a support contract, or you would like to start the troubleshooting process on your own, I am going to share in this article some useful tips and tricks for you to find the root cause of the problem.

 

Mindset

The famous Hippocratic school of medicine quotes: “Practice two things in your dealings with disease: either help or do not harm the patient.”

The problems in a production network environment sometimes can be transient. There are cases where testing might cause more disruption than the problem itself. If a transient issue has cleared, would be better to plan on the long-term monitoring with testing occurring when the problem next manifests itself.

 

Purpose

The purpose of troubleshooting is to identify the root cause of an issue (RCA).

By using a logical approach, we can identify the Troubleshooting steps below:

  1. Define Success: define a specific, recognizable and desirable endpoint
  2. Isolate the problem: isolate the component preventing success
  3. Identify a solution: the fix does not cause other problems, the fix survives a reboot
  4. Implement the solution: follow change control processes, use maintenance windows

 

Collect information

Ask additional questions, for example:

  • “When did this start happening?”
  • “Is there any service impact?”
  • “Has this ever worked?”
  • “When did this last worked as desired”
  • “What has changed?”
  • “What troubleshooting steps and actions have been tried already?”

 

Rsi and system logs:

    The most common resources of information to start the troubleshooting process are the rsi (request support information) and the content of /var/log. Example below on how to collect the files:

     

    RSI

    root@yoda> request support information | save /var/log/rsi-yy-mm-dd.txt

     

    Logs

    root@yoda>file archive compress source /var/log/* destination /var/tmp/yy-mm-dd_LOGRSI_DeviceName.tgz

    sftp: get /var/tmp/yy-mm-dd_LOGRSI_DeviceName.tgz

    “Request support information” command will provide the outputs from most of the commands required to analyze the health of the device.

     

    RSI

    In the next lines, I will focus on the output provided by the rsi and describe some examples of common issues.

    Software version

    root@yoda> show version detail no-forwarding

    Hostname: yoda

    Model: srx240

    Junos: 18.2R3.4

    If the reported problem is an abnormal behavior, there could be a chance that this is caused by a problem report (PR). Simply checking the Junos Version and browsing through the known PRs might help find the RCA and save you a lot of time.

    You can do a PR search by version and keywords at: https://prsearch.juniper.net

     

    Example:

    The customer is observing a high increase in the memory utilization of the RE1 (backup routing engine), on MX240 router using Junos: 18.4R2.7. Nothing relevant can be found in the logs and there is no recent configuration change.

    Using key words in the PR Search tool, PR1459384 can be identified as a RCA.

    The solution is a software upgrade.

     

    Core Files

    In case of a panic or other serious malfunction a core dumps file that contains program`s environment in the form of memory pointers, instructions and register data might be created.

     

    Example: 

    The customer is reporting a spontaneous reboot of their device. In the rsi core-dumps can be seen:

    root@yoda> show system core-dumps no-forwarding

    -rw-rw—-  1 root  wheel    4307560 Apr 21   2021 /var/tmp/pfed.core-tarball.0.tgz

    -rw-r–r–  1 root  wheel  1035087872 Apr 7   2021 /var/tmp/smgd.core.live.1

    ….

    /var/tmp/pics/*core*: No such file or directory

    /var/crash/kernel.*: No such file or directory

    total files: 12

    Followed by logs:

    Apr 21 14:27:40.890 2021  Yoda_RE0 kernel: Dumping 2040 out of 49108 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

    Apr 21 14:27:40.890 2021  Yoda_RE0 kernel: Dump complete

    Apr 21 14:27:40.890 2021  Yoda_RE0 kernel: Automatic reboot in 15 seconds – press a key on the console to abort

    Feb 18 14:27:40.890 2021  Yoda_RE0 kernel: Rebooting…

    We can see that the trigger for the reboot was a kernel core.

    A software engineer using a debugger and a version of the executable containing debugging symbols can analyze the resulting core file and find the sequence of events that led to the crash. Once this information is known, corrective actions can be taken.

    Alarms

    If there are conditions that prevent the device on the chassis, or system software from operating normally, checking the alarms is always a good idea.

     

    Example 1 

    root@yoda_RE0> show chassis alarms no-forwarding

    1 alarms currently active

    Alarm time               Class  Description

    2021-06-18 13:33:28 UTC  Major  PEM 4 Not Powered

    Followed by:

    user@yoda_RE0> show chassis power

    PEM 4:

    State: Present

    Input: Absent

    This points to a hardware failure of the PEM which will request creating a RMA.

     

    Example 2:

    user@yoda> show chassis alarms

    1 alarms currently active

    Alarm time Class Description

    2021-06-10 12:24:09 CEST Major FPC 2 Major Errors

    Solution: When this alarm is seen, usually the FPC errors will be cleared by re-seating the card (physically remove it and then insert it back).

     

    Example 3:

    user@yoda> show chassis alarms

    2 alarms currently active

    Alarm time Class Description

    2021-04-27 09:02:06 CET Minor Loss of communication with Backup RE

    2021-04-21 09:01:34 CET Minor Backup RE Active

    Followed by logs:

    Apr 04 08:14:17 CHASSISD_MASTER_LOG_MSG: – No response from the other routing engine for the last 360 seconds.

    Apr 04 08:14:17 CHASSISD_MASTER_LOG_MSG: – No response from the other routing engine for the last 2 seconds.

    Apr 04 08:14:17 CHASSISD_MASTER_LOG_MSG: – Keepalive timeout of 2 seconds expired. Assuming RE mastership.

    In this example, we will see a mastership change because there were missed keepalives between the two RE. In order to bring back the RE, a reseat is the next recommended step. If the reseat is unsuccessful, RMA might be created.

     

    Processes

    user@yoda> show system processes extensive no-forwarding

    last pid: 23839;  load averages:  0.46,  0.59,  0.49  up 0+00:49:27    15:15:35

    328 processes: 8 running, 272 sleeping, 1 zombie, 47 waiting

    Mem: 2802M Active, 5196M Inact, 2346M Wired, 1572M Buf, 37G Free

    Swap: 3072M Total, 3072M Free

    The following could indicate a problem:

    • Too many zombie processes
    • Using extensive swap memory
    • Not having enough available memory
    • Unexpectedly high CPU utilization for specific processes (except idle)

    Chassis routing-engine

    • Check for memory/CPU utilization, uptime, reboot reason.

     

    Example:

    The customer is reporting that one of their switches abnormally rebooted. There are no useful logs, no core-dump, no alarms. Looking at the processes, there is a high CPU.

    Checking the chassis routing engine, we can see that the last reboot reason was: Router rebooted : 0x2000: hypervisor reboot.

    Because a high CPU utilization was noticed before the switch started having issues, this could explain why the internal KVM hypervisor rebooted unexpectedly.

     

    Note:

    On MX Series Routing Engines, the reboot reason code can also be determined from the shell by using the following shell command:

    % sysctl hw.re.reboot_reason

    Bit 0 is set when there is a reboot due to power failure or power cycle.
    Bit 1 is set when there is a reboot triggered by hardware watchdog.
    Bit 2 is set when a reboot is initiated by the reset button.
    Bit 3 is set when there is a power button press.

     

    LOGS 

    Interpreting Syslog Messages 

    When using the standard syslog format, each log entry written to the messages files consists of the following fields:

    • Timestamp: time of logging the message
    • Name: the configured system name
    • Process name or PID: the name of the process (or the Process ID when a name is not available) that generated the log entry
    • Message-code: A code that identifies the general nature and purpose of the messages, example: CHASSISD_FRU_EVENT
    • Message-text: additional information related to the message code

    Check the System Log Messages Reference documentation for a full description of the various messages codes and their meaning: https://www.juniper.net/documentation/partners/ibm/junos11.4-oemlitedocs/syslog-messages.pdf

    To determine the details of any log type, use the command:

    user@yoda>help syslog log-name

     

    Example of “Strange Logs”:

    • Jul  5 11:29:53  Yoda fpc0 brcm_pw_get_egr_stats: Egr Pkt Counter fetch failed. 

    → The PFE is trying to get some statistics. These messages are harmless and informational. Syslog filter can be applied to filter these logs. Issue is fixed by PR1491819

    • Jul 10 07:19:08 2020  Yoda fpc0 BRCM_SALM:brcm_salm_periodic_clear_pending(),153: Failed to delete Pending entres forunit = 0, modid = 0, port = 27, err code = 

    →  Match PR1475005

    • Oct 22 15:21:05  Yoda xntpd[9279]: kernel time sync enabled 2001
    • Oct 22 16:48:06 Yoda eventd: sendto: No route to hostOct 22 16:48:06  QFX51-4.ZH4 eventd[16886]:SYSTEM_ABNORMAL_SHUTDOWN: System abnormally shut

    → Match PR1459384

    • re0 chassisd[17080]: CHASSISD_I2CS_READBACK_ERROR: Readback error from I2C slave for FPC 5 ([0x17, 0x21] →0x0)

            → The logs that are registered on the device are typically seen when the FPC is faulty which makes the chassis incapable of reading the FPC state trough the i2c connections between the CB and the FPC, or if there is a communication issue which is caused by the midplane. 

                Try to reseat the FPC, if this does not work, RMA might be created.

    • May 5 06:34:12 2020  Yoda fpc0 BRCM_SALM:brcm_salm_periodic_clear_pending(),126: Failed to delete Pending entres for unit = 0, tgid = 2,  err code = -9

           → Match PR1371092

    • Jun 24 09:13:29 lab-re0 /kernel: rt_pfe_veto: Too many delayed route/nexthop unrefs. Op 2 err 55, rtsm_id 5:-1, msg type 2
      Jun 24 09:13:29 lab-re0 /kernel: rt_pfe_veto: Memory usage of M_RTNEXTHOP type = (0) Max size possible for M_RTNEXTHOP type = (16711550976) Current delayed unref = (40151), Current unique delayed unref = (40000), Max delayed unref on 

    → Check: KB36114

    The rt_pfe_veto means that the FPC is overloaded, and sends a “veto” message to the RE to not send him any routes to be processed.

    • pio_read_u64() failed and jspec_pio_read_u256 failed ()

    → Check: KB24641

    This error is from the memory integrity check, which is periodically performed by the LU driver. When the read encounters an error in the check, the LU driver will try to repair the error by writing the data in its shadow copy to the same IDMEM/GUMEM location. If the error is repaired, the subsequent memory checks will pass; as the error message will stop.