Nexus and Indeni (Scenarios)

A few examples of what only Indeni Server can monitor, alert & provide remediation for a Nexus Data Center Network

Seven common scenarios will be discussed in this section, describing how it is possible to eliminate severe network outages and reduce Network Operation cost of Cisco Nexus Data Center by introducing the Indeni R. 5.9 Monitoring Server at your Network!

A remediation is provided for each alert in order to eliminate the network downtime and heavily reduce the time of troubleshooting. Finally, it should be mentioned that the alerting/remedation for the issued described on the scenarios 6 & 7 are on the Roadmap of the coming Indeni Release.

Scenario-1

Problem: Service outage after a scheduled reboot at a Nexus Core Switch.

Description: You just applied the last configuration changes to a Nexus Data Center Core Switch. You are a L3 standby engineer and you received a call at night. It was mentioned that several services are not operational after a reboot of the Nexus Core Switch during a scheduled Maintenance Window. You troubleshoot the case and you realized that the latest configuration changes do not exist to the configuration.

How Indeni R 5.9 can overcome this common problem?

Indeni R. 5.9 Server regularly checks that the latest configuration changes were saved to all the Nexus switches of the DC infrastructure.

Scenario-2

Problem: The N9K-X9732C-EX line card and the N9K-C9508-FM-E fabric modules do not boot up and displayed at switch inventory after a scheduled reboot of the Nexus 9000 Series switches of a Data Center.

Description: Two Nexus 9500 Core Switches were rebooted during a Maintenance Window due to a Software Upgrade. Both switches boot up as expected and have been upgraded to the new NX-OS Release. You receive a call after a couple of minutes that there are complaints for service outage to the network. You troubleshoot the case. You suspect the problem could be due to the Software upgrade and you downgraded the switches. The problem has not been resolved! You notice that the N9K-X9732C-EX line card and the N9K-C9508-FM-E fabric modules installed at both 9500 Core Switches are not be displayed at switch inventory! You open a priority 1 Case to Cisco TAC and you get notified that there is a Field Notice (FN-64251) for these modules and line cards publicly announced in February 2017. In particular, you are informed that Nexus 9000 Series N9K-C9504-FM-E/N9K-C9508-FM-E/N9K-X9732C-EX Might Fail After 18 Months or Longer Due to Clock Signal Component Failure with Replacement Available After July 1, 2017

How Indeni R 5.9 can proactively inform you for this serious problem of Nexus 9000 Series Switches?

Indeni R. 5.9 Server periodically examines the Version ID (VID) number of the fabric modules or line cards installed at the Nexus 9k series. If there is a match, an alert is generated referring to the CISCO Field Notice: FN – 64251 in order to proactively arrange the replacement of the affected modules or line cards. This problem affects only the Nexus 9k Series switches.

Scenario-3

Problem: The Nexus Switches deployed at the Data Center failed the Network Security Assessment performed by an Auditor. The company faces the risk of failing the Data Center ISO compliance.

Description: A trial version of a Network Monitoring System (NMS) was installed by a Junior Network Engineer in order to test the product. The Network Engineer temporarily configured several Nexus switches with unrestricted SNMPv2 RW access using the default community and a telnet user in order the NMS access the Nexus switches. The Pen Tester managed easily to collect all the Configuration files of the Nexus switches by exploiting the SNMP. Besides, the admin level credentials of the unencrypted telnet user were revealed.

How Indeni R 5.9 can proactively inform you for this major security issue?

Indeni R. 5.9 Server periodically examines that the best Network Security practices have been configured to the Nexus switches. In case that the configuration is not aligned with the best Network Security practices an alarm is generated along with a remediation message.

Scenario-4

Problem: A network outage was caused due to vPC NX-OS misconfiguration and vPC inconsistency to the ports.

Description: A network engineer was assigned the task to add a new vlans to the allowed vlan list of Nexus vPC ports. The Network Engineer by mistake didn’t configure the "vlan add" option to the one of the two vPC ports, so the existing vlan list to one vPC port has been replaced with the new vlan list. Later you received a call for service downtime due to the VLAN inconsistencies of the vPC ports.

How Indeni R 5.9 can proactively inform you for this serious problem of Nexus Series Switches?

Indeni R. 5.9 Server regularly examines the vPC for inconsistencies. An alarm is generated along with a remediation message in case of a vPC inconsistency.

Scenario-5

Problem: Network outage due to license expiration of the Nexus switches.

Description: The 120 days grace period has been enabled to the Nexus switch in order to activate advanced Layer 3 features and configure BGP. There was a delay for the delivery of the Layer 3 Advanced NX-OS license by CISCO. The license expired and the BGP configuration applied to the Nexus has been disabled. A network outage noticed till a workaround been implemented.

How Indeni R 5.9 can proactively inform you for this serious problem of Nexus Series Switches?

Indeni R. 5.9 Server regularly examines the Nexus activated features and licenses. An alarm is generated along with a remediation message in case of license expiration. Besides, an alarm is generated if a NX-OS feature or license mismatch is identified between the vPC Nexus peer switches.

Scenario-6

Problem: BGP too high reconvergence time to the secondary MPLS WAN link in case of failure to the primary WAN link of CE.

Description: A customer’s HQ site is multihomed with two IP/MPLS WAN links in order to provide High Availability services to the remote branch offices. It has been noticed service unavailability for the remote sites in case of HQ Primary WAN link failure. In particular, it has been reported service outage for 2-3 minutes. This reconvergence time to the backup link is too high in order to achieve the strict SLA requirements. You troubleshoot the case by checking the IGP and HSRP configuration and everything is tuned to achieve the optimum reconvergence time. Further investigating the issue it is noticed that the BGP configured among the SP and HQ have the default CISCO BGP keepalive (60sec) and holddown (180sec) timers. You request from the Service Provider to reduce the BGP keepalive and holddown timers.

How Indeni R 5.9 can proactively inform you for slow reconvergence time of the BGP in case of failure?

Indeni R. 5.9 Server periodically examines the BGP keepalive and holddown timers for all nodes with BGP configuration. An alarm is proactively generated along with a remediation message if the BGP timers are high and needs in case of link or node failure more than 60sec for BGP reconvergence.

NOTE: Supported on future Release of Indeni Server

Scenario-7

Problem: It has been noticed BGP routing loops and slow reconverngce time in case of link or node failure.

Description: The HQ of a remote site has dual links connected via BGP with the Service Provider. Besides, a backdoor link is used for the direct communication with a Remote Site. It has been noticed delays and severe service degradation at the HQ in case of link or node failure although the HQ utilizes multiple links and nodes. Further troubleshoothing the issue you identify L3 loop between the backdoor link and the dual WAN links. The problem was resolved by applying an AS-PATH list to the BPG to prevent routing loops and by tuning the BGP timers (holddown, advertise-interval and scantime)

How Indeni R 5.9 can proactively inform you for this serious problem of Nexus Series Switches?

Indeni R. 5.9 Server periodically examines if the BGP best practices has been applied. An alarm is proactively generated along with a remediation message in case of too high BGP timers (e.g. scantime, advertise-interval, keepalive, holddown), no BGP security practices have been applied (e.g. TTL security) and no L3 loop prevention mechanisms (e.g. AS-PATH lists, prefix-lists etc) are configured.

NOTE: Supported on future Release of Indeni Server

Scenario-8

Problem: Service degradation due to forgotten ‘show debug’ NX-OS commands.

Description: An incident occurred at the night and Level 3 TAC engineers connected remotely to troubleshoot the issue. Extensive usage of ‘show debug’ NX-OS commands was required to identify the problem. The issue was resolved but the engineers forgot to disable the debugging. The next day and during the peak hour was reported service degradation. The forgotten debug statements are processed at a higher priority than other network traffic so these debug statements had jeopardized the network device performance.

How Indeni R 6.0 can inform you for this serious misconfiguration for the Nexus Series Switches?

Indeni R. 6.0 Server regularly examines if any debug command is activated to the Nexus switches and will alert if one of the debug mechanisms on a device is enabled. Enabled debug can be detected only by logging to the device or by monitoring syslog message in case debug level logging has been enabled. This important information is not provided via SNMP

Scenario-9

Problem: An access port transits to err disable state and the user can not access the network.

Description: A customer is at the meeting room and needs to connect multiple users to the internal network. He connected a low end switch to an available interface located at the meeting room. None of the guests connected to the new switch can access the internal network. The customer removed the switch to connect his laptop to the same interface but now cannot reach the network. He called the IT support to troubleshoot the case. The Network engineers troubleshooted the problem by checking the cable and the laptops network settings without success. Further troubleshooting the case and it was noticed that the BPDU Guard one of the feature that protect STP from several types of problems or attacks was activated and brought the interface in err-disable state when the switch has been connected to the network.

How Indeni R 6.0 can instantly inform you for this problem to the Nexus Series Switches?

Indeni R. 6.0 Server regularly checks the ports which are in err-disable state. In particular the script logs into the Cisco Nexus switch using SSH and retrieves the err-disable status of the interfaces An alarm is generated along with a remediation message in case an interface has moved to err-disable state.