Saturday, November 10, 2012

You have to make the right balance between the convergence time and MTU

Lately i'm getting the impression that Cisco is getting new products out without the proper internal testing.

I'm going to talk about two recent examples, ASR1001 and ASR901, devices that are an excellent value for money, but (as usual) hide limitations that you unfortunately find out only after exhaustive testing.

ASR1001 is a fine router, a worthy replacement of 7200, which can be used for various purposes. Of course, as every new platform by every vendor these days, it fully supports jumbo frames and that's a nice thing. At least you get that impression until you try to use the large MTU for control/routing protocols, where you might fall into an also nice surprise.

MTU 1500 (just a few ms)

17:22:35.415: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from DOWN to INIT, Received Hello
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from INIT to 2WAY, 2-Way Received
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from 2WAY to EXSTART, AdjOK?
17:22:35.643: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXSTART to EXCHANGE, Negotiation Done
17:22:35.823: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXCHANGE to LOADING, Exchange Done
17:22:35.824: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from LOADING to FULL, Loading Done

MTU 9216 (~48 sec!)
17:43:07.923: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from DOWN to INIT, Received Hello
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from INIT to 2WAY, 2-Way Received
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from 2WAY to EXSTART, AdjOK?
17:43:08.098: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXSTART to EXCHANGE, Negotiation Done
17:43:08.241: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXCHANGE to LOADING, Exchange Done
17:43:55.942: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from LOADING to FULL, Loading Done

While trying to use MTU 9216 in a test environment (the same issue was observed even with smaller MTU), we met an interesting issue during the exchange of large OSPF databases between ASR1001s (running 15.1S & 15.2S). Packets (with LSAs) were being dropped internally in the router, because the underlying driver of LSMPI (Linux Shared Memory Punt Interface) was not capable of handling such packets in fast rates. These large packets internally got fragmented to smaller ones (512 bytes each), transmitted and then reassembled, something that increased the pps rate (9216/512=18 times) between the involved subsystems.

To be more exact, packets punted from the ESP to the RP are received by the Linux kernel of the RP. The Linux kernel then sends those packets to the IOSD process through LSMPI, as you can see in the following diagram.


So, this is the complete punt path on the ASR1000 Series router:

QFP <==> RP Linux Kernel <==> LSMPI <==> Fast-Path Thread <==> Cisco IOS Thread

Since there is a built-in limit on the pps rate that the LSMPI can handle, by fragmenting internally the packets due to their size, the internal pps rate increases and some sub-packets get discarded, which leads to complete packets being dropped and then retransmitted from the neighboring routers, which in turn leads to longer convergence times...if convergence can be accomplished after such packet losses (there were cases that even after minutes there was no convergence, or adjacency seemed FULL but the RIB didn't have any entries from the LSADB). Things can get messier if you also run BGP (with path mtu discovery) and there is a large number of updates that need to be processed. Some time in the past, IOS included code that retransmitted internally only the lost 512-bytes packets so OSPF couldn't actually understand it was losing packets, but due to it causing other issues (probably overloading the LSMPI even more) it got removed.

So this leads to the question "at what layer should the router handle internally control plane packet loss"? As low as possible in order to hide it from the actual protocol or just leave everything to the protocol itself?

You can use the following command in order to check for issues in the LSMPI path (look out for "Device xmit fail").

ASR1001#show platform software infrastructure lsmpi

LSMPI Driver stat ver: 3

Packets:
         In: 17916
        Out: 4713

Rings:
         RX: 2047 free    0    in-use    2048 total
         TX: 2047 free    0    in-use    2048 total
     RXDONE: 2047 free    0    in-use    2048 total
     TXDONE: 2046 free    1    in-use    2048 total

Buffers:
         RX: 6877 free    1317 in-use    8194 total

Reason for RX drops (sticky):
     Ring full        : 0
     Ring put failed  : 0
     No free buffer   : 0
     Receive failed   : 0
     Packet too large : 0
     Other inst buf   : 0
     Consecutive SOPs : 0
     No SOP or EOP    : 0
     EOP but no SOP   : 0
     Particle overrun : 0
     Bad particle ins : 0
     Bad buf cond     : 0
     DS rd req failed : 0
     HT rd req failed : 0
Reason for TX drops (sticky):
     Bad packet len   : 0
     Bad buf len      : 0
     Bad ifindex      : 0
     No device        : 0
     No skbuff        : 0
     Device xmit fail : 103
     Device xmit rtry : 0
     Tx Done ringfull : 0
     Bad u->k xlation : 0
     No extra skbuff  : 0
     Consecutive SOPs : 0
     No SOP or EOP    : 0
     EOP but no SOP   : 0
     Particle overrun : 0
     Other inst buf   : 0
...

Keep in mind that ICMP echoes cannot be used to verify this behavior, because ICMP replies are handled by the ESP/QFP, so you won't notice this issue.

Note: Cisco has an excellent doc describing all cases of packets drops on the ASR1k platform here.
Also, there is a relevant bug (CSCtz53398) that is supposed to provide a workaround.

Cisco's answer? "You have to make the right balance between the convergence time and MTU"!

I tend to agree with them, but until now i had the impression that a larger MTU would lower the convergence time (as long as the CPU could follow). Well, time to reconsider....


 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Greece License.