Uploaded image for project: 'vpp'
  1. vpp
  2. VPP-2112

MTU agnostic tunnel substrate using Packet Vectorisation

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • None
    • None
    • Packet Vectorisation

      The SD-WAN/SASE/SSE set of mechanisms use encapsulation to direct traffic to intermediate nodes in the network. This reduces the effective payload MTU of end-user applications. Similarly traffic between intermediate nodes in the solution uses encapsulation, for security, to carry meta data and to direct traffic.

       

      Internally in the data center the MTU is high enough (9K) to support arbitrary long encapsulation. The path MTU between the end-user and the data center, and the path MTU between data-centres is typically limited to 1500 bytes.

       

      Today, the problem has been dealt with by a combination of:

      • manual configuration of a smaller MTU at the end-user site
      • TCP MSS clamping
      • IP fragmentation
      • Path MTU discovery (sending packet too big ICMP messages back to the host)

       

      These mechanisms all have undesired side-effects. ICMP error messages are not robust and often not acted upon by the application. TCP MSS clamping, works fine, but only for TCP. IP fragmentation means the 5-tuple isn't available in every packet, which both increases drop probability, affects ECMP and requires stateful devices to do virtual reassembly.

       
      References:
      https://datatracker.ietf.org/doc/draft-templin-intarea-parcels/

      https://datatracker.ietf.org/doc/html/draft-templin-6man-omni-interface

      https://datatracker.ietf.org/doc/rfc9347/

       
      Traffic between data centres and between end-users and a data centre is carried 

      over point to point tunnels. The tunnel acts as a pseudo-wire. The idea is to pr

      ovide a substrate within the tunnel that offers a higher MTU (at least 1500) tha

      n the underlaying path MTU.

       

      If one thinks of the underlaying path MTU to have a fixed cell size, e.g. 1280 b

      ytes. Then the idea is to use a combination of packet coalescing, and tunnel sub

      strate segmentation and reassembly to fill these cells. The hypotheses is that by doing so it is possible to provide a higer inner MTU on the link, while at the

       same time not increase the number of packets.

       

      Given that the tunnel substrate runs between two endpoints, there is no issue with mis-identification of fragments as one has with 'normal' IP fragmentation.

       

      The approach is uniquely suited to VPP. Where the mechanism, called "Packet Vectorisation" is run on the interface output vector in VPP.

       

      If the underlaying path MTU is 1280 bytes, the tunnel overhead is 40 bytes, and 

      the output vector consists of 5 packets of sizes {40, 40, 1500, 1500, 40}, then 

      the resulting packets with tunnel encapsulation to be sent are: {1280, 1280, 560}

       

      The tunnel sublayer is within a UDP packet. UDP is used to ensure ECMP load-balancing. A fixed destination port and a per-vector randomized source port.

       

      There is an initial tunnel sublayer header containing a 32-bit sequence number, and a flag indicating whether the very first chunk contains a fragment or not.

      For each packet segment there is a chunk header containing a total length field.

      All the chunks except the first and last one are expected to be full packets rather than fragments.

      The determination whether the given inner chunk contains the full packet or not is taken based on parsing of the inner packet payload, and thus using the IPv4 or IPv6 total length field from the header (subject to change).

      Implementation:

      there are two packet paths: transmit and receive. The transmit packet path implements a tunnel interface similar to that of wireguard, etc. Upon getting the buffer to send, it adds the chunk header and temporarily buffers the block in anticipation of possibly tacking on the additional packets. After the processing of the current vector is done, any of the pending blocks get the encapsulation header and are sent out - therefore the added delay is minimal. Each tunnelled packet gets its sequence number.

      The RX path performs the reverse operation. First operation is checking of the received sequence number vs. the last received sequence number. If they are not strictly adjacent, then the fragment chunks which are pending the reassembly are discarded, if any, and the first chunk from the newly received packet is discarded as well. Else a reassembly attempt is made. If there are remaining chunks in the packet and the reassembly is not successful, then the unreassembled data is discarded. and reassembly state reset

      After this, all the subsequent inner chunks are decapsulated (with the check that they do form the full packets). The first chunk which does not form the full upper layer PDU is stored as a fragment in progress, and terminates the processing.

      Unit testing:

      Since the unit testing framework uses Scapy, the testing of a new custom packet format protocol requires the development of a new packet dissector / assembler for Scapy. The relevant documentation is at https://scapy.readthedocs.io/en/latest/build_dissect.html

      This code for testing the TX path needs to parse the packets sent by the VPP data plane, and verify the sanity of the packet format, for various combinations and sizes of the payloads. For the RX path the testing code needs to create various combinations of the tunnelled payloads in scapy code and then inject them into the VPP data path, and verify that the decapsulation process behaves as expected. Also there need to be negative tests, e.g. the ones supplying incorrect chunk lengths, etc., to ensure that the tunnel code is robust enough to the accidental wrong data.

      Security & spoofing:

      The presented protocol design assumes the trusted underlay, insofar that it does not attempt to verify whether the packet is spoofed or not. This is not different from any other "simple" tunnelling protocol like GRE/VXLAN/Geneve, etc. This protocol is specifically designed to be layered atop any other tunnelling protocol to augment the overall tunnelling properties.

            Unassigned Unassigned
            otroan Ole Trøan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: