LiuZe
/
env


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210
							<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">

<rfc category="info" docName="sp-tcp-mapping-01">

  <front>

    <title abbrev="TCP mapping for SPs">
    TCP Mapping for Scalability Protocols
    </title>

    <author fullname="Martin Sustrik" initials="M." role="editor"
            surname="Sustrik">
      <address>
        <email>sustrik@250bpm.com</email>
      </address>
    </author>

    <date month="March" year="2014" />

    <area>Applications</area>
    <workgroup>Internet Engineering Task Force</workgroup>

    <keyword>TCP</keyword>
    <keyword>SP</keyword>

    <abstract>
      <t>This document defines the TCP mapping for scalability protocols.
         The main purpose of the mapping is to turn the stream of bytes
         into stream of messages. Additionally, the mapping provides some
         additional checks during the connection establishment phase.</t>
    </abstract>

  </front>

  <middle>

    <section title = "Underlying protocol">

      <t>This mapping should be layered directly on the top of TCP.</t>

      <t>There's no fixed TCP port to use for the communication. Instead, port
         numbers are assigned to individual services by the user.</t>

    </section>

    <section title = "Connection initiation">

      <t>As soon as the underlying TCP connection is established, both parties
         MUST send the protocol header (described in detail below) immediately.
         Both endpoints MUST then wait for the protocol header from the peer
         before proceeding on.</t>

      <t>The goal of this design is to keep connection establishment as
         fast as possible by avoiding any additional protocol handshakes,
         i.e. network round-trips. Specifically, the protocol headers
         can be bundled directly with to the last packets of TCP handshake
         and thus have virtually zero performance impact.</t>

      <t>The protocol header is 8 bytes long and looks like this:</t>

      <figure>
        <artwork>
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      0x00     |      0x53     |      0x50     |    version    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             type              |           reserved            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        </artwork>
      </figure>

      <t>First four bytes of the protocol header are used to make sure that
         the peer's protocol is compatible with the protocol used by the local
         endpoint. Keep in mind that this protocol is designed to run on an
         arbitrary TCP port, thus the standard compatibility check -- if it runs
         on port X and protocol Y is assigned to X by IANA, it speaks protocol Y
         -- does not apply. We have to use an alternative mechanism.</t>

      <t>First four bytes of the protocol header MUST be set to 0x00, 0x53, 0x50
         and 0x00 respectively. If the protocol header received from the peer
         differs, the TCP connection MUST be closed immediately.</t>

      <t>The fact that the first byte of the protocol header is binary zero
         eliminates any text-based protocols that were accidentally connected
         to the endpoint. Subsequent two bytes make the check even more
         rigorous. At the same time they can be used as a debugging hint to
         indicate that the connection is supposed to use one of the scalability
         protocols -- ASCII representation of these bytes is 'SP' that can
         be easily spotted in when capturing the network traffic. Finally,
         the fourth byte rules out any incompatible versions of this
         protocol.</t>
      
      <t>Fifth and sixth bytes of the header form a 16-bit unsigned integer in
         network byte order representing the type of SP endpoint on the layer
         above. The value SHOULD NOT be interpreted by the mapping, rather
         the interpretation should be delegated to the scalability protocol
         above the mapping. For informational purposes, it should be noted that
         the field encodes information such as SP protocol ID, protocol version
         and the role of endpoint within the protocol. Individual values are
         assigned by IANA.</t>

      <t>Finally, the last two bytes of the protocol header are reserved for
         future use and must be set to binary zeroes. If the protocol header
         from the peer contains anything else than zeroes in this field, the
         implementation MUST close the underlying TCP connection.</t>

    </section>

    <section title = "Message delimitation">

      <t>Once the protocol header is accepted, endpoint can send and receive
         messages. Message is an arbitrarily large chunk of binary data. Every
         message starts with 64-bit unsigned integer in network byte order
         representing the size, in bytes,  of the remaining part of the message.
         Thus, the message payload can be from 0 to 2^64-1 bytes long.
         The payload of the specified size follows directly after the size
         field:</t>

      <figure>
        <artwork>
+------------+-----------------+
| size (64b) |     payload     |
+------------+-----------------+ 
        </artwork>
      </figure>

      <t>It may seem that 64 bit message size is excessive and consumes too much
         of valuable bandwidth, especially given that most scenarios call for
         relatively small messages, in order of bytes or kilobytes.</t>

      <t>Variable length field may seem like a better solution, however, our
         experience is that variable length size field doesn't  provide any
         performance benefit in the real world.</t>

      <t>For large messages, 64 bits used by the field form a negligible portion
         of the message and the performance impact is not even measurable.</t>

      <t>For small messages, the overall throughput is heavily CPU-bound, never
         I/O-bound. In other words, CPU processing associated with each
         individual message limits the message rate in such a way that network
         bandwidth limit is never reached. In the future we expect it to be
         even more so: network bandwidth is going to grow faster than CPU speed.
         All in all, some performance improvement could be achieved using
         variable length size field with huge streams of very small messages
         on very slow networks. We consider that scenario to be a corner case
         that's almost never seen in a real world.</t>

      <t>On the other hand, it may be argued that limiting the messages to
         2^64-1 bytes can prove insufficient in the future. However,
         extrapolating the message size growth size seen in the past indicates
         that 64 bit size should be sufficient for the expected lifetime of
         the protocol (30-50 years).</t>

      <t>Finally, it may be argued that chaining arbitrary number of smaller
         data chunks can yield unlimited message size. The downside of this
         approach is that the message payload cannot be continuous on the wire,
         it has to be interleaved with chunk headers. That typically requires
         one more copy of the data in the receiving part of the stack which
         may be a problem for very large messages.</t>

    </section>

    <section title = "Note on multiplexing">

      <t>Several modern general-purpose protocols built on top of TCP provide
         multiplexing capability, i.e. a way to transfer multiple independent
         message streams over a single TCP connection. This mapping deliberately
         opts to provide no such functionality. Instead, independent message
         streams should be implemented as different TCP connections. This
         section provides the rationale for the design decision.</t>

      <t>First of all, multiplexing is typically added to protocols to avoid
         the overhead of establishing additional TCP connections. This need
         arises in environments where the TCP connections are extremely
         short-lived, often used only for a single handshake between the peers.
         Scalability protocols, on the other hand, require long-lived
         connections which doesn't make the feature necessary.</t>

      <t>At the same time, multiplexing on top of TCP, while doable, is inferior
         to the real multiplexing done using multiple TCP connections.
         Specifically, TCP's head-of-line blocking feature means that a single
         lost TCP packet will hinder delivery for all the streams on the top of
         the connection, not just the one the missing packets belonged to.</t>

      <t>At the same time, implementing multiplexing is a non-trivial matter
         and results in increased development cost, more bugs and larger
         attack surface.</t>

      <t>Finally, for multiplexing to work properly, large messages have to be
         split into smaller data chunks interleaved by chunk headers, which
         makes receiving stack less efficient, as already discussed above.</t>

    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This memo includes no request to IANA.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>The mapping isn't intended to provide any additional security in
         addition to what TCP does. DoS concerns are addressed within
         the specification.</t>
    </section>

  </middle>

</rfc>