sp-request-reply-01.xml 36 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745
  1. <?xml version="1.0" encoding="US-ASCII"?>
  2. <!DOCTYPE rfc SYSTEM "rfc2629.dtd">
  3. <rfc category="info" docName="sp-request-reply-01">
  4. <front>
  5. <title abbrev="Request/Reply SP">
  6. Request/Reply Scalability Protocol
  7. </title>
  8. <author fullname="Martin Sustrik" initials="M." role="editor"
  9. surname="Sustrik">
  10. <address>
  11. <email>sustrik@250bpm.com</email>
  12. </address>
  13. </author>
  14. <date month="August" year="2013" />
  15. <area>Applications</area>
  16. <workgroup>Internet Engineering Task Force</workgroup>
  17. <keyword>Request</keyword>
  18. <keyword>Reply</keyword>
  19. <keyword>REQ</keyword>
  20. <keyword>REP</keyword>
  21. <keyword>stateless</keyword>
  22. <keyword>service</keyword>
  23. <keyword>SP</keyword>
  24. <abstract>
  25. <t>This document defines a scalability protocol used for distributing
  26. processing tasks among arbitrary number of stateless processing nodes
  27. and returning the results of the processing.</t>
  28. </abstract>
  29. </front>
  30. <middle>
  31. <section title = "Introduction">
  32. <t>One of the most common problems in distributed applications is how to
  33. delegate a work to another processing node and get the result back to
  34. the original node. In other words, the goal is to utilise the CPU
  35. power of a remote node.</t>
  36. <t>There's a wide range of RPC systems addressing the problem, however,
  37. instead of relying on simple RPC algorithm, we will aim at solving a
  38. more general version of the problem. First, we want to issue processing
  39. requests from multiple clients, not just a single one. Second, we want
  40. to distribute the tasks to any number processing nodes instead of a
  41. single one so that the processing can be scaled up by adding new
  42. processing nodes as necessary.</t>
  43. <t>Solving the generalised problem requires that the algorithm
  44. executing the task in question -- also known as "service" -- is
  45. stateless.</t>
  46. <t>To put it simply, the service is called "stateless" when there's no
  47. way for the user to distinguish whether a request was processed by
  48. one instance of the service or another one.</t>
  49. <t>So, for example, a service which accepts two integers and multiplies
  50. them is stateless. Request for "2x2" is always going to produce "4",
  51. no matter what instance of the service have computed it.</t>
  52. <t>Service that accepts empty requests and produces the number
  53. of requests processed so far (1, 2, 3 etc.), on the other hand, is
  54. not stateless. To prove it you can run two instances of the service.
  55. First reply, no matter which instance produces it is going to be 1.
  56. Second reply though is going to be either 2 (if processed by the same
  57. instance as the first one) or 1 (if processed by the other instance).
  58. You can distinguish which instance produced the result. Thus,
  59. according to the definition, the service is not stateless.</t>
  60. <t>Despite the name, being "stateless" doesn't mean that the service has
  61. no state at all. Rather it means that the service doesn't retain any
  62. business-logic-related state in-between processing two subsequent
  63. requests. The service is, of course, allowed to have state while
  64. processing a single request. It can also have state that is unrelated
  65. to its business logic, say statistics about the processing that are
  66. used for administrative purposes and never returned to the clients.</t>
  67. <t>Also note that "stateless" doesn't necessarily mean "fully
  68. deterministic". For example, a service that generates random numbers is
  69. non-deterministic. However, the client, after receiving a new random
  70. number cannot tell which instance has produced it, thus, the service
  71. can be considered stateless.</t>
  72. <t>While stateless services are often implemented by passing the entire
  73. state inside the request, they are not required to do so. Especially
  74. when the state is large, passing it around in each request may be
  75. impractical. In such cases, it's typically just a reference to the
  76. state that's passed in the request, such as ID or path. The state
  77. itself can then be retrieved by the service from a shared database,
  78. a network file system or similar storage mechanism.</t>
  79. <t>Requiring services to be stateless serves a specific purpose.
  80. It allows for using any number of service instances to handle
  81. the processing load. After all, the client won't be able to tell the
  82. difference between replies from instance A and replies from instance B.
  83. You can even start new instances on the fly and get away with it.
  84. The client still won't be able to tell the difference. In other
  85. words, statelessness is a prerequisite to make your service cluster
  86. fully scalable.</t>
  87. <t>Once it is ensured that the service is stateless there are several
  88. topologies for a request/reply system to form. What follows are
  89. the most common:
  90. <list style = "numbers">
  91. <t>One client sends a request to one server and gets a reply.
  92. The common RPC scenario.</t>
  93. <t>Many clients send requests to one server and get replies. The
  94. classic client/server model. Think of a database server and
  95. database clients. Alternatively think of a messaging broker and
  96. messaging clients.</t>
  97. <t>One client send requests to many servers and gets replies.
  98. The load-balancer model. Think of HTTP load balancers.</t>
  99. <t>Many clients send requests to be processed by many servers.
  100. The "enterprise service bus" model. In the simplest case the bus
  101. can be implemented as a simple hub-and-spokes topology. In complex
  102. cases the bus can span multiple physical locations or multiple
  103. organisations with intermediate nodes at the boundaries connecting
  104. different parts of the topology.</t>
  105. </list>
  106. </t>
  107. <t>In addition to distributing tasks to processing nodes, request/reply
  108. model comes with full end-to-end reliability. The reliability guarantee
  109. can be defined as follows: As long as the client is alive and there's
  110. at least one server accessible from the client, the task will
  111. eventually get processed and the result will be delivered back to
  112. the client.</t>
  113. <t>End-to-end reliability is achieved, similar to TCP, by re-sending the
  114. request if the client believes the original instance of the request
  115. has failed. Typically, request is believed to have failed when there's
  116. no reply received within a specified time.</t>
  117. <t>Note that, unlike with TCP, the reliability algorithm is resistant to
  118. a server failure. Even if server fails while processing a request, the
  119. request will be re-sent and eventually processed by a different
  120. instance of the server.</t>
  121. <t>As can be seen from the above, one request may be processed multiple
  122. times. For example, reply may be lost on its way back to the client.
  123. Client will assume that the request was not processed yet, it will
  124. resend it and thus cause duplicate execution of the task.</t>
  125. <t>Some applications may want to prevent duplicate execution of tasks. It
  126. often turns out that hardening such applications to be idempotent is
  127. relatively easy as they already possess the tools to do so. For
  128. example, a payment processing server already has access to a shared
  129. database which it can use to verify that the payment with specified ID
  130. was not yet processed.</t>
  131. <t>On the other hand, many applications don't care about occasional
  132. duplicate processed tasks. Therefore, request/reply protocol does not
  133. require the service to be idempotent. Instead, the idempotence issue
  134. is left to the user to decide on.</t>
  135. <t>Finally, it should be noted that this specification discusses several
  136. features that are of little use in simple topologies and are rather
  137. aimed at large, geographically or organisationally distributed
  138. topologies. Features like channel prioritisation and loop avoidance
  139. fall into this category.</t>
  140. </section>
  141. <section title = "Underlying protocol">
  142. <t>The request/reply protocol can be run on top of any SP mapping,
  143. such as, for example, <xref target='SPoverTCP'>SP TCPmapping</xref>.
  144. </t>
  145. <t>Also, given that SP protocols describe the behaviour of entire
  146. arbitrarily complex topology rather than of a single node-to-node
  147. communication, several underlying protocols can be used in parallel.
  148. For example, a client may send a request via WebSocket, then, on the
  149. edge of the company network an intermediary node may retransmit it
  150. using TCP etc.</t>
  151. <figure>
  152. <artwork>
  153. +---+ WebSocket +---+ TCP +---+
  154. | |-------------| |-----------| |
  155. +---+ +---+ +---+
  156. | |
  157. +---+ IPC | | SCTP +---+ DCCP +---+
  158. | |---------+ +--------| |-----------| |
  159. +---+ +---+ +---+
  160. </artwork>
  161. </figure>
  162. </section>
  163. <section title = "Overview of the algorithm">
  164. <t>Request/reply protocol defines two different endpoint types:
  165. The requester or REQ (the client) and the replier or REP (the
  166. service).</t>
  167. <t>REQ endpoint can be connected only to a REP endpoint. REP endpoint
  168. can be connected only to the REQ endpoint. If the underlying protocol
  169. indicates that there's an attempt to create a channel to an
  170. incompatible endpoint, the channel MUST NOT be used. In the case of
  171. TCP mapping, for example, the underlying TCP connection MUST
  172. be closed.</t>
  173. <t>When creating more complex topologies, REQ and REP endpoints are
  174. paired in the intermediate nodes to form a forwarding component,
  175. so called "device". Device receives requests from the REP endpoint
  176. and forwards them to the REQ endpoint. At the same time it receives
  177. replies from the REQ endpoint and forwards them to the REP
  178. endpoint:</t>
  179. <figure>
  180. <artwork>
  181. --- requests --&gt;
  182. +-----+ +-----+-----+ +-----+-----+ +-----+
  183. | |--&gt;| | |--&gt;| | |--&gt;| |
  184. | REQ | | REP | REQ | | REP | REQ | | REP |
  185. | |&lt;--| | |&lt;--| | |&lt;--| |
  186. +-----+ +-----+-----+ +-----+-----+ +-----+
  187. &lt;-- replies ---
  188. </artwork>
  189. </figure>
  190. <t>Using devices, arbitrary complex topologies can be built. The rest
  191. of this section explains how are the requests routed through a topology
  192. towards processing nodes and how are replies routed back from
  193. processing nodes to the original clients, as well as how the
  194. reliability is achieved.</t>
  195. <t>The idea for routing requests is to implement a simple coarse-grained
  196. scheduling algorithm based on pushback capabilities of the underlying
  197. transport.</t>
  198. <t>The algorithm works by interpreting pushback on a particular channel
  199. as "the part of topology accessible through this channel is busy at
  200. the moment and doesn't accept any more requests."</t>
  201. <t>Thus, when a node is about to send a request, it can choose to send
  202. it only to one of the channels that don't report pushback at the
  203. moment. To implement approximately fair distribution of the workload
  204. the node choses a channel from that pool using the round-robin
  205. algorithm.</t>
  206. <t>As for delivering replies back to the clients, it should be understood
  207. that the client may not be directly accessible (say using TCP/IP) from
  208. the processing node. It may be beyond a firewall, have no static IP
  209. address etc. Furthermore, the client and the processing may not even
  210. speak the same transport protocol -- imagine client connecting to the
  211. topology using WebSockets and processing node via SCTP.</t>
  212. <t>Given the above, it becomes obvious that the replies must be routed
  213. back through the existing topology rather than directly. In fact,
  214. request/reply topology may be thought of as an overlay network on the
  215. top of underlying transport mechanisms.</t>
  216. <t>As for routing replies within the request/topology, it is designed in
  217. such a way that each reply contains the whole routing path, rather
  218. than containing just the address of destination node, as is the case
  219. with, for example, TCP/IP.</t>
  220. <t>The downside of the design is that replies are a little bit longer
  221. and that is in intermediate node gets restarted, all the requests
  222. that were routed through it will fail to complete and will have to be
  223. resent by request/reply end-to-end reliability mechanism.</t>
  224. <t>The upside, on the other hand, is that the nodes in the topology don't
  225. have to maintain any routing tables beside the simple table of
  226. adjacent channels along with their IDs. There's also no need for any
  227. additional protocols for distributing routing information within
  228. the topology.</t>
  229. <t>The most important reason for adopting the design though is that
  230. there's no propagation delay and any nodes becomes accessible
  231. immediately after it is started. Given that some nodes in the topology
  232. may be extremely short-lived this is a crucial requirement. Imagine
  233. a database client that sends a query, reads the result and terminates.
  234. It makes no sense to delay the whole process until the routing tables
  235. are synchronised between the client and the server.</t>
  236. <t>The algorithm thus works as follows: When request is routed from the
  237. client to the processing node, every REP endpoint determines which
  238. channel it was received from and adds the ID of the channel to the
  239. request. Thus, when the request arrives at the ultimate processing node
  240. it already contains a full backtrace stack, which in turn contains
  241. all the info needed to route a message back to the original client.</t>
  242. <t>After processing the request, the processing node attaches the
  243. backtrace stack from the request to the reply and sends it back
  244. to the topology. At that point every REP endpoint can check the
  245. traceback and determine which channel it should send the reply to.</t>
  246. <t>In addition to routing, request/reply protocol takes care of
  247. reliability, i.e. ensures that every request will be eventually
  248. processed and the reply will be delivered to the user, even when
  249. facing failures of processing nodes, intermediate nodes and network
  250. infrastructure.</t>
  251. <t>Reliability is achieved by simply re-sending the request, if the reply
  252. is not received with a certain timeframe. To make that algorithm
  253. work flawlessly, the client has to be able to filter out any stray
  254. replies (delayed replies for the requests that we've already received
  255. reply to).</t>
  256. <t>The client thus adds an unique request ID to the request. The ID gets
  257. copied from the request to the reply by the processing node. When the
  258. reply gets back to the client, it can simply check whether the request
  259. in question is still being processed and if not so, it can ignore
  260. the reply.</t>
  261. <t>To implement all the functionality described above, messages (both
  262. requests and replies have the following format:</t>
  263. <figure>
  264. <artwork>
  265. +-+------------+-+------------+ +-+------------+-------------+
  266. |0| Channel ID |0| Channel ID |...|1| Request ID | payload |
  267. +-+------------+-+------------+ +-+------------+ ------------+
  268. </artwork>
  269. </figure>
  270. <t>Payload of the message is preceded by a stack of 32-bit tags. The most
  271. significant bit of each tag is set to 0 except for the very last tag.
  272. That allows the algorithm to find out where the tags end and where
  273. the message payload begins.</t>
  274. <t>As for the remaining 31 bits, they are either request ID (in the last
  275. tag) or a channel ID (in all the remaining tags). The first channel ID
  276. is added and processed by the REP endpoint closest to the processing
  277. node. The last channel ID is added and processed by the REP endpoint
  278. closest to the client.</t>
  279. <t>Following picture shows an example of request saying "Hello" being
  280. routed from the client through two intermediate nodes to the
  281. processing node and the reply "World" being routed back. It shows
  282. what messages are passed over the network at each step of the
  283. process:</t>
  284. <figure>
  285. <artwork>
  286. client
  287. Hello | World
  288. | +-----+ ^
  289. | | REQ | |
  290. V +-----+ |
  291. 1|823|Hello | 1|823|World
  292. | +-----+ ^
  293. | | REP | |
  294. | +-----+ |
  295. | | REQ | |
  296. V +-----+ |
  297. 0|299|1|823|Hello | 0|299|1|823|World
  298. | +-----+ ^
  299. | | REP | |
  300. | +-----+ |
  301. | | REQ | |
  302. V +-----+ |
  303. 0|446|0|299|1|823|Hello | 0|446|0|299|1|823|World
  304. | +-----+ ^
  305. | | REP | |
  306. V +-----+ |
  307. Hello | World
  308. service
  309. </artwork>
  310. </figure>
  311. </section>
  312. <section title = "Hop-by-hop vs. End-to-end">
  313. <t>All endpoints implement so called "hop-by-hop" functionality. It's
  314. the functionality concerned with sending messages to the immediately
  315. adjacent components and receiving messages from them.</t>
  316. <t>In addition to that, the endpoints on the edge of the topology
  317. implement so called "end-to-end" functionality that is concerned
  318. with issues such as, for example, reliability.</t>
  319. <figure>
  320. <artwork>
  321. end to end
  322. +-----------------------------------------+
  323. | |
  324. +-----+ +-----+-----+ +-----+-----+ +-----+
  325. | |--&gt;| | |--&gt;| | |--&gt;| |
  326. | REQ | | REP | REQ | | REP | REQ | | REP |
  327. | |&lt;--| | |&lt;--| | |&lt;--| |
  328. +-----+ +-----+-----+ +-----+-----+ +-----+
  329. | | | | | |
  330. +---------+ +---------+ +---------+
  331. hop by hop hop by hop hop by hop
  332. </artwork>
  333. </figure>
  334. <t>To make an analogy with the TCP/IP stack, IP provides hop-by-hop
  335. functionality, i.e. routing of the packets to the adjacent node,
  336. while TCP implements end-to-end functionality such resending of
  337. lost packets.</t>
  338. <t>As a rule of thumb, raw hop-by-hop endpoints are used to build
  339. devices (intermediary nodes in the topology) while end-to-end
  340. endpoints are used directly by the applications.</t>
  341. <t>To prevent confusion, the specification of the endpoint behaviour
  342. below will discuss hop-by-hop and end end-to-end functionality in
  343. separate chapters.</t>
  344. </section>
  345. <section title = "Hop-by-hop functionality">
  346. <section title = "REQ endpoint">
  347. <t>The REQ endpoint is used by the user to send requests to the
  348. processing nodes and receive the replies afterwards.</t>
  349. <t>When user asks REQ endpoint to send a request, the endpoint should
  350. send it to one of the associated outbound channels (TCP connections
  351. or similar). The request sent is exactly the message supplied by
  352. the user. REQ socket MUST NOT modify an outgoing request in any
  353. way.</t>
  354. <t>If there's no channel to send the request to, the endpoint won't send
  355. the request and MUST report the backpressure condition to the user.
  356. For example, with BSD socket API, backpressure is reported as EAGAIN
  357. error.</t>
  358. <t>If there are associated channels but none of them is available for
  359. sending, i.e. all of them are already reporting backpressure, the
  360. endpoint won't send the message and MUST report the backpressure
  361. condition to the user.</t>
  362. <t>Backpressure is used as a means to redirect the requests from the
  363. congested parts of the topology to to the parts that are still
  364. responsive. It can be thought of as a crude scheduling algorithm.
  365. However crude though, it's probably still the best you can get
  366. without knowing estimates of execution time for individual tasks,
  367. CPU capacity of individual processing nodes etc.</t>
  368. <t>Alternatively, backpressure can be thought of as a congestion control
  369. mechanism. When all available processing nodes are busy, it slows
  370. down the client application, i.e. it prevents the user from sending
  371. any more requests.</t>
  372. <t>If the channel is not capable of reporting backpressure (e.g. DCCP)
  373. the endpoint SHOULD consider it as always available for sending new
  374. request. However, such channels should be used with care as when the
  375. congestion hits they may suck in a lot of requests just to discard
  376. them silently and thus cause re-transmission storms later on. The
  377. implementation of the REQ endpoint MAY choose to prohibit the use
  378. of such channels altogether.</t>
  379. <t>When there are multiple channels available for sending the request
  380. endpoint MAY use any prioritisation mechanism to decide which channel
  381. to send the request to. For example, it may use classic priorities
  382. attached to channels and send message to the channel with the highest
  383. priority. That allows for routing algorithms such as: "Use local
  384. processing nodes if any are available. Send the requests to remote
  385. nodes only if there are no local ones available." Alternatively,
  386. the endpoint may implement weighted priorities ("send 20% of the
  387. request to node A and 80% to node B). The endpoint also may not
  388. implement any prioritisation strategy and treat all channels as
  389. equal.</t>
  390. <t>Whatever the case, two rules must apply.</t>
  391. <t>First, by default the priority settings for all channels MUST be
  392. equal. Creating a channel with different priority MUST be triggered
  393. by an explicit action by the user.</t>
  394. <t>Second, if there are several channels with equal priority, the
  395. endpoint MUST distribute the messages among them in fair fashion
  396. using round-robin algorithm. The round-robin implementation MUST also
  397. take care not to become unfair when new channels are added or old
  398. ones are removed on the fly.</t>
  399. <t>As for incoming messages, i.e. replies, REQ endpoint MUST fair-queues
  400. them. In other words, if there are replies available on several
  401. channels, it MUST receive them in a round-robin fashion. It must also
  402. take care not to compromise the fairness when new channels are
  403. added or old ones removed.</t>
  404. <t>In addition to providing basic fairness, the goal of fair-queueing is
  405. to prevent DoS attacks where a huge stream of fake replies from one
  406. channel would be able to block the real replies coming from different
  407. channels. Fair queueing ensures that messages from every channel are
  408. received at approximately the same rate. That way, DoS attack can
  409. slow down the system but it can't entirely block it.</t>
  410. <t>Incoming replies MUST be handed to the user exactly as they were
  411. received. REQ endpoint MUST not modify the replies in any way.</t>
  412. </section>
  413. <section title = "REP endpoint">
  414. <t>REP endpoint is used to receive requests from the clients and send
  415. replies back to the clients.</t>
  416. <t>First of all, REP socket is responsible for assigning unique 31-bit
  417. channel IDs to the individual associated channels.</t>
  418. <t>First ID assigned MUST be random. Next is computed by adding 1 to
  419. the previous one with potential overflow to 0.</t>
  420. <t>The implementation MUST ensure that the random number is different
  421. each time the endpoint is re-started, the process that contains
  422. it is restarted or similar. So, for example, using pseudo-random
  423. generator with a constant seed won't do.</t>
  424. <t>The goal of the algorithm is to the spread of possible channel ID
  425. values and thus minimise the chance that a reply is routed to an
  426. unrelated channel, even in the face of intermediate node
  427. failures.</t>
  428. <t>When receiving a message, REP endpoint MUST fair-queue among the
  429. channels available for receiving. In other words it should
  430. round-robin among such channels and receive one request from
  431. a channel at a time. It MUST also implement the round-robin
  432. algorithm is such a way that adding or removing channels don't
  433. break its fairness.</t>
  434. <t>In addition to guaranteeing basic fairness in access to computing
  435. resources the above algorithm makes it impossible for a malevolent
  436. or misbehaving client to completely block the processing of requests
  437. from other clients by issuing steady stream of requests.</t>
  438. <t>After getting hold on the request, the REP socket should prepend it
  439. by 32 bit value, consisting of 1 bit set to 0 followed by the 31-bit
  440. ID of the channel the request was received from. The extended request
  441. will be then handed to the user.</t>
  442. <t>The goal of adding the channel ID to the request is to be able to
  443. route the reply back to the original channel later on. Thus, when
  444. the user sends a reply, endpoint strips first 32 bits off and uses
  445. the value to determine where it is to be routed.</t>
  446. <t>If the reply is shorter than 32 bits, it is malformed and
  447. the endpoint MUST ignore it. Also, if the most relevant bit of the
  448. 32-bit value isn't set to 0, the reply is malformed and MUST
  449. be ignored.</t>
  450. <t>Otherwise, the endpoint checks whether its table of associated
  451. channels contains the channel with a corresponding ID. If so, it
  452. sends the reply (with first 32 bits stripped off) to that channel.
  453. If the channel is not found, the reply MUST be dropped. If the
  454. channel is not available for sending, i.e. it is applying
  455. backpressure, the reply MUST be dropped.</t>
  456. <t>Note that when the reply is unroutable two things might have
  457. happened. Either there was some kind of network disruption, in which
  458. case the request will be re-sent later on, or the original client
  459. have failed or been shut down. In such case the request won't be
  460. resent, however, it doesn't really matter because there's no one to
  461. deliver the reply to any more anyway.</t>
  462. <t>Unlike requests, there's no pushback applied to the replies; they are
  463. simply dropped. If the endpoint blocked and waited for the channel to
  464. become available, all the subsequent replies, possibly destined for
  465. different unblocked channels, would be blocked in the meantime. That
  466. allows for a DoS attack simply by firing a lot of requests and not
  467. receiving the replies.</t>
  468. </section>
  469. </section>
  470. <section title = "End-to-end functionality">
  471. <t>End-to-end functionality is built on top of hop-to-hop functionality.
  472. Thus, an endpoint on the edge of a topology contains all the
  473. hop-by-hop functionality, but also implements additional
  474. functionality of its own. This end-to-end functionality acts
  475. basically as a user of the underlying hop-by-hop functionality.</t>
  476. <section title = "REQ endpoint">
  477. <t>End-to-end functionality for REQ sockets is concerned with re-sending
  478. the requests in case of failure and with filtering out stray or
  479. outdated replies.</t>
  480. <t>To be able to do the latter, the endpoint must tag the requests with
  481. unique 31-bit request IDs. First request ID is picked at random. All
  482. subsequent request IDs are generated by adding 1 to the last request
  483. ID and possibly overflowing to 0.</t>
  484. <t>To improve robustness of the system, the implementation MUST ensure
  485. that the random number is different each time the endpoint, the
  486. process or the machine is restarted. Pseudo-random generator with
  487. fixed seed won't do.</t>
  488. <t>When user asks the endpoint to send a message, the endpoint prepends
  489. a 32-bit value to the message, consisting of a single bit set to 1
  490. followed by a 31-bit request ID and passes it on in a standard
  491. hop-by-hop way.</t>
  492. <t>If the hop-by-hop layer reports pushback condition, the end-to-end
  493. layer considers the request unsent and MUST report pushback condition
  494. to the user.</t>
  495. <t>If the request is successfully sent, the endpoint stores the request
  496. including its request ID, so that it can be resent later on if
  497. needed. At the same time it sets up a timer to trigger the
  498. re-transmission in case the reply is not received within a specified
  499. timeout. The user MUST be allowed to specify the timeout interval.
  500. The default timeout interval must be 60 seconds.</t>
  501. <t>When a reply is received from the underlying hop-by-hop
  502. implementation, the endpoint should strip off first 32 bits from
  503. the reply to check whether it is a valid reply.</t>
  504. <t>If the reply is shorter than 32 bits, it is malformed and the
  505. endpoint MUST ignore it. If the most significant bit of the 32-bit
  506. value is set to 0, the reply is malformed and MUST be ignored.</t>
  507. <t>Otherwise, the endpoint should check whether the request ID in
  508. the reply matches any of the request IDs of the requests being
  509. processed at the moment. If not so, the reply MUST be ignored.
  510. It is either a stray message or a duplicate reply.</t>
  511. <t>Please note that the endpoint can support either one or more
  512. requests being processed in parallel. Which one is the case depends
  513. on the API exposed to the user and is not part of this
  514. specification.</t>
  515. <t>If the ID in the reply matches one of the requests in progress, the
  516. reply MUST be passed to the user (with the 32-bit prefix stripped
  517. off). At the same time the stored copy of the original request as
  518. well as re-transmission timer must be deallocated.</t>
  519. <t>Finally, REQ endpoint MUST make it possible for the user to cancel
  520. a particular request in progress. What it means technically is
  521. deleting the stored copy of the request and cancelling the associated
  522. timer. Thus, once the reply arrives, it will be discarded by the
  523. algorithm above.</t>
  524. <t>The cancellation allows, for example, the user to time out a request.
  525. They can simply post a request and if there's no answer in specific
  526. timeframe, they can cancel it.</t>
  527. </section>
  528. <section title = "REP endpoint">
  529. <t>End-to-end functionality for REP endpoints is concerned with turning
  530. requests into corresponding replies.</t>
  531. <t>When user asks to receive a request, the endpoint gets next request
  532. from the hop-by-hop layer and splits it into the traceback stack and
  533. the message payload itself. The traceback stack is stored and the
  534. payload is returned to the user.</t>
  535. <t>The algorithm for splitting the request is as follows: Strip 32 bit
  536. tags from the message in one-by-one manner. Once the most significant
  537. bit of the tag is set, we've reached the bottom of the traceback
  538. stack and the splitting is done. If the end of the message is reached
  539. without finding the bottom of the stack, the request is malformed and
  540. MUST be ignored.</t>
  541. <t>Note that the payload produced by this procedure is the same as the
  542. request payload sent by the original client.</t>
  543. <t>Once the user processes the request and sends the reply, the endpoint
  544. prepends the reply with the stored traceback stack and sends it on
  545. using the hop-by-hop layer. At that point the stored traceback stack
  546. MUST be deallocated.</t>
  547. <t>Additionally, REP endpoint MUST support cancelling any request being
  548. processed at the moment. What it means, technically, is that
  549. state associated with the request, i.e. the traceback stack stored
  550. by the endpoint is deleted and reply to that particular
  551. request is never sent.</t>
  552. <t>The most important use of cancellation is allowing the service
  553. instances to ignore malformed requests. If the application-level
  554. part of the request doesn't conform to the application protocol
  555. the service can simply cancel the request. In such case the reply
  556. is never sent. Of course, if application wants to send an
  557. application-specific error massage back to the client it can do so
  558. by not cancelling the request and sending a regular reply.</t>
  559. </section>
  560. </section>
  561. <section title = "Loop avoidance">
  562. <t>It may happen that a request/reply topology contains a loop. It becomes
  563. increasingly likely as the topology grows out of scope of a single
  564. organisation and there are multiple administrators involved
  565. in maintaining it. Unfortunate interaction between two perfectly
  566. legitimate setups can cause loop to be created.</t>
  567. <t>With no additional guards against the loops, it's likely that
  568. requests will be caught inside the loop, rotating there forever,
  569. each message gradually growing in size as new prefixes are added to it
  570. by each REP endpoint on the way. Eventually, a loop can cause
  571. congestion and bring the whole system to a halt.</t>
  572. <t>To deal with the problem REQ endpoints MUST check the depth of the
  573. traceback stack for every outgoing request and discard any requests
  574. where it exceeds certain threshold. The threshold should be defined
  575. by the user. The default value is suggested to be 8.</t>
  576. </section>
  577. <section anchor="IANA" title="IANA Considerations">
  578. <t>New SP endpoint types REQ and REP should be registered by IANA. For
  579. now, value of 16 should be used for REQ endpoints and value of 17 for
  580. REP endpoints.</t>
  581. </section>
  582. <section anchor="Security" title="Security Considerations">
  583. <t>The mapping is not intended to provide any additional security to the
  584. underlying protocol. DoS concerns are addressed within
  585. the specification.</t>
  586. </section>
  587. </middle>
  588. <back>
  589. <references>
  590. <reference anchor='SPoverTCP'>
  591. <front>
  592. <title>TCP mapping for SPs</title>
  593. <author initials='M.' surname='Sustrik' fullname='M. Sustrik'/>
  594. <date month='August' year='2013'/>
  595. </front>
  596. <format type='TXT' target='sp-tcp-mapping-01.txt'/>
  597. </reference>
  598. </references>
  599. </back>
  600. </rfc>