lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 12 Jul 2016 11:08:20 -0700
From:	Benjamin Poirier <benjamin.poirier@...il.com>
To:	Ursula Braun <ubraun@...ux.vnet.ibm.com>
Cc:	Dave Miller <davem@...emloft.net>, netdev@...r.kernel.org,
	linux-s390@...r.kernel.org,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Utz Bacher <utz.bacher@...ibm.com>
Subject: Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory
 Communications - RDMA

On 2016/07/06 17:29, Ursula Braun wrote:
> Dave,
> 
> we still like to see SMC-R included into a future Linux-kernel. After
> answering your first 2 questions, there is no longer a response. What should
> we do next?
> - Still wait for an answer from you?
> - Resend the same whole SMC-R patch series, this time with the cover letter
> adapted to your requested changes?

^^^ I would suggest to send v2 of the patch series with the changes
that were requested.

> - Put the SMC-R development on hold, and concentrate on another
> s390-specific SMC-solution first (not RDMA-based), that makes use of the
> SMC-socket family as well.
> - Anything else?
> 
> Kind regards, Ursula
> 
> -------- Forwarded Message --------
> Subject: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory
> Communications - RDMA
> Date: Tue, 21 Jun 2016 16:02:59 +0200
> From: Ursula Braun <ubraun@...ux.vnet.ibm.com>
> To: davem@...emloft.net
> CC: netdev@...r.kernel.org, linux-s390@...r.kernel.org,
> schwidefsky@...ibm.com, heiko.carstens@...ibm.com, utz.bacher@...ibm.com
> 
> Dave,
> 
> the SMC-R patches submitted 2016-06-03 show up in state "Changes
> Requested" on patchwork:
> https://patchwork.ozlabs.org/project/netdev/list/?submitter=2266&state=*&page=1
> 
> You had requested a change of the SMC-R description in the cover letter.
> We came up with the response below. Do you need anything else from us?
> 
> Kind regards,
> Ursula Braun
> 
> -------- Forwarded Message --------
> Subject: Re: [PATCH net-next 00/15] net/smc: Shared Memory
> Communications - RDMA
> Date: Thu,  9 Jun 2016 17:36:28 +0200
> From: Ursula Braun <ubraun@...ux.vnet.ibm.com>
> To: davem@...emloft.net
> CC: netdev@...r.kernel.org, linux-s390@...r.kernel.org,
> schwidefsky@...ibm.com, heiko.carstens@...ibm.com
> 
> On Tue, 2016-06-07 at 15:07 -0700, David Miller wrote:
> > In case my previous reply wasn't clear enough, I require that you provide
> > a more accurate description of what the implications of this feature are.
> > 
> > Namely, that important _CORE_ networking features are completely bypassed
> > and unusable when SMC applies to a connection.
> > 
> > Specifically, all packet shaping, filtering, traffic inspection, and
> > flow management facilitites in the kernel will not be able to see nor
> > act upon the data flow of these TCP connections once established.
> > 
> > It is always important, and in my opinion required, to list the
> > negative aspects of your change and not just the "wow, amazing"
> > positive aspects.
> > 
> > Thanks.
> > 
> > 
> Correct, the SMC-R data stream bypasses TCP and thus cannot enjoy its
> features. This is the price for leveraging the TCP application ecosystem
> and reducing CPU load.
> 
> When a load balancer allows the TCP handshake to take place between a
> worker node and the TCP client, RDMA will be used between these two
> nodes. So anything based on TCP connection establishment (including a
> firewall) can apply to SMC-R, too. To be clear -- yes, the data flow
> later on is not subject to these features anymore.  At least VLAN
> isolation from the TCP part can be leveraged for RDMA traffic. From our
> experience, discussions, etc., that tradeoff seems acceptable in a
> classical data center environment.
> 
> Improving our cover letter would result in the following new
> introductory motivation part at the beginning and a slightly modified
> list of
> planned enhancements at the end:
> 
> On Fri, 2016-06-03 at 17:26 +0200, Ursula Braun wrote:
> 
> > These patches are the initial part of the implementation of the
> > "Shared Memory Communications-RDMA" (SMC-R) protocol. The protocol is
> > defined in RFC7609 [1]. It allows transformation of TCP connections
> > using the "Remote Direct Memory Access over Converged Ethernet" (RoCE)
> > feature of specific communication hardware for data center environments.
> > 
> > SMC-R inherits TCP qualities such as reliable connections, host-based
> > firewall packet filtering (on connection establishment) and unmodified
> > application of communication encryption such as TLS (transport layer
> > security) or SSL (secure sockets layer). It is transparent to most existing
> > TCP connection load balancers that are commonly used in the enterprise data
> > center environment for multi-tier application workloads.
> > 
> > Being designed for the data center network switched fabric environment, it
> > does not need congestion control and thus reaches line speed right away
> > without having to go through slow start as with TCP. This can be beneficial
> > for short living flows including request response patterns requiring
> > reliability. A full SMC-R implementation also provides seamless high
> > availability and load-balancing demanded by enterprise installations.
> > 
> > SMC-R does not require an RDMA communication manager (RDMA CM). Its use of
> > RDMA provides CPU savings transparently for unmodified applications.
> > For instance, when running 10 parallel connections with uperf, we measured
> > a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
> > (with throughput and latency comparable;
> > measured on x86_64 with the same RoCE card and port).
> > 
> These patches are the initial part of the implementation of the
> "Shared Memory Communications-RDMA" (SMC-R) protocol as defined in
> RFC7609 [1].  While SMC-R does not aim to replace TCP,
> it taps a wealth of existing data center TCP socket applications
> to become more efficient without the need for rewriting them.
> SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption.
> For instance, when running 10 parallel connections with uperf, we measured
> a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
> (with throughput and latency comparable;
> measured on x86_64 with the same RoCE card and port).
> 
> SMC-R does not require an RDMA communication manager (RDMA CM).
> 
> SMC-R inherits TCP qualities such as reliable connections, host-based
> firewall packet filtering (on connection establishment) and unmodified
> application of communication encryption such as TLS (transport layer
> security) or SSL (secure sockets layer). Since original TCP is used to
> establish SMC-R connections, load balancers and packet inspection based
> on TCP/IP connection establishment continue to work for SMC-R.
> 
> On the other hand using SMC-R implies:
> - either involving a preload library when invoking the unchanged
> TCP-application
>   or slightly modifying the source by simply changing the socket family
> in the
>   socket() call
> - accepting extra overhead and latency in connection establishment due to
>   SMC Connection Layer Control (CLC) handshake
> - explicit coupling of RoCE ports with Ethernet ports
> - not routable as currently built on RoCE V1
> - bypassing of packet-based networking features
>     - filtering (netfilter)
>     - sniffing (libpcap, packet sockets, (E)BPF)
>     - traffic control (scheduling, shaping)
> - bypassing of IP-header based socket options
> - bypassing of memory buffer (pressure) management
> - unusable together with IPsec
> 
> > 
> > Overview of the SMC-R Protocol described in informational RFC 7609
> > 
> > SMC-R is an open protocol that provides RDMA capabilities over RoCE
> > transparently for applications exploiting TCP sockets.
> > A new socket protocol family PF_SMC is introduced.
> > There are no changes required to applications using the sockets API for TCP
> > stream sockets other than the specification of the new socket family AF_SMC.
> > Unmodified applications can be used by means of a dynamic preload shared
> > library which rewrites the socket API call
> > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into
> > socket(AF_SMC,  SOCK_STREAM, IPPROTO_TCP).
> > SMC-R re-uses the address family AF_INET for all addressing purposes around
> > struct sockaddr.
> > 
> > 
> > SMC-R system architecture layers:
> > 
> > +=============================================================================+
> > |                                      | unmodified TCP application           |
> > | native SMC application               +--------------------------------------+
> > |                                      | dynamic preload shared library       |
> > +=============================================================================+
> > |                                 SMC socket                                  |
> > +-----------------------------------------------------------------------------+
> > |                    | TCP socket (for connection establishment and fallback) |
> > | IB verbs           +--------------------------------------------------------+
> > |                    | IP                                                     |
> > +--------------------+--------------------------------------------------------+
> > | RoCE device driver | some network device driver                             |
> > +=============================================================================+
> > 
> > 
> > Terms:
> > 
> > A link group is determined by an ordered peer pair of TCP client and TCP server
> > (IP addresses and subnet). Reversed client server roles cause an own link group.
> > A link is a logical point-to-point connection based on an
> > infiniband reliable connected queue pair (RC-QP) between two RoCE ports
> > (MACs and GIDs) of a peer pair.
> > A link group can have 1..8 links for failover and load balancing.
> > This initial Linux implementation always has 1 link per link group.
> > Each link group on a peer can have 1..255 remote memory buffers (RMBs).
> > If more RMBs are needed, a peer can open another link group
> > (this initial Linux implementation) or fall back to TCP.
> > Each RMB has its own particular size and its own (R)DMA mapping and credentials
> > (rtoken consisting of rkey and RDMA "virtual address").
> > This initial Linux implementation uses physically contiguous memory for RMBs
> > but we are working towards scattered memory because of memory fragmentation.
> > Each RMB has 1..255 RMB elements (RMBEs) of equal size
> > to provide multiplexing of connections within an RMB.
> > An RMBE is the RDMA Write destination organized as wrapping ring buffer
> > for data transmit of a particular connection in one direction
> > (duplex by means of mirror symmetry as with TCP).
> > This initial Linux implementation always has 1 RMBE per RMB
> > and thus an individual RMB for each connection.
> > 
> > 
> > SMC-R connection establishment with subsequent data transfer:
> > 
> >    CLIENT                                                   SERVER
> > 
> > TCP three-way handshake:
> >                          regular TCP SYN
> >       -------------------------------------------------------->
> >                        regular TCP SYN ACK
> >       <--------------------------------------------------------
> >                          regular TCP ACK
> >       -------------------------------------------------------->
> > 
> > SMC Connection Layer Control (CLC) handshake
> > exchanges RDMA credentials between peers:
> >              via above TCP connection: SMC CLC Proposal
> >       -------------------------------------------------------->
> >               via above TCP connection: SMC CLC Accept
> >       <--------------------------------------------------------
> >              via above TCP connection: SMC CLC Confirm
> >       -------------------------------------------------------->
> > 
> > SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group):
> >                  RoCE RC-QP: SMC LLC Confirm Link
> >       <========================================================
> >              RoCE RC-QP: SMC LLC Confirm Link response
> >       ========================================================>
> > 
> > SMC data transmission (incl. SMC Connection Data Control (CDC) message):
> >                        RoCE RC-QP: RDMA Write
> >       ========================================================>
> >              RoCE RC-QP: SMC CDC message (flow control)
> >       ========================================================>
> >                           ...
> > 
> >                        RoCE RC-QP: RDMA Write
> >       <========================================================
> >              RoCE RC-QP: SMC CDC message (flow control)
> >       <========================================================
> >                           ...
> > 
> > 
> > Data flow within an established connection:
> > 
> > +----------------------------------------------------------------------------
> > |            SENDER
> > | sendmsg()
> > |    |
> > |    | produces into sndbuf [sender's process context]
> > |    v
> > | +--------+
> > | | sndbuf | [ring buffer]
> > | +--------+
> > |    |
> > |    | consumes from sndbuf and produces into receiver's RMBE [any context]
> > |    | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP
> > |    |
> > +----|-----------------------------------------------------------------------
> >      |
> > +----|-----------------------------------------------------------------------
> > |    v       RECEIVER
> > | +------+
> > | | RMBE | [ring buffer, can have size different from sender's sndbuf]
> > | |      | [RMBE represents rcvbuf, no further de-coupling as on sender side]
> > | +------+
> > |    |
> > |    | consumes from RMBE [receiver's process context]
> > |    v
> > | recvmsg()
> > +----------------------------------------------------------------------------
> > 
> > 
> > Flow control ("cursor" updates) by means of SMC CDC messages:
> > 
> >                SENDER                            RECEIVER
> > 
> >         sends updates via CDC-------------+   sends updates via CDC
> >         on consuming from sndbuf          |   on consuming from RMBE
> >         and producing into RMBE           |   by means of recvmsg()
> >                                           |            |
> >                                           |            |
> >       +-----------------------------------|------------+
> >       |                                   |
> >    +--v-------------------------+      +--v-----------------------+
> >    | receiver's consumer cursor |      | sender's producer cursor----+
> >    +----------------|-----------+      +--------------------------+  |
> >                     |                                                |
> >                     |                        receiver's RMBE         |
> >                     |                  +--------------------------+  |
> >                     |                  |                          |  |
> >                     +--------------------------------+            |  |
> >                                        |             |            |  |
> >                                        |             v            |  |
> >                                        |             +------------|  |
> >                                        |-------------+////////////|  |
> >                                        |//RDMA data written by////|  |
> >                                        |////sender that is////////|  |
> >                                        |/available to be consumed/|  |
> >                                        |///////// +---------------|  |
> >                                        |----------+^              |  |
> >                                        |           |              |  |
> >                                        |           +-----------------+
> >                                        |                          |
> >                                        +--------------------------+
> > 
> > Sending updates of the producer cursor is immediate for low latency;
> > something like Nagle's algorithm (absence of TCP_NODELAY) is optional and
> > currently not part of this initial Linux implementation.
> > Sending updates of the consumer cursor is conditional to avoid the
> > silly window syndrome.
> > 
> > 
> > Normal connection termination:
> > 
> > Normal connection termination starts transitioning from socket state
> > ACTIVE via either "Active Close" or "Passive Close".
> > 
> > shutdown rdwr               +-----------------+
> > or close,   +-------------->|  INIT / CLOSED  |<-------------+
> > send PeerCon|nClosed        +-----------------+              | PeerConnClosed
> >             |                       |                        | received
> >             |            connection | established            |
> >             |                       V                        |
> >     +----------------+     +-----------------+     +----------------+
> >     |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
> >     +----------------+     +-----------------+     +----------------+
> >             |                   |         |                   |
> >             |     Active Close: |         |Passive Close:     |
> >             |     close or      |         |PeerConnClosed or  |
> >             |     shutdown wr or|         |PeerDoneWriting    |
> >             |     shutdown rdwr |         |received           |
> >             |                   V         V                   |
> >  PeerConnClo|sed    +--------------+   +-------------+        | close or
> >  received   +--<----|PeerCloseWait1|   |AppCloseWait1|--->----+ shutdown rdwr,
> >             |       +--------------+   +-------------+        | send
> >             |  PeerDoneWri|ting                | shutdown wr, | PeerConnClosed
> >             |  received   |            send Pee|rDoneWriting  |
> >             |             V                    V              |
> >             |       +--------------+   +-------------+        |
> >             +--<----|PeerCloseWait2|   |AppCloseWait2|--->----+
> >                     +--------------+   +-------------+
> > 
> > In state CLOSED, the socket can be destructed only, once the application has
> > issued a close().
> > 
> > Abnormal connection termination:
> > 
> >                             +-----------------+
> >             +-------------->|  INIT / CLOSED  |<-------------+
> >             |               +-----------------+              |
> >             |                                                |
> >             |           +-----------------------+            |
> >             |           |     Any state         |            |
> >  PeerConnAbo|rt         | (before setting       |            | send
> >  received   |           |  PeerConnClosed       |            | PeerConnAbort
> >             |           |  indicator in         |            |
> >             |           |  peer's RMBE)         |            |
> >             |           +-----------------------+            |
> >             |                   |         |                  |
> >             |     Active Abort: |         | Passive Abort:   |
> >             |     problem,      |         | PeerConnAbort    |
> >             |     send          |         | received,        |
> >             |     PeerConnAbort,|         | ECONNRESET       |
> >             |     ECONNABORTED  |         |                  |
> >             |                   V         V                  |
> >             |       +--------------+   +--------------+      |
> >             +-------|PeerAbortWait |   | ProcessAbort |------+
> >                     +--------------+   +--------------+
> > 
> > 
> > Implementation notes beyond RFC 7609:
> > 
> > A PNET table in sysfs provides the mapping between network device names and
> > RoCE Infiniband device names for the transparent switch of data communication.
> > A PNET table can contain an arbitrary number of PNETIDs.
> > Each PNETID contains exactly one (Ethernet) network device name
> > and one or more RoCE Infiniband device names.
> > Each device name can only exist in at most one PNETID (no overlapping).
> > This initial Linux implementation allows at most one RoCE Infiniband device
> > name per PNETID.
> > After a new TCP connection is established, the network device name
> > used for egress traffic with the TCP connection's local source IP address
> > is used as key to lookup the unique PNETID, and the RoCE Infiniband device
> > of this PNETID is used to switch data communication from TCP to RDMA
> > during SMC CLC handshake.
> > 
> > 
> > Problem determination:
> > 
> > A protocol dissector is available with upstream wireshark for formatting
> > SMC-R related RoCE LAN traffic.
> > [https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c]
> > 
> > 
> > We are working on enhancing the Linux implementation to cover:
> > 
> > - Improve default socket closing asynchronicity
> > - Address corner cases with many parallel connections
> > - Load balancing and fail-over
> > - Urgent data
> > - Splice and sendpage support
> > - Keepalive
> > - More socket options
> > - IPv6 support
> > - Tracing
> > - Statistics support
> > 
> - Improve default socket closing asynchronicity
> - Address corner cases with many parallel connections
> - Tracing
> - Integrated load balancing and fail-over within a link group
> - Splice and sendpage support
> - IPv6 addressing support
> - Keepalive, Cork
> - Namespaces support
> - Urgent data
> - More socket options
> - Diagnostics
> - Statistics support
> - SNMP support
> 
> > 
> > References:
> > 
> > [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
> 
> Do you agree with this changed cover letter?
> 
> Kind regards,
> Ursula Braun
> 
> 
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ