lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Thu, 18 Aug 2011 15:08:54 -0700
From:	San Mehat <san@...gle.com>
To:	davem@...emloft.net, mst@...hat.com, rusty@...tcorp.com.au
Cc:	linux-kernel@...r.kernel.org,
	virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
	digitaleric@...gle.com, mikew@...gle.com, miche@...gle.com,
	maccarro@...gle.com
Subject: Re:

Pls disregard in favor of the one with an actual subject line :P

-san

On Thu, Aug 18, 2011 at 3:07 PM, San Mehat <san@...gle.com> wrote:
>
> TL;DR
> -----
> In this RFC we propose the introduction of the concept of hardware socket
> offload to the Linux kernel. Patches will accompany this RFC in a few days,
> but we felt we had enough on the design to solicit constructive discussion
> from the community at-large.
>
> BACKGROUND
> ----------
> Many applications within enterprise organizations suitable for virtualization
> neither require nor desire a connection to the full internal Ethernet+IP
> network.  Rather, some specific socket connections -- for processing HTTP
> requests, making database queries, or interacting with storage -- are needed,
> and IP networking in the application may typically be discouraged for
> applications that do not sit on the edge of the network. Furthermore, removing
> the application's need to understand where its inputs come from / go to within
> the networking fabric can make save/restore/migration of a virtualized
> application substantially easier - especially in large clusters and on fabrics
> which can't handle IP re-assignment.
>
> REQUIREMENTS
> ------------
>  * Allow VM connectivity to internal resources without requiring additional
>   network resources (IPs, VLANs, etc).
>  * Easy authentication of network streams from a trusted domain (vmm).
>  * Protect host-kernel & network-fabric from direct exposure to untrusted
>   packet data-structures.
>  * Support for multiple distributions of Linux.
>  * Minimal third-party software maintenance burden.
>  * To be able to co-exist with the existing network stack and ethernet virtual
>   devices in the event that an applications specific requirements cannot be
>   met by this design.
>
> DESIGN
> ------
> The Berkeley sockets coprocessor is a virtual PCI device which has the ability
> to offload socket activity from an unmodified application at the BSD sockets
> layer (Layer 4).  Offloaded socket requests bypass the local operating systems
> networking stack entirely via the card and are relayed into the VMM
> (Virtual Machine Manager) for processing. The VMM then passes the request to a
> socket backend for handling. The difference between a socket backend and a
> traditional VM ethernet backend is that the socket backend receives layer 4
> socket (STREAM/DGRAM) requests instead of a multiplexed stream of layer 2
> packets (ethernet) that must be interpreted by the host. This technique also
> improves security isolation as the guest is no longer constructing packets which
> are evaluated by the host or underlying network fabric; packet construction
> happens in the host.
>
> Lastly, pushing socket processing back into the host allows for host-side
> control of the network protocols used, which limits the potential congestion
> problems that can arise when various guests are using their own congestion
> control algorithms.
>
> ================================================================================
>
>           +-----------------------------------------------------------------+
>           |                                                                 |
>  guest    |                      unmodified application                     |
> userspace  +-----------------------------------------------------------------+
>           |                         unmodified libc                         |
>           +-----------------------------------------------------------------+
>                            |                             / \
>                            |                              |
> =========================== | ============================ | ===================
>                            |                              |
>                           \ /                             |
>                 +------------------------------------------------------+
>                 |                       socket core                    |
>                 +----+============+------------------------------------+
>                      |    INET    |                   |         / \
>  guest               +-----+------+                   |          |
>  kernel              | TCP | UDP  |                   |          |
>                      +-----+------+                   | L4 reqs  |
>                      |   NETDEV   |                   |          |
>                      +------------+                   |          |
>                      | virtio_net |                  \ /         |
>                      +------------+               +------------------+
>                          |   / \                  |    hw_socket     |
>                          |    |                   +------------------+
>                          |    |                   |  virtio_socket   |
>                          |    |                   +------------------+
>                          |    |                        |       / \
> ========================= | == | ====================== | ====== | =============
>                         \ /   |                       \ /       |
>  host           +---------------------+        +------------------------+
> userspace        |  virito net device  |        |  virtio socket device  |
>  (vmm)          +---------------------+        +------------------------+
>                 |  ethernet backend   |        |     socket backend     |
>                 +---------------------+        +------------------------+
>                        |     / \                      |        / \
>                 L2     |      |                       |         |     L4
>               packets  |      |                      \ /        |  requests
>                        |      |                +-----------------------+
>                        |      |                |    Socket Handlers    |
>                        |      |                +-----------------------+
>                        |      |                       |        / \
> ======================= | ==== | ===================== | ======= | =============
>                        |      |                       |         |
>   host                \ /     |                      \ /        |
>  kernel
>
> ================================================================================
>
> One of the most appealing aspects of this design (to application developers) is
> that this approach can be completely transparent to the application, provided
> we're able to intercept the application's socket requests in such a way that we
> do not impact performance in a negative fashion, yet retain the API semantics
> the application expects. In the event that this design is not suitable for an
> application, the virtual machine may be also fitted with a normal virtual
> ethernet device in addition to the co-processor (as shown in the diagram above).
>
> Since we wish to allow these paravirtualized sockets to coexist peacefully with
> the existing Linux socket system, we've chosen to introduce the idea that a
> socket can at some point transition from being managed by the O/S socket system
> to a more enlightened 'hardware assisted' socket. The transition is managed by
> a 'socket coprocessor' component which intercepts and gets first right of
> refusal on handling certain global socket calls (connect, sendto, bind, etc...).
> In this initial design, the policy on whether to transition a socket or not is
> made by the virtual hardware, although we understand that further measurement
> into operation latency is warranted.
>
> In the event the determination is made to transition a socket to hw-assisted
> mode, the socket is marked as being assisted by hardware, and all socket
> operations are offloaded to hardware.
>
> The following flag values have been added to struct socket (only visible within
> the guest kernel):
>
>  * SOCK_HWASSIST
>    Indicates socket operations are handled by hardware
>
> In order to support a variety of socket address families, addresses are
> converted from their native socket family to an opaque string. Our initial
> design formats these strings as URIs. The currently supported conversions are:
>
> +-----------------------------------------------------------------------------+
> |   Domain   |      Type     |  URI example conversion                        |
> |  AF_INET   |  SOCK_STREAM  |  tcp://x.x.x.x:yyyy                            |
> |  AF_INET   |  SOCK_DGRAM   |  udp://x.x.x.x:yyyy                            |
> |  AF_INET6  |  SOCK_STREAM  |  tcp6://aaaa:b:cccc:d:eeee:ffff:gggg:hhhh/ii   |
> |  AF_INET6  |  SOCK_DGRAM   |  udp6://aaaa:b:cccc:d:eeee:ffff:gggg:hhhh/ii   |
> |  AF_IPX    |  SOCK_DGRAM   |  ipx://xxxxxxxx.yyyyyyyyyy.zzzz                |
> +-----------------------------------------------------------------------------+
>
> In order for the socket coprocessor to take control of a socket, hooks must be
> added to the socket core. Our initial implementation hooks a number of functions
> in the socket-core (too many), and after consideration we feel we can reduce it
> down considerably by managing the socket 'ops' pointers.
>
> ALTERNATIVE STRATEGIES
> ----------------------
>
> An alternative strategy for providing similar functionality involves either
> modifying glibc or using LD_PRELOAD tricks to intercept socket calls. We were
> forced to rule this out due to the complexity (and fragility) involved with
> attempting to maintain a general solution compatible accross various
> distributions where platform-libraries differ.
>
> CAVEATS
> -------
>
>  * We're currently hooked into too many socket calls. We should be able to
>   reduce the number of hooks to 3 (__sock_create(), sys_connect(), sys_bind()).
>
>  * Our 'hw_socket' component should be folded into a netdev so we can leverage
>   NAPI.
>
>  * We don't handle SOCK_SEQPACKET, SOCK_RAW, SOCK_RDM, or SOCK_PACKET sockets.
>
>  * We don't currently have support for /proc/net. Our current plan is to
>   add '/proc/net/hwsock' (filename TBD) and add support for these sockets
>   to the net-tools packages (netstat & friends), rather than muck around with
>   plumbing hardware-assisted socket info into '/proc/net/tcp' and
>   '/proc/net/udp'.
>
>  * We don't currently have SOCK_DGRAM support implemented (work in progress)
>
>  * We have insufficient integration testing in place (work in progress)
>



-- 
San Mehat | Staff Software Engineer | san@...gle.com | 415-366-6172
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ