netdev - Re: OVS Offload Decision Proposal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+mtBx9U-HUXcfk+oOuNRwpKq74WXmbkWvQDrYURCavfWZjQaQ@mail.gmail.com>
Date:	Wed, 4 Mar 2015 08:45:56 -0800
From:	Tom Herbert <therbert@...gle.com>
To:	Simon Horman <simon.horman@...ronome.com>
Cc:	"dev@...nvswitch.org" <dev@...nvswitch.org>,
	Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: OVS Offload Decision Proposal

Hi Simon, a few comments inline.

On Tue, Mar 3, 2015 at 5:18 PM, Simon Horman <simon.horman@...ronome.com> wrote:
> [ CCed netdev as although this is primarily about Open vSwitch userspace
>   I believe there are some interested parties not on the Open vSwitch
>   dev mailing list ]
>
> Hi,
>
> The purpose of this email is to describe a rough design for driving Open
> vSwitch flow offload from user-space. But before getting to that I would
> like to provide some background information.
>
> The proposed design is for "OVS Offload Decision": a proposed component of
> ovs-vswitchd. In short the top-most red box in the first figure in the
> "OVS HW Offload Architecture" document edited by Thomas Graf[1].
>
> [1] https://docs.google.com/document/d/195waUliu7G5YYVuXHmLmHgJ38DFSte321WPq0oaFhyU/edit#heading=h.116je16s8xzw
>
> Assumptions
> -----------
>
> There is currently a lively debate on various aspects of flow offloads
> within the Linux networking community. As of writing the latest discussion
> centers around the "Flows! Offload them." thread[2] on the netdev mailing
> list.
>
> [2] http://thread.gmane.org/gmane.linux.network/351860
>
> My aim is not to preempt the outcome of those discussions. But rather to
> investigate what offloads might look like in ovs-vswitchd. In order to make
> that investigation concrete I have made some assumptions about facilities
> that may be provided by the kernel in future. Clearly if the discussions
> within the Linux networking community end in a solution that differs from
> my assumptions then this work will need to be revisited. Indeed, I entirely
> expect this work to be revised and refined and possibly even radically
> rethought as time goes on.
>
> That said, my working assumptions are:
>
> * That Open vSwitch may manage flow offloads from user-space. This is as
>   opposed to them being transparently handled in the datapath. This does
>   not preclude the existence of transparent offloading in the datapath.
>   But rather limits this discussion to a mode where offloads are managed
>   from user-space.
>
> * That Open vSwitch may add flows to hardware via an API provided by the
>   kernel. In particular my working assumption is that the Flow API proposed
>   by John Fastabend[3] may be used to add flows to hardware. While the
>   existing netlink API may be used to add flows to the kernel datapath.
>
Doesn't this imply two entities to be independently managing the same
physical resource? If so, this raises questions of how the resource
would be partitioned between them? How are conflicting requests
between the two rectified?

> * That there will be an API provided by the kernel to allow the discovery
>   of hardware offload capabilities by user-space. Again my working
>   assumption is that the Flow API proposed by John Fastabend[3] may be used
>   for this purpose.
>
> [3] http://thread.gmane.org/gmane.linux.network/347188
>
> Rough Design
> ------------
>
> * Modify flow translation so that the switch parent id[4] of the flow is
>   recorded as part of its translation context. The switch parent id was
>   recently added to the Linux kernel and provides a common identifier for
>   all netdevices that are backed by the same underlying switch hardware for
>   some very loose definition of switch. In this scheme if the input and all
>   output ports of a flow belong to the same switch hardware then the switch
>   id of the translation context would be set accordingly, indicating
>   offload of the flow may occur to that switch.
>
>   [4] https://github.com/torvalds/linux/blob/master/Documentation/networking/switchdev.txt
>
>   At this time this excludes both flows that either span multiple switch
>   devices or use vports that are not backed directly by netdevices, for
>   example tunnel vports. While important I believe these are topics for
>   further work.
>
> * At the point where a flow is to be added to the datapath ovs-vswitchd
>   should determine if it should be offloaded and if so translate it to a
>   flow for the hardware offload API and queue this translated flow up to be
>   added to hardware as well as the datapath.
>
>   The translation to hardware flows could be performed along with the
>   translation that already occurs from OpenFlow to ODP flows. However, that
>   translation is already quite complex and called for a variety of reasons
>   other than to prepare flows to be added to the datapath. So I think it
>   makes some sense to keep the new translation separate from the existing
>   one.
>
>   The determination mentioned above could first check if the switch id is
>   set and then may make further checks: for example that there is space in
>   the hardware for a new flow, that all the matches and actions of the flow
>   may be offloaded.
>
>   There seems to be ample scope for complex logic to determine which flows
>   should be offloaded. And I believe that one motivation for handling
>   offloads in user-space for such complex logic to live in user-space.

I think there needs to be more thought around the long term
ramifications of this model. Aside from the potential conflicts with
kernel that I mentioned above as well as the inevitable replication of
functionality between kernel and userspace, I don't see that we have
any good precedents for dynamically managing a HW offload from user
space like this. AFAIK, all current networking offloads are managed by
kernel or device, and I believe iSCSI, RDMA qp's, and even TOE
offloads were all managed in the kernel. The basic problem of choosing
best M of N total flows to offload really isn't fundamentally
different than some other kernel mechanisms such as how we need to
manage the memory allocated to the page cache.

>   However, in order to keep things simple in the beginning I propose some
>   very simple logic: offload all flows that the hardware supports up until
>   the hardware runs out of space.
>
>   This seems like a reasonable start keeping in mind that all flows will
>   also be added to the datapath and that ovs-vswitchd constructs flows such
>   that they do not overlap.
>
Again, who will enforce this?

>   A more conservative version of this simple rule would be to remove all
>   flows from hardware if a flow is encountered that is not to be added to
>   hardware. That is, ensure either all flows that are in hardware are also
>   in software or no flows are in hardware at all. This is the approach
>   being initially taken for L3 offloads in the Linux kernel[5].
>
That approach is non-starter for real deployment anyway. Graceful
degradation is a fundamental requirement.

>   [5] http://thread.gmane.org/gmane.linux.network/352481/focus=352658
>
> * It seems to me that somewhat tricky problem is how to manage flows in
>   hardware. As things stand ovs-vswitchd generally manages flows in the
>   datapath by dumping flows, inspecting the dumped flows to see how
>   recently they have been used and removing idle flows from the datapath.
>   Unfortunately this approach may not be well suited to flows offloaded to
>   hardware as dumping flows may be prohibitively expensive. As such I would
>   like some consideration given to three approaches. Perhaps in the end all
>   will need to be supported. And perhaps there are others:
>
>   1. Dump Flows
>      This is the approach currently taken to managing datapath flows. As
>      stated above my feeling is that this will not be well suited much
>      hardware. However, for simplicity it may be a good place to start.
>
>   2. Notifications
>      In this approach flows are added to hardware with a soft timeout and
>      hardware removes flows when they timeout sending a notification when
>      that occurs. Notifications would be relayed up to user space from the
>      driver in the kernel. Some effort may be required to mitigate
>      notification storms if many flows are removed in a short space of
>      time. It is also of note that there is likely to be hardware that
>      can't generate notifications on flow removal.
>
>   3. Aging in hardware
>      In this approach flows are added to hardware with a soft timeout and
>      hardware removes the flows when they timeout. However no notification
>      is generated. And thus ovs-vswitchd has no way of knowing if a flow is
>      still present in hardware or not. From a hardware point of view this
>      seems to be the simplest to support. But I suspect that it would
>      present some significant challenges to ovs-vswitchd in the context of
>      its current implementation of flow management. Especially if flows are
>      also to be present in the datapath as proposed above.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html