netdev - Re: Designing a safe RX-zero-copy Memory Model for Networking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161214220438.4608f2bb@redhat.com>
Date:   Wed, 14 Dec 2016 22:04:38 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     John Fastabend <john.fastabend@...il.com>
Cc:     David Miller <davem@...emloft.net>, cl@...ux.com,
        rppt@...ux.vnet.ibm.com, netdev@...r.kernel.org,
        linux-mm@...ck.org, willemdebruijn.kernel@...il.com,
        bjorn.topel@...el.com, magnus.karlsson@...el.com,
        alexander.duyck@...il.com, mgorman@...hsingularity.net,
        tom@...bertland.com, bblanco@...mgrid.com, tariqt@...lanox.com,
        saeedm@...lanox.com, jesse.brandeburg@...el.com, METH@...ibm.com,
        vyasevich@...il.com, brouer@...hat.com
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking

On Wed, 14 Dec 2016 08:32:10 -0800
John Fastabend <john.fastabend@...il.com> wrote:

> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 13 Dec 2016 12:08:21 -0800
> > John Fastabend <john.fastabend@...il.com> wrote:
> >   
> >> On 16-12-13 11:53 AM, David Miller wrote:  
> >>> From: John Fastabend <john.fastabend@...il.com>
> >>> Date: Tue, 13 Dec 2016 09:43:59 -0800
> >>>     
> >>>> What does "zero-copy send packet-pages to the application/socket that
> >>>> requested this" mean? At the moment on x86 page-flipping appears to be
> >>>> more expensive than memcpy (I can post some data shortly) and shared
> >>>> memory was proposed and rejected for security reasons when we were
> >>>> working on bifurcated driver.    
> >>>
> >>> The whole idea is that we map all the active RX ring pages into
> >>> userspace from the start.
> >>>
> >>> And just how Jesper's page pool work will avoid DMA map/unmap,
> >>> it will also avoid changing the userspace mapping of the pages
> >>> as well.
> >>>
> >>> Thus avoiding the TLB/VM overhead altogether.
> >>>     
> > 
> > Exactly.  It is worth mentioning that pages entering the page pool need
> > to be cleared (measured cost 143 cycles), in order to not leak any
> > kernel info.  The primary focus of this design is to make sure not to
> > leak kernel info to userspace, but with an "exclusive" mode also
> > support isolation between applications.
> > 
> >   
> >> I get this but it requires applications to be isolated. The pages from
> >> a queue can not be shared between multiple applications in different
> >> trust domains. And the application has to be cooperative meaning it
> >> can't "look" at data that has not been marked by the stack as OK. In
> >> these schemes we tend to end up with something like virtio/vhost or
> >> af_packet.  
> > 
> > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> > two would require CAP_NET_ADMIN privileges.  All modes have a trust
> > domain id, that need to match e.g. when page reach the socket.  
> 
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.

Good point.

> > 
> > Mode-1 "Shared": Application choose lowest isolation level, allowing
> >  multiple application to mmap VMA area.  
> 
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
> 
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
> 
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
> 
> > 
> > Mode-2 "Single-user": Application request it want to be the only user
> >  of the RX queue.  This blocks other application to mmap VMA area.
> >   
> 
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.

Yes, as describe in orig email and here[1]: "once an application
request zero-copy RX, then the driver must use a specific SKB
allocation mode and might have to reconfigure the RX-ring."

The SKB allocation mode is "read-only packet page", which is the
current default mode (also desc in document[1]) of using skb-frags.

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
 
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
> 
> I assume all these concerns are shared between mode-1 and mode-2
> 
> > Mode-3 "Exclusive": Application request to own RX queue.  Packets are
> >  no longer allowed for normal netstack delivery.
> >   
> 
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.

Interesting.

> > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> > still allowed to travel netstack and thus can contain packet data from
> > other normal applications.  This is part of the design, to share the
> > NIC between netstack and an accelerated userspace application using RX
> > zero-copy delivery.
> >   
> 
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I though the bifurcated driver work was rejected because it could leak
kernel info in the pages. This approach cannot.

   
> >> Any ACLs/filtering/switching/headers need to be done in hardware or
> >> the application trust boundaries are broken.  
> > 
> > The software solution outlined allow the application to make the
> > choice of what trust boundary it wants.
> > 
> > The "exclusive" mode-3 make most sense together with HW filters.
> > Already today, we support creating a new RX queue based on ethtool
> > ntuple HW filter and then you simply attach your application that
> > queue in mode-3, and have full isolation.
> >   
> 
> Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
> Without hardware filters we have no way of knowing who/what data is
> put in the page.

For sockets, an SKB carrying a RX zero-copy-able page can be steered
(as normal) into a given socket. Then we check if socket requested
zero-copy, and verify if the domain-id match between the page_pool and
socket.

You can also use XDP to filter and steer the packet (which will be
faster and using normal steering code). 

> >    
> >> If the above can not be met then a copy is needed. What I am trying
> >> to tease out is the above comment along with other statements like
> >> this "can be done with out HW filter features".  
> > 
> > Does this address your concerns?
> >   
> 
> I think we need to enforce strong isolation. An application should not
> be able to read data or impact other applications. I gather this is
> the case per comment about normal applications in mode-2. A slightly
> weaker statement would be to say applications can only impace/read
> data of other applications in their domain. This might be OK as well.

I think this approach covers the "weaker statement".  Because only page
within the pool are "exposed".  Thus, the domain is the NIC (possibly
restricted to a single RX queue).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer