netdev - Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180309170520.11967b11@redhat.com>
Date:   Fri, 9 Mar 2018 17:05:20 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Jason Wang <jasowang@...hat.com>
Cc:     netdev@...r.kernel.org,
        BjörnTöpel <bjorn.topel@...el.com>,
        magnus.karlsson@...el.com, eugenia@...lanox.com,
        John Fastabend <john.fastabend@...il.com>,
        Eran Ben Elisha <eranbe@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>, galp@...lanox.com,
        Daniel Borkmann <borkmann@...earbox.net>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Tariq Toukan <tariqt@...lanox.com>, brouer@...hat.com
Subject: Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API
 and use in cpumap

On Fri, 9 Mar 2018 21:04:23 +0800
Jason Wang <jasowang@...hat.com> wrote:

> On 2018年03月09日 17:35, Jesper Dangaard Brouer wrote:
> > On Fri, 9 Mar 2018 15:24:10 +0800
> > Jason Wang <jasowang@...hat.com> wrote:
> >  
> >> On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:  
> >>> Introduce an xdp_return_frame API, and convert over cpumap as
> >>> the first user, given it have queued XDP frame structure to leverage.
> >>>
> >>> Signed-off-by: Jesper Dangaard Brouer <brouer@...hat.com>
> >>> ---
> >>>    include/net/xdp.h   |   32 +++++++++++++++++++++++++++
> >>>    kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
> >>>    net/core/xdp.c      |   18 +++++++++++++++
> >>>    3 files changed, 86 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/include/net/xdp.h b/include/net/xdp.h
> >>> index b2362ddfa694..3cb726a6dc5b 100644
> >>> --- a/include/net/xdp.h
> >>> +++ b/include/net/xdp.h
> >>> @@ -33,16 +33,48 @@
> >>>     * also mandatory during RX-ring setup.
> >>>     */
> >>>    
> >>> +enum mem_type {
> >>> +	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
> >>> +	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> >>> +	// Possible new ideas for types:
> >>> +	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
> >>> +	// MEM_TYPE_AF_XDP,
> >>> +	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
> >>> +	MEM_TYPE_MAX,
> >>> +};  
> >> So if we plan to support dev->ndo, it looks to me two types AF_XDP and
> >> DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool
> >> or ordinary page allocator) in ndo or what ever other callbacks.  
> > So, the design is open to go both ways, we can figure out later.
> >
> > The reason I'm not calling page_pool from a driver level dev->ndo, is
> > that I'm trying to avoid invoking code in a module, as the driver
> > module code can be (in the process of being) unloaded from the kernel,
> > while another driver is calling xdp_return_frame.  The current design
> > in patch 10, uses the RCU period to guarantee that the allocator
> > pointer is valid as long as the ID lookup succeeded.
> >
> > To do what you propose, we also need to guarantee that the net_device
> > cannot disappear so it is safe to invoke dev->ndo.  To do so, the
> > driver likely have to add a rcu_barrier before unloading.  I'm also
> > thinking that for MEM_TYPE_DEVICE_OWNED the allocator pointer ("under
> > protection") could be the net_device pointer, and we could take a
> > refcnt on dev (dev_hold/dev_put).  
> 
> Then it looks like a device (e.g tun) can lock another device for 
> indefinite time.
> 
> > Thus, I think it is doable, but lets
> > figure this out later.
> >
> > Another important design consideration, is that the xdp core need to
> > know how to release memory in case the ID lookup failed.  
> 
> Can we simply use put_page() in this case?

For now, yes. The (future) use case is when pages are kept DMA mapped,
then the page also need to be DMA unmapped.  To DMA unmap we need to
know the struct device (not net_device).  (And then the discussion
start around how to know struct device is still valid).


> >   This is
> > another argument for keeping the memory release code inside the xdp
> > core, and not leave too much freedom to the drivers.
> >
> >  
> >>> +struct xdp_mem_info {
> >>> +	u32 type; /* enum mem_type, but known size type */
> >>> +	u32 id; // Needed later (to lookup struct xdp_rxq_info)
> >>> +};
> >>> +
> >>>    struct xdp_rxq_info {
> >>>    	struct net_device *dev;
> >>>    	u32 queue_index;
> >>>    	u32 reg_state;
> >>> +	struct xdp_mem_info mem;
> >>>    } ____cacheline_aligned; /* perf critical, avoid false-sharing */
> >>>    
> >>> +
> >>> +static inline
> >>> +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> >>> +{
> >>> +	if (mem->type == MEM_TYPE_PAGE_SHARED)
> >>> +		page_frag_free(data);
> >>> +
> >>> +	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> >>> +		struct page *page = virt_to_page(data); /* Assumes order0 page*/
> >>> +
> >>> +		put_page(page);
> >>> +	}
> >>> +}
> >>> +
> >>>    int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
> >>>    		     struct net_device *dev, u32 queue_index);
> >>>    void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
> >>>    void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
> >>>    bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
> >>> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> >>> +			       enum mem_type type, void *allocator);
> >>>    
> >>>    #endif /* __LINUX_NET_XDP_H__ */
> >>> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> >>> index a4bb0b34375a..3e4bbcbe3e86 100644
> >>> --- a/kernel/bpf/cpumap.c
> >>> +++ b/kernel/bpf/cpumap.c  
> > [...]  
> >>>    static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> >>>    {
> >>>    	atomic_inc(&rcpu->refcnt);
> >>> @@ -188,6 +168,10 @@ struct xdp_pkt {
> >>>    	u16 len;
> >>>    	u16 headroom;
> >>>    	u16 metasize;
> >>> +	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> >>> +	 * while mem info is valid on remote CPU.
> >>> +	 */  
> >> Can we simply move the xdp_mem_info to xdp_buff to avoid conversion?  
> > No, xdp_buff is a stack allocated piece of memory, thus we do need a
> > conversion into another piece of memory.
> >  
> 
> Right but just duplicate xdp_buff content is sufficient in this case or 
> is there anything other I missed?

That is what I'm doing (just with state compression).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer