lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210416154523.3b1fe700@carbon>
Date:   Fri, 16 Apr 2021 15:45:23 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Martin KaFai Lau <kafai@...com>
Cc:     Toke Høiland-Jørgensen <toke@...hat.com>,
        Hangbin Liu <liuhangbin@...il.com>, <bpf@...r.kernel.org>,
        <netdev@...r.kernel.org>, Jiri Benc <jbenc@...hat.com>,
        Eelco Chaudron <echaudro@...hat.com>, <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Lorenzo Bianconi <lorenzo.bianconi@...hat.com>,
        David Ahern <dsahern@...il.com>,
        Andrii Nakryiko <andrii.nakryiko@...il.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Björn Töpel 
        <bjorn.topel@...il.com>, brouer@...hat.com,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush
 instead of bulk enqueue

On Thu, 15 Apr 2021 17:39:13 -0700
Martin KaFai Lau <kafai@...com> wrote:

> On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> > Jesper Dangaard Brouer <brouer@...hat.com> writes:
> >   
> > > On Thu, 15 Apr 2021 10:35:51 -0700
> > > Martin KaFai Lau <kafai@...com> wrote:
> > >  
> > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:  
> > >> > Hangbin Liu <liuhangbin@...il.com> writes:
> > >> >     
> > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:    
> > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> > >> > >> >  {
> > >> > >> >  	struct net_device *dev = bq->dev;
> > >> > >> > -	int sent = 0, err = 0;
> > >> > >> > +	int sent = 0, drops = 0, err = 0;
> > >> > >> > +	unsigned int cnt = bq->count;
> > >> > >> > +	int to_send = cnt;
> > >> > >> >  	int i;
> > >> > >> >  
> > >> > >> > -	if (unlikely(!bq->count))
> > >> > >> > +	if (unlikely(!cnt))
> > >> > >> >  		return;
> > >> > >> >  
> > >> > >> > -	for (i = 0; i < bq->count; i++) {
> > >> > >> > +	for (i = 0; i < cnt; i++) {
> > >> > >> >  		struct xdp_frame *xdpf = bq->q[i];
> > >> > >> >  
> > >> > >> >  		prefetch(xdpf);
> > >> > >> >  	}
> > >> > >> >  
> > >> > >> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > >> > >> > +	if (bq->xdp_prog) {    
> > >> > >> bq->xdp_prog is used here
> > >> > >>     
> > >> > >> > +		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> > >> > >> > +		if (!to_send)
> > >> > >> > +			goto out;
> > >> > >> > +
> > >> > >> > +		drops = cnt - to_send;
> > >> > >> > +	}
> > >> > >> > +    
> > >> > >> 
> > >> > >> [ ... ]
> > >> > >>     
> > >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > >> > >> > -		       struct net_device *dev_rx)
> > >> > >> > +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
> > >> > >> >  {
> > >> > >> >  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> > >> > >> >  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> > >> > >> >  	/* Ingress dev_rx will be the same for all xdp_frame's in
> > >> > >> >  	 * bulk_queue, because bq stored per-CPU and must be flushed
> > >> > >> >  	 * from net_device drivers NAPI func end.
> > >> > >> > +	 *
> > >> > >> > +	 * Do the same with xdp_prog and flush_list since these fields
> > >> > >> > +	 * are only ever modified together.
> > >> > >> >  	 */
> > >> > >> > -	if (!bq->dev_rx)
> > >> > >> > +	if (!bq->dev_rx) {
> > >> > >> >  		bq->dev_rx = dev_rx;
> > >> > >> > +		bq->xdp_prog = xdp_prog;    
> > >> > >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> > >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> > >> > >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> > >> > >> 
> > >> > >> e.g. what if the devmap elem gets deleted.    
> > >> > >
> > >> > > Jesper knows better than me. From my veiw, based on the description of
> > >> > > __dev_flush():
> > >> > >
> > >> > > On devmap tear down we ensure the flush list is empty before completing to
> > >> > > ensure all flush operations have completed. When drivers update the bpf
> > >> > > program they may need to ensure any flush ops are also complete.    
> > >>
> > >> AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.

The bq->xdp_prog comes form the devmap "dev" element, and it is stored
in temporarily in the "bq" structure that is only valid for this
softirq NAPI-cycle.  I'm slightly worried that we copied this pointer
the the xdp_prog here, more below (and Q for Paul).

> > >> > 
> > >> > Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> > >> > which also runs under one big rcu_read_lock(). So the storage in the
> > >> > bulk queue is quite temporary, it's just used for bulking to increase
> > >> > performance :)    
> > >>
> > >> I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
> > >> i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
> > >> in i40e_run_xdp() and it is fine.
> > >> 
> > >> In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
> > >> rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
> > >> or I missed the big rcu_read_lock() in i40e_napi_poll()?
> > >>
> > >> I do see the big rcu_read_lock() in mlx5e_napi_poll().  
> > >
> > > I believed/assumed xdp_do_flush_map() was already protected under an
> > > rcu_read_lock.  As the devmap and cpumap, which get called via
> > > __dev_flush() and __cpu_map_flush(), have multiple RCU objects that we
> > > are operating on.  
>
> What other rcu objects it is using during flush?

Look at code:
 kernel/bpf/cpumap.c
 kernel/bpf/devmap.c

The devmap is filled with RCU code and complicated take-down steps.  
The devmap's elements are also RCU objects and the BPF xdp_prog is
embedded in this object (struct bpf_dtab_netdev).  The call_rcu
function is __dev_map_entry_free().


> > > Perhaps it is a bug in i40e?  
>
> A quick look into ixgbe falls into the same bucket.
> didn't look at other drivers though.

Intel driver are very much in copy-paste mode.
 
> > >
> > > We are running in softirq in NAPI context, when xdp_do_flush_map() is
> > > call, which I think means that this CPU will not go-through a RCU grace
> > > period before we exit softirq, so in-practice it should be safe.  
> > 
> > Yup, this seems to be correct: rcu_softirq_qs() is only called between
> > full invocations of the softirq handler, which for networking is
> > net_rx_action(), and so translates into full NAPI poll cycles.  
>
> I don't know enough to comment on the rcu/softirq part, may be someone
> can chime in.  There is also a recent napi_threaded_poll().

CC added Paul. (link to patch[1][2] for context)

> If it is the case, then some of the existing rcu_read_lock() is unnecessary?

Well, in many cases, especially depending on how kernel is compiled,
that is true.  But we want to keep these, as they also document the
intend of the programmer.  And allow us to make the kernel even more
preempt-able in the future.

> At least, it sounds incorrect to only make an exception here while keeping
> other rcu_read_lock() as-is.

Let me be clear:  I think you have spotted a problem, and we need to
add rcu_read_lock() at least around the invocation of
bpf_prog_run_xdp() or before around if-statement that call
dev_map_bpf_prog_run(). (Hangbin please do this in V8).

Thank you Martin for reviewing the code carefully enough to find this
issue, that some drivers don't have a RCU-section around the full XDP
code path in their NAPI-loop.

Question to Paul.  (I will attempt to describe in generic terms what
happens, but ref real-function names).

We are running in softirq/NAPI context, the driver will call a
bq_enqueue() function for every packet (if calling xdp_do_redirect) ,
some driver wrap this with a rcu_read_lock/unlock() section (other have
a large RCU-read section, that include the flush operation).

In the bq_enqueue() function we have a per_cpu_ptr (that store the
xdp_frame packets) that will get flushed/send in the call
xdp_do_flush() (that end-up calling bq_xmit_all()).  This flush will
happen before we end our softirq/NAPI context.

The extension is that the per_cpu_ptr data structure (after this patch)
store a pointer to an xdp_prog (which is a RCU object).  In the flush
operation (which we will wrap with RCU-read section), we will use this
xdp_prog pointer.   I can see that it is in-principle wrong to pass
this-pointer between RCU-read sections, but I consider this safe as we
are running under softirq/NAPI and the per_cpu_ptr is only valid in
this short interval.

I claim a grace/quiescent RCU cannot happen between these two RCU-read
sections, but I might be wrong? (especially in the future or for RT).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://lore.kernel.org/netdev/20210414122610.4037085-2-liuhangbin@gmail.com/
[2] https://patchwork.kernel.org/project/netdevbpf/patch/20210414122610.4037085-2-liuhangbin@gmail.com/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ