netdev - Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170313202300.GA52396@ast-mbp.thefacebook.com>
Date:   Mon, 13 Mar 2017 13:23:02 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Alexander Duyck <alexander.duyck@...il.com>
Subject: Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path

On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
> >> <alexei.starovoitov@...il.com> wrote:
> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >> >>                       case XDP_PASS:
> >> >>                               break;
> >> >>                       case XDP_TX:
> >> >> +                             /* Make sure we have one page ready to replace this one */
> >> >> +                             npage = NULL;
> >> >> +                             if (!ring->page_cache.index) {
> >> >> +                                     npage = mlx4_alloc_page(priv, ring,
> >> >> +                                                             &ndma, numa_mem_id(),
> >> >> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
> >> >
> >> > did you test this with xdp2 test ?
> >> > under what conditions it allocates ?
> >> > It looks dangerous from security point of view to do allocations here.
> >> > Can it be exploited by an attacker?
> >> > we use xdp for ddos and lb and this is fast path.
> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> >> > perf regression.
> >> > In general I dont think it's a good idea to penalize x86 in favor of powerpc.
> >> > Can you #ifdef this new code somehow? so we won't have these concerns on x86?
> >>
> >> Normal paths would never hit this point really. I wanted to be extra
> >> safe, because who knows, some guys could be tempted to set
> >> ethtool -G ethX  rx 512 tx 8192
> >>
> >> Before this patch, if you were able to push enough frames in TX ring,
> >> you would also eventually be forced to allocate memory, or drop frames...
> >
> > hmm. not following.
> > Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
> > So this rx page belongs to driver, not shared with anyone and it only needs to
> > be put onto tx ring, so I don't understand why driver needs to allocating
> > anything here. To refill the rx ring? but why here?
> 
> Because RX ring can be shared, by packets goind to the upper stacks (say TCP)
> 
> So there is no guarantee that the pages in the quarantine pool have
> their page count to 1.
> 
> The normal TX_XDP path will recycle pages in ring->cache_page .
> 
> This is exactly where I pick up a replacement.
> 
> Pages in ring->cache_page have the guarantee to have no other users
> than ourself (mlx4 driver)
> 
> You might have not noticed that current mlx4 driver has a lazy refill
> of RX ring buffers, that eventually
> removes all the pages from RX ring, and we have to recover with this
> lazy mlx4_en_recover_from_oom() thing
> that will attempt to restart the allocations.
> 
> After my patch, we have the guarantee that the RX ring buffer is
> always fully populated.
> 
> When we receive a frame (XDP or not), we drop it if we can not
> replenish the RX slot,
> in case the oldest page in quarantine is not a recycling candidate and
> we can not allocate a new page.

Got it. Could you please add above explanation to the commit log,
since it's a huge difference vs other drivers.
I don't think any other driver in the tree follows this strategy.
I think that's the right call, but it shouldn't be hidden in details.
If that's the right approach (probably is) we should do it
in the other drivers too.
Though I don't see you dropping "to the stack" packet in this patch.
The idea is to drop the packet (whether xdp or not) if rx
buffer cannot be replinshed _regardless_ whether driver
implements recycling, right?

> > rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
> > will be unused.
> 
> Yes, but as an author of a kernel patch, I do not want to be
> responsible for a crash
> _if_ someone tries something dumb like that.
> 
> > why are you saying it will cause this if (!ring->page_cache.index) to trigger?
> >
> >> This patch does not penalize x86, quite the contrary.
> >> It brings a (small) improvement on x86, and a huge improvement on powerpc.
> >
> > for normal tcp stack. sure. I'm worried about xdp fast path that needs to be tested.
> 
> Note that TX_XDP in mlx4 is completely broken as I mentioned earlier.

Theory vs practice. We don't see this issue running xdp.
My understanding, since rx and tx napi are scheduled for the same cpu
there is a guarantee that during normal invocation they won't race
and the only problem is due to mlx4_en_recover_from_oom() that can
run rx napi on some random cpu.