netdev - Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJRPmQcShBiPwcK_jaAQZsQOOStg5TmY-=AmORKSNPiOA@mail.gmail.com>
Date:   Wed, 15 Mar 2017 19:48:04 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Alexander Duyck <alexander.duyck@...il.com>
Subject: Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path

On Wed, Mar 15, 2017 at 6:56 PM, Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
> On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
>> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
>>
>> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
>> > When tx ring and mtu are sized properly, we don't expect to see it
>> > incrementing at all. This is something in our control. 'Our' means
>> > humans that setup the environment.
>> > 'cache empty' condition is something ephemeral. Packets will be dropped
>> > and we won't have any tools to address it. These packets are real
>> > people requests. Any drop needs to be categorized.
>> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
>> > hw is dropping packets before they reach the driver. We see it
>> > incrementing depending on the traffic pattern though overall Mpps
>> > through the nic is not too high and this is something we
>> > actively investigating too.
>>
>>
>> This all looks nice, except that current mlx4 driver does not have a
>> counter of failed allocations.
>>
>> You are asking me to keep some inexistent functionality.
>>
>> Look in David net tree :
>>
>> mlx4_en_refill_rx_buffers()
>> ...
>>    if (mlx4_en_prepare_rx_desc(...))
>>       break;
>>
>>
>> So in case of memory pressure, mlx4 stops working and not a single
>> counter is incremented/reported.
>
> Not quite. That is exactly the case i'm asking to keep.
> In case of memory pressure (like in the above case rx fifo
> won't be replenished) and in case of hw receiving
> faster than the driver can drain the rx ring,
> the hw will increment 'rx_fifo_errors' counter.

In current mlx4 driver, if napi_get_frags() fails, no counter is incremented.

So you are describing quite a different behavior, where _cpu_ can not
keep up and rx_fifo_errors is incremented.

But in case of _memory_ pressure (and normal traffic), rx_fifo_errors
wont be incremented.

And even if xdp_prog 'decides' to return XDP_PASS, the fine packet
will be dropped anyway.


> And that's what we monitor already and what I described in previous email.
>
>> Is it really what you want ?
>
> almost. see below.
>
>>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +++++++++++++----------------
>>  1 file changed, 18 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> index cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>>                 if (xdp_prog) {
>>                         struct xdp_buff xdp;
>>                         struct page *npage;
>> -                       dma_addr_t ndma, dma;
>> +                       dma_addr_t dma;
>>                         void *orig_data;
>>                         u32 act;
>>
>> +                       /* Make sure we have one page ready to replace this one, per Alexei request */
>
> do you have to add snarky comments?

Is request a bad or offensive word ?

What would be the best way to say that you asked to move this code
here, while in my opinion it was better where it was  ?

>
>> +                       if (unlikely(!ring->page_cache.index)) {
>> +                               npage = mlx4_alloc_page(priv, ring,
>> +                                                       &ring->page_cache.buf[0].dma,
>> +                                                       numa_mem_id(),
>> +                                                       GFP_ATOMIC | __GFP_MEMALLOC);
>> +                               if (!npage) {
>> +                                       /* replace this by a new ring->rx_alloc_failed++
>> +                                        */
>> +                                       ring->xdp_drop++;
>
> counting it as 'xdp_drop' is incorrect.

I added a comment to make that very clear .

If you do not read the comment, what can I say ?

So the comment is :

 replace this by a new ring->rx_alloc_failed++

This of course will require other changes in other files (folding
stats at ethtool -S)
that are irrelevant for the discussion we have right now.

I wont provide full patch without knowing exactly what you are requesting.


> 'xdp_drop' should be incremented only when program actually doing it,
> otherwise that's confusing to the user.
> If you add new counter 'rx_alloc_failed' (as you implying above)
> than it's similar to the existing state.
> Before: for both hw receives too much and oom with rx fifo empty - we
> will see 'rx_fifo_errors' counter.
> After: most rx_fifo_erros would mean hw receive issues and oom will
> be covered by this new counter.
>
> Another thought... if we do this 'replenish rx ring immediately'
> why do it for xdp rings only? Let's do it for all rings?
> the above 'if (unlikely(!ring->page_cache.index)) ..alloc_page'
> can move before 'if (xdp_prog)' and simplify the rest?
>

Because non XDP paths attempt to use the page pool first.

_if_ the oldest page in page pool can not be recycled, then we
allocate a fresh page,
from a special pool (order-X preallocations) that does not fit the
page_cache order-0 model

Non XDP paths do not need to populate page_cache with one order-0
page, that would add extra useless code.