netdev - Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181101152716.GA13895@intel.com>
Date:   Thu, 1 Nov 2018 23:27:16 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Paweł Staszewski <pstaszewski@...are.pl>,
        Eric Dumazet <eric.dumazet@...il.com>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Yoel Caspersen <yoel@...knet.dk>,
        Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal
 users traffic

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote:
... ...
> Section copied out:
> 
>   mlx5e_poll_tx_cq
>   |          
>    --16.34%--napi_consume_skb
>              |          
>              |--12.65%--__free_pages_ok
>              |          |          
>              |           --11.86%--free_one_page
>              |                     |          
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |          
>              |                      --0.65%--_raw_spin_lock

This callchain looks like it is freeing higher order pages than order 0:
__free_pages_ok is only called for pages whose order are bigger than 0.

>              |          
>              |--1.55%--page_frag_free
>              |          
>               --1.44%--skb_release_data
> 
> 
> Let me explain what (I think) happens.  The mlx5 driver RX-page recycle
> mechanism is not effective in this workload, and pages have to go
> through the page allocator.  The lock contention happens during mlx5
> DMA TX completion cycle.  And the page allocator cannot keep up at
> these speeds.
> 
> One solution is extend page allocator with a bulk free API.  (This have
> been on my TODO list for a long time, but I don't have a
> micro-benchmark that trick the driver page-recycle to fail).  It should
> fit nicely, as I can see that kmem_cache_free_bulk() does get
> activated (bulk freeing SKBs), which means that DMA TX completion do
> have a bulk of packets. 
> 
> We can (and should) also improve the page recycle scheme in the driver.
> After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
> page_pool, and we will (attempt) to generalize this, for both high-end
> mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).
> 
> The MM-people is in parallel working to improve the performance of
> order-0 page returns.  Thus, the explicit page bulk free API might
> actually become less important.  I actually think (Cc.) Aaron have a
> patchset he would like you to test, which removes the (zone->)lock
> you hit in free_one_page().

Thanks Jesper.

Yes, the said patchset is in this branch:
https://github.com/aaronlu/linux no_merge_cluster_alloc_4.19-rc5

But as I said above, I think the lock contention here is for
order > 0 pages so my current patchset will not work here, unfortunately.

BTW, Mel Gorman has suggested an alternative way to improve page
allocator's scalability and I'm working on it right now, it will
improve page allocator's scalability for all order pages. I might be
able to post it some time next week, will CC all of you when it's ready.