netdev - Re: [RFC 2/2] page_frag_cache: Store metadata in struct page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180319092718.071f4801@redhat.com>
Date:   Mon, 19 Mar 2018 09:27:18 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     brouer@...hat.com, Alexander Duyck <alexander.duyck@...il.com>,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        linux-mm@...r.kernel.org, Netdev <netdev@...r.kernel.org>,
        Matthew Wilcox <mawilcox@...rosoft.com>,
        Paolo Abeni <pabeni@...hat.com>
Subject: Re: [RFC 2/2] page_frag_cache: Store metadata in struct page

On Fri, 16 Mar 2018 14:05:00 -0700 Matthew Wilcox <willy@...radead.org> wrote:

> I understand your concern about the cacheline bouncing between the
> freeing and allocating CPUs.  Is cross-CPU freeing a frequent
> occurrence?  From looking at its current usage, it seemed like the
> allocation and freeing were usually on the same CPU.

While we/the-network-stack in many cases try to alloc and free on the
same CPU.  Then, in practical default setups it will be common case to
alloc and free on different CPUs.  The scheduler moves processes
between CPUs, and irqbalance change which CPU does the DMA TX
completion (in case of forwarding).  I usually pin/align the NIC IRQs
manually (via proc smp_affinity_list) and manually pin/taskset the
userspace process (and makes sure to test both local/remote alloc/free
cases when benchmarking).

I used to recommend people to pin the RX userspace process to the NAPI
RX CPU, but based on my benchmarking I no longer do that.  At least for
UDP (after Paolo Abeni's optimizations) then there is a significant
performance advantage of running UDP receiver on another CPU (in the
range from 800Kpps to 2200Kpps). (Plus it avoids the softirq starvation
problem).

Mellanox even have a perf tuning tool, that explicit moves the DMA
TX-completion IRQ to run on another CPU than RX. Thus, I assume that
they have evidence/benchmarks that show this as an advantage.

More recently I implemented XDP cpumap redirect.  Which explicitly
moves the raw page/frame to be handled on a remote CPU.  Mostly to move
another MM alloc/free overhead away from the RX-CPU, which is the SKB
alloc/free overhead.  I'm working on a XDP return frame API, but for
now, performance depend on the page_frag recycle tricks (although for
the sake of accuracy it doesn't directly depend on page_frag_cache API,
but similar pagecnt_bias tricks).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer