[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ+HfNh0mGnDnQD0FZqza0oEDZpj+nh_DS=JvWvJMATwsOMJEA@mail.gmail.com>
Date: Fri, 20 Dec 2019 11:09:57 +0100
From: Björn Töpel <bjorn.topel@...il.com>
To: Magnus Karlsson <magnus.karlsson@...el.com>
Cc: Björn Töpel <bjorn.topel@...el.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Netdev <netdev@...r.kernel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
bpf <bpf@...r.kernel.org>, Saeed Mahameed <saeedm@...lanox.com>,
Jeff Kirsher <jeffrey.t.kirsher@...el.com>,
"Fijalkowski, Maciej" <maciej.fijalkowski@...el.com>,
Maciej Fijalkowski <maciejromanfijalkowski@...il.com>
Subject: Re: [PATCH bpf-next v2 00/12] xsk: clean up ring access functions
On Thu, 19 Dec 2019 at 13:40, Magnus Karlsson <magnus.karlsson@...el.com> wrote:
>
> This patch set cleans up the ring access functions of AF_XDP in hope
> that it will now be easier to understand and maintain. I used to get a
> headache every time I looked at this code in order to really understand it,
> but now I do think it is a lot less painful.
>
> The code has been simplified a lot and as a bonus we get better
> performance in nearly all cases. On my new 2.1 GHz Cascade Lake
> machine with a standard default config plus AF_XDP support and
> CONFIG_PREEMPT on I get the following results in percent performance
> increases with this patch set compared to without it:
>
> Zero-copy (-N):
> rxdrop txpush l2fwd
> 1 core: -2% 0% 3%
> 2 cores: 4% 0% 3%
>
> Zero-copy with poll() (-N -p):
> rxdrop txpush l2fwd
> 1 core: 3% 0% 1%
> 2 cores: 21% 0% 9%
>
> Skb mode (-S):
> Shows a 0% to 5% performance improvement over the same benchmarks as
> above.
>
> Here 1 core means that we are running the driver processing and the
> application on the same core, while 2 cores means that they execute on
> separate cores. The applications are from the xdpsock sample app.
>
> On my older 2.0 Ghz Broadwell machine that I used for the v1, I get
> the following results:
>
> Zero-copy (-N):
> rxdrop txpush l2fwd
> 1 core: 4% 5% 4%
> 2 cores: 1% 0% 2%
>
> Zero-copy with poll() (-N -p):
> rxdrop txpush l2fwd
> 1 core: 1% 3% 3%
> 2 cores: 22% 0% 5%
>
> Skb mode (-S):
> Shows a 0% to 1% performance improvement over the same benchmarks as
> above.
>
> When a results says 21 or 22% better, as in the case of poll mode with
> 2 cores and rxdrop, my first reaction is that it must be a
> bug. Everything else shows between 0% and 5% performance
> improvement. What is giving rise to 22%? A quick bisect indicates that
> it is patches 2, 3, 4, 5, and 6 that are giving rise to most of this
> improvement. So not one patch in particular, but something around 4%
> improvement from each one of them. Note that exactly this benchmark
> has previously had an extraordinary slow down compared to when running
> without poll syscalls. For all the other poll tests above, the
> slowdown has always been around 4% for using poll syscalls. But with
> the bad performing test in question, it was above 25%. Interestingly,
> after this clean up, the slow down is 4%, just like all the other poll
> tests. Please take an extra peek at this so I have not messed up
> something.
>
> The 0% for several txpush results are due to the test bottlenecking on
> a non-CPU HW resource. If I eliminated that bottleneck on my system, I
> would expect to see an increase there too.
>
> Changes v1 -> v2:
> * Corrected textual errors in the commit logs (Sergei and Martin)
> * Fixed the functions that detect empty and full rings so that they
> now operate on the global ring state (Maxim)
>
> This patch has been applied against commit a352a82496d1 ("Merge branch 'libbpf-extern-followups'")
>
> Structure of the patch set:
>
> Patch 1: Eliminate the lazy update threshold used when preallocating
> entries in the completion ring
> Patch 2: Simplify the detection of empty and full rings
> Patch 3: Consolidate the two local producer pointers into one
> Patch 4: Standardize the naming of the producer ring access functions
> Patch 5: Eliminate the Rx batch size used for the fill ring
> Patch 6: Simplify the functions xskq_nb_avail and xskq_nb_free
> Patch 7: Simplify and standardize the naming of the consumer ring
> access functions
> Patch 8: Change the names of the validation functions to improve
> readability and also the return value of these functions
> Patch 9: Change the name of xsk_umem_discard_addr() to
> xsk_umem_release_addr() to better reflect the new
> names. Requires a name change in the drivers that support AF_XDP
> zero-copy.
> Patch 10: Remove unnecessary READ_ONCE of data in the ring
> Patch 11: Add overall function naming comment and reorder the functions
> for easier reference
> Patch 12: Use the struct_size helper function when allocating rings
>
> Thanks: Magnus
>
Very nice cleanup (and performance boost)!
For the series:
Reviewed-by: Björn Töpel <bjorn.topel@...el.com>
Tested-by: Björn Töpel <bjorn.topel@...el.com>
Acked-by: Björn Töpel <bjorn.topel@...el.com>
> Magnus Karlsson (12):
> xsk: eliminate the lazy update threshold
> xsk: simplify detection of empty and full rings
> xsk: consolidate to one single cached producer pointer
> xsk: standardize naming of producer ring access functions
> xsk: eliminate the RX batch size
> xsk: simplify xskq_nb_avail and xskq_nb_free
> xsk: simplify the consumer ring access functions
> xsk: change names of validation functions
> xsk: ixgbe: i40e: ice: mlx5: xsk_umem_discard_addr to
> xsk_umem_release_addr
> xsk: remove unnecessary READ_ONCE of data
> xsk: add function naming comments and reorder functions
> xsk: use struct_size() helper
>
> drivers/net/ethernet/intel/i40e/i40e_xsk.c | 4 +-
> drivers/net/ethernet/intel/ice/ice_xsk.c | 4 +-
> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 4 +-
> .../net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 2 +-
> include/net/xdp_sock.h | 14 +-
> net/xdp/xsk.c | 62 ++--
> net/xdp/xsk_queue.c | 15 +-
> net/xdp/xsk_queue.h | 371 +++++++++++----------
> 8 files changed, 246 insertions(+), 230 deletions(-)
>
> --
> 2.7.4
Powered by blists - more mailing lists