lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 31 Jan 2017 08:39:19 -0800
From:   Stephen Hemminger <stephen@...workplumber.org>
To:     Nikolay Aleksandrov <nikolay@...ulusnetworks.com>
Cc:     netdev@...r.kernel.org, roopa@...ulusnetworks.com,
        davem@...emloft.net
Subject: Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization

On Tue, 31 Jan 2017 16:31:54 +0100
Nikolay Aleksandrov <nikolay@...ulusnetworks.com> wrote:

> Hi all,
> This is the first set which begins to deal with the bad bridge cache
> access patterns. The first patch rearranges the bridge and port structs
> a little so the frequently (and closely) accessed members are in the same
> cache line. The second patch then moves the garbage collection to a
> workqueue trying to improve system responsiveness under load (many fdbs)
> and more importantly removes the need to check if the matched entry is
> expired in __br_fdb_get which was a source of false-sharing.
> The third patch is a preparation for the final one which adds a new option
> that allows to turn off "used" updates on transmit to a fdb which is the
> other source of false-sharing in the bridge. If properly configured, i.e.
> ports bound to CPUs (thus updating "updated" locally) and "used_enabled"
> set to 0 then the bridge's HitM goes from 100% to 0%, but even with
> "used_enabled" = 1 we still get a win because lookups which iterated over
> the hash chain caused false-sharing due to the first cache line being
> used for both mac/vid and used/updated fields.
> I'm not a fan of adding new options to the bridge, so I'm open to
> suggestions for this one. Things I've tried before that:
> - dynamically allocate a pool of percpu memory for used/updated fields
>   (works but it's more complicated as we need dynamic resizing, too)
> - dynamically allocate percpu memory for each fdb separately (I'm not a fan
>   of this one, but it's much simpler to implement)
> 
> Of these two obviously the first approach worked best, but the complexity
> it brings is considerable, while with this patchset we can achieve the same
> result with proper configuration. Any feedback on this one would be greatly
> appreciated.
> 
> Some results from tests I've run:
> (note that these were run in good conditions for the baseline, everything
>  ran on a single NUMA node and there were only 3 fdbs)
> 
> 1. baseline
> 100% Load HitM on the fdbs (between everyone who has done lookups and hit
>                             one of the 3 hash chains of the communicating
>                             src/dst fdbs)
> Overall 5.06% Load HitM for the bridge, first place in the list
> 
> bridge udp rr between 3 cores/ports/fdbs test:
>    0.35%  ksoftirqd/0  [k] br_fdb_update
>    0.30%  ksoftirqd/0  [k] __br_fdb_get
>    0.17%  ksoftirqd/0  [k] br_handle_frame_finish
> 
> 2. used = 1
> 0% Local Load HitM
> bridge udp rr between 3 cores/ports/fdbs test:
>    0.24%  ksoftirqd/0    [k] br_fdb_update
>    0.23%  ksoftirqd/0    [k] __br_fdb_get
>    0.12%  ksoftirqd/0    [k] br_handle_frame_finish
> 
> 3. used = 0
> 0% Local Load HitM
> bridge udp rr between 3 cores/ports/fdbs test:
>    0.23%  ksoftirqd/0    [k] __br_fdb_get
>    0.22%  ksoftirqd/0    [k] br_fdb_update
>    0.10%  ksoftirqd/0    [k] br_handle_frame_finish
> 
> 
> Thanks,
>  Nik
> 
> Nikolay Aleksandrov (4):
>   bridge: modify bridge and port to have often accessed fields in one
>     cache line
>   bridge: move to workqueue gc
>   bridge: move write-heavy fdb members in their own cache line
>   bridge: add ability to turn off fdb used updates
> 
>  include/uapi/linux/if_link.h |  1 +
>  net/bridge/br_device.c       |  2 ++
>  net/bridge/br_fdb.c          | 31 ++++++++++++++---------
>  net/bridge/br_if.c           |  2 +-
>  net/bridge/br_input.c        |  3 ++-
>  net/bridge/br_ioctl.c        |  2 +-
>  net/bridge/br_netlink.c      | 12 +++++++--
>  net/bridge/br_private.h      | 58 ++++++++++++++++++++++----------------------
>  net/bridge/br_stp.c          |  2 +-
>  net/bridge/br_stp_if.c       |  4 +--
>  net/bridge/br_stp_timer.c    |  2 --
>  net/bridge/br_sysfs_br.c     | 25 ++++++++++++++++++-
>  12 files changed, 92 insertions(+), 52 deletions(-)
> 

I agree with the first 3 patches, but not the last one.
Changing the API just for a performance hack is not necessary. Instead make
the algorithm smarter and use per-cpu values.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ