lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1757004393.git.petrm@nvidia.com>
Date: Thu, 4 Sep 2025 19:07:17 +0200
From: Petr Machata <petrm@...dia.com>
To: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
	<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, <netdev@...r.kernel.org>
CC: Simon Horman <horms@...nel.org>, Nikolay Aleksandrov
	<razor@...ckwall.org>, Ido Schimmel <idosch@...dia.com>,
	<bridge@...ts.linux.dev>, Petr Machata <petrm@...dia.com>, <mlxsw@...dia.com>
Subject: [PATCH net-next 00/10] bridge: Allow keeping local FDB entries only on VLAN 0

The bridge FDB contains one local entry per port per VLAN, for the MAC of
the port in question, and likewise for the bridge itself. This allows
bridge to locally receive and punt "up" any packets whose destination MAC
address matches that of one of the bridge interfaces or of the bridge
itself.

The number of these local "service" FDB entries grows linearly with number
of bridge-global VLAN memberships, but that in turn will tend to grow
quadratically with number of ports and per-port VLAN memberships. While
that does not cause issues during forwarding lookups, it does make dumps
impractically slow.

As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB
that just contains these 400K local entries, takes 6.5s. That's _without_
considering iproute2 formatting overhead, this is just how long it takes to
walk the FDB (repeatedly), serialize it into netlink messages, and parse
the messages back in userspace.

This is to illustrate that with growing number of ports and VLANs, the time
required to dump this repetitive information blows up. Arguably 4K VLANs
per interface is not a very realistic configuration, but then modern
switches can instead have several hundred interfaces, and we have fielded
requests for >1K VLAN memberships per port among customers.

FDB entries are currently all kept on a single linked list, and then
dumping uses this linked list to walk all entries and dump them in order.
When the message buffer is full, the iteration is cut short, and later
restarted. Of course, to restart the iteration, it's first necessary to
walk the already-dumped front part of the list before starting dumping
again. So one possibility is to organize the FDB entries in different
structure more amenable to walk restarts.

One option is to walk directly the hash table. The advantage is that no
auxiliary structure needs to be introduced. With a rough sketch of this
approach, the above scenario gets dumped in not quite 3 s, saving over 50 %
of time. However hash table iteration requires maintaining an active cursor
that must be collected when the dump is aborted. It looks like that would
require changes in the NDO protocol to allow to run this cleanup. Moreover,
on hash table resize the iteration is simply restarted. FDB dumps are
currently not guaranteed to correspond to any one particular state: entries
can be missed, or be duplicated. But with hash table iteration we would get
that plus the much less graceful resize behavior, where swaths of FDB are
duplicated.

Another option is to maintain the FDB entries in a red-black tree. We have
a PoC of this approach on hand, and the above scenario is dumped in about
2.5 s. Still not as snappy as we'd like it, but better than the hash table.
However the savings come at the expense of a more expensive insertion, and
require locking during dumps, which blocks insertion.

The upside of these approaches is that they provide benefits whatever the
FDB contents. But it does not seem like either of these is workable.
However we intend to clean up the RB tree PoC and present it for
consideration later on in case the trade-offs are considered acceptable.

Yet another option might be to use in-kernel FDB filtering, and to filter
the local entries when dumping. Unfortunately, this does not help all that
much either, because the linked-list walk still needs to happen. Also, with
the obvious filtering interface built around ndm_flags / ndm_state
filtering, one can't just exclude pure local entries in one query. One
needs to dump all non-local entries first, and then to get permanent
entries in another run filter local & added_by_user. I.e. one needs to pay
the iteration overhead twice, and then integrate the result in userspace.
To get significant savings, one would need a very specific knob like "dump,
but skip/only include local entries". But if we are adding a local-specific
knobs, maybe let's have an option to just not duplicate them in the first
place.

All this FDB duplication is there merely to make things snappy during
forwarding. But high-radix switches with thousands of VLANs typically do
not process much traffic in the SW datapath at all, but rather offload vast
majority of it. So we could exchange some of the runtime performance for a
neater FDB.

To that end, in this patchset, introduce a new bridge option,
BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries
installed only on VLAN 0, instead of duplicating them across all VLANs.
Then to maintain the local termination behavior, on FDB miss, the bridge
does a second lookup on VLAN 0.

Enabling this option changes the bridge behavior in expected ways. Since
the entries are only kept on VLAN 0, FDB get, flush and dump will not
perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects
forwarding on all VLANs.

This patchset is loosely based on a privately circulated patch by Nikolay
Aleksandrov.

The patchset progresses as follows:

- Patch #1 introduces a bridge option to enable the above feature. Then
  patches #2 to #5 gradually patch the bridge to do the right thing when
  the option is enabled. Finally patch #6 adds the UAPI knob and the code
  for when the feature is enabled or disabled.
- Patches #7, #8 and #9 contain fixes and improvements to selftest
  libraries
- Patch #10 contains a new selftest

The corresponding iproute2 support is at:
https://github.com/pmachata/iproute2/commits/fdb_local_vlan_0/

Petr Machata (10):
  net: bridge: Introduce BROPT_FDB_LOCAL_VLAN_0
  net: bridge: BROPT_FDB_LOCAL_VLAN_0: Look up FDB on VLAN 0 on miss
  net: bridge: BROPT_FDB_LOCAL_VLAN_0: On port changeaddr, skip per-VLAN
    FDBs
  net: bridge: BROPT_FDB_LOCAL_VLAN_0: On bridge changeaddr, skip
    per-VLAN FDBs
  net: bridge: BROPT_FDB_LOCAL_VLAN_0: Skip local FDBs on VLAN creation
  net: bridge: Introduce UAPI for BR_BOOLOPT_FDB_LOCAL_VLAN_0
  selftests: defer: Allow spaces in arguments of deferred commands
  selftests: defer: Introduce DEFER_PAUSE_ON_FAIL
  selftests: net: lib.sh: Don't defer failed commands
  selftests: forwarding: Add test for BR_BOOLOPT_FDB_LOCAL_VLAN_0

 include/uapi/linux/if_bridge.h                |   3 +
 net/bridge/br.c                               |  22 ++
 net/bridge/br_fdb.c                           | 114 +++++-
 net/bridge/br_input.c                         |   8 +
 net/bridge/br_private.h                       |   3 +
 net/bridge/br_vlan.c                          |  10 +-
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/bridge_fdb_local_vlan_0.sh | 374 ++++++++++++++++++
 tools/testing/selftests/net/lib.sh            |  32 +-
 tools/testing/selftests/net/lib/sh/defer.sh   |  20 +-
 10 files changed, 559 insertions(+), 28 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/bridge_fdb_local_vlan_0.sh

-- 
2.49.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ