[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v871skwu.fsf@waldekranz.com>
Date: Tue, 06 Feb 2024 15:54:25 +0100
From: Tobias Waldekranz <tobias@...dekranz.com>
To: Vladimir Oltean <olteanv@...il.com>
Cc: davem@...emloft.net, kuba@...nel.org, atenart@...nel.org,
roopa@...dia.com, razor@...ckwall.org, bridge@...ts.linux.dev,
netdev@...r.kernel.org, jiri@...nulli.us, ivecera@...hat.com
Subject: Re: [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of
pending events
On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@...il.com> wrote:
> On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
>> Before this change, generation of the list of events MDB to replay
>
> s/events MDB/MDB events/
>
>> would race against the IGMP/MLD snooping logic, which could concurrently
>
> logic. This could (...)
>
>> enqueue events to the switchdev deferred queue, leading to duplicate
>> events being sent to drivers. As a consequence of this, drivers which
>> reference count memberships (at least DSA), would be left with orphan
>> groups in their hardware database when the bridge was destroyed.
>
> Still missing the user impact description, aka "when would this be
> noticed by, and actively bother an user?". Something that would justify
> handling this via net.git rather than net-next.git.
Other than moving up the example to this point, what do you feel is
missing? Or are you saying that you don't think this belongs on net.git?
>>
>> Avoid this by grabbing the write-side lock of the MDB while generating
>> the replay list, making sure that no deferred version of a replay
>> event is already enqueued to the switchdev deferred queue, before
>> adding it to the replay list.
>
> The description of the solution is actually not very satisfactory to me.
> I would have liked to see a more thorough analysis.
Sorry to disappoint.
> The race has 2 components, one comes from the fact that during replay,
> we iterate using RCU, which does not halt concurrent updates, and the
> other comes from the fact that the MDB addition procedure is non-atomic.
> Elements are first added to the br->mdb_list, but are notified to
> switchdev in a deferred context.
>
> Grabbing the bridge multicast spinlock only solves the first problem: it
> stops new enqueues of deferred events. We also need special handling of
> the pending deferred events. The idea there is that we cannot cancel
> them, since that would lead to complications for other potential
> listeners on the switchdev chain. And we cannot flush them either, since
> that wouldn't actually help: flushing needs sleepable context, which is
> incompatible with holding br->multicast_lock, and without
> br->multicast_lock held, we haven't actually solved anything, since new
> deferred events can still be enqueued at any time.
>
> So the only simple solution is to let the pending deferred events
> execute (eventually), but during event replay on joining port, exclude
> replaying those multicast elements which are in the bridge's multicast
> list, but were not yet added through switchdev. Eventually they will be.
In my opinion, your three paragraphs above say pretty much the same
thing as my single paragraph. But sure, I will try to provide more
details in v4.
> (side note: the handling code for excluding replays on pending event
> deletion seems to not actually help, because)
>
> Event replays on a switchdev port leaving the bridge are also
> problematic, but in a different way. The deletion procedure is also
> non-atomic, they are first removed from br->mdb_list then the switchdev
> notification is deferred. So, the replay procedure cannot enter a
> condition where it replays the deletion twice. But, there is no
> switchdev_deferred_process() call when the switchdev port unoffloads an
> intermediary LAG bridge port, and this can lead to the situation where
> neither the replay, nor the deferred switchdev object, trickle down to a
> call in the switchdev driver. So for event deletion, we need to force a
> synchronous call to switchdev_deferred_process().
Good catch!
What does this mean for v4? Is this still one logical change that
ensures that MDB events are always delivered exactly once, or a series
where one patch fixes the duplicates and one ensures that no events are
missed?
> See how the analysis in the commit message changes the patch?
Suddenly the novice was enlightened.
>>
>> An easy way to reproduce this issue, on an mv88e6xxx system, was to
>
> s/was/is/
>
>> create a snooping bridge, and immediately add a port to it:
>>
>> root@...ix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>> > ip link set dev x3 up master br0
>> root@...ix-06-0b-00:~$ ip link del dev br0
>> root@...ix-06-0b-00:~$ mvls atu
>> ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a
>> DEV:0 Marvell 88E6393X
>> 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . .
>> 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . .
>> ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a
>> root@...ix-06-0b-00:~$
>>
>> The two IPv6 groups remain in the hardware database because the
>> port (x3) is notified of the host's membership twice: once via the
>> original event and once via a replay. Since only a single delete
>> notification is sent, the count remains at 1 when the bridge is
>> destroyed.
>
> These 2 paragraphs, plus the mvls output, should be placed before the
> "Avoid this" paragraph.
>
>>
>> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
>> Signed-off-by: Tobias Waldekranz <tobias@...dekranz.com>
>> ---
Powered by blists - more mailing lists