[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240207163614.irxeoowoapglbnxj@skbuf>
Date: Wed, 7 Feb 2024 18:36:14 +0200
From: Vladimir Oltean <olteanv@...il.com>
To: Tobias Waldekranz <tobias@...dekranz.com>
Cc: davem@...emloft.net, kuba@...nel.org, atenart@...nel.org,
roopa@...dia.com, razor@...ckwall.org, bridge@...ts.linux.dev,
netdev@...r.kernel.org, jiri@...nulli.us, ivecera@...hat.com
Subject: Re: [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of
pending events
On Tue, Feb 06, 2024 at 10:58:18PM +0100, Tobias Waldekranz wrote:
> On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@...il.com> wrote:
> > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
> >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@...il.com> wrote:
> >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
> >> >> Before this change, generation of the list of events MDB to replay
> >> >
> >> > s/events MDB/MDB events/
> >> >
> >> >> would race against the IGMP/MLD snooping logic, which could concurrently
> >> >
> >> > logic. This could (...)
> >> >
> >> >> enqueue events to the switchdev deferred queue, leading to duplicate
> >> >> events being sent to drivers. As a consequence of this, drivers which
> >> >> reference count memberships (at least DSA), would be left with orphan
> >> >> groups in their hardware database when the bridge was destroyed.
> >> >
> >> > Still missing the user impact description, aka "when would this be
> >> > noticed by, and actively bother an user?". Something that would justify
> >> > handling this via net.git rather than net-next.git.
> >>
> >> Other than moving up the example to this point, what do you feel is
> >> missing? Or are you saying that you don't think this belongs on net.git?
> >
> > I just don't have enough data to be able to judge (it's missing from the
> > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
> > as a reference. It lists a series of guidelines from which I gather,
> > generally speaking, that there should exist a user impact on
> > functionality. The "mvls atu" command is a debug program, it just dumps
> > hardware tables. It is enough to point out that multicast entries leak,
> > but not what will be broken as a result of that.
>
> Fair enough.
>
> I originally discovered the issue while developing a kselftest for
> another improvement I want to make to switchdev multicast offloading
> (related to router ports). I thought my new code was causing these
> orphan entries, but then realized that the issue existed on a plain -net
> and -net-next kernel.
>
> I can imagine scenarios in which a user _could_ be impacted by this, but
> they are all quite esoteric. Maybe that is an indication that I should
> just target net-next instead.
Again, I'm not discouraging you to send this patch to net.git. I've been
saying (since v1, apparently: https://lore.kernel.org/netdev/20240131135157.ddrtt4swvz5y3nbz@skbuf/)
that if there is a reason to do it, just say it in the commit message,
so that we're all on the same page.
Be verbose when calculating the cost/benefit ratio calculation
(preferably also in the commit message). I would analyze the complexity
of the change as "medium", since it does change the locking scheme to
fix an underdesigned mechanism. And the fact that mlxsw also uses
replays through a slightly different code path means something you
(probably) can't test, and potentially risky.
> >> > The race has 2 components, one comes from the fact that during replay,
> >> > we iterate using RCU, which does not halt concurrent updates, and the
> >> > other comes from the fact that the MDB addition procedure is non-atomic.
> >> > Elements are first added to the br->mdb_list, but are notified to
> >> > switchdev in a deferred context.
> >> >
> >> > Grabbing the bridge multicast spinlock only solves the first problem: it
> >> > stops new enqueues of deferred events. We also need special handling of
> >> > the pending deferred events. The idea there is that we cannot cancel
> >> > them, since that would lead to complications for other potential
> >> > listeners on the switchdev chain. And we cannot flush them either, since
> >> > that wouldn't actually help: flushing needs sleepable context, which is
> >> > incompatible with holding br->multicast_lock, and without
> >> > br->multicast_lock held, we haven't actually solved anything, since new
> >> > deferred events can still be enqueued at any time.
> >> >
> >> > So the only simple solution is to let the pending deferred events
> >> > execute (eventually), but during event replay on joining port, exclude
> >> > replaying those multicast elements which are in the bridge's multicast
> >> > list, but were not yet added through switchdev. Eventually they will be.
> >>
> >> In my opinion, your three paragraphs above say pretty much the same
> >> thing as my single paragraph. But sure, I will try to provide more
> >> details in v4.
> >
> > It's not about word count, I'm notoriously bad at summarizing. If you
> > could still say in a single paragraph what I did in 3, it would be great.
> >
> > The difference is that the above 3 paragraphs only explore the race on
> > MDB addition events. It is of a different nature that the one on deletion.
> > Your analysis clumps them together, and the code follows suit. There is
> > code to supposedly handle the race between deferred deletion events and
> > replay on port removal, but I don't think it does.
>
> That's the thing though: I did not lump them together, I simply did not
> think about the deletion issue at all. An issue which you yourself state
> should probably be fixed in a separate patch, presumably becuase you
> think of them as two very closely related, but ultimately separate,
> issues. Which is why I think it was needlessly harsh to say that my
> analysis of the duplicate-events-issue was too simplistic, because it
> did not address a related issue that I had not considered.
So in the comments for your v1, one of the things I told you was that
"there's one more race to deal with" on removal.
https://lore.kernel.org/netdev/20240131150642.mxcssv7l5qfiejkl@skbuf/
At the time, the idea had not crystalized in my mind either that it's
not "one more" race on removal, but that on removal it's _the only_
race.
You then submitted v2 and v3, not taking this comment into consideration,
not even to explore it. Exploring it thoroughly would have revealed that
your approach - to apply the algorithm of excluding objects from replay
if events on them are pending in switchdev - is completely unnecessary
when "adding=false".
In light of my comment and the lack of a detailed analysis on your side
on removal, it _appears_, by looking at this change, that the patch
handles a race on removal, but the code actually runs through the
algorithm that handles a race that doesn't exist, and doesn't handle the
race that _does_ exist.
My part of the fault is doing spotty review, giving ideas piecemeal, and
getting frustrated you aren't picking all of them up. It's still going
to be a really busy few weeks/months for me ahead, and there's nothing I
can do about that. I'm faced with the alternatives of either doing a
shit job reviewing and leaving comments/ideas in the little time I have,
or not reviewing at all. I'll think about my options some more.
Powered by blists - more mailing lists