netdev - Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <01371d1f-8fdf-4159-331e-32f2dac22445@nvidia.com>
Date:   Fri, 24 Sep 2021 13:03:57 +0300
From:   Nikolay Aleksandrov <nikolay@...dia.com>
To:     Vladimir Oltean <vladimir.oltean@....com>,
        Florian Fainelli <f.fainelli@...il.com>
Cc:     Roopa Prabhu <roopa@...dia.com>, Andrew Lunn <andrew@...n.ch>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB

On 24/09/2021 01:49, Vladimir Oltean wrote:
> On Thu, Sep 23, 2021 at 03:29:07PM -0700, Florian Fainelli wrote:
>>> Does something like this have any chance of being accepted?
>>> https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/
>>
>> Had not seen the link you just posted, in premise speeding up the FDB
>> dump sounds good to me, especially given that we typically have these
>> slow buses to work with.
> 
> I didn't copy you because you were on vacation.
> 
>> These questions are probably super stupid and trivial and I really
>> missed reviewing properly your latest work, how do we manage to keep the
>> bridge's FDB and hardware FDB in sync given that switches don't
>> typically tell us when they learn new addresses?
> 
> We don't, that's the premise, you didn't miss anything there. I mean in
> the switchdev world, I see that an interrupt is the primary mechanism,
> and we do have DSA switches which are theoretically capable of notifying
> of new addresses, but not through interrupts but over Ethernet, of
> course. Ocelot for example supports sending "learn frames" to the CPU -
> these are just like flooded frames, except that a "flooded" frame is one
> with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.
> I've been thinking whether it's worth adding support for learn frames
> coming from Ocelot/Felix in DSA, and it doesn't really look like it.
> Anyway, when the DSA tagging protocol receives a "learn" frame, it needs
> to consume_skb() it, then trigger some sort of switchdev atomic notifier
> for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only
> the beginning of a long series of complications, because we also need to
> know when the hardware has fogotten it, and it doesn't look like
> "forget" frames are a thing. So because the risk of having an address
> expire in hardware while it is still present in software is kind of
> high, the only option left is to make the hardware entry static, and
> (a) delete it manually when the software entry expires
> (b) set up a second alert which triggers a MAC SA has been received on a
>     port other than the destination port which is pointed towards by an
>     existing FDB entry. In short, "station migration alert". Because the
>     FDB entry is static, we need to migrate it by hand, in software.
> So it all looks kind of clunky. Whereas what we do now is basically
> assume that the amount of frames with unknown MAC DA reaching the CPU
> via flooding is more or less equal and equally distributed with the
> frames with unknown MAC SA reaching the CPU. I have not yet encountered
> a use case where the two are tragically different, in a way that could
> be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.
> 
> 
> Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs
> and that is the reason in the first place why we have an .ndo_fdb_dump
> implementation. Theoretically if all hardware drivers were perfect,
> you'd only have the .ndo_fdb_dump implementation done for the bridge,
> vxlan, things like that. So that is why I asked Roopa whether hacking up
> rtnl_fdb_dump in this way, transforming it into a state machine even
> more complicated than it already was, is acceptable. We aren't the
> primary use case of it, I think.
> 

Hi Vladimir,
I glanced over the patches and the obvious question that comes first is have you
tried pushing all of this complexity closer to the driver which needs it?
I mean rtnl_fdb_dump() can already "resume" and passes all the necessary resume indices
to ndo_fdb_dump(), so it seems to me that all of this can be hidden away. I doubt
there will be a many users overall, so it would be nice to avoid adding the complexity
as you put it and supporting it in the core. I'd imagine a hacked driver would simply cache
the dump for some time while needed (that's important to define well, more below) and just
return it for the next couple of devices which share it upon request, basically you'd have the
same type of solution you've done here, just have the details hidden in the layer which needs it.

Now the hard part seems to be figuring out when to finish in this case. Prepare should be a simple
check if a shared fdb list is populated, finish would need to be inferred. One way to do that is
based on a transaction/dump id which is tracked for each shared device and the last dump. Another
is if you just pass a new argument to ndo_fdb_dump() if it's dumping a single device or doing a
full dump, since that's the case that's difficult to differentiate. If you're doing a single
dump - obviously you do a normal fdb dump without caching, if you're doing a full dump (all devices)
then you need to check if the list is populated, if it is and this is the last device you need to free
the cached shared list (finish phase). That would make the core change very small and would push the
complexity to be maintained where it's needed. Actually you have the netlink_callback struct passed
down to ndo_fdb_dump() so probably that can be used to infer if you're doing a single or multi- device
dump already and there can be 0 core changes.

Also one current downside with this set is the multiple walks over the net device list for a single dump,
especially for setups with thousands of net devices. Actually, I might be missing something, but what
happens if a dump is terminated before it's finished? Looks like the finish phase will never be reached.
That would be a problem for both solutions, you might have to add a netlink ->done() callback anyway.

Cheers,
 Nik