netdev - Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aR79MCBdyx2oTcp2@krikkit>
Date: Thu, 20 Nov 2025 12:36:16 +0100
From: Sabrina Dubroca <sd@...asysnail.net>
To: Cosmin Ratiu <cratiu@...dia.com>, steffen.klassert@...unet.com
Cc: "andrew+netdev@...n.ch" <andrew+netdev@...n.ch>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"ap420073@...il.com" <ap420073@...il.com>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	Leon Romanovsky <leonro@...dia.com>,
	"jv@...sburgh.net" <jv@...sburgh.net>,
	"kuba@...nel.org" <kuba@...nel.org>,
	"horms@...nel.org" <horms@...nel.org>,
	"edumazet@...gle.com" <edumazet@...gle.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	Jianbo Liu <jianbol@...dia.com>,
	"pabeni@...hat.com" <pabeni@...hat.com>
Subject: Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs

2025-11-17, 12:48:20 +0000, Cosmin Ratiu wrote:
> On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> > 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> > > The bonding driver manages offloaded SAs using the following
> > > strategy:
> > > 
> > > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa()
> > > uses
> > > 'real_dev' on the xfrm_state xs to redirect the offload to the
> > > current
> > > active slave. The corresponding bond_ipsec_del_sa() (called with
> > > the xs
> > > spinlock held) redirects the unoffload call to real_dev.
> > 
> > 
> > > Finally,
> > > cleanup happens in bond_ipsec_free_sa(), which removes the offload
> > > from
> > > the device. Since the last call happens without the xs spinlock
> > > held,
> > > that is where the real work to unoffload actually happens.
> > 
> > Not on all devices (some don't even implement xdo_dev_state_free).
> 
> You're right. Looking at what the stack does:
> xfrm_state_delete() immediately calls xdo_dev_state_delete(), but
> leaves xdo_dev_state_free() for when there are no more refs (in-flight
> tx packets are done).
> xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately
> after deleting xs. I guess the goal is to clean up *now* everything
> from 'dev'.

Yes, see 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.")
(though I'm not sure it's still needed, now that TCP drops the secpath
early: 9b6412e6979f)

> All other callers of xfrm_state_delete() don't care about free, it will
> be done when there are no more refs.
> 
> So right now for devices that implement xdo_dev_state_free(), there's
> distinct behavior of what happens when xfrm_state_delete gets called
> 
> So right now, there's a difference in behavior for what happens with
> in-flight packets when xfrm_state_delete() is called:
> 1. On devs which delete the dev state in xdo_dev_state_free(), in-
> flight packets are not affected.
> 2. On devs which delete the dev state in xdo_dev_state_delete(), in-
> flight packets will see the xs yanked from underneath them.
> 
> This makes me ask the question: Is there a point to the
> xdo_dev_state_delete() callback any more? Couldn't we consolidate on
> having a single callback to free the offloaded xfrm_state when there
> are no more references to it? This would simplify the delete+free dance
> and would leave proper cleanup for the xs reference counting.
> 
> What am I missing?

I don't know. Maybe it's a leftover of the initial offload
implementation/drivers that we don't need anymore? Steffen?


[...]
> > > With the new approach, in-use states on old_dev will not be deleted
> > > until in-flight packets are transmitted.
> > 
> > How does this guarantee it? It would be good to describe how the new
> > approach closes the race with a bit more info than "use
> > xfrm_state_migrate".
> > 
> > And I don't think we currently guarantee that packets using offload
> > will be fully transmitted before xdo_dev_state_delete was called in
> > case of deletion. 
> 
> Apologies for leaving this part out, yeah, it's pretty important.
> I changed the descriptions for the next versions, here's what happens:
> In-flight offloaded tx packets hold a reference to the used xfrm_state
> via xfrm_output -> xfrm_state_hold which gets released when the
> completion arrives via napi_consume_skb -...-> skb_ext_put ->
> skb_ext_put_sp -> xfrm_state_put.
> 
> But this doesn't work on devices which do the dev state deletion in
> xdo_dev_state_delete(), because those might get their SAs yanked from
> the device during the xfrm_state_delete() added in this patch. I guess
> this ties to the previous point: Shouldn't there be only
> xdo_dev_state_delete which touches the device when refcount is 0?
> 
> 
> > But ok, the bond case is worse due to the add/delete
> > dance when we change the active slave (and there's still the possible
> > issue Steffen mentioned a while ago, that this delete/add dance may
> > not be valid at all depending on how the HW behaves wrt IVs).
> 
> I am aware of that issue, I am not trying to change any of that. Just
> trying to improve bond from a security perspective.

Sure. But I'm not sure we can make it really trustworthy...

> I don't think it's
> ok for it to send out unencrypted IPSec packets.

Agree.


> > > It also makes for cleaner
> > > bonding code, which no longer needs to care about xfrm_state
> > > management
> > > so much.
> > 
> > But using the migrate code for that feels kind of hacky, and the 2nd
> > patch in this set also looks quite hacky.
> 
> It's less hacky than the manual xfrm state management done so far. At
> least bonding no longer needs to care so much about the semantics of
> the xfrm dev state operations. And it no longer needs to acquire the
> xs->lock (what does bonding have to do with an internal xfrm_state lock
> anyway?)

To me, it's hacky in the sense that we're hijacking the migrate code
that isn't intended for that, and triggering core xfrm operations from
inside a driver (and without proper locking). But true, the current
code is also hacky.

I think a better solution might be to find a way to use the
"per-resource" SA code for bonding (currently implemented for
"per-CPU" SAs, but a resource could be a lower device). Then we don't
have to worry about moving states from one link to another, but it
requires userspace cooperation.


> > And doing all that without protection against admin operations on the
> > xfrm state objects doesn't seem safe.
> > 
> > Thinking about the migrate behavior, if we fail to create/offload the
> > new state:
> >  - old state will be deleted
> >  - new state won't be created
> > 
> > So any packet we send afterwards that would need to use this SA will
> > just get dropped? (while the old behavior was "no more offload until
> > we change the active slave again"?)
> 
> This is not ideal, I agree. Perhaps instead of giving up on the failed
> xs there could be an alternate migration path where we call
> xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really
> like that.
> 
> Alternatively, I have implemented another fix to these races, which is
> to change xs.xso to be able to be offloaded on multiple devices at the
> same time (nothing fancy, just parameter changes to xdo ops) and
> changing the bonding driver to maintain a single offloaded xfrm_state
> on *all* slaves (using bonding data structs). Changing the active slave
> then becomes as simple as updating the esn on the new device (to get
> that device state up to speed).
> Leon and I discussed about that and he suggested it would be better to
> use xfrm_state_migrate, since that is an existing well-understood
> workflow.
> Do you think that approach is worth pursuing instead? I could send them
> patches as RFC for discussion.

You can go ahead and share them if you want, but the short description
above kind of puzzles/scares me.

This whole feature is really a mess :/

-- 
Sabrina