netdev - Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88f2bf5ef1977fcdd4c87051cd54a4545db993da.camel@nvidia.com>
Date: Mon, 17 Nov 2025 12:48:20 +0000
From: Cosmin Ratiu <cratiu@...dia.com>
To: "sd@...asysnail.net" <sd@...asysnail.net>
CC: "andrew+netdev@...n.ch" <andrew+netdev@...n.ch>, "davem@...emloft.net"
	<davem@...emloft.net>, "ap420073@...il.com" <ap420073@...il.com>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>, Leon Romanovsky
	<leonro@...dia.com>, "steffen.klassert@...unet.com"
	<steffen.klassert@...unet.com>, "jv@...sburgh.net" <jv@...sburgh.net>,
	"kuba@...nel.org" <kuba@...nel.org>, "horms@...nel.org" <horms@...nel.org>,
	"edumazet@...gle.com" <edumazet@...gle.com>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, Jianbo Liu <jianbol@...dia.com>,
	"pabeni@...hat.com" <pabeni@...hat.com>
Subject: Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs

On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> > The bonding driver manages offloaded SAs using the following
> > strategy:
> > 
> > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa()
> > uses
> > 'real_dev' on the xfrm_state xs to redirect the offload to the
> > current
> > active slave. The corresponding bond_ipsec_del_sa() (called with
> > the xs
> > spinlock held) redirects the unoffload call to real_dev.
> 
> 
> > Finally,
> > cleanup happens in bond_ipsec_free_sa(), which removes the offload
> > from
> > the device. Since the last call happens without the xs spinlock
> > held,
> > that is where the real work to unoffload actually happens.
> 
> Not on all devices (some don't even implement xdo_dev_state_free).

You're right. Looking at what the stack does:
xfrm_state_delete() immediately calls xdo_dev_state_delete(), but
leaves xdo_dev_state_free() for when there are no more refs (in-flight
tx packets are done).
xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately
after deleting xs. I guess the goal is to clean up *now* everything
from 'dev'.
All other callers of xfrm_state_delete() don't care about free, it will
be done when there are no more refs.

So right now for devices that implement xdo_dev_state_free(), there's
distinct behavior of what happens when xfrm_state_delete gets called

So right now, there's a difference in behavior for what happens with
in-flight packets when xfrm_state_delete() is called:
1. On devs which delete the dev state in xdo_dev_state_free(), in-
flight packets are not affected.
2. On devs which delete the dev state in xdo_dev_state_delete(), in-
flight packets will see the xs yanked from underneath them.

This makes me ask the question: Is there a point to the
xdo_dev_state_delete() callback any more? Couldn't we consolidate on
having a single callback to free the offloaded xfrm_state when there
are no more references to it? This would simplify the delete+free dance
and would leave proper cleanup for the xs reference counting.

What am I missing?

> > 
> > When the active slave changes to a new device a 3-step process is
> > used
> > to migrate all xfrm states to the new device:
> > 
> > 1. bond_ipsec_del_sa_all() unoffloads all states in bond-
> > >ipsec_list
> >    from the previously active device.
> > 2. The active slave is flipped to the new device.
> > 3. bond_ipsec_add_sa_all() offloads all states in bond->ipsec_list
> > to
> >    the new device.
> > 
> > This patch closes a race that could happen between xfrm_state
> > migration
> > and TX, which could result in unencrypted packets going out the
> > wire:
> > CPU1 (xfrm_output)                   CPU2
> > (bond_change_active_slave)
> > bond_ipsec_offload_ok -> true
> >                                      bond_ipsec_del_sa_all
> > bond_xmit_activebackup
> > bond_dev_queue_xmit
> > dev_queue_xmit on old_dev
> > 				     bond->curr_active_slave =
> > new_dev
> > 				     bond_ipsec_add_sa_all
> > 
> > So the packet makes it out to old_dev after the offloaded
> > xfrm_state is
> > deleted from it. The result: an unencrypted IPSec packet on the
> > wire.
> > 
> > With the new approach, in-use states on old_dev will not be deleted
> > until in-flight packets are transmitted.
> 
> How does this guarantee it? It would be good to describe how the new
> approach closes the race with a bit more info than "use
> xfrm_state_migrate".
> 
> And I don't think we currently guarantee that packets using offload
> will be fully transmitted before xdo_dev_state_delete was called in
> case of deletion. 

Apologies for leaving this part out, yeah, it's pretty important.
I changed the descriptions for the next versions, here's what happens:
In-flight offloaded tx packets hold a reference to the used xfrm_state
via xfrm_output -> xfrm_state_hold which gets released when the
completion arrives via napi_consume_skb -...-> skb_ext_put ->
skb_ext_put_sp -> xfrm_state_put.

But this doesn't work on devices which do the dev state deletion in
xdo_dev_state_delete(), because those might get their SAs yanked from
the device during the xfrm_state_delete() added in this patch. I guess
this ties to the previous point: Shouldn't there be only
xdo_dev_state_delete which touches the device when refcount is 0?


> But ok, the bond case is worse due to the add/delete
> dance when we change the active slave (and there's still the possible
> issue Steffen mentioned a while ago, that this delete/add dance may
> not be valid at all depending on how the HW behaves wrt IVs).

I am aware of that issue, I am not trying to change any of that. Just
trying to improve bond from a security perspective. I don't think it's
ok for it to send out unencrypted IPSec packets.

> > It also makes for cleaner
> > bonding code, which no longer needs to care about xfrm_state
> > management
> > so much.
> 
> But using the migrate code for that feels kind of hacky, and the 2nd
> patch in this set also looks quite hacky.

It's less hacky than the manual xfrm state management done so far. At
least bonding no longer needs to care so much about the semantics of
the xfrm dev state operations. And it no longer needs to acquire the
xs->lock (what does bonding have to do with an internal xfrm_state lock
anyway?)

> 
> And doing all that without protection against admin operations on the
> xfrm state objects doesn't seem safe.
> 
> Thinking about the migrate behavior, if we fail to create/offload the
> new state:
>  - old state will be deleted
>  - new state won't be created
> 
> So any packet we send afterwards that would need to use this SA will
> just get dropped? (while the old behavior was "no more offload until
> we change the active slave again"?)

This is not ideal, I agree. Perhaps instead of giving up on the failed
xs there could be an alternate migration path where we call
xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really
like that.

Alternatively, I have implemented another fix to these races, which is
to change xs.xso to be able to be offloaded on multiple devices at the
same time (nothing fancy, just parameter changes to xdo ops) and
changing the bonding driver to maintain a single offloaded xfrm_state
on *all* slaves (using bonding data structs). Changing the active slave
then becomes as simple as updating the esn on the new device (to get
that device state up to speed).
Leon and I discussed about that and he suggested it would be better to
use xfrm_state_migrate, since that is an existing well-understood
workflow.
Do you think that approach is worth pursuing instead? I could send them
patches as RFC for discussion.

> 
> 
> > Fixes: ("ec13009472f4 bonding: implement xdo_dev_state_free and
> > call it after deletion")
> 
> nit: wrong formatting of the Fixes tag

Thanks, will fix in the next submission (if we decide there will be a
next one).

> 
> 
> [just one comment on the diff, I'll look at it again if we decide to
> proceed with this patch]
> 
> > @@ -533,36 +535,42 @@ static void bond_ipsec_add_sa_all(struct
> > bonding *bond)
> >  			slave_warn(bond_dev, real_dev,
> >  				   "%s: no slave
> > xdo_dev_state_add\n",
> >  				   __func__);
> > -		goto out;
> > +		return;
> >  	}
> >  
> > -	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > -		/* If new state is added before ipsec_lock
> > acquired */
> > -		if (ipsec->xs->xso.real_dev == real_dev)
> > -			continue;
> > +	/* Prepare the list of xfrm_states to be migrated. */
> > +	mutex_lock(&bond->ipsec_lock);
> > +	list_splice_init(&bond->ipsec_list, &ipsec_list);
> > +	/* Add back states already offloaded on the new device
> > before the
> > +	 * lock was acquired and hold all remaining states to
> > avoid them
> > +	 * getting deleted during the migration.
> 
> Even with hold(), they could still be deleted (but not destroyed)?

xfrm_del_sa() might concurrently happen, but since we have a ref,
they'll not get destroyed, yes.

> 
> > +	 */
> > +	list_for_each_entry_safe(ipsec, tmp, &ipsec_list, list) {
> > +		if (unlikely(ipsec->xs->xso.real_dev == real_dev))
> > +			list_move_tail(&ipsec->list, &bond-
> > >ipsec_list);
> > +		else
> > +			xfrm_state_hold(ipsec->xs);
> > +	}
> > +	mutex_unlock(&bond->ipsec_lock);
>